There are bound to be situations in which this isn’t enough, such as when you want to read in a large amount of text from a file. Using the OpenAI API allows you to send many more tokens in a messages array, with the maximum number depending on your chosen model. This lets you provide large amounts of text to ChatGPT using chunking. Here’s how.
The gpt-4
model currently has a maximum content length token limit of 8,192 tokens. (Here are the docs containing current limits for all the models.) Remember that you can first apply text preprocessing techniques to reduce your input size – in my previous post I achieved a 28% size reduction without losing meaning with just a little tokenization and pruning.
When this isn’t enough to fit your message within the maximum message token limit, you can take a general programmatic approach that sends your input in message chunks. The goal is to divide your text into sections that each fit within the model’s token limit. The general idea is to:
Each chunk is sent as a separate message in the conversation thread.
You send your chunks to ChatGPT using the OpenAI library’s ChatCompletion
. ChatGPT returns individual responses for each message, so you may want to process these by:
\n
with line breaks.Using the OpenAI API, you can send multiple messages to ChatGPT and ask it to wait for you to provide all of the data before answering your prompt. Being a language model, you can provide these instructions to ChatGPT in plain language. Here’s a suggested script:
Prompt: Summarize the following text for me
To provide the context for the above prompt, I will send you text in parts. When I am finished, I will tell you “ALL PARTS SENT”. Do not answer until you have received all the parts.
I created a Python module, chatgptmax
, that puts all this together. It breaks up a large amount of text by a given maximum token length and sends it in chunks to ChatGPT.
You can install it with pip install chatgptmax
, but here’s the juicy part:
import os
import openai
import tiktoken
# Set up your OpenAI API key
# Load your API key from an environment variable or secret management service
openai.api_key = os.getenv("OPENAI_API_KEY")
def send(
prompt=None,
text_data=None,
chat_model="gpt-3.5-turbo",
model_token_limit=8192,
max_tokens=2500,
):
"""
Send the prompt at the start of the conversation and then send chunks of text_data to ChatGPT via the OpenAI API.
If the text_data is too long, it splits it into chunks and sends each chunk separately.
Args:
- prompt (str, optional): The prompt to guide the model's response.
- text_data (str, optional): Additional text data to be included.
- max_tokens (int, optional): Maximum tokens for each API call. Default is 2500.
Returns:
- list or str: A list of model's responses for each chunk or an error message.
"""
# Check if the necessary arguments are provided
if not prompt:
return "Error: Prompt is missing. Please provide a prompt."
if not text_data:
return "Error: Text data is missing. Please provide some text data."
# Initialize the tokenizer
tokenizer = tiktoken.encoding_for_model(chat_model)
# Encode the text_data into token integers
token_integers = tokenizer.encode(text_data)
# Split the token integers into chunks based on max_tokens
chunk_size = max_tokens - len(tokenizer.encode(prompt))
chunks = [
token_integers[i : i + chunk_size]
for i in range(0, len(token_integers), chunk_size)
]
# Decode token chunks back to strings
chunks = [tokenizer.decode(chunk) for chunk in chunks]
responses = []
messages = [
{"role": "user", "content": prompt},
{
"role": "user",
"content": "To provide the context for the above prompt, I will send you text in parts. When I am finished, I will tell you 'ALL PARTS SENT'. Do not answer until you have received all the parts.",
},
]
for chunk in chunks:
messages.append({"role": "user", "content": chunk})
# Check if total tokens exceed the model's limit and remove oldest chunks if necessary
while (
sum(len(tokenizer.encode(msg["content"])) for msg in messages)
> model_token_limit
):
messages.pop(1) # Remove the oldest chunk
response = openai.ChatCompletion.create(model=chat_model, messages=messages)
chatgpt_response = response.choices[0].message["content"].strip()
responses.append(chatgpt_response)
# Add the final "ALL PARTS SENT" message
messages.append({"role": "user", "content": "ALL PARTS SENT"})
response = openai.ChatCompletion.create(model=chat_model, messages=messages)
final_response = response.choices[0].message["content"].strip()
responses.append(final_response)
return responses
Here’s an example of how you can use this module with text data read from a file. (chatgptmax
also provides a convenience method for getting text from a file.)
# First, import the necessary modules and the function
import os
from chatgptmax import send
# Define a function to read the content of a file
def read_file_content(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
return file.read()
# Use the function
if __name__ == "__main__":
# Specify the path to your file
file_path = "path_to_your_file.txt"
# Read the content of the file
file_content = read_file_content(file_path)
# Define your prompt
prompt_text = "Summarize the following text for me:"
# Send the file content to ChatGPT
responses = send(prompt=prompt_text, text_data=file_content)
# Print the responses
for response in responses:
print(response)
While the module is designed to handle most standard use cases, there are potential pitfalls to be aware of:
As with any process, there’s always room for improvement. Here are a couple of ways you might optimize the module’s chunking and sending process further:
32k
models or need to use small chunk sizes, however, parallelism gains are likely to be minimal.If you found your way here via search, you probably already have a use case in mind. Here are some other (startup) ideas:
Do you have a use case I didn’t list? Let me know about it! In the meantime, have fun sending lots of text to ChatGPT.
]]>Text preprocessing can help shorten and refine your input, ensuring that ChatGPT can grasp the essence without getting overwhelmed. In this article, we’ll explore these techniques, understand their importance, and see how they make your interactions with tools like ChatGPT more reliable and productive.
Text preprocessing prepares raw text data for analysis by NLP models. Generally, it distills everyday text (like full sentences) to make it more manageable or concise and meaningful. Techniques include:
While all these techniques can help reduce the size of raw text data, some of these techniques are easier to apply to general use cases than others. Let’s examine how text preprocessing can help us send a large amount of text to ChatGPT.
In the realm of Natural Language Processing (NLP), a token is the basic unit of text that a system reads. At its simplest, you can think of a token as a word, but depending on the language and the specific tokenization method used, a token can represent a word, part of a word, or even multiple words.
While in English we often equate tokens with words, in NLP, the concept is broader. A token can be as short as a single character or as long as a word. For example, with word tokenization, the sentence “Unicode characters such as emojis are not indivisible. ✂️” can be broken down into tokens like this: [“Unicode”, “characters”, “such”, “as”, “emojis”, “are”, “not”, “indivisible”, “.”, “✂️”]
In another form called Byte-Pair Encoding (BPE), the same sentence is tokenized as: [“Un”, “ic”, “ode”, " characters", " such", " as", " em, “oj”, “is”, " are", " not", " ind", “iv”, “isible”, “.”, " �", “�️”]. The emoji itself is split into tokens containing its underlying bytes.
Depending on the ChatGPT model chosen, your text input size is restricted by tokens. Here are the docs containing current limits. BPE is used by ChatGPT to determine token count, and we’ll discuss it more thoroughly later. First, we can programmatically apply some preprocessing techniques to reduce our text input size and use fewer tokens.
For a general approach that can be applied programmatically, pruning is a suitable preprocessing technique. One form is stop word removal, or removing common words that might not add significant meaning in certain contexts. For example, consider the sentence:
“I always enjoy having pizza with my friends on weekends.”
Stop words are often words that don’t carry significant meaning on their own in a given context. In this sentence, words like “I”, “always”, “enjoy”, “having”, “with”, “my”, “on” are considered stop words.
After removing the stop words, the sentence becomes:
“pizza friends weekends.”
Now, the sentence is distilled to its key components, highlighting the main subject (pizza) and the associated context (friends and weekends). If you find yourself wishing you could convince people to do this in real life (coughmeetingscough)… you aren’t alone.
Stop word removal is straightforward to apply programmatically: given a list of stop words, examine some text input to see if it contains any of the stop words on your list. If it does, remove them, then return the altered text.
def clean_stopwords(text: str) -> str:
stopwords = ["a", "an", "and", "at", "but", "how", "in", "is", "on", "or", "the", "to", "what", "will"]
tokens = text.split()
clean_tokens = [t for t in tokens if not t in stopwords]
return " ".join(clean_tokens)
To see how effective stop word removal can be, I took the entire text of my Tech Leader Docs newsletter (17,230 words consisting of 104,892 characters) and processed it using the above function. How effective was it? The resulting text contained 89,337 characters, which is about a 15% reduction in size.
Other pruning techniques can also be applied programmatically. Removing punctuation, numbers, HTML tags, URLs and email addresses, or non-alphabetical characters are all valid pruning techniques that can be straightforward to apply. Here is a function that does just that:
import re
def clean_text(text):
# Remove URLs
text = re.sub(r'http\S+', '', text)
# Remove email addresses
text = re.sub(r'\S+@\S+', '', text)
# Remove everything that's not a letter (a-z, A-Z)
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Remove whitespace, tabs, and new lines
text = ''.join(text.split())
return text
What measure of length reduction might we be able to get from this additional processing? Applying these techniques to the remaining characters of Tech Leader Docs results in just 75,217 characters; an overall reduction of about 28% from the original text.
More opinionated pruning, such as removing short words or specific words or phrases, can be tailored to a specific use case. These don’t lend themselves well to general functions, however.
Now that you have some text processing techniques in your toolkit, let’s look at how a reduction in characters translates to fewer tokens used when it comes to ChatGPT. To understand this, we’ll examine Byte-Pair Encoding.
Byte-Pair Encoding (BPE) is a subword tokenization method. It was originally introduced for data compression but has since been adapted for tokenization in NLP tasks. It allows representing common words as tokens and splits more rare words into subword units. This enables a balance between character-level and word-level tokenization.
Let’s make that more concrete. Imagine you have a big box of LEGO bricks, and each brick represents a single letter or character. You’re tasked with building words using these LEGO bricks. At first, you might start by connecting individual bricks to form words. But over time, you notice that certain combinations of bricks (or characters) keep appearing together frequently, like “th” in “the” or “ing” in “running.”
BPE is like a smart LEGO-building buddy who suggests, “Hey, since ’th’ and ‘ing’ keep appearing together a lot, why don’t we glue them together and treat them as a single piece?” This way, the next time you want to build a word with “the” or “running,” you can use these glued-together pieces, making the process faster and more efficient.
Colloquially, the BPE algorithm looks like this:
BPE is a particularly powerful tokenization method, especially when dealing with diverse and extensive vocabularies. Here’s why:
In essence, BPE strikes a balance, offering the granularity of character-level tokenization and the context-awareness of word-level tokenization. This hybrid approach ensures that NLP models like ChatGPT can understand a wide range of texts while maintaining computational efficiency.
At time of writing, a message to ChatGPT via its web interface has a maximum token length of 4,096 tokens. If we assume the prior mentioned percent reduction as an average, this means you could reduce text of up to 5,712 tokens down to the appropriate size with just text preprocessing.
What about when this isn’t enough? Beyond text preprocessing, larger input can be sent in chunks using the OpenAI API. In my next post, I’ll show you how to build a Python module that does exactly that.
]]>Most organizations want to improve productivity and output, but few technical teams seem to take a data-driven approach to discovering productivity bottlenecks. If you’re looking to improve development velocity, a couple key metrics could help your team get unblocked. Here’s how you can apply a smidge of data science to visualize how your repository is doing, and where improvements can be made.
The first and most difficult part, as any data scientist would likely tell you, is ensuring the quality of your data. It’s especially important to consider consistency: are dates throughout the dataset presented in a consistent format? Have tags or labels been applied under consistent rules? Does the dataset contain repeated values, empty values, or unmatched types?
If your repository has previously changed up processes or standards, consider the timeframe of the data you collect. If labeling issues is done arbitrarily, those may not be a useful feature. While cleaning data is outside the scope of this article, I can, at least, help you painlessly collect it.
I wrote a straightforward Python utility that uses the GitHub API to pull data for any repository. You can use this on the command line and output the data to a file. It uses the list repository issues endpoint (docs), which, perhaps confusingly, includes both issues and pull requests (PRs) for the repository. I get my data like this:
$ python fetch.py -h
usage: fetch.py [-h] [--token TOKEN] repository months
$ python fetch.py OWASP/wstg 24 > data.json
Using the GitHub API means less worry about standardization, for example, all the dates are expressed as ISO 8601. Now that you have some data to process, it’s time to play with Pandas.
You can use a Jupyter Notebook to do some simple calculations and data visualization.
First, create the Notebook file:
touch stats.ipynb
Open the file in your favorite IDE, or in your browser by running jupyter notebook
.
In the first code cell, import Pandas and load your data:
import pandas as pd
data = pd.read_json("data.json")
data
You can then run that cell to see a preview of the data you collected.
Pandas is a well-documented data analysis library. With a little imagination and a few keyword searches, you can begin to measure all kinds of repository metrics. For this walk-through, here’s how you can calculate and create a graph that shows the number of days an issue or PR remains open in your repository.
Create a new code cell and, for each item in your Series, subtract the date it was closed from the date it was created:
duration = pd.Series(data.closed_at - data.created_at)
duration.describe()
Series.describe()
will give you some summary statistics that look something like these (from mypy on GitHub):
count 514
mean 5 days 08:04:17.239299610
std 14 days 12:04:22.979308668
min 0 days 00:00:09
25% 0 days 00:47:46.250000
50% 0 days 06:18:47
75% 2 days 20:22:49.250000
max 102 days 20:56:30
Series.plot()
uses a specified plotting backend (matplotlib
by default) to visualize your data. A histogram can be a helpful way to examine issue duration:
duration.apply(lambda x: x.days).plot(kind="hist")
This will plot a histogram that represents the frequency distribution of issues over days, which is one way you can tell how long most issues take to close. For example, mypy seems to handle the majority of issues and PRs within 10 days, with some outliers taking more than three months:
It would be interesting to visualize other repository data, such as its most frequent contributors, or most often used labels. Does a relationship exist between the author or reviewers of an issue and how quickly it is resolved? Does the presence of particular labels predict anything about the duration of the issue?
Now that you have some data-driven superpowers, remember that it comes with great responsibility. Deciding what to measure is just as, if not more, important than measuring it.
Consider how to translate the numbers you gather into productivity improvements. For example, if your metric is closing issues and PRs faster, what actions can you take to encourage the right behavior in your teams? I’d suggest encouraging issues to be clearly defined, and pull requests to be small and have a well-contained scope, making them easier to understand and review.
To prepare to accurately take measurements for your repository, establish consistent standards for labels, tags, milestones, and other features you might want to examine. Remember that meaningful results are more easily gleaned from higher quality data.
Finally, have fun exercising your data science skills. Who knows what you can discover and improve upon next!
]]>If you’re interested in managing your own mailing list or newsletter, you can set up Simple Subscribe on your own AWS resources to collect email addresses. This open source API is written in Go, and runs on AWS Lambda. Visitors to your site can sign up to your list, which is stored in a DynamoDB table, ready to be queried or exported at your leisure.
When someone signs up, they’ll receive an email asking them to confirm their subscription. This is sometimes called “double opt-in,” although I prefer the term “verified.” Simple Subscribe works on serverless infrastructure and uses an AWS Lambda to handle subscription, confirmation, and unsubscribe requests.
You can find the Simple Subscribe project, with its fully open-source code, on GitHub. I encourage you to pull up the code and follow along! In this post I’ll share each build step, the thought process behind the API’s single-responsibility functions, and security considerations for an AWS project like this one.
A non-verified email sign up process is straightforward. Someone puts their email into a box on your website, then that email goes into your database. However, if I’ve taught you anything about not trusting user input, the very idea of a non-verified sign up process should raise your hackles. Spam may be great when fried in a sandwich, but no fun when it’s running up your AWS bill.
While you can use a strategy like a CAPTCHA or puzzle for is-it-a-human verification, these can create enough friction to turn away your potential subscribers. Instead, a confirmation email can help to ensure both address correctness and user sentience.
To build a subscription flow with email confirmation, create single-responsibility functions that satisfy each logical step. Those are:
To achieve each of these goals, Simple Subscribe uses the official AWS SDK for Go to interact with DynamoDB and SES.
At each stage, consider what the data looks like and how you store it. This can help to handle conundrums like, “What happens if someone tries to subscribe twice?” or even threat-modeling such as, “What if someone subscribes with an email they don’t own?”
Ready? Let’s break down each step and see how the magic happens.
The subscription process begins with a humble web form, like the one on my site’s main page. A form input with attributes type="email" required
helps with validation, thanks to the browser. When submitted, the form sends a GET request to the Simple Subscribe subscription endpoint.
Simple Subscribe receives a GET request to this endpoint with a query string containing the intended subscriber’s email. It then generates an id
value and adds both email
and id
to your DynamoDB table.
The table item now looks like:
confirm | id | timestamp | |
---|---|---|---|
subscriber@example.com |
false | uuid-xxxxx |
2020-11-01 00:27:39 |
The confirm
column, which holds a boolean, indicates that the item is a subscription request that has not yet been confirmed. To verify an email address in the database, you’ll need to find the correct item and change confirm
to true
.
As you work with your data, consider the goal of each manipulation and how you might compare an incoming request to existing data.
For example, if someone made a subsequent subscription request for the same email address, how would you handle it? You might say, “Create a new line item with a new id
,” however, this might not be best strategy when your serverless application database is paid for by request volume.
Since DynamoDB Pricing depends on how much data you read and write to your tables, it’s advantageous to avoid piling on excess data.
With that in mind, it would be prudent to handle subscription requests for the same email by performing an update instead of adding a new line. Simple Subscribe actually uses the same function to either add or update a database item. This is typically referred to as, “update or insert.”
In a database like SQLite this is accomplished with the UPSERT syntax. In the case of DynamoDB, you use an update operation. For the Go SDK, its syntax is UpdateItem
.
When a duplicate subscription request is received, the database item is matched on the email
only. If an existing line item is found, its id
and timestamp
are overridden, which updates the existing database record and avoids flooding your table with duplicate requests.
After submitting the form, the intended subscriber then receives an email from SES containing a link. This link is built using the email
and id
from the table, and takes the format:
<BASE_URL><VERIFY_PATH>/?email=subscriber@example.com&id=uuid-xxxxx
In this set up, the id
is a UUID that acts as a secret token. It provides an identifier that you can match that is sufficiently complex and hard to guess. This approach deters people from subscribing with email addresses they don’t control.
Visiting the link sends a request to your verification endpoint with the email
and id
in the query string. This time, it’s important to compare both the incoming email
and id
values to the database record. This verifies that the recipient of the confirmation email is initiating the request.
The verification endpoint ensures that these values match an item in your database, then performs another update operation to set confirm
to true
, and update the timestamp. The item now looks like:
confirm | id | timestamp | |
---|---|---|---|
subscriber@example.com |
true | uuid-xxxxx |
2020-11-01 00:37:39 |
You can now query your table to build your email list. Depending on your email sending solution, you might do this manually, with another Lambda, or even from the command line.
Since data for requested subscriptions (where confirm
is false
) is stored in the table alongside confirmed subscriptions, it’s important to differentiate this data when querying for email addresses to send to. You’ll want to ensure you only return emails where confirm
is true
.
Similar to verifying an email address, Simple Subscribe uses email
and id
as arguments to the function that deletes an item from your DynamoDB table in order to unsubscribe an email address. To allow people to remove themselves from your list, you’ll need to provide a URL in each email you send that includes their email
and id
as a query string to the unsubscribe endpoint. It would look something like:
<BASE_URL><UNSUBSCRIBE_PATH>/?email=subscriber@example.com&id=uuid-xxxxx
When the link is clicked, the query string is passed to the unsubscribe endpoint. If the provided email
and id
match a database item, that item will be deleted.
Proving a method for your subscribers to automatically remove themselves from your list, without any human intervention necessary, is part of an ethical and respectful philosophy towards handling the data that’s been entrusted to you.
Once you decide to accept other people’s data, it becomes your responsibility to care for it. This is applicable to everything you build. For Simple Subscribe, it means maintaining the security of your database, and periodically pruning your table.
In order to avoid retaining email addresses where confirm
is false
past a certain time frame, it would be a good idea to set up a cleaning function that runs on a regular schedule. This can be achieved manually, with an AWS Lambda function, or using the command line.
To clean up, find database items where confirm
is false
and timestamp
is older than a particular point in time. Depending on your use case and request volumes, the frequency at which you choose to clean up will vary.
Also depending on your use case, you may wish to keep backups of your data. If you are particularly concerned about data integrity, you can explore On-Demand Backup or Point-in-Time Recovery for DynamoDB.
Building your own subscriber list can be an empowering endeavor! Whether you intend to start a newsletter, send out notifications for new content, or want to create a community around your work, there’s nothing more personal or direct than an email from me to you.
I encourage you to start building your subscriber base with Simple Subscribe today! Like most of my work, it’s open source and free for your personal use. Dive into the code at the GitHub repository or learn more at SimpleSubscribe.org.
]]>Among the several Actions I’ve built, I have two current favorites. One is hugo-remote, which lets you continuously deploy a Hugo static site from a private source repository to a public GitHub Pages repository. This keeps the contents of the source repository private, such as your unreleased drafts, while still allowing you to have a public open source site using GitHub Pages.
The second is django-security-check. It’s an effortless way to continuously check that your production Django application is free from a variety of security misconfigurations. You can think of it as your little CI/CD helper for busy projects – a security linter!
When I was a kid, I spent several summer vacations coding a huge medieval fantasy world MUD (Multi-User Dungeon, like a multiplayer role-playing game) written in LPC, with friends. It was entirely text-based, and built and played via Telnet. I fell in love with the terminal and learned a lot about object-oriented programming and prototype-based programming early on.
I became a freelance developer and had the privilege of working on a wide variety of client projects. Realizing the difficulty that companies have with hiring experienced developers, I built ApplyByAPI.com to help. As you might imagine, it allows candidates to apply for jobs via API, instead of emailing a resume. It’s based on the Django framework, so in the process, I learned even more about building reusable units of software.
When I became a co-author and a core maintainer for the Open Web Application Security Project (OWASP) Web Security Testing Guide (WSTG), I gained an even broader appreciation for how a prototype-based, repeatable approach can help build secure web applications. Organizations worldwide consider the WSTG the foremost open source resource for testing the security of web applications. We’ve applied this thinking via the use of GitHub Actions in our repository – I’ll tell you more about that later.
Whether I’m creating an open source tool or leading a development team, my childhood experience still informs how I think about programming today. I strive to create repeatable units of software like GitHub Actions – only now, I make them for large enterprises in the real world!
Developers take on a lot of responsibility when it comes to building secure applications these days. I’m a full-time senior software developer at a cybersecurity company. I’ve found that I’m maximally productive when I create systems and processes that help myself and my team make desired outcomes inevitable. So I spend my free time building tools that make it easy for other developers to build secure software as well. My Actions help to automate contained, repeatable units of work that can make a big difference in a developer’s day.
Yes! I’m always finding ways for tools like GitHub Actions to boost the velocity of technical teams, whether at work or in my open source projects. Remember the Open Web Application Security Project? In the work I’ve lead with OWASP, I’ve championed the effort to increase automation using GitHub Actions to maintain quality, securely deploy new versions to the web, and even build PDFs of the WSTG. We’re constantly looking into new ways that GitHub Actions can make our lives easier and our readers’ projects more secure.
I like that I can build an Action using familiar and portable technologies, like Docker. Actions are easy for collaborators to work on too, since in the case of a Dockerized Action, you can use any language your team is comfortable with. This is especially useful in large organizations with polyglot teams and environments. There aren’t any complicated dependencies for running these portable tasks, and you don’t need to learn any special frameworks to get started.
One of my first blog posts about GitHub Actions even describes how I used an Action to run a Makefile! This is especially useful for large legacy applications that want to modernize their pipeline by using GitHub Actions.
The largest challenge of GitHub Actions isn’t really in GitHub Actions, but in the transition of legacy software and company culture.
Migrating legacy software is always challenging, particularly with large legacy applications. Moving to modern CI/CD processes requires changes at the software level, team level, and even a shift in thinking when it comes to individual developers. It can help to have a tool like GitHub Actions, which is at once seamlessly modern and familiar, when transitioning legacy code to a modern pipeline.
I’m happiest when I’m solving a challenge that makes developing secure software less challenging in the future, both for myself and for the technology organization I’m leading. With tools like GitHub Actions, a lot of mental overhead can be offloaded to automatic processes – like getting a whole other brain, for free! This can massively help organizations that are ready to scale up their development output.
In the realm of cybersecurity, not only does creating portable and reusable software make developers’ lives easier, it helps to make whole workflows repeatable, which in turn makes software development processes more secure. With smart processes in place, technical teams are happier. As an inevitable result, they’ll build better software for customers, too.
]]>With the general availability of GitHub Actions, we have a chance to programmatically access and preserve GitHub event data in our repository. Making the data part of the repository itself is a way of preserving it outside of GitHub, and also gives us the ability to feature the data on a front-facing website, such as with GitHub Pages, through an automated process that’s part of our CI/CD pipeline.
And, if you’re like me, you can turn GitHub issue comments into an awesome 90s guestbook page.
No matter the usage, the principle concepts are the same. We can use Actions to access, preserve, and display GitHub event data - with just one workflow file. To illustrate the process, I’ll take you through the workflow code that makes my guestbook shine on.
For an introductory look at GitHub Actions including how workflows are triggered, see A lightweight, tool-agnostic CI/CD flow with GitHub Actions.
An Action workflow runs in an environment with some default environment variables. A lot of convenient information is available here, including event data. The most complete way to access the event data is using the $GITHUB_EVENT_PATH
variable, the path of the file with the complete JSON event payload.
The expanded path looks like /home/runner/work/_temp/_github_workflow/event.json
and its data corresponds to its webhook event. You can find the documentation for webhook event data in GitHub REST API Event Types and Payloads. To make the JSON data available in the workflow environment, you can use a tool like jq
to parse the event data and put it in an environment variable.
Below, I grab the comment ID from an issue comment event:
ID="$(jq '.comment.id' $GITHUB_EVENT_PATH)"
Most event data is also available via the github.event
context variable without needing to parse JSON. The fields are accessed using dot notation, as in the example below where I grab the same comment ID:
ID=${{ github.event.comment.id }}
For my guestbook, I want to display entries with the user’s handle, and the date and time. I can capture this event data like so:
AUTHOR=${{ github.event.comment.user.login }}
DATE=${{ github.event.comment.created_at }}
Shell variables are handy for accessing data, however, they’re ephemeral. The workflow environment is created anew each run, and even shell variables set in one step do not persist to other steps. To persist the captured data, you have two options: use artifacts, or commit it to the repository.
Using artifacts, you can persist data between workflow jobs without committing it to your repository. This is handy when, for example, you wish to transform or incorporate the data before putting it somewhere more permanent.
Two actions assist with using artifacts: upload-artifact
and download-artifact
. You can use these actions to make files available to other jobs in the same workflow. For a full example, see passing data between jobs in a workflow.
The upload-artifact
action’s action.yml
contains an explanation of the keywords. The uploaded files are saved in .zip
format. Another job in the same workflow run can use the download-artifact
action to utilize the data in another step.
You can also manually download the archive on the workflow run page, under the repository’s Actions tab.
Persisting workflow data between jobs does not make any changes to the repository files, as the artifacts generated live only in the workflow environment. Personally, being comfortable working in a shell environment, I see a narrow use case for artifacts, though I’d have been remiss not to mention them. Besides passing data between jobs, they could be useful for creating .zip
format archives of, say, test output data. In the case of my guestbook example, I simply ran all the necessary steps in one job, negating any need for passing data between jobs.
To preserve data captured in the workflow in the repository itself, it is necessary to add and push this data to the Git repository. You can do this in the workflow by creating new files with the data, or by appending data to existing files, using shell commands.
To work with the repository files in the workflow, use the checkout
action to first get a copy to work with:
- uses: actions/checkout@master
with:
fetch-depth: 1
To add comments to my guestbook, I turn the event data captured in shell variables into proper files, using substitutions in shell parameter expansion to sanitize user input and translate newlines to paragraphs. I wrote previously about why user input should be treated carefully.
- name: Turn comment into file
run: |
ID=${{ github.event.comment.id }}
AUTHOR=${{ github.event.comment.user.login }}
DATE=${{ github.event.comment.created_at }}
COMMENT=$(echo "${{ github.event.comment.body }}")
NO_TAGS=${COMMENT//[<>]/\`}
FOLDER=comments
printf '%b\n' "<div class=\"comment\"><p>${AUTHOR} says:</p><p>${NO_TAGS//$'\n'/\<\/p\>\<p\>}</p><p>${DATE}</p></div>\r\n" > ${FOLDER}/${ID}.html
By using printf
and directing its output with >
to a new file, the event data is transformed into an HTML file, named with the comment ID number, that contains the captured event data. Formatted, it looks like:
<div class="comment">
<p>victoriadrake says:</p>
<p>This is a comment!</p>
<p>2019-11-04T00:28:36Z</p>
</div>
When working with comments, one effect of naming files using the comment ID is that a new file with the same ID will overwrite the previous. This is handy for a guestbook, as it allows any edits to a comment to replace the original comment file.
If you’re using a static site generator like Hugo, you could build a Markdown format file, stick it in your content/
folder, and the regular site build will take care of the rest. In the case of my simplistic guestbook, I have an extra step to consolidate the individual comment files into a page. Each time it runs, it overwrites the existing index.html
with the header.html
portion (>
), then finds and appends (>>
) all the comment files’ contents in descending order, and lastly appends the footer.html
portion to end the page.
- name: Assemble page
run: |
cat header.html > index.html
find comments/ -name "*.html" | sort -r | xargs -I % cat % >> index.html
cat footer.html >> index.html
Since the checkout
action is not quite the same as cloning the repository, at time of writing, there are some issues still to work around. A couple extra steps are necessary to pull
, checkout
, and successfully push
changes back to the master
branch, but this is pretty trivially done in the shell.
Below is the step that adds, commits, and pushes changes made by the workflow back to the repository’s master
branch.
- name: Push changes to repo
run: |
REMOTE=https://${{ secrets.GITHUB_TOKEN }}@github.com/${{ github.repository }}
git config user.email "${{ github.actor }}@users.noreply.github.com"
git config user.name "${{ github.actor }}"
git pull ${REMOTE}
git checkout master
git add .
git status
git commit -am "Add new comment"
git push ${REMOTE} master
The remote, in fact, our repository, is specified using the github.repository
context variable. For our workflow to be allowed to push to master, we give the remote URL using the default secrets.GITHUB_TOKEN
variable.
Since the workflow environment is shiny and newborn, we need to configure Git. In the above example, I’ve used the github.actor
context variable to input the username of the account initiating the workflow. The email is similarly configured using the default noreply
GitHub email address.
If you’re using GitHub Pages with the default secrets.GITHUB_TOKEN
variable and without a site generator, pushing changes to the repository in the workflow will only update the repository files. The GitHub Pages build will fail with an error, “Your site is having problems building: Page build failed.”
To enable Actions to trigger a Pages site build, you’ll need to create a Personal Access Token. This token can be stored as a secret in the repository settings and passed into the workflow in place of the default secrets.GITHUB_TOKEN
variable. I wrote more about Actions environment and variables in this post.
With the use of a Personal Access Token, a push initiated by the Actions workflow will also update the Pages site. You can see it for yourself by leaving a comment in my guestbook! The comment creation event triggers the workflow, which then takes around 30 seconds to run and update the guestbook page.
Where a site build is necessary for changes to be published, such as when using Hugo, an Action can do this too. However, in order to avoid creating unintended loops, one Action workflow will not trigger another (see what will). Instead, it’s extremely convenient to handle the process of building the site with a Makefile, which any workflow can then run. Simply add running the Makefile as the final step in your workflow job, with the repository token where necessary:
- name: Run Makefile
env:
TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: make all
This ensures that the final step of your workflow builds and deploys the updated site.
GitHub Actions provides a neat way to capture and utilize event data so that it’s not only available within GitHub. The possibilities are only as limited as your imagination! Here are a few ideas for things this lets us create:
Did I mention I made a 90s guestbook page? My inner-Geocities-nerd is a little excited.
]]>