To support continuous delivery, no human should have direct push permissions on your master
branch. If you develop on GitHub, the latest tag of this branch gets deployed when you create a release – which is hopefully very often, and very automated.
You’re already doing a great job of tracking future features and current bugs as issues (right?). To take a quick aside, an issue is a well-defined piece of work that can be merged to the main branch and deployed without breaking anything. It could be a new piece of functionality, a button component update, or a bug fix.
A short-lived branch-per-issue helps ensure that its resulting pull request doesn’t get too large, making it unwieldy and hard to review carefully. The definition of “short” varies depending on the team or project’s development velocity: for a small team producing a commercial app (like a startup), the time from issue branch creation to PR probably won’t exceed a week. For open source projects like the OWASP WSTG that depends on volunteers working around busy schedules, branches may live for a few weeks to a few months, depending on the contributor. Generally, strive to iterate in as little time as possible.
Here’s what this looks like practically. For an issue named (#28) Add user settings page, check out a new branch from master
:
# Get all the latest work locally
git checkout master
git pull
# Start your new branch from master
git checkout -b 28/add-settings-page
Work on the issue, and periodically merge master
to fix and avoid other conflicts:
# Commit to your issue branch
git commit ...
# Get the latest work on master
git checkout master
git pull
# Return to your issue branch and merge in master
git checkout 28/add-settings-page
git merge master
You may prefer to use rebasing instead of merging in master
. This happens to be my personal preference as well, however, I’ve found that people generally seem to have a harder time wrapping their heads around how rebasing works than they do with merging. Interactive rebasing can easily introduce confusing errors, and rewriting history can be confusing to begin with. Since I’m all about reducing cognitive load in developers’ processes in general, I recommend using a merge strategy.
When the issue work is ready to PR, open the request against master
. Automated tests run. Teammates review the work (using inline comments and suggestions if you’re on GitHub). Depending on the project, you may deploy a preview version as well.
Once everything checks out, the PR is merged, the issue is closed, and the branch is deleted.
Some common pitfalls I’ve seen that can undermine this flow are:
master
.master
. Not removing branches that are stale or have already been merged can cause confusion and make it more difficult than necessary to differentiate new ones.If this sounds like a process you’d use, or if you have anything to add, let me know via Webmention!
]]>A slight change in thinking can create a sea change in security. Let’s examine how.
When it comes to cybersecurity, I take a pragmatic approach. There aren’t enough sheaves of NIST recommendations in the world to help you if you aren’t accustomed to thinking like the bad guy. To best lead your team to defend against hacking, first know how to hack yourself.
A perusal of the resources linked at the end of this article can help you with a starting point, as will general consideration of your application through the lens of an outsider. Are there potentially vulnerable forms or endpoints you might examine first? Is there someone at your company you could call on the phone and surreptitiously get helpful information from? Defense is a difficult position to hold in any battle. If you aren’t the first person to consider how your application might be attacked, you’ve already lost.
Develop your sense of how to be the bad guy. Every component of software, every interaction, every bit of data, can be useful to the bad guy. The more you hone your ability to consider how a thing can be used for ill, the better able you’ll be to protect it.
When looking at information, ask, “How can I use this information to gain access to more important information?” When considering a user story, ask, “What happens if I do something unexpected?”
In all things, channel your inner four-year-old. Push all the buttons.
Playing offense on your own application lets you fix vulnerabilities before they happen. That’s a luxury you won’t get from the real bad guys.
Every part of a system will fail with 100% certainty on a long enough timescale. Thinking a step ahead can help to ensure that when it does, the one failure doesn’t leave your application wide open to others.
To fail secure means that when a system or code fails to perform or does something unexpected, any follow-on effects are halted rather than permitted. This likely takes many forms in many areas of your application, so here are the more common ones I see.
When gating access, deny by default. This most often takes the form of whitelisting, or colloquially, “no one is allowed, except for the people on this list.” In terms of code flow, everything should be denied first. Only allow any particular action after proper credentials are verified.
For automated workflows such as deployments, ensure each step is dependent on the last. Don’t make the (rather common) mistake of connecting actions to triggers that can kick off a workflow before all the necessary pieces are in place. With the smorgasbord of cloud and CI tools available, failure events may not be obvious or noisy.
Be careful to avoid running flows on timed triggers unless they are completely self-contained. Workflows that unpredictably run faster or slower than expected can throw a whole series of events into disarray, leaving processes half-run and states insecure.
Errors are a frequent gold mine for attackers. Ensure your team’s code returns “pretty” errors with content that you can control. “Ugly” errors, returned by default by databases, frameworks, etc, try to be helpful by providing lots of debugging information that can be extremely helpful to a hacker.
If your development team doesn’t currently have one central source of information when it comes to keeping track of all your application components, here’s a tip you really need. In software security, less is more (secure).
The more modular an application is, the better its various components can be isolated, protected, or changed out. With a central source of truth for what all those components are (and preferably one that doesn’t rely on manual updates), it’s easier to ensure that your application is appropriately minimalist. Dependency managers, such as Pipenv, are a great example.
Few industries besides technology seem to have produced as many acronyms. Philosophies like Don’t Repeat Yourself (DRY), Keep It Simple Stupid (KISS), You Aren’t Going to Need It (YAGNI), and countless other methodologies all build upon one very basic principle: minimalism. It’s a principle that warrants incorporation in every aspect of an application.
There’s a reason it takes little skill to shoot the broad side of a barn: barns are rather large, and there’s quite a lot of one to hit. Applications bloated by excessive third-party components, repeated code, and unnecessary assets make similarly large targets. The more there is to maintain and protect, the easier it is to hit.
Like Marie Kondo’s method for dispatching the inevitable creep of household clutter, you can reduce your application’s attack surface by considering each component and asking whether it brings you joy. Do all of this component’s functions benefit your application? Is there unnecessary redundancy here? Assess each component and decide how integral it is to the application. Every component is a risk; your job is to decide if it’s a worthwhile risk.
With the basic principles of learning to think like the bad guy, failing securely, and practicing software minimalism, you’re now ready to steep in the specifics. Keeping the fundamentals in mind can help you lead your team to focus your cybersecurity efforts where it matters most.
No Jedi succeeds without a little help from friends. Whether you’re a beginner in the battle against the dark side or a twice-returned-home Jedi Master, these resources provide continuing training and guidance.
I hope you find these thought systems helpful! If you find your interest piqued as well, you can read more of what I’ve written about cybersecurity here.
]]>Can a Makefile improve your DevOps and keep developers happy? How awesome would it be if a new developer working on your project didn’t start out by copying and pasting commands from your README? What if instead of:
pip3 install pipenv
pipenv shell --python 3.8
pipenv install --dev
npm install
pre-commit install --install-hooks
# look up how to install Framework X...
# copy and paste from README...
npm run serve
… you could just type:
make start
…and then start working?
I use make
every day to take the tedium out of common development activities like updating programs, installing dependencies, and testing. To do all this with a Makefile (GNU make), we use Makefile rules and recipes. Similar parallels exist for POSIX flavor make, like Target Rules; here’s a great article on POSIX-compatible Makefiles.
Here’s some examples of things we can make
easier (sorry):
update: ## Do apt upgrade and autoremove
sudo apt update && sudo apt upgrade -y
sudo apt autoremove -y
env:
pip3 install pipenv
pipenv shell --python 3.8
install: ## Install or update dependencies
pipenv install --dev
npm install
pre-commit install --install-hooks
serve: ## Run the local development server
hugo serve --enableGitInfo --disableFastRender --environment development
initial: update env install serve ## Install tools and start development server
Now we have some command-line aliases that you can check in! Great idea! If you’re wondering what’s up with that weird ##
comment syntax, it gets better.
Aliases are great, if you remember what they all are and what they do without constantly typing cat Makefile
. Naturally, you need a help
command:
.PHONY: help
help: ## Show this help
@egrep -h '\s##\s' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-20s\033[0m %s\n", $$1, $$2}'
With a little command-line magic, this egrep
command takes the output of MAKEFILE_LIST
, sorts it, and uses awk
to find strings that follow the ##
pattern. It then prints a helpful formatted version of the comments.
We’ll put it at the top of the file so it’s the default target. Now to see all our handy shortcuts and what they do, we just run make
, or make help
:
help Show this help
initial Install tools and start development server
install Install or update dependencies
serve Run the local development server
update Do apt upgrade and autoremove
Now we have our very own personalized and project-specific CLI tool!
The possibilities for improving your DevOps flow with a self-documenting Makefile are almost endless. You can use one to simplify any workflow and produce some very happy developers.
Please enjoy the (live!) Makefile I use to manage and develop this Hugo site. I hope it inspires you!
SHELL := /bin/bash
.POSIX:
.PHONY: help env install upgrade-hugo serve build start initial
help: ## Show this help
@egrep -h '\s##\s' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-20s\033[0m %s\n", $$1, $$2}'
env:
pip3 install pipenv
shell: ## Enter the virtual environment
pipenv shell
install: ## Install or update dependencies
pipenv install --dev
npm install
npm install -g markdownlint-cli
pre-commit install --install-hooks
HUGO_VERSION:=$(shell curl -s https://api.github.com/repos/gohugoio/hugo/releases/latest | grep 'tag_name' | cut -d '"' -f 4 | cut -c 2-)
upgrade-hugo: ## Get the latest Hugo
mkdir tmp/ && \
cd tmp/ && \
curl -sSL https://github.com/gohugoio/hugo/releases/download/v$(HUGO_VERSION)/hugo_extended_$(HUGO_VERSION)_Linux-64bit.tar.gz | tar -xvzf- && \
sudo mv hugo /usr/local/bin/ && \
cd .. && \
rm -rf tmp/
hugo version
dev: ## Run the local development server
hugo serve --enableGitInfo --disableFastRender --environment development
future: ## Run the local development server in the future
hugo serve --enableGitInfo --buildFuture --disableFastRender --environment development
build: ## Lock dependencies and build site
git pull --recurse-submodules
pipenv lock
hugo --minify --cleanDestinationDir
start: upgrade-hugo serve ## Update Hugo and start development server
initial: env install upgrade-hugo serve ## Install tools and start development server
At front-and-center on your GitHub profile, your README is a great opportunity to let folks know what you’re about, what you find important, and to showcase some highlights of your work. You might like to show off your latest repositories, tweet, or blog post. Keeping it up to date doesn’t have to be a pain either, thanks to continuous delivery tools like GitHub Actions.
My current README refreshes itself daily with a link to my latest blog post. Here’s how I’m creating a self-updating README.md
with Go and GitHub actions.
I’ve been writing a lot of Python lately, but for some things I really like using Go. You could say it’s my go-to language for just-for-func
projects. Sorry. Couldn’t stop myself.
To create my README.md, I’m going to get some static content from an existing file, mash it together with some new dynamic content that we’ll generate with Go, then bake the whole thing at 400 degrees until something awesome comes out.
Here’s how we read in a file called static.md
and put it in string
form:
// Unwrap Markdown content
content, err := ioutil.ReadFile("static.md")
if err != nil {
log.Fatalf("cannot read file: %v", err)
return err
}
// Make it a string
stringyContent := string(content)
The possibilities for your dynamic content are only limited by your imagination! Here, I’ll use the github.com/mmcdole/gofeed
package to read the RSS feed from my blog and get the newest post.
fp := gofeed.NewParser()
feed, err := fp.ParseURL("https://victoria.dev/index.xml")
if err != nil {
log.Fatalf("error getting feed: %v", err)
}
// Get the freshest item
rssItem := feed.Items[0]
To join these bits together and produce stringy goodness, we use fmt.Sprintf()
to create a formatted string.
// Whisk together static and dynamic content until stiff peaks form
blog := "Read my latest blog post: **[" + rssItem.Title + "](" + rssItem.Link + ")**"
data := fmt.Sprintf("%s\n%s\n", stringyContent, blog)
Then to create a new file from this mix, we use os.Create()
. There are more things to know about deferring file.Close()
, but we don’t need to get into those details here. We’ll add file.Sync()
to ensure our README gets written.
// Prepare file with a light coating of os
file, err := os.Create("README.md")
if err != nil {
return err
}
defer file.Close()
// Bake at n bytes per second until golden brown
_, err = io.WriteString(file, data)
if err != nil {
return err
}
return file.Sync()
View the full code here in my README repository.
Mmmm, doesn’t that smell good? 🍪 Let’s make this happen on the daily with a GitHub Action.
You can create a GitHub Action workflow that triggers both on a push to your master
branch as well as on a daily schedule. Here’s a slice of the .github/workflows/update.yaml
that defines this:
on:
push:
branches:
- master
schedule:
- cron: '0 11 * * *'
To run the Go program that rebuilds our README, we first need a copy of our files. We use actions/checkout
for that:
steps:
- name: 🍽️ Get working copy
uses: actions/checkout@master
with:
fetch-depth: 1
This step runs our Go program:
- name: 🍳 Shake & bake README
run: |
cd ${GITHUB_WORKSPACE}/update/
go run main.go
Finally, we push the updated files back to our repository. Learn more about the variables shown at Using variables and secrets in a workflow.
- name: 🚀 Deploy
run: |
git config user.name "${GITHUB_ACTOR}"
git config user.email "${GITHUB_ACTOR}@users.noreply.github.com"
git add .
git commit -am "Update dynamic content"
git push --all -f https://${{ secrets.GITHUB_TOKEN }}@github.com/${GITHUB_REPOSITORY}.git
View the full code for this Action workflow here in my README repository.
Congratulations and welcome to the cool kids’ club! You now know how to build an auto-updating GitHub profile README. You may now go forth and add all sorts of neat dynamic elements to your page – just go easy on the GIFs, okay?
]]>Among the several Actions I’ve built, I have two current favorites. One is hugo-remote, which lets you continuously deploy a Hugo static site from a private source repository to a public GitHub Pages repository. This keeps the contents of the source repository private, such as your unreleased drafts, while still allowing you to have a public open source site using GitHub Pages.
The second is django-security-check. It’s an effortless way to continuously check that your production Django application is free from a variety of security misconfigurations. You can think of it as your little CI/CD helper for busy projects – a security linter!
When I was a kid, I spent several summer vacations coding a huge medieval fantasy world MUD (Multi-User Dungeon, like a multiplayer role-playing game) written in LPC, with friends. It was entirely text-based, and built and played via Telnet. I fell in love with the terminal and learned a lot about object-oriented programming and prototype-based programming early on.
I became a freelance developer and had the privilege of working on a wide variety of client projects. Realizing the difficulty that companies have with hiring experienced developers, I built ApplyByAPI.com to help. As you might imagine, it allows candidates to apply for jobs via API, instead of emailing a resume. It’s based on the Django framework, so in the process, I learned even more about building reusable units of software.
When I became a co-author and a core maintainer for the Open Web Application Security Project (OWASP) Web Security Testing Guide (WSTG), I gained an even broader appreciation for how a prototype-based, repeatable approach can help build secure web applications. Organizations worldwide consider the WSTG the foremost open source resource for testing the security of web applications. We’ve applied this thinking via the use of GitHub Actions in our repository – I’ll tell you more about that later.
Whether I’m creating an open source tool or leading a development team, my childhood experience still informs how I think about programming today. I strive to create repeatable units of software like GitHub Actions – only now, I make them for large enterprises in the real world!
Developers take on a lot of responsibility when it comes to building secure applications these days. I’m a full-time senior software developer at a cybersecurity company. I’ve found that I’m maximally productive when I create systems and processes that help myself and my team make desired outcomes inevitable. So I spend my free time building tools that make it easy for other developers to build secure software as well. My Actions help to automate contained, repeatable units of work that can make a big difference in a developer’s day.
Yes! I’m always finding ways for tools like GitHub Actions to boost the velocity of technical teams, whether at work or in my open source projects. Remember the Open Web Application Security Project? In the work I’ve lead with OWASP, I’ve championed the effort to increase automation using GitHub Actions to maintain quality, securely deploy new versions to the web, and even build PDFs of the WSTG. We’re constantly looking into new ways that GitHub Actions can make our lives easier and our readers’ projects more secure.
I like that I can build an Action using familiar and portable technologies, like Docker. Actions are easy for collaborators to work on too, since in the case of a Dockerized Action, you can use any language your team is comfortable with. This is especially useful in large organizations with polyglot teams and environments. There aren’t any complicated dependencies for running these portable tasks, and you don’t need to learn any special frameworks to get started.
One of my first blog posts about GitHub Actions even describes how I used an Action to run a Makefile! This is especially useful for large legacy applications that want to modernize their pipeline by using GitHub Actions.
The largest challenge of GitHub Actions isn’t really in GitHub Actions, but in the transition of legacy software and company culture.
Migrating legacy software is always challenging, particularly with large legacy applications. Moving to modern CI/CD processes requires changes at the software level, team level, and even a shift in thinking when it comes to individual developers. It can help to have a tool like GitHub Actions, which is at once seamlessly modern and familiar, when transitioning legacy code to a modern pipeline.
I’m happiest when I’m solving a challenge that makes developing secure software less challenging in the future, both for myself and for the technology organization I’m leading. With tools like GitHub Actions, a lot of mental overhead can be offloaded to automatic processes – like getting a whole other brain, for free! This can massively help organizations that are ready to scale up their development output.
In the realm of cybersecurity, not only does creating portable and reusable software make developers’ lives easier, it helps to make whole workflows repeatable, which in turn makes software development processes more secure. With smart processes in place, technical teams are happier. As an inevitable result, they’ll build better software for customers, too.
]]>Multiple threads in Python is a bit of a bitey subject (not sorry) in that the Python interpreter doesn’t actually let multiple threads execute at the same time. Python’s Global Interpreter Lock, or GIL, prevents multiple threads from executing Python bytecodes at once. Each thread that wants to execute must first wait for the GIL to be released by the currently executing thread. The GIL is pretty much the microphone in a low-budget conference panel, except where no one gets to shout.
This has the advantage of preventing race conditions. It does, however, lack the performance advantages afforded by running multiple tasks in parallel. (If you’d like a refresher on concurrency, parallelism, and multithreading, see Concurrency, parallelism, and the many threads of Santa Claus.) While I prefer Go for its convenient first-class primitives that support concurrency (see Goroutines), this project’s recipients were more comfortable with Python. I took it as an opportunity to test and explore!
Simultaneously performing multiple tasks in Python isn’t impossible; it just takes a little extra work. For Hydra, the main advantage is in overcoming the input/output (I/O) bottleneck.
In order to get web pages to check, Hydra needs to go out to the Internet and fetch them. When compared to tasks that are performed by the CPU alone, going out over the network is comparatively slower. How slow?
Here are approximate timings for tasks performed on a typical PC:
Task | Time | |
---|---|---|
CPU | execute typical instruction | 1/1,000,000,000 sec = 1 nanosec |
CPU | fetch from L1 cache memory | 0.5 nanosec |
CPU | branch misprediction | 5 nanosec |
CPU | fetch from L2 cache memory | 7 nanosec |
RAM | Mutex lock/unlock | 25 nanosec |
RAM | fetch from main memory | 100 nanosec |
Network | send 2K bytes over 1Gbps network | 20,000 nanosec |
RAM | read 1MB sequentially from memory | 250,000 nanosec |
Disk | fetch from new disk location (seek) | 8,000,000 nanosec (8ms) |
Disk | read 1MB sequentially from disk | 20,000,000 nanosec (20ms) |
Network | send packet US to Europe and back | 150,000,000 nanosec (150ms) |
Peter Norvig first published these numbers some years ago in Teach Yourself Programming in Ten Years. Since computers and their components change year over year, the exact numbers shown above aren’t the point. What these numbers help to illustrate is the difference, in orders of magnitude, between operations.
Compare the difference between fetching from main memory and sending a simple packet over the Internet. While both these operations occur in less than the blink of an eye (literally) from a human perspective, you can see that sending a simple packet over the Internet is over a million times slower than fetching from RAM. It’s a difference that, in a single-thread program, can quickly accumulate to form troublesome bottlenecks.
In Hydra, the task of parsing response data and assembling results into a report is relatively fast, since it all happens on the CPU. The slowest portion of the program’s execution, by over six orders of magnitude, is network latency. Not only does Hydra need to fetch packets, but whole web pages! One way of improving Hydra’s performance is to find a way for the page fetching tasks to execute without blocking the main thread.
Python has a couple options for doing tasks in parallel: multiple processes, or multiple threads. These methods allow you to circumvent the GIL and speed up execution in a couple different ways.
To execute parallel tasks using multiple processes, you can use Python’s ProcessPoolExecutor
. A concrete subclass of Executor
from the concurrent.futures
module, ProcessPoolExecutor
uses a pool of processes spawned with the multiprocessing
module to avoid the GIL.
This option uses worker subprocesses that maximally default to the number of processors on the machine. The multiprocessing
module allows you to maximally parallelize function execution across processes, which can really speed up compute-bound (or CPU-bound) tasks.
Since the main bottleneck for Hydra is I/O and not the processing to be done by the CPU, I’m better served by using multiple threads.
Fittingly named, Python’s ThreadPoolExecutor
uses a pool of threads to execute asynchronous tasks. Also a subclass of Executor
, it uses a defined number of maximum worker threads (at least five by default, according to the formula min(32, os.cpu_count() + 4)
) and reuses idle threads before starting new ones, making it pretty efficient.
Here is a snippet of Hydra with comments showing how Hydra uses ThreadPoolExecutor
to achieve parallel multithreaded bliss:
# Create the Checker class
class Checker:
# Queue of links to be checked
TO_PROCESS = Queue()
# Maximum workers to run
THREADS = 100
# Maximum seconds to wait for HTTP response
TIMEOUT = 60
def __init__(self, url):
...
# Create the thread pool
self.pool = futures.ThreadPoolExecutor(max_workers=self.THREADS)
def run(self):
# Run until the TO_PROCESS queue is empty
while True:
try:
target_url = self.TO_PROCESS.get(block=True, timeout=2)
# If we haven't already checked this link
if target_url["url"] not in self.visited:
# Mark it as visited
self.visited.add(target_url["url"])
# Submit the link to the pool
job = self.pool.submit(self.load_url, target_url, self.TIMEOUT)
job.add_done_callback(self.handle_future)
except Empty:
return
except Exception as e:
print(e)
You can view the full code in Hydra’s GitHub repository.
If you’d like to see the full effect, I compared the run times for checking my website between a prototype single-thread program, and the multiheadedmultithreaded Hydra.
time python3 slow-link-check.py https://victoria.dev
real 17m34.084s
user 11m40.761s
sys 0m5.436s
time python3 hydra.py https://victoria.dev
real 0m15.729s
user 0m11.071s
sys 0m2.526s
The single-thread program, which blocks on I/O, ran in about seventeen minutes. When I first ran the multithreaded version, it finished in 1m13.358s - after some profiling and tuning, it took a little under sixteen seconds. Again, the exact times don’t mean all that much; they’ll vary depending on factors such as the size of the site being crawled, your network speed, and your program’s balance between the overhead of thread management and the benefits of parallelism.
The more important thing, and the result I’ll take any day, is a program that runs some orders of magnitude faster.
]]>I’ve been helping out a group called the Open Web Application Security Project (OWASP). They’re a non-profit foundation that produces some of the foremost application testing guides and cybersecurity resources. OWASP’s publications, checklists, and reference materials are a help to security professionals, penetration testers, and developers all over the world. Most of the individual teams that create these materials are run almost entirely by volunteers.
OWASP is a great group doing important work. I’ve seen this firsthand as part of the core team that produces the Web Security Testing Guide. However, while OWASP inspires in its large volunteer base, it lacks in the area of central organization.
This lack of organization was most recently apparent in the group’s website, OWASP.org. A big organization with an even bigger website to match, OWASP.org enjoys hundreds of thousands of visitors. Unfortunately, many of its pages - individually managed by disparate projects - are infrequently updated. Some are abandoned. The website as a whole lacks a centralized quality assurance process, and as a result, OWASP.org is peppered with broken links.
Customers don’t like broken links; attackers really do. That’s because broken links are a security vulnerability. Broken links can signal opportunities for attacks like broken link hijacking and subdomain takeovers. At their least effective, these attacks can be embarrassing; at their worst, severely damaging to businesses and organizations. One OWASP group, the Application Security Verification Standard (ASVS) project, writes about integrity controls that can help to mitigate the likelihood of these attacks. This knowledge, unfortunately, has not yet propagated throughout the rest of OWASP yet.
This is the story of how I created a fast and efficient tool to help OWASP solve this problem.
I took on the task of creating a program that could run as part of a CI/CD process to detect and report broken links. The program needed to:
Essentially; I need to build a web crawler.
My original journey through this process was also in Python, as that was a comfortable language choice for everyone in the OWASP group. Personally, I prefer to use Go for higher performance as it offers more convenient concurrency primitives. Between the task and this talk, I wrote three programs: a prototype single-thread Python program, a multithreaded Python program, and a Go program using goroutines. We’ll see a comparison of how each worked out near the end of the talk - first, let’s explore how to build a web crawler.
Here’s what our web crawler will need to do:
https://victoria.dev
)https://victoria.dev
and not https://github.com
, for instance)Here’s what the execution flow will look like:
As you can see, the nodes “GET page” -> “HTML” -> “Parse links” -> “Valid link” -> “Check visited” all form a loop. These are what enable our web crawler to continue crawling until all the links on the site have been accounted for in the “Check visited” node. When the crawler encounters links it’s already checked, it will “Stop.” This loop will become more important in a moment.
For now, the question on everyone’s mind (I hope): how do we make it fast?
Here are some approximate timings for tasks performed on a typical PC:
Type | Task | Time |
---|---|---|
CPU | execute typical instruction | 1/1,000,000,000 sec = 1 nanosec |
CPU | fetch from L1 cache memory | 0.5 nanosec |
CPU | branch misprediction | 5 nanosec |
CPU | fetch from L2 cache memory | 7 nanosec |
RAM | Mutex lock/unlock | 25 nanosec |
RAM | fetch from main memory | 100 nanosec |
RAM | read 1MB sequentially from memory | 250,000 nanosec |
Disk | fetch from new disk location (seek) | 8,000,000 nanosec (8ms) |
Disk | read 1MB sequentially from disk | 20,000,000 nanosec (20ms) |
Network | send packet US to Europe and back | 150,000,000 nanosec (150ms) |
Peter Norvig first published these numbers some years ago in Teach Yourself Programming in Ten Years. They typically crop up now and then in articles titled along the lines of, “Latency numbers every developer should know.”
Since computers and their components change year over year, the exact numbers shown above aren’t the point. What these numbers help to illustrate is the difference, in orders of magnitude, between operations.
Compare the difference between fetching from main memory and sending a simple packet over the Internet. While both these operations occur in less than the blink of an eye (literally) from a human perspective, you can see that sending a simple packet over the Internet is over a million times slower than fetching from RAM. It’s a difference that, in a single-thread program, can quickly accumulate to form troublesome bottlenecks.
The numbers above mean that the difference in time it takes to send something over the Internet compared to fetching data from main memory is over six orders of magnitude. Remember the loop in our execution chart? The “GET page” node, in which our crawler fetches page data over the network, is going to be a million times slower than the next slowest thing in the loop!
We don’t need to run our prototype to see what that means in practical terms; we can estimate it. Let’s take OWASP.org, which has upwards of 12,000 links, as an example:
150 milliseconds
x 12,000 links
---------
1,800,000 milliseconds (30 minutes)
A whole half hour, just for the network tasks. It may even be much slower than that, since web pages are frequently much larger than a packet. This means that in our single-thread prototype web crawler, our biggest bottleneck is network latency. Why is this problematic?
I previously wrote about feedback loops. In essence, in order to improve at doing anything, you first need to be able to get feedback from your last attempt. That way, you have the necessary information to make adjustments and get closer to your goal on your next iteration.
As a software developer, bottlenecks can contribute to long and inefficient feedback loops. If I’m waiting on a process that’s part of a CI/CD pipeline, in our bottlenecked web crawler example, I’d be sitting around for a minimum of a half hour before learning whether or not changes in my last push were successful, or whether they broke master
(hopefully staging
).
Multiply a slow and inefficient feedback loop by many runs per day, over many days, and you’ve got a slow and inefficient developer. Multiply that by many developers in an organization bottlenecked on the same process, and you’ve got a slow and inefficient company.
To add insult to injury, not only are you waiting on a bottlenecked process to run; you’re also paying to wait. Take the serverless example - AWS Lambda, for instance. Here’s a chart showing the cost of functions by compute time and CPU usage.
Again, the numbers change over the years, but the main concepts remain the same: the bigger the function and the longer its compute time, the bigger the cost. For applications taking advantage of serverless, these costs can add up dramatically.
Bottlenecks are a recipe for failure, for both productivity and the bottom line.
The good news is that bottlenecks are mostly unnecessary. If we know how to identify them, we can strategize our way out of them. To understand how, let’s get some tacos.
Everyone, meet Bob. He’s a gopher who works at the taco stand down the street as the cashier. Say “Hi,” Bob.
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
🌮 🌳
🌮
🌮 ╔══════════════╗
🌮 Hi I'm Bob 🌳
🌮 ╚══════════════╝ \
🌮 🐹 🌮
🌮
🌮
🌮 🌳
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
Bob works very hard at being a cashier, but he’s still just one gopher. The customers who frequent Bob’s taco stand can eat tacos really quickly; but in order to get the tacos to eat them, they’ve got to order them through Bob. Here’s what our bottlenecked, single-thread taco stand currently looks like:
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
🌮 🌳
🌮
🌮
🌮 🌳
🌮 🐹 🧑💵🧑💵🧑💵🧑💵🧑💵🧑💵🧑💵🧑💵🧑💵
🌮
🌮
🌮
🌮 🌳
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
As you can see, all the customers are queued up, right out the door. Poor Bob handles one customer’s transaction at a time, starting and finishing with that customer completely before moving on to the next. Bob can only do so much, so our taco stand is rather inefficient at the moment. How can we make Bob faster?
We can try splitting the queue:
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
🌮 🌳
🌮
🌮 🧑💵🧑💵🧑💵🧑💵
🌮 🌳
🌮 🐹
🌮
🌮 🧑💵🧑💵🧑💵🧑💵🧑💵
🌮
🌮 🌳
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
Now Bob can do some multitasking. For example, he can start a transaction with a customer in one queue; then, while that customer counts their bills, Bob can pop over to the second queue and get started there. This arrangement, known as a concurrency model, helps Bob go a little bit faster by jumping back and forth between lines. However, it’s still just one Bob, which limits our improvement possibilities. If we were to make four queues, they’d all be shorter; but Bob would be very thinly stretched between them. Can we do better?
We could get two Bobs:
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
🌮 🌳
🌮
🌮 🌳
🌮 🐹 🧑💵🧑💵🧑💵🧑💵
🌮 🌳
🌮 🐹 🧑💵🧑💵🧑💵🧑💵🧑💵
🌮 🌳
🌮
🌮 🌳
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
With twice the Bobs, each can handle a queue of his own. This is our most efficient solution for our taco stand so far, since two Bobs can handle much more than one Bob can, even if each customer is still attended to one at a time.
We can do even better than that:
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
🌮 🌳
🌮 🐹 🧑💵🧑💵
🌮 🌳
🌮 🐹 🧑💵🧑💵
🌮 🌳
🌮 🐹 🧑💵🧑💵
🌮 🌳
🌮 🐹 🧑💵🧑💵🧑💵
🌮 🌳
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
With quadruple the Bobs, we have some very short queues, and a much more efficient taco stand. In computing, the concept of having multiple workers do tasks in parallel is called multithreading.
In Go, we can apply this concept using goroutines. Here are some illustrative snippets from my Go solution.
In order to share data between our goroutines, we’ll need to create some data structures. Our Checker
structure will be shared, so it will have a Mutex
(mutual exclusion) to allow our goroutines to lock and unlock it. The Checker
structure will also hold a list of brokenLinks
results, and visitedLinks
. The latter will be a map of strings to booleans, which we’ll use to directly and efficiently check for visited links. By using a map instead of iterating over a list, our visitedLinks
lookup will have a constant complexity of O(1) as opposed to a linear complexity of O(n), thus avoiding the creation of another bottleneck. For more on time complexity, see my coffee-break introduction to time complexity of algorithms article.
type Checker struct {
startDomain string
brokenLinks []Result
visitedLinks map[string]bool
workerCount, maxWorkers int
sync.Mutex
}
...
// Page allows us to retain parent and sublinks
type Page struct {
parent, loc, data string
}
// Result adds error information for the report
type Result struct {
Page
reason string
code int
}
To extract links from HTML data, here’s a parser I wrote on top of package html
:
// Extract links from HTML
func parse(parent, data string) ([]string, []string) {
doc, err := html.Parse(strings.NewReader(data))
if err != nil {
fmt.Println("Could not parse: ", err)
}
goodLinks := make([]string, 0)
badLinks := make([]string, 0)
var f func(*html.Node)
f = func(n *html.Node) {
if n.Type == html.ElementNode && checkKey(string(n.Data)) {
for _, a := range n.Attr {
if checkAttr(string(a.Key)) {
j, err := formatURL(parent, a.Val)
if err != nil {
badLinks = append(badLinks, j)
} else {
goodLinks = append(goodLinks, j)
}
break
}
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
f(c)
}
}
f(doc)
return goodLinks, badLinks
}
If you’re wondering why I didn’t use a more full-featured package for this project, I highly recommend the story of left-pad
. The short of it: more dependencies, more problems.
Here are snippets of the main
function, where we pass in our starting URL and create a queue (or channels, in Go) to be filled with links for our goroutines to process.
func main() {
...
startURL := flag.String("url", "http://example.com", "full URL of site")
...
firstPage := Page{
parent: *startURL,
loc: *startURL,
}
toProcess := make(chan Page, 1)
toProcess <- firstPage
var wg sync.WaitGroup
The last significant piece of the puzzle is to create our workers, which we’ll do here:
for i := range toProcess {
wg.Add(1)
checker.addWorker()
🐹 go worker(i, &checker, &wg, toProcess)
if checker.workerCount > checker.maxWorkers {
time.Sleep(1 * time.Second) // throttle down
}
}
wg.Wait()
A WaitGroup does just what it says on the tin: it waits for our group of goroutines to finish. When they have, we’ll know our Go web crawler has finished checking all the links on the site.
Here’s a comparison of the three programs I wrote on this journey. First, the prototype single-thread Python version:
time python3 slow-link-check.py https://victoria.dev
real 17m34.084s
user 11m40.761s
sys 0m5.436s
This finished crawling my website in about seventeen-and-a-half minutes, which is rather long for a site at least an order of magnitude smaller than OWASP.org.
The multithreaded Python version did a bit better:
time python3 hydra.py https://victoria.dev
real 1m13.358s
user 0m13.161s
sys 0m2.826s
My multithreaded Python program (which I dubbed Hydra) finished in one minute and thirteen seconds.
How did Go do?
time ./go-link-check --url=https://victoria.dev
real 0m7.926s
user 0m9.044s
sys 0m0.932s
At just under eight seconds, I found the Go version to be extremely palatable.
As fun as it is to simply enjoy the speedups, we can directly relate these results to everything we’ve learned so far. Consider taking a process that used to soak up seventeen minutes and turning it into an eight-second-affair instead. Not only will that give developers a much shorter and more efficient feedback loop, it will give companies the ability to develop faster, and thus grow more quickly - while costing less. To drive the point home: a process that runs in seventeen-and-a-half minutes when it could take eight seconds will also cost over a hundred and thirty times as much to run!
A better work day for developers, and a better bottom line for companies. There’s a lot of benefit to be had in making functions, code, and processes as efficient as possible - by breaking bottlenecks.
]]>With the general availability of GitHub Actions, we have a chance to programmatically access and preserve GitHub event data in our repository. Making the data part of the repository itself is a way of preserving it outside of GitHub, and also gives us the ability to feature the data on a front-facing website, such as with GitHub Pages, through an automated process that’s part of our CI/CD pipeline.
And, if you’re like me, you can turn GitHub issue comments into an awesome 90s guestbook page.
No matter the usage, the principle concepts are the same. We can use Actions to access, preserve, and display GitHub event data - with just one workflow file. To illustrate the process, I’ll take you through the workflow code that makes my guestbook shine on.
For an introductory look at GitHub Actions including how workflows are triggered, see A lightweight, tool-agnostic CI/CD flow with GitHub Actions.
An Action workflow runs in an environment with some default environment variables. A lot of convenient information is available here, including event data. The most complete way to access the event data is using the $GITHUB_EVENT_PATH
variable, the path of the file with the complete JSON event payload.
The expanded path looks like /home/runner/work/_temp/_github_workflow/event.json
and its data corresponds to its webhook event. You can find the documentation for webhook event data in GitHub REST API Event Types and Payloads. To make the JSON data available in the workflow environment, you can use a tool like jq
to parse the event data and put it in an environment variable.
Below, I grab the comment ID from an issue comment event:
ID="$(jq '.comment.id' $GITHUB_EVENT_PATH)"
Most event data is also available via the github.event
context variable without needing to parse JSON. The fields are accessed using dot notation, as in the example below where I grab the same comment ID:
ID=${{ github.event.comment.id }}
For my guestbook, I want to display entries with the user’s handle, and the date and time. I can capture this event data like so:
AUTHOR=${{ github.event.comment.user.login }}
DATE=${{ github.event.comment.created_at }}
Shell variables are handy for accessing data, however, they’re ephemeral. The workflow environment is created anew each run, and even shell variables set in one step do not persist to other steps. To persist the captured data, you have two options: use artifacts, or commit it to the repository.
Using artifacts, you can persist data between workflow jobs without committing it to your repository. This is handy when, for example, you wish to transform or incorporate the data before putting it somewhere more permanent.
Two actions assist with using artifacts: upload-artifact
and download-artifact
. You can use these actions to make files available to other jobs in the same workflow. For a full example, see passing data between jobs in a workflow.
The upload-artifact
action’s action.yml
contains an explanation of the keywords. The uploaded files are saved in .zip
format. Another job in the same workflow run can use the download-artifact
action to utilize the data in another step.
You can also manually download the archive on the workflow run page, under the repository’s Actions tab.
Persisting workflow data between jobs does not make any changes to the repository files, as the artifacts generated live only in the workflow environment. Personally, being comfortable working in a shell environment, I see a narrow use case for artifacts, though I’d have been remiss not to mention them. Besides passing data between jobs, they could be useful for creating .zip
format archives of, say, test output data. In the case of my guestbook example, I simply ran all the necessary steps in one job, negating any need for passing data between jobs.
To preserve data captured in the workflow in the repository itself, it is necessary to add and push this data to the Git repository. You can do this in the workflow by creating new files with the data, or by appending data to existing files, using shell commands.
To work with the repository files in the workflow, use the checkout
action to first get a copy to work with:
- uses: actions/checkout@master
with:
fetch-depth: 1
To add comments to my guestbook, I turn the event data captured in shell variables into proper files, using substitutions in shell parameter expansion to sanitize user input and translate newlines to paragraphs. I wrote previously about why user input should be treated carefully.
- name: Turn comment into file
run: |
ID=${{ github.event.comment.id }}
AUTHOR=${{ github.event.comment.user.login }}
DATE=${{ github.event.comment.created_at }}
COMMENT=$(echo "${{ github.event.comment.body }}")
NO_TAGS=${COMMENT//[<>]/\`}
FOLDER=comments
printf '%b\n' "<div class=\"comment\"><p>${AUTHOR} says:</p><p>${NO_TAGS//$'\n'/\<\/p\>\<p\>}</p><p>${DATE}</p></div>\r\n" > ${FOLDER}/${ID}.html
By using printf
and directing its output with >
to a new file, the event data is transformed into an HTML file, named with the comment ID number, that contains the captured event data. Formatted, it looks like:
<div class="comment">
<p>victoriadrake says:</p>
<p>This is a comment!</p>
<p>2019-11-04T00:28:36Z</p>
</div>
When working with comments, one effect of naming files using the comment ID is that a new file with the same ID will overwrite the previous. This is handy for a guestbook, as it allows any edits to a comment to replace the original comment file.
If you’re using a static site generator like Hugo, you could build a Markdown format file, stick it in your content/
folder, and the regular site build will take care of the rest. In the case of my simplistic guestbook, I have an extra step to consolidate the individual comment files into a page. Each time it runs, it overwrites the existing index.html
with the header.html
portion (>
), then finds and appends (>>
) all the comment files’ contents in descending order, and lastly appends the footer.html
portion to end the page.
- name: Assemble page
run: |
cat header.html > index.html
find comments/ -name "*.html" | sort -r | xargs -I % cat % >> index.html
cat footer.html >> index.html
Since the checkout
action is not quite the same as cloning the repository, at time of writing, there are some issues still to work around. A couple extra steps are necessary to pull
, checkout
, and successfully push
changes back to the master
branch, but this is pretty trivially done in the shell.
Below is the step that adds, commits, and pushes changes made by the workflow back to the repository’s master
branch.
- name: Push changes to repo
run: |
REMOTE=https://${{ secrets.GITHUB_TOKEN }}@github.com/${{ github.repository }}
git config user.email "${{ github.actor }}@users.noreply.github.com"
git config user.name "${{ github.actor }}"
git pull ${REMOTE}
git checkout master
git add .
git status
git commit -am "Add new comment"
git push ${REMOTE} master
The remote, in fact, our repository, is specified using the github.repository
context variable. For our workflow to be allowed to push to master, we give the remote URL using the default secrets.GITHUB_TOKEN
variable.
Since the workflow environment is shiny and newborn, we need to configure Git. In the above example, I’ve used the github.actor
context variable to input the username of the account initiating the workflow. The email is similarly configured using the default noreply
GitHub email address.
If you’re using GitHub Pages with the default secrets.GITHUB_TOKEN
variable and without a site generator, pushing changes to the repository in the workflow will only update the repository files. The GitHub Pages build will fail with an error, “Your site is having problems building: Page build failed.”
To enable Actions to trigger a Pages site build, you’ll need to create a Personal Access Token. This token can be stored as a secret in the repository settings and passed into the workflow in place of the default secrets.GITHUB_TOKEN
variable. I wrote more about Actions environment and variables in this post.
With the use of a Personal Access Token, a push initiated by the Actions workflow will also update the Pages site. You can see it for yourself by leaving a comment in my guestbook! The comment creation event triggers the workflow, which then takes around 30 seconds to run and update the guestbook page.
Where a site build is necessary for changes to be published, such as when using Hugo, an Action can do this too. However, in order to avoid creating unintended loops, one Action workflow will not trigger another (see what will). Instead, it’s extremely convenient to handle the process of building the site with a Makefile, which any workflow can then run. Simply add running the Makefile as the final step in your workflow job, with the repository token where necessary:
- name: Run Makefile
env:
TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: make all
This ensures that the final step of your workflow builds and deploys the updated site.
GitHub Actions provides a neat way to capture and utilize event data so that it’s not only available within GitHub. The possibilities are only as limited as your imagination! Here are a few ideas for things this lets us create:
Did I mention I made a 90s guestbook page? My inner-Geocities-nerd is a little excited.
]]>Of course, having your CI/CD work everywhere is a tall order. Popular CI apps for GitHub repositories alone use a multitude of configuration languages spanning Groovy, YAML, TOML, JSON, and more… all with differing syntax, of course. Porting workflows from one tool to another is more than a one-cup-of-coffee process.
The introduction of GitHub Actions has the potential to add yet another tool to the mix; or, for the right set up, greatly simplify a CI/CD workflow.
Prior to this article, I accomplished my CD flow with several lashed-together apps. I used AWS Lambda to trigger site builds on a schedule. I had Netlify build on push triggers, as well as run image optimization, and then push my site to the public Pages repository. I used Travis CI in the public repository to test the HTML. All this worked in conjunction with GitHub Pages, which actually hosts the site.
I’m now using the GitHub Actions beta to accomplish all the same tasks, with one portable Makefile of build instructions, and without any other CI/CD apps.
What do most CI/CD tools have in common? They run your workflow instructions in a shell environment! This is wonderful, because that means that most CI/CD tools can do anything that you can do in a terminal… and you can do pretty much anything in a terminal.
Especially for a contained use case like building my static site with a generator like Hugo, running it all in a shell is a no-brainer. To tell the magic box what to do, we just need to write instructions.
While a shell script is certainly the most portable option, I use the still-very-portable Make to write my process instructions. This provides me with some advantages over simple shell scripting, like the use of variables and macros, and the modularity of rules.
I got into the nitty-gritty of my Makefile in my last post. Let’s look at how to get GitHub Actions to run it.
To our point on portability, my magic Makefile is stored right in the repository root. Since it’s included with the code, I can run the Makefile locally on any system where I can clone the repository, provided I set the environment variables. Using GitHub Actions as my CI/CD tool is as straightforward as making Make go worky-worky.
I found the GitHub Actions workflow syntax guide to be pretty straightforward, though also lengthy on options. Here’s the necessary set up for getting the Makefile to run.
The workflow file at .github/workflows/make-master.yml
contains the following:
name: make-master
on:
push:
branches:
- master
schedule:
- cron: '20 13 * * *'
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@master
with:
fetch-depth: 1
- name: Run Makefile
env:
TOKEN: ${{ secrets.TOKEN }}
run: make all
I’ll explain the components that make this work.
Actions support multiple triggers for a workflow. Using the on
syntax, I’ve defined two triggers for mine: a push event to the master
branch only, and a scheduled cron
job.
Once the make-master.yml
file is in your repository, either of your triggers will cause Actions to run your Makefile. To see how the last run went, you can also add a fun badge to the README.
Because the Makefile runs on every push to master
, I sometimes would get errors when the site build had no changes. When Git, via my Makefile, attempted to commit to the Pages repository, no changes were detected and the commit would fail annoyingly:
nothing to commit, working tree clean
On branch master
Your branch is up to date with 'origin/master'.
nothing to commit, working tree clean
Makefile:62: recipe for target 'deploy' failed
make: *** [deploy] Error 1
##[error]Process completed with exit code 2.
I came across some solutions that proposed using diff
to check if a commit should be made, but this may not work for reasons. As a workaround, I simply added the current UTC time to my index page so that every build would contain a change to be committed.
You can define the virtual environment for your workflow to run in using the runs-on
syntax. The obvious best choice one I chose is Ubuntu. Using ubuntu-latest
gets me the most updated version, whatever that happens to be when you’re reading this.
GitHub sets some default environment variables for workflows. The actions/checkout
action with fetch-depth: 1
creates a copy of just the most recent commit your repository in the GITHUB_WORKSPACE
variable. This allows the workflow to access the Makefile at GITHUB_WORKSPACE/Makefile
. Without using the checkout action, the Makefile won’t be found, and I get an error that looks like this:
make: *** No rule to make target 'all'. Stop.
Running Makefile
##[error]Process completed with exit code 2.
While there is a default GITHUB_TOKEN
secret, this is not the one I used. The default is only locally scoped to the current repository. To be able to push to my separate GitHub Pages repository, I created a personal access token scoped to public_repo
and pass it in as the secrets.TOKEN
encrypted variable. For a step-by-step, see Creating and using encrypted secrets.
The nice thing about using a simple Makefile to define the bulk of my CI/CD process is that it’s completely portable. I can run a Makefile anywhere I have access to an environment, which is most CI/CD apps, virtual instances, and, of course, on my local machine.
One of the reasons I like GitHub Actions is that getting my Makefile to run was pretty straightforward. I think the syntax is well done - easy to read, and intuitive when it comes to finding an option you’re looking for. For someone already using GitHub Pages, Actions provides a pretty seamless CD experience; and if that should ever change, I can run my Makefile elsewhere. ¯\_(ツ)_/¯
]]>Since then, we’ve grown together. From early cringe-worthy commit messages, through eighty-six versions of Hugo, and up until last week, a less-than-streamlined multi-app continuous integration and deployment (CI/CD) workflow.
If you know me at all, you know I love to automate things. I’ve been using a combination of AWS Lambda, Netlify, and Travis CI to automatically build and publish this site. My workflow for the task includes:
Thanks to the introduction of GitHub Actions, I’m able to do all the above with just one portable Makefile.
Next week I’ll cover my Actions set up; today, I’ll take you through the nitty-gritty of my Makefile so you can write your own.
POSIX-standard-flavour Make runs on every Unix-like system out there. Make derivatives, such as GNU Make and several flavours of BSD Make also run on Unix-like systems, though their particular use requires installing the respective program. To write a truly portable Makefile, mine follows the POSIX standard. (For a more thorough summation of POSIX-compatible Makefiles, I found this article helpful: A Tutorial on Portable Makefiles.) I run Ubuntu, so I’ve tested the portability aspect using the BSD Make programs bmake
, pmake
, and fmake
. Compatibility with non-Unix-like systems is a little more complicated, since shell commands differ. With derivatives such as Nmake, it’s better to write a separate Makefile with appropriate Windows commands.
While much of my particular use case could be achieved with shell scripting, I find Make offers some worthwhile advantages. I enjoy the ease of using variables and macros, and the modularity of rules when it comes to organizing my steps.
The writing of rules mostly comes down to shell commands, which is the main reason Makefiles are as portable as they are. The best part is that you can do pretty much anything in a terminal, and certainly handle all the workflow steps listed above.
Here’s the portable Makefile that handles my workflow. Yes, I put emojis in there. I’m a monster.
.POSIX:
DESTDIR=public
HUGO_VERSION=0.58.3
OPTIMIZE = find $(DESTDIR) -not -path "*/static/*" \( -name '*.png' -o -name '*.jpg' -o -name '*.jpeg' \) -print0 | \
xargs -0 -P8 -n2 mogrify -strip -thumbnail '1000>'
.PHONY: all
all: get_repository clean get build test deploy
.PHONY: get_repository
get_repository:
@echo "🛎 Getting Pages repository"
git clone https://github.com/victoriadrake/victoriadrake.github.io.git $(DESTDIR)
.PHONY: clean
clean:
@echo "🧹 Cleaning old build"
cd $(DESTDIR) && rm -rf *
.PHONY: get
get:
@echo "❓ Checking for hugo"
@if ! [ -x "$$(command -v hugo)" ]; then\
echo "🤵 Getting Hugo";\
wget -q -P tmp/ https://github.com/gohugoio/hugo/releases/download/v$(HUGO_VERSION)/hugo_extended_$(HUGO_VERSION)_Linux-64bit.tar.gz;\
tar xf tmp/hugo_extended_$(HUGO_VERSION)_Linux-64bit.tar.gz -C tmp/;\
sudo mv -f tmp/hugo /usr/bin/;\
rm -rf tmp/;\
hugo version;\
fi
.PHONY: build
build:
@echo "🍳 Generating site"
hugo --gc --minify -d $(DESTDIR)
@echo "🧂 Optimizing images"
$(OPTIMIZE)
.PHONY: test
test:
@echo "🍜 Testing HTML"
docker run -v $(GITHUB_WORKSPACE)/$(DESTDIR)/:/mnt 18fgsa/html-proofer mnt --disable-external
.PHONY: deploy
deploy:
@echo "🎁 Preparing commit"
@cd $(DESTDIR) \
&& git config user.email "hello@victoria.dev" \
&& git config user.name "Victoria via GitHub Actions" \
&& git add . \
&& git status \
&& git commit -m "🤖 CD bot is helping" \
&& git push -f -q https://$(TOKEN)@github.com/victoriadrake/victoriadrake.github.io.git master
@echo "🚀 Site is deployed!"
Sequentially, this workflow:
If you’re familiar with command line, most of this may look familiar. Here are a couple bits that might warrant a little explanation.
I think this bit is pretty tidy:
if ! [ -x "$$(command -v hugo)" ]; then\
...
fi
I use a negated if
conditional in conjunction with command -v
to check if an executable (-x
) called hugo
exists. If one is not present, the script gets the specified version of Hugo and installs it. This Stack Overflow answer has a nice summation of why command -v
is a more portable choice than which
.
My Makefile uses mogrify
to batch resize and compress images in particular folders. It finds them automatically using the file extension, and only modifies images that are larger than the target size of 1000px in any dimension. I wrote more about the batch-processing one-liner in this post.
There are a few different ways to achieve this same task, one of which, theoretically, is to take advantage of Make’s suffix rules to run commands only on image files. I find the shell script to be more readable.
HTMLProofer is installed with gem
, and uses Ruby and Nokogiri, which adds up to a lot of installation time for a CI workflow. Thankfully, 18F has a Dockerized version that is much faster to implement. Its usage requires starting the container with the built site directory mounted as a data volume, which is easily achieved by appending to the docker run
command.
docker run -v /absolute/path/to/site/:/mounted-site 18fgsa/html-proofer /mounted-site
In my Makefile, I specify the absolute site path using the default environment variable GITHUB_WORKSPACE
. I’ll dive into this and other GitHub Actions features in the next post.
In the meantime, happy Making!
]]>Here’s my one-liner:
find public/ -not -path "*/static/*" \( -name '*.png' -o -name '*.jpg' -o -name '*.jpeg' \) -print0 | xargs -0 -P8 -n2 mogrify -strip -thumbnail '1000>' -format jpg
I use find
to target only certain image file formats in certain directories. With mogrify
, part of ImageMagick, I resize only the images that are larger than a certain dimension, compress them, and strip the metadata. I tack on the format
flag to create jpg copies of the images.
Here’s the one-liner again (broken up for better reading):
# Look in the public/ directory
find public/ \
# Ignore directories called "static" regardless of location
-not -path "*/static/*" \
# Print the file paths of all files ending with any of these extensions
\( -name '*.png' -o -name '*.jpg' -o -name '*.jpeg' \) -print0 \
# Pipe the file paths to xargs and use 8 parallel workers to process 2 arguments
| xargs -0 -P8 -n2 \
# Tell mogrify to strip metadata, and...
mogrify -strip \
# ...compress and resize any images larger than the target size (1000px in either dimension)
-thumbnail '1000>' \
# Convert the files to jpg format
-format jpg
That’s it. That’s the post.
]]>I’ve used Hugo to build my site for years, but until this past week I’d never hooked up my Pages repository to any deployment service. Why? Because using a tool that built my site before deploying it seemed to require having the whole recipe in one place - and if you’re using GitHub Pages with the free version of GitHub, that place is public. That means that all my three-in-the-morning bright ideas and messy unfinished (and unfunny) drafts would be publicly available - and no amount of continuous convenience was going to convince me to do that.
So I kept things separated, with Hugo’s messy behind-the-scenes stuff in a local Git repository, and the generated public/
folder pushing to my GitHub Pages remote repository. Each time I wanted to deploy my site, I’d have to get on my laptop and hugo
to build my site, then cd public/ && git add . && git commit
… etc etc. And all was well, except for the nagging feeling that there was a better way to do this.
I wrote another article a little while back about using GitHub and Working Copy to make changes to my repositories on my iPad whenever I’m out and about. It seemed off to me that I could do everything except deploy my site from my iPad, so I set out to change that.
A couple three-in-the-morning bright ideas and a revoked access token later (oops), I now have not one but two ways to deploy to my public GitHub Pages repository from an entirely separated, private GitHub repository. In this post, I’ll take you through achieving this with Travis CI or using Netlify and Make.
There’s nothing hackish about it - my public GitHub Pages repository still looks the same as it does when I pushed to it locally from my terminal. Only now, I’m able to take advantage of a couple great deployment tools to have the site update whenever I push to my private repo, whether I’m on my laptop or out and about with my iPad.
This article assumes you have working knowledge of Git and GitHub Pages. If not, you may like to spin off some browser tabs from my articles on using GitHub and Working Copy and building a site with Hugo and GitHub Pages first.
Let’s do it!
Travis CI has the built-in ability (♪) to deploy to GitHub Pages following a successful build. They do a decent job in the docs of explaining how to add this feature, especially if you’ve used Travis CI before… which I haven’t. Don’t worry, I did the bulk of the figuring-things-out for you.
.travis.yml
travis
on the command linerepo
configuration variable.Create a new configuration file for Travis with the filename .travis.yml
(note the leading “.”). These scripts are very customizable and I struggled to find a relevant example to use as a starting point - luckily, you don’t have that problem!
Here’s my basic .travis.yml
:
git:
depth: false
env:
global:
- HUGO_VERSION="0.54.0"
matrix:
- YOUR_ENCRYPTED_VARIABLE
install:
- wget -q https://github.com/gohugoio/hugo/releases/download/v${HUGO_VERSION}/hugo_${HUGO_VERSION}_Linux-64bit.tar.gz
- tar xf hugo_${HUGO_VERSION}_Linux-64bit.tar.gz
- mv hugo ~/bin/
script:
- hugo --gc --minify
deploy:
provider: pages
skip-cleanup: true
github-token: $GITHUB_TOKEN
keep-history: true
local-dir: public
repo: gh-username/gh-username.github.io
target-branch: master
verbose: true
on:
branch: master
This script downloads and installs Hugo, builds the site with the garbage collection and minify flags, then deploys the public/
directory to the specified repo
- in this example, your public GitHub Pages repository. You can read about each of the deploy
configuration options here.
To add the GitHub personal access token as an encrypted variable, you don’t need to manually edit your .travis.yml
. The travis
gem commands below will encrypt and add the variable for you when you run them in your repository directory.
First, install travis
with sudo gem install travis
.
Then generate your GitHub personal access token, copy it (it only shows up once!) and run the commands below in your repository root, substituting your token for the kisses:
travis login --pro --github-token xxxxxxxxxxxxxxxxxxxxxxxxxxx
travis encrypt GITHUB_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxx --add env.matrix
Your encrypted token magically appears in the file. Once you’ve committed .travis.yml
to your private Hugo repository, Travis CI will run the script and if the build succeeds, will deploy your site to your public GitHub Pages repo. Magic!
Travis will always run a build each time you push to your private repository. If you don’t want to trigger this behavior with a particular commit, add the skip
command to your commit message.
Yo that’s cool but I like Netlify.
Okay fine.
We can get Netlify to do our bidding by using a Makefile, which we’ll run with Netlify’s build command.
Here’s what our Makefile
looks like:
SHELL:=/bin/bash
BASEDIR=$(CURDIR)
OUTPUTDIR=public
.PHONY: all
all: clean get_repository build deploy
.PHONY: clean
clean:
@echo "Removing public directory"
rm -rf $(BASEDIR)/$(OUTPUTDIR)
.PHONY: get_repository
get_repository:
@echo "Getting public repository"
git clone https://github.com/gh-username/gh-username.github.io.git public
.PHONY: build
build:
@echo "Generating site"
hugo --gc --minify
.PHONY: deploy
deploy:
@echo "Preparing commit"
@cd $(OUTPUTDIR) \
&& git config user.email "you@youremail.com" \
&& git config user.name "Your Name" \
&& git add . \
&& git status \
&& git commit -m "Deploy via Makefile" \
&& git push -f -q https://$(GITHUB_TOKEN)@github.com/gh-username/gh-username.github.io.git master
@echo "Pushed to remote"
To preserve the Git history of our separate GitHub Pages repository, we’ll first clone it, build our new Hugo site to it, and then push it back to the Pages repository. This script first removes any existing public/
folder that might contain files or a Git history. It then clones our Pages repository to public/
, builds our Hugo site (essentially updating the files in public/
), then takes care of committing the new site to the Pages repository.
In the deploy
section, you’ll notice lines starting with &&
. These are chained commands. Since Make invokes a new sub-shell for each line, it starts over with every new line from our root directory. To get our cd
to stick and avoid running our Git commands in the project root directory, we’re chaining the commands and using the backslash character to break long lines for readability.
By chaining our commands, we’re able to configure our Git identity, add all our updated files, and create a commit for our Pages repository.
Similarly to using Travis CI, we’ll need to pass in a GitHub personal access token to push to our public GitHub Pages repository - only Netlify doesn’t provide a straightforward way to encrypt the token in our Makefile.
Instead, we’ll use Netlify’s Build Environment Variables, which live safely in our site settings in the Netlify app. We can then call our token variable in the Makefile. We use it to push (quietly, to avoid printing the token in logs) to our Pages repository by passing it in the remote URL.
To avoid printing the token in Netlify’s logs, we suppress recipe echoing for that line with the leading @
character.
With your Makefile in the root of your private GitHub repository, you can set up Netlify to run it for you.
Getting set up with Netlify via the web UI is straightforward. Once you sign in with GitHub, choose the private GitHub repository where your Hugo site lives. The next page Netlify takes you to lets you enter deploy settings:
You can specify the build command that will run your Makefile (make all
for this example). The branch to deploy and the publish directory don’t matter too much in our specific case, since we’re only concerned with pushing to a separate repository. You can enter the typical master
deploy branch and public
publish directory.
Under “Advanced build settings” click “New variable” to add your GitHub personal access token as a Build Environment Variable. In our example, the variable name is GITHUB_TOKEN
. Click “Deploy site” to make the magic happen.
If you’ve already previously set up your repository with Netlify, find the settings for Continuous Deployment under Settings > Build & deploy.
Netlify will build your site each time you push to the private repository. If you don’t want a particular commit to trigger a build, add [skip ci]
in your Git commit message.
One effect of using Netlify this way is that your site will be built in two places: one is the separate, public GitHub Pages repository that the Makefile pushes to, and the other is your Netlify site that deploys on their CDN from your linked private GitHub repository. The latter is useful if you’re going to play with Deploy Previews and other Netlify features, but those are outside the scope of this post.
The main point is that your GitHub Pages site is now updated in your public repo. Yay!
I hope the effect of this new information is that you feel more able to update your sites, wherever you happen to be. The possibilities are endless - at home on your couch with your laptop, out cafe-hopping with your iPad, or in the middle of a first date on your phone. Endless!
]]>$ file IMG* | awk 'BEGIN{a=0} {print substr($1, 1, length($1)-5),a++"_"substr($8,1, length($8)-1)}' | while read fn fr; do echo $(rename -v "s/$fn/img_$fr/g" *); done
IMG_20170808_172653_425.jpg renamed as img_0_4032x3024.jpg
IMG_20170808_173020_267.jpg renamed as img_1_3024x3506.jpg
IMG_20170808_173130_616.jpg renamed as img_2_3024x3779.jpg
IMG_20170808_173221_425.jpg renamed as img_3_3024x3780.jpg
IMG_20170808_173417_059.jpg renamed as img_4_2956x2980.jpg
IMG_20170808_173450_971.jpg renamed as img_5_3024x3024.jpg
IMG_20170808_173536_034.jpg renamed as img_6_4032x3024.jpg
IMG_20170808_173602_732.jpg renamed as img_7_1617x1617.jpg
IMG_20170808_173645_339.jpg renamed as img_8_3024x3780.jpg
IMG_20170909_170146_585.jpg renamed as img_9_3036x3036.jpg
IMG_20170911_211522_543.jpg renamed as img_10_3036x3036.jpg
IMG_20170913_071608_288.jpg renamed as img_11_2760x2760.jpg
IMG_20170913_073205_522.jpg renamed as img_12_2738x2738.jpg
// ... etc etc
The last item on the aforementioned list is “TODO: come up with a shorter title for this list.”
I previously wrote about the power of command line tools like sed. This post expands on how to string all this magical functionality into one big, long, rainbow-coloured, viscous stream of awesome.
The tool that actually handles the renaming of our files is, appropriately enough, rename
. The syntax is: rename -n "s/original_filename/new_filename/g" *
where -n
does a dry-run, and substituting -v
would rename the files. The s
indicates our substitution string, and g
for “global” finds all occurrences of the string. The *
matches zero or more occurrences of our search-and-replace parameters.
We’ll come back to this later.
When I run $ file IMG_20170808_172653_425.jpg
in the image directory, I get this output:
IMG_20170808_172653_425.jpg: JPEG image data, baseline, precision 8, 4032x3024, frames 3
Since we can get the image resolution (“4032x3024” above), we know that we’ll be able to use it in our new filename.
I love awk
for its simplicity. It takes lines of text and makes individual bits of information available to us with built in variables that we can then refer to as column numbers denoted by $1
, $2
, etc. By default, awk
splits up columns on whitespace. To take the example above:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
-------------------------------------------------------------------------------------------------------------
| IMG_20170808_172653_425.jpg: | JPEG | image | data, | baseline, | precision | 8, | 4032x3024, | frames | 3 |
We can denote different values to use as a splitter with, for example, -F','
if we wanted to use commas as the column divisions. For our current project, spaces are fine.
There are a couple issues we need to solve before we can plug the information into our new filenames. Column $1
has the original filename we want, but there’s an extra “:” character on the end. We don’t need the “.jpg” either. Column $8
has an extra “,” that we don’t want as well. To get just to information we need, we’ll take a substring of the column with substr()
:
substr($1, 1, length($1)-5)
- This gives us the file name from the beginning of the string to the end of the string, minus 5 characters (“length minus 5”).
substr($8,1, length($8)-1)
- This gives us the image size, without the extra comma (“length minus 1”).
To ensure that two images with the same resolutions don’t create identical, competing file names, we’ll append a unique incrementing number to the filename.
BEGIN{a=0}
- Using BEGIN
tells awk
to run the following code only once, at the (drumroll) beginning. Here, we’re declaring the variable a
to be 0
.
a++
- Later in our code, at the appropriate spot for our file name, we call a
and increment it.
When awk
prints a string, it concatenates everything that isn’t separated by a comma. {print a b c}
would create “abc” and {print a,b,c}
would create “a b c”, for example.
We can add additional characters to our file name, such as an underscore, by inserting it in quotations: "_"
.
To feed the output of one command into another command, we use “pipe,” written as |
.
If we only used pipe in this instance, all our data from file
and awk
would get fed into rename
all at once, making for one very, very long and probably non-compiling file name. To run the rename
command line by line, we can use while
and read
. Similarly to awk
, read
takes input and splits it into variables we can assign and use. In our code, it takes the first bit of output from awk
(the original file name) and assigns that the variable name $fn
. It takes the second output (our incrementing number and the image resolution) and assigns that to $fr
. The variable names are arbitrary; you can call them whatever you want.
To run our rename
commands as if we’d manually entered them in the terminal one by one, we can use echo $(some command)
. Finally, done
ends our while
loop.
I wasn’t kidding with that “rainbow-coloured” bit…
p install lolcat
Here’s our full code:
le IMG* | awk 'BEGIN{a=0} {print substr($1, 1, length($1)-5),a++"_"substr($8,1, length($8)-1)}' | while read fn fs; do echo $(rename -v "s/$fn/img_$fs/g" *); done | lolcat
Enjoy!
]]>