There are bound to be situations in which this isn’t enough, such as when you want to read in a large amount of text from a file. Using the OpenAI API allows you to send many more tokens in a messages array, with the maximum number depending on your chosen model. This lets you provide large amounts of text to ChatGPT using chunking. Here’s how.
The gpt-4
model currently has a maximum content length token limit of 8,192 tokens. (Here are the docs containing current limits for all the models.) Remember that you can first apply text preprocessing techniques to reduce your input size – in my previous post I achieved a 28% size reduction without losing meaning with just a little tokenization and pruning.
When this isn’t enough to fit your message within the maximum message token limit, you can take a general programmatic approach that sends your input in message chunks. The goal is to divide your text into sections that each fit within the model’s token limit. The general idea is to:
Each chunk is sent as a separate message in the conversation thread.
You send your chunks to ChatGPT using the OpenAI library’s ChatCompletion
. ChatGPT returns individual responses for each message, so you may want to process these by:
\n
with line breaks.Using the OpenAI API, you can send multiple messages to ChatGPT and ask it to wait for you to provide all of the data before answering your prompt. Being a language model, you can provide these instructions to ChatGPT in plain language. Here’s a suggested script:
Prompt: Summarize the following text for me
To provide the context for the above prompt, I will send you text in parts. When I am finished, I will tell you “ALL PARTS SENT”. Do not answer until you have received all the parts.
I created a Python module, chatgptmax
, that puts all this together. It breaks up a large amount of text by a given maximum token length and sends it in chunks to ChatGPT.
You can install it with pip install chatgptmax
, but here’s the juicy part:
import os
import openai
import tiktoken
# Set up your OpenAI API key
# Load your API key from an environment variable or secret management service
openai.api_key = os.getenv("OPENAI_API_KEY")
def send(
prompt=None,
text_data=None,
chat_model="gpt-3.5-turbo",
model_token_limit=8192,
max_tokens=2500,
):
"""
Send the prompt at the start of the conversation and then send chunks of text_data to ChatGPT via the OpenAI API.
If the text_data is too long, it splits it into chunks and sends each chunk separately.
Args:
- prompt (str, optional): The prompt to guide the model's response.
- text_data (str, optional): Additional text data to be included.
- max_tokens (int, optional): Maximum tokens for each API call. Default is 2500.
Returns:
- list or str: A list of model's responses for each chunk or an error message.
"""
# Check if the necessary arguments are provided
if not prompt:
return "Error: Prompt is missing. Please provide a prompt."
if not text_data:
return "Error: Text data is missing. Please provide some text data."
# Initialize the tokenizer
tokenizer = tiktoken.encoding_for_model(chat_model)
# Encode the text_data into token integers
token_integers = tokenizer.encode(text_data)
# Split the token integers into chunks based on max_tokens
chunk_size = max_tokens - len(tokenizer.encode(prompt))
chunks = [
token_integers[i : i + chunk_size]
for i in range(0, len(token_integers), chunk_size)
]
# Decode token chunks back to strings
chunks = [tokenizer.decode(chunk) for chunk in chunks]
responses = []
messages = [
{"role": "user", "content": prompt},
{
"role": "user",
"content": "To provide the context for the above prompt, I will send you text in parts. When I am finished, I will tell you 'ALL PARTS SENT'. Do not answer until you have received all the parts.",
},
]
for chunk in chunks:
messages.append({"role": "user", "content": chunk})
# Check if total tokens exceed the model's limit and remove oldest chunks if necessary
while (
sum(len(tokenizer.encode(msg["content"])) for msg in messages)
> model_token_limit
):
messages.pop(1) # Remove the oldest chunk
response = openai.ChatCompletion.create(model=chat_model, messages=messages)
chatgpt_response = response.choices[0].message["content"].strip()
responses.append(chatgpt_response)
# Add the final "ALL PARTS SENT" message
messages.append({"role": "user", "content": "ALL PARTS SENT"})
response = openai.ChatCompletion.create(model=chat_model, messages=messages)
final_response = response.choices[0].message["content"].strip()
responses.append(final_response)
return responses
Here’s an example of how you can use this module with text data read from a file. (chatgptmax
also provides a convenience method for getting text from a file.)
# First, import the necessary modules and the function
import os
from chatgptmax import send
# Define a function to read the content of a file
def read_file_content(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
return file.read()
# Use the function
if __name__ == "__main__":
# Specify the path to your file
file_path = "path_to_your_file.txt"
# Read the content of the file
file_content = read_file_content(file_path)
# Define your prompt
prompt_text = "Summarize the following text for me:"
# Send the file content to ChatGPT
responses = send(prompt=prompt_text, text_data=file_content)
# Print the responses
for response in responses:
print(response)
While the module is designed to handle most standard use cases, there are potential pitfalls to be aware of:
As with any process, there’s always room for improvement. Here are a couple of ways you might optimize the module’s chunking and sending process further:
32k
models or need to use small chunk sizes, however, parallelism gains are likely to be minimal.If you found your way here via search, you probably already have a use case in mind. Here are some other (startup) ideas:
Do you have a use case I didn’t list? Let me know about it! In the meantime, have fun sending lots of text to ChatGPT.
]]>Text preprocessing can help shorten and refine your input, ensuring that ChatGPT can grasp the essence without getting overwhelmed. In this article, we’ll explore these techniques, understand their importance, and see how they make your interactions with tools like ChatGPT more reliable and productive.
Text preprocessing prepares raw text data for analysis by NLP models. Generally, it distills everyday text (like full sentences) to make it more manageable or concise and meaningful. Techniques include:
While all these techniques can help reduce the size of raw text data, some of these techniques are easier to apply to general use cases than others. Let’s examine how text preprocessing can help us send a large amount of text to ChatGPT.
In the realm of Natural Language Processing (NLP), a token is the basic unit of text that a system reads. At its simplest, you can think of a token as a word, but depending on the language and the specific tokenization method used, a token can represent a word, part of a word, or even multiple words.
While in English we often equate tokens with words, in NLP, the concept is broader. A token can be as short as a single character or as long as a word. For example, with word tokenization, the sentence “Unicode characters such as emojis are not indivisible. ✂️” can be broken down into tokens like this: [“Unicode”, “characters”, “such”, “as”, “emojis”, “are”, “not”, “indivisible”, “.”, “✂️”]
In another form called Byte-Pair Encoding (BPE), the same sentence is tokenized as: [“Un”, “ic”, “ode”, " characters", " such", " as", " em, “oj”, “is”, " are", " not", " ind", “iv”, “isible”, “.”, " �", “�️”]. The emoji itself is split into tokens containing its underlying bytes.
Depending on the ChatGPT model chosen, your text input size is restricted by tokens. Here are the docs containing current limits. BPE is used by ChatGPT to determine token count, and we’ll discuss it more thoroughly later. First, we can programmatically apply some preprocessing techniques to reduce our text input size and use fewer tokens.
For a general approach that can be applied programmatically, pruning is a suitable preprocessing technique. One form is stop word removal, or removing common words that might not add significant meaning in certain contexts. For example, consider the sentence:
“I always enjoy having pizza with my friends on weekends.”
Stop words are often words that don’t carry significant meaning on their own in a given context. In this sentence, words like “I”, “always”, “enjoy”, “having”, “with”, “my”, “on” are considered stop words.
After removing the stop words, the sentence becomes:
“pizza friends weekends.”
Now, the sentence is distilled to its key components, highlighting the main subject (pizza) and the associated context (friends and weekends). If you find yourself wishing you could convince people to do this in real life (coughmeetingscough)… you aren’t alone.
Stop word removal is straightforward to apply programmatically: given a list of stop words, examine some text input to see if it contains any of the stop words on your list. If it does, remove them, then return the altered text.
def clean_stopwords(text: str) -> str:
stopwords = ["a", "an", "and", "at", "but", "how", "in", "is", "on", "or", "the", "to", "what", "will"]
tokens = text.split()
clean_tokens = [t for t in tokens if not t in stopwords]
return " ".join(clean_tokens)
To see how effective stop word removal can be, I took the entire text of my Tech Leader Docs newsletter (17,230 words consisting of 104,892 characters) and processed it using the above function. How effective was it? The resulting text contained 89,337 characters, which is about a 15% reduction in size.
Other pruning techniques can also be applied programmatically. Removing punctuation, numbers, HTML tags, URLs and email addresses, or non-alphabetical characters are all valid pruning techniques that can be straightforward to apply. Here is a function that does just that:
import re
def clean_text(text):
# Remove URLs
text = re.sub(r'http\S+', '', text)
# Remove email addresses
text = re.sub(r'\S+@\S+', '', text)
# Remove everything that's not a letter (a-z, A-Z)
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Remove whitespace, tabs, and new lines
text = ''.join(text.split())
return text
What measure of length reduction might we be able to get from this additional processing? Applying these techniques to the remaining characters of Tech Leader Docs results in just 75,217 characters; an overall reduction of about 28% from the original text.
More opinionated pruning, such as removing short words or specific words or phrases, can be tailored to a specific use case. These don’t lend themselves well to general functions, however.
Now that you have some text processing techniques in your toolkit, let’s look at how a reduction in characters translates to fewer tokens used when it comes to ChatGPT. To understand this, we’ll examine Byte-Pair Encoding.
Byte-Pair Encoding (BPE) is a subword tokenization method. It was originally introduced for data compression but has since been adapted for tokenization in NLP tasks. It allows representing common words as tokens and splits more rare words into subword units. This enables a balance between character-level and word-level tokenization.
Let’s make that more concrete. Imagine you have a big box of LEGO bricks, and each brick represents a single letter or character. You’re tasked with building words using these LEGO bricks. At first, you might start by connecting individual bricks to form words. But over time, you notice that certain combinations of bricks (or characters) keep appearing together frequently, like “th” in “the” or “ing” in “running.”
BPE is like a smart LEGO-building buddy who suggests, “Hey, since ’th’ and ‘ing’ keep appearing together a lot, why don’t we glue them together and treat them as a single piece?” This way, the next time you want to build a word with “the” or “running,” you can use these glued-together pieces, making the process faster and more efficient.
Colloquially, the BPE algorithm looks like this:
BPE is a particularly powerful tokenization method, especially when dealing with diverse and extensive vocabularies. Here’s why:
In essence, BPE strikes a balance, offering the granularity of character-level tokenization and the context-awareness of word-level tokenization. This hybrid approach ensures that NLP models like ChatGPT can understand a wide range of texts while maintaining computational efficiency.
At time of writing, a message to ChatGPT via its web interface has a maximum token length of 4,096 tokens. If we assume the prior mentioned percent reduction as an average, this means you could reduce text of up to 5,712 tokens down to the appropriate size with just text preprocessing.
What about when this isn’t enough? Beyond text preprocessing, larger input can be sent in chunks using the OpenAI API. In my next post, I’ll show you how to build a Python module that does exactly that.
]]>Most organizations want to improve productivity and output, but few technical teams seem to take a data-driven approach to discovering productivity bottlenecks. If you’re looking to improve development velocity, a couple key metrics could help your team get unblocked. Here’s how you can apply a smidge of data science to visualize how your repository is doing, and where improvements can be made.
The first and most difficult part, as any data scientist would likely tell you, is ensuring the quality of your data. It’s especially important to consider consistency: are dates throughout the dataset presented in a consistent format? Have tags or labels been applied under consistent rules? Does the dataset contain repeated values, empty values, or unmatched types?
If your repository has previously changed up processes or standards, consider the timeframe of the data you collect. If labeling issues is done arbitrarily, those may not be a useful feature. While cleaning data is outside the scope of this article, I can, at least, help you painlessly collect it.
I wrote a straightforward Python utility that uses the GitHub API to pull data for any repository. You can use this on the command line and output the data to a file. It uses the list repository issues endpoint (docs), which, perhaps confusingly, includes both issues and pull requests (PRs) for the repository. I get my data like this:
$ python fetch.py -h
usage: fetch.py [-h] [--token TOKEN] repository months
$ python fetch.py OWASP/wstg 24 > data.json
Using the GitHub API means less worry about standardization, for example, all the dates are expressed as ISO 8601. Now that you have some data to process, it’s time to play with Pandas.
You can use a Jupyter Notebook to do some simple calculations and data visualization.
First, create the Notebook file:
touch stats.ipynb
Open the file in your favorite IDE, or in your browser by running jupyter notebook
.
In the first code cell, import Pandas and load your data:
import pandas as pd
data = pd.read_json("data.json")
data
You can then run that cell to see a preview of the data you collected.
Pandas is a well-documented data analysis library. With a little imagination and a few keyword searches, you can begin to measure all kinds of repository metrics. For this walk-through, here’s how you can calculate and create a graph that shows the number of days an issue or PR remains open in your repository.
Create a new code cell and, for each item in your Series, subtract the date it was closed from the date it was created:
duration = pd.Series(data.closed_at - data.created_at)
duration.describe()
Series.describe()
will give you some summary statistics that look something like these (from mypy on GitHub):
count 514
mean 5 days 08:04:17.239299610
std 14 days 12:04:22.979308668
min 0 days 00:00:09
25% 0 days 00:47:46.250000
50% 0 days 06:18:47
75% 2 days 20:22:49.250000
max 102 days 20:56:30
Series.plot()
uses a specified plotting backend (matplotlib
by default) to visualize your data. A histogram can be a helpful way to examine issue duration:
duration.apply(lambda x: x.days).plot(kind="hist")
This will plot a histogram that represents the frequency distribution of issues over days, which is one way you can tell how long most issues take to close. For example, mypy seems to handle the majority of issues and PRs within 10 days, with some outliers taking more than three months:
It would be interesting to visualize other repository data, such as its most frequent contributors, or most often used labels. Does a relationship exist between the author or reviewers of an issue and how quickly it is resolved? Does the presence of particular labels predict anything about the duration of the issue?
Now that you have some data-driven superpowers, remember that it comes with great responsibility. Deciding what to measure is just as, if not more, important than measuring it.
Consider how to translate the numbers you gather into productivity improvements. For example, if your metric is closing issues and PRs faster, what actions can you take to encourage the right behavior in your teams? I’d suggest encouraging issues to be clearly defined, and pull requests to be small and have a well-contained scope, making them easier to understand and review.
To prepare to accurately take measurements for your repository, establish consistent standards for labels, tags, milestones, and other features you might want to examine. Remember that meaningful results are more easily gleaned from higher quality data.
Finally, have fun exercising your data science skills. Who knows what you can discover and improve upon next!
]]>The right answer will depend on the goals of your application logic. You want to ensure your Python code doesn’t fail silently, saving you and your teammates from having to hunt down deeply entrenched errors.
Here’s the difference between raise
and return
when handling failures in Python.
The
raise
statement allows the programmer to force a specific exception to occur. (8.4 Raising Exceptions)
Use raise
when you know you want a specific behavior, such as:
raise TypeError("Wanted strawberry, got grape.")
Raising an exception terminates the flow of your program, allowing the exception to bubble up the call stack. In the above example, this would let you explicitly handle TypeError
later. If TypeError
goes unhandled, code execution stops and you’ll get an unhandled exception message.
Raise is useful in cases where you want to define a certain behavior to occur. For example, you may choose to disallow certain words in a text field:
if "raisins" in text_field:
raise ValueError("That word is not allowed here")
Raise takes an instance of an exception, or a derivative of the Exception class. Here are all of Python’s built-in exceptions.
Raise can help you avoid writing functions that fail silently. For example, this code will not raise an exception if JAM
doesn’t exist:
import os
def sandwich_or_bust(bread: str) -> str:
jam = os.getenv("JAM")
return bread + str(jam) + bread
s = sandwich_or_bust("\U0001F35E")
print(s)
# Prints "🍞None🍞" which is not very tasty.
To cause the sandwich_or_bust()
function to actually bust, add a raise
:
import os
def sandwich_or_bust(bread: str) -> str:
jam = os.getenv("JAM")
if not jam:
raise ValueError("There is no jam. Sad bread.")
return bread + str(jam) + bread
s = sandwich_or_bust("\U0001F35E")
print(s)
# ValueError: There is no jam. Sad bread.
Any time your code interacts with an external variable, module, or service, there is a possibility of failure. You can use raise
in an if
statement to help ensure those failures aren’t silent.
try
and except
To handle a possible failure by taking an action if there is one, use a try
… except
statement.
try:
s = sandwich_or_bust("\U0001F35E")
print(s)
except ValueError:
buy_more_jam()
raise
This lets you buy_more_jam()
before re-raising the exception. If you want to propagate a caught exception, use raise
without arguments to avoid possible loss of the stack trace.
If you don’t know that the exception will be a ValueError
, you can also use a bare except:
or catch any derivative of the Exception
class with except Exception:
. Whenever possible, it’s better to raise and handle exceptions explicitly.
Use else
for code to execute if the try
does not raise an exception. For example:
try:
s = sandwich_or_bust("\U0001F35E")
print(s)
except ValueError:
buy_more_jam()
raise
else:
print("Congratulations on your sandwich.")
You could also place the print line within the try
block, however, this is less explicit.
When you use return
in Python, you’re giving back a value. A function returns to the location it was called from.
While it’s more idiomatic to raise
errors in Python, there may be occasions where you find return
to be more applicable.
For example, if your Python code is interacting with other components that do not handle exception classes, you may want to return a message instead. Here’s an example using a try
… except
statement:
from typing import Union
def share_sandwich(sandwich: int) -> Union[float, Exception]:
try:
bad_math = sandwich / 0
return bad_math
except Exception as e:
return e
s = share_sandwich(1)
print(s)
# Prints "division by zero"
Note that when you return an Exception
class object, you’ll get a representation of its associated value, usually the first item in its list of arguments. In the example above, this is the string explanation of the exception. In some cases, it may be a tuple with other information about the exception.
You may also use return
to give a specific error object, such as with HttpResponseNotFound
in Django. For example, you may want to return a 404
instead of a 403
for security reasons:
if object.owner != request.user:
return HttpResponseNotFound
Using return
can help you write appropriately noisy code when your function is expected to give back a certain value, and when interacting with outside elements.
Silent failures create some of the most frustrating bugs to find and fix. You can help create a pleasant development experience for yourself and your team by using raise
and return
to ensure that errors are handled in your Python code.
I write about good development practices and how to improve productivity as a software developer. You can get these tips right in your inbox by signing up below!
The Django framework in particular offers your team the opportunity to create an efficient testing practice. Based on the Python standard library unittest
, proper tests in Django are fast to write, faster to run, and can offer you a seamless continuous integration solution for taking the pulse of your developing application.
With comprehensive tests, developers have higher confidence when pushing changes. I’ve seen firsthand in my own teams that good tests can boost development velocity as a direct result of a better developer experience.
In this article, I’ll share my own experiences in building useful tests for Django applications, from the basics to the best possible execution. If you’re using Django or building with it in your organization, you might like to read the rest of my Django series.
Tests are extremely important. Far beyond simply letting you know if a function works, tests can form the basis of your team’s understanding of how your application is intended to work.
Here’s the main goal: if you hit your head and forgot everything about how your application works tomorrow, you should be able to regain most of your understanding by reading and running the tests you write today.
Here are some questions that may be helpful to ask as you decide what to test:
Tests that make sense for your application can help build developer confidence. With these sensible safeguards in place, developers make improvements more readily, and feel confident introducing innovative solutions to product needs. The result is an application that comes together faster, and features that are shipped often and with confidence.
If you only have a few tests, you may organize your test files similarly to Django’s default app template by putting them all in a file called tests.py
. This straightforward approach is best for smaller applications.
As your application grows, you may like to split your tests into different files, or test modules. One method is to use a directory to organize your files, such as projectroot/app/tests/
. The name of each test file within that directory should begin with test
, for example, test_models.py
.
Besides being aptly named, Django will find these files using built-in test discovery based on the unittest
module. All files in your application with names that begin with test
will be collected into a test suite.
This convenient test discovery allows you to place test files anywhere that makes sense for your application. As long as they’re correctly named, Django’s test utility can find and run them.
Use docstrings to explain what a test is intended to verify at a high level. For example:
def test_create_user(self):
"""Creating a new user object should also create an associated profile object"""
# ...
These docstrings help you quickly understand what a test is supposed to be doing. Besides navigating the codebase, this helps to make it obvious when a test doesn’t verify what the docstring says it should.
Docstrings are also shown when the tests are being run, which can be helpful for logging and debugging.
Django tests can be quickly set up using data created in the setUpTestData()
method. You can use various approaches to create your test data, such as utilizing external files, or even hard-coding silly phrases or the names of your staff. Personally, I much prefer to use a fake-data-generation library, such as faker
.
The proper set up of arbitrary testing data can help you ensure that you’re testing your application functionality instead of accidentally testing test data. Because generators like faker
add some degree of unexpectedness to your inputs, it can be more representative of real-world use.
Here is an example set up for a test:
from django.test import TestCase
from faker import Faker
from app.models import MyModel, AnotherModel
fake = Faker()
class MyModelTest(TestCase):
def setUpTestData(cls):
"""Quickly set up data for the whole TestCase"""
cls.user_first = fake.first_name()
cls.user_last = fake.last_name()
def test_create_models(self):
"""Creating a MyModel object should also create AnotherModel object"""
# In test methods, use the variables created above
test_object = MyModel.objects.create(
first_name=self.user_first,
last_name=self.user_last,
# ...
)
another_model = AnotherModel.objects.get(my_model=test_object)
self.assertEqual(another_model.first_name, self.user_first)
# ...
Tests pass or fail based on the outcome of the assertion methods. You can use Python’s unittest
methods, and Django’s assertion methods.
For further guidance on writing tests, see Testing in Django.
Django’s test suite is manually run with:
./manage.py test
I rarely run my Django tests this way.
The best, or most efficient, testing practice is one that occurs without you or your developers ever thinking, “I need to run the tests first.” The beauty of Django’s near-effortless test suite set up is that it can be seamlessly run as a part of regular developer activities. This could be in a pre-commit hook, or in a continuous integration or deployment workflow.
I’ve previously written about how to use pre-commit hooks to improve your developer ergonomics and save your team some brainpower. Django’s speedy tests can be run this way, and they become especially efficient if you can run tests in parallel.
Tests that run as part of a CI/CD workflow, for example, on pull requests with GitHub Actions, require no regular effort from your developers to remember to run tests at all. I’m not sure how plainly I can put it – this one’s literally a no-brainer.
Tests are extremely important, and underappreciated. They can catch logical errors in your application. They can help explain and validate how concepts and features of your product actually function. Best of all, tests can boost developer confidence and development velocity as a result.
The best tests are ones that are relevant, help to explain and define your application, and are run continuously without a second thought. I hope I’ve now shown you how testing in Django can help you to achieve these goals for your team!
]]>I’ve been developing with Django for years, and I’ve never been happier with my Django project set up than I am right now. Here’s how I’m making a day of developing with Django the most relaxing and enjoyable development experience possible for myself and my engineering team.
Instead of typing:
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
python3 manage.py makemigrations
python3 manage.py migrate
python3 manage.py collectstatic
python3 manage.py runserver
Wouldn’t it be much nicer to type:
make start
…and have all that happen for you? I think so!
We can do that with a self-documenting Makefile! Here’s one I frequently use when developing my Django applications, like ApplyByAPI.com:
VENV := env
BIN := $(VENV)/bin
PYTHON := $(BIN)/python
SHELL := /bin/bash
include .env
.PHONY: help
help: ## Show this help
@egrep -h '\s##\s' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-20s\033[0m %s\n", $$1, $$2}'
.PHONY: venv
venv: ## Make a new virtual environment
python3 -m venv $(VENV) && source $(BIN)/activate
.PHONY: install
install: venv ## Make venv and install requirements
$(BIN)/pip install --upgrade -r requirements.txt
freeze: ## Pin current dependencies
$(BIN)/pip freeze > requirements.txt
migrate: ## Make and run migrations
$(PYTHON) manage.py makemigrations
$(PYTHON) manage.py migrate
db-up: ## Pull and start the Docker Postgres container in the background
docker pull postgres
docker-compose up -d
db-shell: ## Access the Postgres Docker database interactively with psql. Pass in DBNAME=<name>.
docker exec -it container_name psql -d $(DBNAME)
.PHONY: test
test: ## Run tests
$(PYTHON) manage.py test application --verbosity=0 --parallel --failfast
.PHONY: run
run: ## Run the Django server
$(PYTHON) manage.py runserver
start: install migrate run ## Install requirements, apply migrations, then start development server
You’ll notice the presence of the line include .env
above. This ensures make
has access to environment variables stored in a file called .env
. This allows Make to utilize these variables in its commands, for example, the name of my virtual environment, or to pass in $(DBNAME)
to psql
.
What’s with that weird “##
” comment syntax? A Makefile like this gives you a handy suite of command-line aliases you can check in to your Django project. It’s very useful so long as you’re able to remember what all those aliases are.
The help
command above, which runs by default, prints a helpful list of available commands when you run make
or make help
:
help Show this help
venv Make a new virtual environment
install Make venv and install requirements
migrate Make and run migrations
db-up Pull and start the Docker Postgres container in the background
db-shell Access the Postgres Docker database interactively with psql
test Run tests
run Run the Django server
start Install requirements, apply migrations, then start development server
All the usual Django commands are covered, and we’ve got a test
command that runs our tests with the options we prefer. Brilliant.
You can read my full post about self-documenting Makefiles here, which also includes an example Makefile using pipenv
.
I previously wrote about some technical ergonomics that can make it a lot easier for teams to develop great software.
One area that’s a no-brainer is using pre-commit hooks to lint code prior to checking it in. This helps to ensure the quality of the code your developers check in, but most importantly, ensures that no one on your team is spending time trying to remember if it should be single or double quotes or where to put a line break.
The confusingly-named pre-commit framework is an otherwise fantastic way to keep hooks (which are not included in cloned repositories) consistent across local environments.
Here is my configuration file, .pre-commit-config.yaml
, for my Django projects:
fail_fast: true
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v3.1.0
hooks:
- id: detect-aws-credentials
- repo: https://github.com/psf/black
rev: 19.3b0
hooks:
- id: black
- repo: https://github.com/asottile/blacken-docs
rev: v1.7.0
hooks:
- id: blacken-docs
additional_dependencies: [black==19.3b0]
- repo: local
hooks:
- id: markdownlint
name: markdownlint
description: "Lint Markdown files"
entry: markdownlint '**/*.md' --fix --ignore node_modules --config "./.markdownlint.json"
language: node
types: [markdown]
These hooks check for accidental secret commits, format Python files using Black, format Python snippets in Markdown files using blacken-docs
, and lint Markdown files as well. To install them, just type pre-commit install
.
There are likely even more useful hooks available for your particular use case: see supported hooks to explore.
An underappreciated way to improve your team’s daily development experience is to make sure your project uses a well-rounded .gitignore
file. It can help prevent files containing secrets from being committed, and can additionally save developers hours of tedium by ensuring you’re never sifting through a git diff
of generated files.
To efficiently create a gitignore for Python and Django projects, Toptal’s gitignore.io can be a nice resource for generating a robust .gitignore
file.
I still recommend examining the generated results yourself to ensure that ignored files suit your use case, and that nothing you want ignored is commented out.
If your team works on GitHub, setting up a testing process with Actions is low-hanging fruit.
Tests that run in a consistent environment on every pull request can help eliminate “works on my machine” conundrums, as well as ensure no one’s sitting around waiting for a test to run locally.
A hosted CI environment like GitHub Actions can also help when running integration tests that require using managed services resources. You can use encrypted secrets in a repository to grant the Actions runner access to resources in a testing environment, without worrying about creating testing resources and access keys for each of your developers to use.
I’ve written on many occasions about setting up Actions workflows, including using one to run your Makefile, and how to integrate GitHub event data. GitHub even interviewed me about Actions once.
For Django projects, here’s a GitHub Actions workflow that runs tests with a consistent Python version whenever someone opens a pull request in the repository.
name: Run Django tests
on: pull_request
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.8'
- name: Install dependencies
run: make install
- name: Run tests
run: make test
For the installation and test commands, I’ve simply utilized the Makefile that’s been checked in to the repository. A benefit of using your Makefile commands in your CI test workflows is that you only need to keep them updated in one place – your Makefile! No more “why is this working locally but not in CI??!?” headaches.
If you want to step up your security game, you can add Django Security Check as an Action too.
Want to help keep your development team happy? Set them up for success with these best practices for Django development. Remember, an ounce of brainpower is worth a pound of software!
]]>Django models help humans work with data in a way that makes sense to our brains, and the framework offers plenty of classes you can inherit to help you rapidly develop a robust application from scratch. As for developing on existing Django applications, there’s a feature for that, too. In this article, we’ll cover how to use Django migrations to update your existing models and database.
Django migrations are Python files that help you add and change things in your database tables to reflect changes in your Django models. To understand how Django migrations help you work with data, it may be helpful to understand the underlying structures we’re working with.
If you’ve laid eyes on a spreadsheet before, you’re already most of the way to understanding a database table. In a relational database, for example, a PostgreSQL database, you can expect to see data organized into columns and rows. A relational database table may have a set number of columns and any number of rows.
In Django, each model is its own table. For example, here’s a Django model:
from django.db import models
class Lunch(models.Model):
left_side = models.CharField(max_length=100, null=True)
center = models.CharField(max_length=100, null=True)
right_side = models.CharField(max_length=100, null=True)
Each field is a column, and each row is a Django object instance of that model. Here’s a representation of a database table for the Django model “Lunch” above. In the database, its name would be lunch_table
.
id | left_side | center | right_side |
---|---|---|---|
1 | Fork | Plate | Spoon |
The model Lunch
has three fields: left_side
, center
, and right-side
. One instance of a Lunch
object would have “Fork” for the left_side
, a “Plate” for the center
, and “Spoon” for the right_side
. Django automatically adds an id
field if you don’t specify a primary key.
If you wanted to change the name of your Lunch model, you would do so in your models.py
code. For example, change “Lunch” to “Dinner,” then run python manage.py makemigrations
. You’ll see:
python manage.py makemigrations
Did you rename the backend.Lunch model to Dinner? [y/N] y
Migrations for 'backend':
backend/migrations/0003_auto_20200922_2331.py
- Rename model Lunch to Dinner
Django automatically generates the appropriate migration files. The relevant line of the generated migrations file in this case would look like:
migrations.RenameModel(old_name="Lunch", new_name="Dinner"),
This operation would rename our “Lunch” model to “Dinner” while keeping everything else the same. But what if you also wanted to change the structure of the database table itself, its schema, as well as make sure that existing data ends up in the right place on your Dinner table?
Let’s explore how to turn our Lunch model into a Dinner model that looks like this:
from django.db import models
class Dinner(models.Model):
top_left = models.CharField(max_length=100, null=True)
top_center = models.CharField(max_length=100, null=True)
top_right = models.CharField(max_length=100, null=True)
bottom_left = models.CharField(max_length=100, null=True)
bottom_center = models.CharField(max_length=100, null=True)
bottom_right = models.CharField(max_length=100, null=True)
…with a database table that would look like this:
id | top_left | top_center | top_right | bottom_left | bottom_center | bottom_right |
---|---|---|---|---|---|---|
1 | Bread plate | Spoon | Glass | Fork | Plate | Knife |
Before you begin to manipulate your data, it’s always a good idea to create a backup of your database that you can restore in case something goes wrong. There are various ways to do this depending on the database you’re using. You can typically find instructions by searching for <your database name>
and keywords like backup
, recovery
, or snapshot
.
In order to design your migration, it’s helpful to become familiar with the available migration operations. Migrations are run step-by-step, and each operation is some flavor of adding, removing, or altering data. Like a strategic puzzle, it’s important to make model changes one step at a time so that the generated migrations have the correct result.
We’ve already renamed our model successfully. Now, we’ll rename the fields that hold the data we want to retain:
class Dinner(models.Model):
bottom_left = models.CharField(max_length=100, null=True)
bottom_center = models.CharField(max_length=100, null=True)
top_center = models.CharField(max_length=100, null=True)
Django is sometimes smart enough to determine the old and new field names correctly. You’ll be asked for confirmation:
python manage.py makemigrations
Did you rename dinner.center to dinner.bottom_center (a CharField)? [y/N] y
Did you rename dinner.left_side to dinner.bottom_left (a CharField)? [y/N] y
Did you rename dinner.right_side to dinner.top_center (a CharField)? [y/N] y
Migrations for 'backend':
backend/migrations/0004_auto_20200914_2345.py
- Rename field center on dinner to bottom_center
- Rename field left_side on dinner to bottom_left
- Rename field right_side on dinner to top_center
In some cases, you’ll want to try renaming the field and running makemigrations
one at a time.
Now that the existing fields have been migrated to their new names, add the remaining fields to the model:
class Dinner(models.Model):
top_left = models.CharField(max_length=100, null=True)
top_center = models.CharField(max_length=100, null=True)
top_right = models.CharField(max_length=100, null=True)
bottom_left = models.CharField(max_length=100, null=True)
bottom_center = models.CharField(max_length=100, null=True)
bottom_right = models.CharField(max_length=100, null=True)
Running makemigrations
again now gives us:
python manage.py makemigrations
Migrations for 'backend':
backend/migrations/0005_auto_20200914_2351.py
- Add field bottom_right to dinner
- Add field top_left to dinner
- Add field top_right to dinner
You’re done! By generating Django migrations, you’ve successfully set up your dinner_table
and moved existing data to its new spot.
You’ll notice that our Lunch and Dinner models are not very complex. Out of Django’s many model field options, we’re just using CharField
. We also set null=True
to let Django store empty values as NULL
in the database.
Django migrations can handle additional complexity, such as changing field types, and whether a blank or null value is permitted. I keep Django’s model field reference handy as I work with varying types of data and different use cases.
I hope this article has helped you better understand Django migrations and how they work!
Now that you can change models and manipulate existing data in your Django application, be sure to use your powers wisely! Backup your database, research and plan your migrations, and always run tests before working with customer data. By doing so, you have the potential to enable your application to grow – with manageable levels of complexity.
]]>Understanding these main features are the building blocks for maximizing development efficiency with Django. They’ll build the foundation for you to test efficiently and create an awesome development experience for your engineers. Let’s look at how these tools let you create a performant Django application that’s pleasant to build and maintain.
Remember that Django is all Python under the hood. When it comes to views, you’ve got two choices: view functions (sometimes called “function-based views”), or class-based views.
Years ago when I first built ApplyByAPI, it was initially composed entirely of function-based views. These offer granular control, and are good for implementing complex logic; just as in a Python function, you have complete control (for better or worse) over what the view does. With great control comes great responsibility, and function-based views can be a little tedious to use. You’re responsible for writing all the necessary methods for the view to work - this is what allows you to completely tailor your application.
In the case of ApplyByAPI, there were only a sparse few places where that level of tailored functionality was really necessary. Everywhere else, function-based views began making my life harder. Writing what is essentially a custom view for run-of-the-mill operations like displaying data on a list page became tedious, repetitive, and error-prone.
With function-based views, you’ll need figure out which Django methods to implement in order to handle requests and pass data to views. Unit testing can take some work to write. In short, the granular control that function-based views offer also requires some granular tedium to properly implement.
I ended up holding back ApplyByAPI while I refactored the majority of the views into class-based views. This was not a small amount of work and refactoring, but when it was done, I had a bunch of tiny views that made a huge difference. I mean, just look at this one:
class ApplicationsList(ListView):
model = Application
template_name = "applications.html"
It’s three lines. My developer ergonomics, and my life, got a lot easier.
You may think of class-based views as templates that cover most of the functionality any app needs. There are views for displaying lists of things, for viewing a thing in detail, and editing views for performing CRUD (Create, Read, Update, Delete) operations. Because implementing one of these generic views takes only a few lines of code, my application logic became dramatically succinct. This gave me less repeated code, fewer places for something to go wrong, and a more manageable application in general.
Class-based views are fast to implement and use. The built-in class-based generic views may require less work to test, since you don’t need to write tests for the base view Django provides. (Django does its own tests for that; no need for your app to double-check.) To tweak a generic view to your needs, you can subclass a generic view and override attributes or methods. In my case, since I only needed to write tests for any customizations I added, my test files became dramatically shorter, as did the time and resources it took to run them.
When you’re weighing the choice between function-based or class-based views, consider the amount of customization the view needs, and the future work that will be necessary to test and maintain it. If the logic is common, you may be able to hit the ground running with a generic class-based view. If you need sufficient granularity that re-writing a base view’s methods would make it overly complicated, consider a function-based view instead.
Models organize your Django application’s central concepts to help make them flexible, robust, and easy to work with. If used wisely, models are a powerful way to collate your data into a definitive source of truth.
Like views, Django provides some built-in model types for the convenience of implementing basic authentication, including the User and Permission models. For everything else, you can create a model that reflects your concept by inheriting from a parent Model class.
class StaffMember(models.Model):
user = models.OneToOneField(User, on_delete=models.CASCADE)
company = models.OneToOneField(Company, on_delete=models.CASCADE)
def __str__(self):
return self.company.name + " - " + self.user.email
When you create a custom model in Django, you subclass Django’s Model class and take advantage of all its power. Each model you create generally maps to a database table. Each attribute is a database field. This gives you the ability to create objects to work with that humans can better understand.
You can make a model useful to you by defining its fields. Many built-in field types are conveniently provided. These help Django figure out the data type, the HTML widget to use when rendering a form, and even form validation requirements. If you need to, you can write custom model fields.
Database relationships can be defined using a ForeignKey field (many-to-one), or a ManyToManyField (give you three guesses). If those don’t suffice, there’s also a OneToOneField. Together, these allow you to define relations between your models with levels of complexity limited only by your imagination. (Depending on the imagination you have, this may or may not be an advantage.)
Use your model’s Manager (objects
by default) to construct a QuerySet. This is a representation of objects in your database that you can refine, using methods, to retrieve specific subsets. All available methods are in the QuerySet API and can be chained together for even more fun.
Post.objects.filter(
type="new"
).exclude(
title__startswith="Blockchain"
)
Some methods return new QuerySets, such as filter()
, or exclude()
. Chaining these can give you powerful queries without affecting performance, as QuerySets aren’t fetched from the database until they are evaluated. Methods that evaluate a QuerySet include get()
, count()
, len()
, list()
, or bool()
.
Iterating over a QuerySet also evaluates it, so avoid doing so where possible to improve query performance. For instance, if you just want to know if an object is present, you can use exists()
to avoid iterating over database objects.
Use get()
in cases where you want to retrieve a specific object. This method raises MultipleObjectsReturned
if something unexpected happens, as well as the DoesNotExist
exception, if, take a guess.
If you’d like to get an object that may not exist in the context of a user’s request, use the convenient get_object_or_404()
or get_list_or_404()
which raises Http404
instead of DoesNotExist
. These helpful shortcuts are suited to just this purpose. To create an object that doesn’t exist, there’s also the convenient get_or_create()
.
You’ve now got a handle on these three essential tools for building your efficient Django application – congratulations! You can make Django work even better for you by learning about manipulating data with migrations, testing effectively, and setting up your team’s Django development for maximum happiness.
If you’re going to build on GitHub, you may like to set up my django-security-check GitHub Action. In the meantime, you’re well on your way to building a beautiful software project.
]]>Multiple threads in Python is a bit of a bitey subject (not sorry) in that the Python interpreter doesn’t actually let multiple threads execute at the same time. Python’s Global Interpreter Lock, or GIL, prevents multiple threads from executing Python bytecodes at once. Each thread that wants to execute must first wait for the GIL to be released by the currently executing thread. The GIL is pretty much the microphone in a low-budget conference panel, except where no one gets to shout.
This has the advantage of preventing race conditions. It does, however, lack the performance advantages afforded by running multiple tasks in parallel. (If you’d like a refresher on concurrency, parallelism, and multithreading, see Concurrency, parallelism, and the many threads of Santa Claus.) While I prefer Go for its convenient first-class primitives that support concurrency (see Goroutines), this project’s recipients were more comfortable with Python. I took it as an opportunity to test and explore!
Simultaneously performing multiple tasks in Python isn’t impossible; it just takes a little extra work. For Hydra, the main advantage is in overcoming the input/output (I/O) bottleneck.
In order to get web pages to check, Hydra needs to go out to the Internet and fetch them. When compared to tasks that are performed by the CPU alone, going out over the network is comparatively slower. How slow?
Here are approximate timings for tasks performed on a typical PC:
Task | Time | |
---|---|---|
CPU | execute typical instruction | 1/1,000,000,000 sec = 1 nanosec |
CPU | fetch from L1 cache memory | 0.5 nanosec |
CPU | branch misprediction | 5 nanosec |
CPU | fetch from L2 cache memory | 7 nanosec |
RAM | Mutex lock/unlock | 25 nanosec |
RAM | fetch from main memory | 100 nanosec |
Network | send 2K bytes over 1Gbps network | 20,000 nanosec |
RAM | read 1MB sequentially from memory | 250,000 nanosec |
Disk | fetch from new disk location (seek) | 8,000,000 nanosec (8ms) |
Disk | read 1MB sequentially from disk | 20,000,000 nanosec (20ms) |
Network | send packet US to Europe and back | 150,000,000 nanosec (150ms) |
Peter Norvig first published these numbers some years ago in Teach Yourself Programming in Ten Years. Since computers and their components change year over year, the exact numbers shown above aren’t the point. What these numbers help to illustrate is the difference, in orders of magnitude, between operations.
Compare the difference between fetching from main memory and sending a simple packet over the Internet. While both these operations occur in less than the blink of an eye (literally) from a human perspective, you can see that sending a simple packet over the Internet is over a million times slower than fetching from RAM. It’s a difference that, in a single-thread program, can quickly accumulate to form troublesome bottlenecks.
In Hydra, the task of parsing response data and assembling results into a report is relatively fast, since it all happens on the CPU. The slowest portion of the program’s execution, by over six orders of magnitude, is network latency. Not only does Hydra need to fetch packets, but whole web pages! One way of improving Hydra’s performance is to find a way for the page fetching tasks to execute without blocking the main thread.
Python has a couple options for doing tasks in parallel: multiple processes, or multiple threads. These methods allow you to circumvent the GIL and speed up execution in a couple different ways.
To execute parallel tasks using multiple processes, you can use Python’s ProcessPoolExecutor
. A concrete subclass of Executor
from the concurrent.futures
module, ProcessPoolExecutor
uses a pool of processes spawned with the multiprocessing
module to avoid the GIL.
This option uses worker subprocesses that maximally default to the number of processors on the machine. The multiprocessing
module allows you to maximally parallelize function execution across processes, which can really speed up compute-bound (or CPU-bound) tasks.
Since the main bottleneck for Hydra is I/O and not the processing to be done by the CPU, I’m better served by using multiple threads.
Fittingly named, Python’s ThreadPoolExecutor
uses a pool of threads to execute asynchronous tasks. Also a subclass of Executor
, it uses a defined number of maximum worker threads (at least five by default, according to the formula min(32, os.cpu_count() + 4)
) and reuses idle threads before starting new ones, making it pretty efficient.
Here is a snippet of Hydra with comments showing how Hydra uses ThreadPoolExecutor
to achieve parallel multithreaded bliss:
# Create the Checker class
class Checker:
# Queue of links to be checked
TO_PROCESS = Queue()
# Maximum workers to run
THREADS = 100
# Maximum seconds to wait for HTTP response
TIMEOUT = 60
def __init__(self, url):
...
# Create the thread pool
self.pool = futures.ThreadPoolExecutor(max_workers=self.THREADS)
def run(self):
# Run until the TO_PROCESS queue is empty
while True:
try:
target_url = self.TO_PROCESS.get(block=True, timeout=2)
# If we haven't already checked this link
if target_url["url"] not in self.visited:
# Mark it as visited
self.visited.add(target_url["url"])
# Submit the link to the pool
job = self.pool.submit(self.load_url, target_url, self.TIMEOUT)
job.add_done_callback(self.handle_future)
except Empty:
return
except Exception as e:
print(e)
You can view the full code in Hydra’s GitHub repository.
If you’d like to see the full effect, I compared the run times for checking my website between a prototype single-thread program, and the multiheadedmultithreaded Hydra.
time python3 slow-link-check.py https://victoria.dev
real 17m34.084s
user 11m40.761s
sys 0m5.436s
time python3 hydra.py https://victoria.dev
real 0m15.729s
user 0m11.071s
sys 0m2.526s
The single-thread program, which blocks on I/O, ran in about seventeen minutes. When I first ran the multithreaded version, it finished in 1m13.358s - after some profiling and tuning, it took a little under sixteen seconds. Again, the exact times don’t mean all that much; they’ll vary depending on factors such as the size of the site being crawled, your network speed, and your program’s balance between the overhead of thread management and the benefits of parallelism.
The more important thing, and the result I’ll take any day, is a program that runs some orders of magnitude faster.
]]>I’ve been helping out a group called the Open Web Application Security Project (OWASP). They’re a non-profit foundation that produces some of the foremost application testing guides and cybersecurity resources. OWASP’s publications, checklists, and reference materials are a help to security professionals, penetration testers, and developers all over the world. Most of the individual teams that create these materials are run almost entirely by volunteers.
OWASP is a great group doing important work. I’ve seen this firsthand as part of the core team that produces the Web Security Testing Guide. However, while OWASP inspires in its large volunteer base, it lacks in the area of central organization.
This lack of organization was most recently apparent in the group’s website, OWASP.org. A big organization with an even bigger website to match, OWASP.org enjoys hundreds of thousands of visitors. Unfortunately, many of its pages - individually managed by disparate projects - are infrequently updated. Some are abandoned. The website as a whole lacks a centralized quality assurance process, and as a result, OWASP.org is peppered with broken links.
Customers don’t like broken links; attackers really do. That’s because broken links are a security vulnerability. Broken links can signal opportunities for attacks like broken link hijacking and subdomain takeovers. At their least effective, these attacks can be embarrassing; at their worst, severely damaging to businesses and organizations. One OWASP group, the Application Security Verification Standard (ASVS) project, writes about integrity controls that can help to mitigate the likelihood of these attacks. This knowledge, unfortunately, has not yet propagated throughout the rest of OWASP yet.
This is the story of how I created a fast and efficient tool to help OWASP solve this problem.
I took on the task of creating a program that could run as part of a CI/CD process to detect and report broken links. The program needed to:
Essentially; I need to build a web crawler.
My original journey through this process was also in Python, as that was a comfortable language choice for everyone in the OWASP group. Personally, I prefer to use Go for higher performance as it offers more convenient concurrency primitives. Between the task and this talk, I wrote three programs: a prototype single-thread Python program, a multithreaded Python program, and a Go program using goroutines. We’ll see a comparison of how each worked out near the end of the talk - first, let’s explore how to build a web crawler.
Here’s what our web crawler will need to do:
https://victoria.dev
)https://victoria.dev
and not https://github.com
, for instance)Here’s what the execution flow will look like:
As you can see, the nodes “GET page” -> “HTML” -> “Parse links” -> “Valid link” -> “Check visited” all form a loop. These are what enable our web crawler to continue crawling until all the links on the site have been accounted for in the “Check visited” node. When the crawler encounters links it’s already checked, it will “Stop.” This loop will become more important in a moment.
For now, the question on everyone’s mind (I hope): how do we make it fast?
Here are some approximate timings for tasks performed on a typical PC:
Type | Task | Time |
---|---|---|
CPU | execute typical instruction | 1/1,000,000,000 sec = 1 nanosec |
CPU | fetch from L1 cache memory | 0.5 nanosec |
CPU | branch misprediction | 5 nanosec |
CPU | fetch from L2 cache memory | 7 nanosec |
RAM | Mutex lock/unlock | 25 nanosec |
RAM | fetch from main memory | 100 nanosec |
RAM | read 1MB sequentially from memory | 250,000 nanosec |
Disk | fetch from new disk location (seek) | 8,000,000 nanosec (8ms) |
Disk | read 1MB sequentially from disk | 20,000,000 nanosec (20ms) |
Network | send packet US to Europe and back | 150,000,000 nanosec (150ms) |
Peter Norvig first published these numbers some years ago in Teach Yourself Programming in Ten Years. They typically crop up now and then in articles titled along the lines of, “Latency numbers every developer should know.”
Since computers and their components change year over year, the exact numbers shown above aren’t the point. What these numbers help to illustrate is the difference, in orders of magnitude, between operations.
Compare the difference between fetching from main memory and sending a simple packet over the Internet. While both these operations occur in less than the blink of an eye (literally) from a human perspective, you can see that sending a simple packet over the Internet is over a million times slower than fetching from RAM. It’s a difference that, in a single-thread program, can quickly accumulate to form troublesome bottlenecks.
The numbers above mean that the difference in time it takes to send something over the Internet compared to fetching data from main memory is over six orders of magnitude. Remember the loop in our execution chart? The “GET page” node, in which our crawler fetches page data over the network, is going to be a million times slower than the next slowest thing in the loop!
We don’t need to run our prototype to see what that means in practical terms; we can estimate it. Let’s take OWASP.org, which has upwards of 12,000 links, as an example:
150 milliseconds
x 12,000 links
---------
1,800,000 milliseconds (30 minutes)
A whole half hour, just for the network tasks. It may even be much slower than that, since web pages are frequently much larger than a packet. This means that in our single-thread prototype web crawler, our biggest bottleneck is network latency. Why is this problematic?
I previously wrote about feedback loops. In essence, in order to improve at doing anything, you first need to be able to get feedback from your last attempt. That way, you have the necessary information to make adjustments and get closer to your goal on your next iteration.
As a software developer, bottlenecks can contribute to long and inefficient feedback loops. If I’m waiting on a process that’s part of a CI/CD pipeline, in our bottlenecked web crawler example, I’d be sitting around for a minimum of a half hour before learning whether or not changes in my last push were successful, or whether they broke master
(hopefully staging
).
Multiply a slow and inefficient feedback loop by many runs per day, over many days, and you’ve got a slow and inefficient developer. Multiply that by many developers in an organization bottlenecked on the same process, and you’ve got a slow and inefficient company.
To add insult to injury, not only are you waiting on a bottlenecked process to run; you’re also paying to wait. Take the serverless example - AWS Lambda, for instance. Here’s a chart showing the cost of functions by compute time and CPU usage.
Again, the numbers change over the years, but the main concepts remain the same: the bigger the function and the longer its compute time, the bigger the cost. For applications taking advantage of serverless, these costs can add up dramatically.
Bottlenecks are a recipe for failure, for both productivity and the bottom line.
The good news is that bottlenecks are mostly unnecessary. If we know how to identify them, we can strategize our way out of them. To understand how, let’s get some tacos.
Everyone, meet Bob. He’s a gopher who works at the taco stand down the street as the cashier. Say “Hi,” Bob.
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
🌮 🌳
🌮
🌮 ╔══════════════╗
🌮 Hi I'm Bob 🌳
🌮 ╚══════════════╝ \
🌮 🐹 🌮
🌮
🌮
🌮 🌳
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
Bob works very hard at being a cashier, but he’s still just one gopher. The customers who frequent Bob’s taco stand can eat tacos really quickly; but in order to get the tacos to eat them, they’ve got to order them through Bob. Here’s what our bottlenecked, single-thread taco stand currently looks like:
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
🌮 🌳
🌮
🌮
🌮 🌳
🌮 🐹 🧑💵🧑💵🧑💵🧑💵🧑💵🧑💵🧑💵🧑💵🧑💵
🌮
🌮
🌮
🌮 🌳
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
As you can see, all the customers are queued up, right out the door. Poor Bob handles one customer’s transaction at a time, starting and finishing with that customer completely before moving on to the next. Bob can only do so much, so our taco stand is rather inefficient at the moment. How can we make Bob faster?
We can try splitting the queue:
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
🌮 🌳
🌮
🌮 🧑💵🧑💵🧑💵🧑💵
🌮 🌳
🌮 🐹
🌮
🌮 🧑💵🧑💵🧑💵🧑💵🧑💵
🌮
🌮 🌳
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
Now Bob can do some multitasking. For example, he can start a transaction with a customer in one queue; then, while that customer counts their bills, Bob can pop over to the second queue and get started there. This arrangement, known as a concurrency model, helps Bob go a little bit faster by jumping back and forth between lines. However, it’s still just one Bob, which limits our improvement possibilities. If we were to make four queues, they’d all be shorter; but Bob would be very thinly stretched between them. Can we do better?
We could get two Bobs:
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
🌮 🌳
🌮
🌮 🌳
🌮 🐹 🧑💵🧑💵🧑💵🧑💵
🌮 🌳
🌮 🐹 🧑💵🧑💵🧑💵🧑💵🧑💵
🌮 🌳
🌮
🌮 🌳
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
With twice the Bobs, each can handle a queue of his own. This is our most efficient solution for our taco stand so far, since two Bobs can handle much more than one Bob can, even if each customer is still attended to one at a time.
We can do even better than that:
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
🌮 🌳
🌮 🐹 🧑💵🧑💵
🌮 🌳
🌮 🐹 🧑💵🧑💵
🌮 🌳
🌮 🐹 🧑💵🧑💵
🌮 🌳
🌮 🐹 🧑💵🧑💵🧑💵
🌮 🌳
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
With quadruple the Bobs, we have some very short queues, and a much more efficient taco stand. In computing, the concept of having multiple workers do tasks in parallel is called multithreading.
In Go, we can apply this concept using goroutines. Here are some illustrative snippets from my Go solution.
In order to share data between our goroutines, we’ll need to create some data structures. Our Checker
structure will be shared, so it will have a Mutex
(mutual exclusion) to allow our goroutines to lock and unlock it. The Checker
structure will also hold a list of brokenLinks
results, and visitedLinks
. The latter will be a map of strings to booleans, which we’ll use to directly and efficiently check for visited links. By using a map instead of iterating over a list, our visitedLinks
lookup will have a constant complexity of O(1) as opposed to a linear complexity of O(n), thus avoiding the creation of another bottleneck. For more on time complexity, see my coffee-break introduction to time complexity of algorithms article.
type Checker struct {
startDomain string
brokenLinks []Result
visitedLinks map[string]bool
workerCount, maxWorkers int
sync.Mutex
}
...
// Page allows us to retain parent and sublinks
type Page struct {
parent, loc, data string
}
// Result adds error information for the report
type Result struct {
Page
reason string
code int
}
To extract links from HTML data, here’s a parser I wrote on top of package html
:
// Extract links from HTML
func parse(parent, data string) ([]string, []string) {
doc, err := html.Parse(strings.NewReader(data))
if err != nil {
fmt.Println("Could not parse: ", err)
}
goodLinks := make([]string, 0)
badLinks := make([]string, 0)
var f func(*html.Node)
f = func(n *html.Node) {
if n.Type == html.ElementNode && checkKey(string(n.Data)) {
for _, a := range n.Attr {
if checkAttr(string(a.Key)) {
j, err := formatURL(parent, a.Val)
if err != nil {
badLinks = append(badLinks, j)
} else {
goodLinks = append(goodLinks, j)
}
break
}
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
f(c)
}
}
f(doc)
return goodLinks, badLinks
}
If you’re wondering why I didn’t use a more full-featured package for this project, I highly recommend the story of left-pad
. The short of it: more dependencies, more problems.
Here are snippets of the main
function, where we pass in our starting URL and create a queue (or channels, in Go) to be filled with links for our goroutines to process.
func main() {
...
startURL := flag.String("url", "http://example.com", "full URL of site")
...
firstPage := Page{
parent: *startURL,
loc: *startURL,
}
toProcess := make(chan Page, 1)
toProcess <- firstPage
var wg sync.WaitGroup
The last significant piece of the puzzle is to create our workers, which we’ll do here:
for i := range toProcess {
wg.Add(1)
checker.addWorker()
🐹 go worker(i, &checker, &wg, toProcess)
if checker.workerCount > checker.maxWorkers {
time.Sleep(1 * time.Second) // throttle down
}
}
wg.Wait()
A WaitGroup does just what it says on the tin: it waits for our group of goroutines to finish. When they have, we’ll know our Go web crawler has finished checking all the links on the site.
Here’s a comparison of the three programs I wrote on this journey. First, the prototype single-thread Python version:
time python3 slow-link-check.py https://victoria.dev
real 17m34.084s
user 11m40.761s
sys 0m5.436s
This finished crawling my website in about seventeen-and-a-half minutes, which is rather long for a site at least an order of magnitude smaller than OWASP.org.
The multithreaded Python version did a bit better:
time python3 hydra.py https://victoria.dev
real 1m13.358s
user 0m13.161s
sys 0m2.826s
My multithreaded Python program (which I dubbed Hydra) finished in one minute and thirteen seconds.
How did Go do?
time ./go-link-check --url=https://victoria.dev
real 0m7.926s
user 0m9.044s
sys 0m0.932s
At just under eight seconds, I found the Go version to be extremely palatable.
As fun as it is to simply enjoy the speedups, we can directly relate these results to everything we’ve learned so far. Consider taking a process that used to soak up seventeen minutes and turning it into an eight-second-affair instead. Not only will that give developers a much shorter and more efficient feedback loop, it will give companies the ability to develop faster, and thus grow more quickly - while costing less. To drive the point home: a process that runs in seventeen-and-a-half minutes when it could take eight seconds will also cost over a hundred and thirty times as much to run!
A better work day for developers, and a better bottom line for companies. There’s a lot of benefit to be had in making functions, code, and processes as efficient as possible - by breaking bottlenecks.
]]>When tackling new concepts, I find concrete examples to be most useful. I’ll share some in this post and discuss appropriate situations for each. (Pun intended.)
First, in pseudocode:
for iterating_variable in iterable:
statement(s)
I find for
loops to be the most readable way to iterate in Python. This is especially nice when you’re writing code that someone else needs to read and understand, which is always.
An iterating_variable
, loosely speaking, is anything you could put in a group. For example: a letter in a string, an item from a list, or an integer in a range of integers.
An iterable
houses the things you iterate on. This can also take different forms: a string with multiple characters, a range of numbers, a list, and so on.
A statement
or multiple statements
indicates doing something to the iterating variable. This could be anything from mathematical expressions to simply printing a result.
Here are a couple simple examples that print each iterating_variable
of an iterable
:
for letter in "Hello world":
print(letter)
for i in range(10):
print(i)
breakfast_menu = ["toast", "eggs", "waffles", "coffee"]
for choice in breakfast_menu:
print(choice)
You can even use a for
loop in a more compact situation, such as this one-liner:
breakfast_buffet = " ".join(str(item) for item in breakfast_menu)
The downside to for
loops is that they can be a bit verbose, depending on how much you’re trying to achieve. Still, for anyone hoping to make their Python code as easily understood as possible, for
loops are the most straightforward choice.
A pseudocode example:
new_list = [statement(s) for iterating_variable in iterable]
List comprehensions are a concise and elegant way to create a new list by iterating on variables. Once you have a grasp of how they work, you can perform efficient iterations with very little code.
List comprehensions will always return a list, which may or may not be appropriate for your situation.
For example, you could use a list comprehension to quickly calculate and print tip percentage on a few bar tabs at once:
tabs = [23.60, 42.10, 17.50]
tabs_incl_tip = [round(tab*1.15, 2) for tab in tabs]
print(tabs_incl_tip)
>>> [27.14, 48.41, 20.12]
In one concise line, we’ve taken each tab amount, added a 15% tip, rounded it to the nearest cent, and made a new list of the tabs plus the tip values.
List comprehensions can be an elegant tool if output to a list is useful to you. Be advised that the more statements you add, the more complicated your list comprehension begins to look, especially once you get into nested list comprehensions. If your code isn’t well annotated, it may become difficult for another reader to figure out.
How to map
, in pseudocode:
map(statement, iterable)
Map is pretty compact, for better or worse. It can be harder to read and understand, especially if your line of code has a lot of parentheses.
In terms of efficiency for character count, map
is hard to beat. It applies your statement
to every instance of your iterable
and returns an iterator.
Here’s an example casting each element of input()
(the iterable) from string representation to integer representation. Since map
returns an iterator, you also cast the result to a list representation.
values = list(map(int, input().split()))
weights = list(map(int, input().split()))
It’s worth noting that you can also use for
loops, list comprehension, and map
all together:
output = sum([x[0] * x[1] for x in zip(values, weights)]) / sum(weights)
print(round(output, 1))
Each of these methods of iteration in Python have a special place in the code I write every day. I hope these examples have helped you see how to use for
loops, list comprehensions, and map
in your own Python code!
If you like this post, there’s a lot more where that came from! I write about efficient programming for coders and for leading technical teams. Check out the posts below!
]]>