Python on victoria.dev

How to send long text input to ChatGPT using the OpenAI API

2023-09-26T04:46:36-05:00

In a previous post, I showed how you can apply text preprocessing techniques to shorten your input length for ChatGPT. Today in the web interface (chat.openai.com), ChatGPT allows you to send a message with a maximum token length of 4,096.

There are bound to be situations in which this isn’t enough, such as when you want to read in a large amount of text from a file. Using the OpenAI API allows you to send many more tokens in a messages array, with the maximum number depending on your chosen model. This lets you provide large amounts of text to ChatGPT using chunking. Here’s how.

Chunking your input

The gpt-4 model currently has a maximum content length token limit of 8,192 tokens. (Here are the docs containing current limits for all the models.) Remember that you can first apply text preprocessing techniques to reduce your input size – in my previous post I achieved a 28% size reduction without losing meaning with just a little tokenization and pruning.

When this isn’t enough to fit your message within the maximum message token limit, you can take a general programmatic approach that sends your input in message chunks. The goal is to divide your text into sections that each fit within the model’s token limit. The general idea is to:

Tokenize and split text into chunks based on the model’s token limit. It’s better to keep message chunks slightly below the token limit since the token limit is shared between your message and ChatGPT’s response.
Maintain context between chunks, e.g. avoid splitting a sentence in the middle.

Each chunk is sent as a separate message in the conversation thread.

Handling responses

You send your chunks to ChatGPT using the OpenAI library’s ChatCompletion. ChatGPT returns individual responses for each message, so you may want to process these by:

Concatenating responses in the order you sent them to get a coherent answer.
Manage conversation flow by keeping track of which response refers to which chunk.
Formatting the response to suit your desired output, e.g. replacing \n with line breaks.

Putting it all together

Using the OpenAI API, you can send multiple messages to ChatGPT and ask it to wait for you to provide all of the data before answering your prompt. Being a language model, you can provide these instructions to ChatGPT in plain language. Here’s a suggested script:

Prompt: Summarize the following text for me

To provide the context for the above prompt, I will send you text in parts. When I am finished, I will tell you “ALL PARTS SENT”. Do not answer until you have received all the parts.

I created a Python module, chatgptmax, that puts all this together. It breaks up a large amount of text by a given maximum token length and sends it in chunks to ChatGPT.

You can install it with pip install chatgptmax, but here’s the juicy part:

import os
import openai
import tiktoken

# Set up your OpenAI API key
# Load your API key from an environment variable or secret management service
openai.api_key = os.getenv("OPENAI_API_KEY")

def send(
    prompt=None,
    text_data=None,
    chat_model="gpt-3.5-turbo",
    model_token_limit=8192,
    max_tokens=2500,
):
    """
    Send the prompt at the start of the conversation and then send chunks of text_data to ChatGPT via the OpenAI API.
    If the text_data is too long, it splits it into chunks and sends each chunk separately.

    Args:
    - prompt (str, optional): The prompt to guide the model's response.
    - text_data (str, optional): Additional text data to be included.
    - max_tokens (int, optional): Maximum tokens for each API call. Default is 2500.

    Returns:
    - list or str: A list of model's responses for each chunk or an error message.
    """

    # Check if the necessary arguments are provided
    if not prompt:
        return "Error: Prompt is missing. Please provide a prompt."
    if not text_data:
        return "Error: Text data is missing. Please provide some text data."

    # Initialize the tokenizer
    tokenizer = tiktoken.encoding_for_model(chat_model)

    # Encode the text_data into token integers
    token_integers = tokenizer.encode(text_data)

    # Split the token integers into chunks based on max_tokens
    chunk_size = max_tokens - len(tokenizer.encode(prompt))
    chunks = [
        token_integers[i : i + chunk_size]
        for i in range(0, len(token_integers), chunk_size)
    ]

    # Decode token chunks back to strings
    chunks = [tokenizer.decode(chunk) for chunk in chunks]

    responses = []
    messages = [
        {"role": "user", "content": prompt},
        {
            "role": "user",
            "content": "To provide the context for the above prompt, I will send you text in parts. When I am finished, I will tell you 'ALL PARTS SENT'. Do not answer until you have received all the parts.",
        },
    ]

    for chunk in chunks:
        messages.append({"role": "user", "content": chunk})

        # Check if total tokens exceed the model's limit and remove oldest chunks if necessary
        while (
            sum(len(tokenizer.encode(msg["content"])) for msg in messages)
            > model_token_limit
        ):
            messages.pop(1)  # Remove the oldest chunk

        response = openai.ChatCompletion.create(model=chat_model, messages=messages)
        chatgpt_response = response.choices[0].message["content"].strip()
        responses.append(chatgpt_response)

    # Add the final "ALL PARTS SENT" message
    messages.append({"role": "user", "content": "ALL PARTS SENT"})
    response = openai.ChatCompletion.create(model=chat_model, messages=messages)
    final_response = response.choices[0].message["content"].strip()
    responses.append(final_response)

    return responses

Here’s an example of how you can use this module with text data read from a file. (chatgptmax also provides a convenience method for getting text from a file.)

# First, import the necessary modules and the function
import os

from chatgptmax import send

# Define a function to read the content of a file
def read_file_content(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

# Use the function
if __name__ == "__main__":
    # Specify the path to your file
    file_path = "path_to_your_file.txt"
    
    # Read the content of the file
    file_content = read_file_content(file_path)
    
    # Define your prompt
    prompt_text = "Summarize the following text for me:"
    
    # Send the file content to ChatGPT
    responses = send(prompt=prompt_text, text_data=file_content)
    
    # Print the responses
    for response in responses:
        print(response)

Error handling

While the module is designed to handle most standard use cases, there are potential pitfalls to be aware of:

Incomplete sentences: If a chunk ends in the middle of a sentence, it might alter the meaning or context. To mitigate this, consider ensuring that chunks end at full stops or natural breaks in the text. You could do this by separating the text-chunking task into a separate function that:
1. Splits the text into sentences.
2. Iterates over the sentences and adds them to a chunk until the chunk reaches the maximum size.
3. Starts a new chunk when the current chunk reaches the maximum size or when adding another sentence would exceed the maximum size.
API connectivity issues: There’s always a possibility of timeouts or connectivity problems during API calls. If this is a significant issue for your application, you can include retry logic in your code. If an API call fails, the script could wait for a few seconds and then try again, ensuring that all chunks are processed.
Rate limits: Be mindful of OpenAI API’s rate limits. If you’re sending many chunks in rapid succession, you might hit these limits. Introducing a slight delay between calls or spreading out requests can help avoid this.

Optimization

As with any process, there’s always room for improvement. Here are a couple of ways you might optimize the module’s chunking and sending process further:

Parallelizing API calls: If OpenAI API’s rate limits and your infrastructure allow, you could send multiple chunks simultaneously. This parallel processing can speed up the overall time it takes to get responses for all chunks. Unless you have access to OpenAI’s 32k models or need to use small chunk sizes, however, parallelism gains are likely to be minimal.
Caching mechanisms: If you find yourself sending the same or similar chunks frequently, consider implementing a caching system. By storing ChatGPT’s responses for specific chunks, you can retrieve them instantly from the cache the next time, saving both time and API calls.

Now what

If you found your way here via search, you probably already have a use case in mind. Here are some other (startup) ideas:

You’re a researcher who wants to save time by getting short summaries of many lengthy articles.
You’re a legal professional who wants to analyze long contracts by extracting key points or clauses.
You’re a financial analyst who wants to pull a quick overview of trends from a long report.
You’re a writer who wants feedback on a new article or chapter… without having to actually show it to anyone yet.

Do you have a use case I didn’t list? Let me know about it! In the meantime, have fun sending lots of text to ChatGPT.

Optimizing text for ChatGPT: NLP and text pre-processing techniques

2023-09-19T04:46:36-05:00

In order for chatbots and voice assistants to be helpful, they need to be able to take in and understand our instructions in plain language using Natural Language Processing (NLP). ChatGPT relies on a blend of advanced algorithms and text preprocessing methods to make sense of our words. But just throwing a wall of text at it can be inefficient – you might be dumping in a lot of noise with that signal and hitting the text input limit.

Text preprocessing can help shorten and refine your input, ensuring that ChatGPT can grasp the essence without getting overwhelmed. In this article, we’ll explore these techniques, understand their importance, and see how they make your interactions with tools like ChatGPT more reliable and productive.

Text preprocessing

Text preprocessing prepares raw text data for analysis by NLP models. Generally, it distills everyday text (like full sentences) to make it more manageable or concise and meaningful. Techniques include:

Tokenization: splitting up text by sentences or paragraphs. For example, you could break down a lengthy legal document into individual clauses or sentences.
Extractive summarization: selecting key sentences from the text and discarding the rest. Instead of reading an entire 10-page document, extractive summarization could pinpoint the most crucial sentences and give you a concise overview without delving into the details.
Abstractive summarization: generating a concise representation of the text content, for example, turning a 10-page document into a brief paragraph that captures the document’s essence in new wording.
Pruning: removing redundant or less relevant parts. For example, in a verbose email thread, pruning can help remove all the greetings, sign-offs, and other repetitive elements, leaving only the core content for analysis.

While all these techniques can help reduce the size of raw text data, some of these techniques are easier to apply to general use cases than others. Let’s examine how text preprocessing can help us send a large amount of text to ChatGPT.

Tokenization and ChatGPT input limits

In the realm of Natural Language Processing (NLP), a token is the basic unit of text that a system reads. At its simplest, you can think of a token as a word, but depending on the language and the specific tokenization method used, a token can represent a word, part of a word, or even multiple words.

While in English we often equate tokens with words, in NLP, the concept is broader. A token can be as short as a single character or as long as a word. For example, with word tokenization, the sentence “Unicode characters such as emojis are not indivisible. ✂️” can be broken down into tokens like this: [“Unicode”, “characters”, “such”, “as”, “emojis”, “are”, “not”, “indivisible”, “.”, “✂️”]

In another form called Byte-Pair Encoding (BPE), the same sentence is tokenized as: [“Un”, “ic”, “ode”, " characters", " such", " as", " em, “oj”, “is”, " are", " not", " ind", “iv”, “isible”, “.”, " �", “�️”]. The emoji itself is split into tokens containing its underlying bytes.

Depending on the ChatGPT model chosen, your text input size is restricted by tokens. Here are the docs containing current limits. BPE is used by ChatGPT to determine token count, and we’ll discuss it more thoroughly later. First, we can programmatically apply some preprocessing techniques to reduce our text input size and use fewer tokens.

A general programmatic approach

For a general approach that can be applied programmatically, pruning is a suitable preprocessing technique. One form is stop word removal, or removing common words that might not add significant meaning in certain contexts. For example, consider the sentence:

“I always enjoy having pizza with my friends on weekends.”

Stop words are often words that don’t carry significant meaning on their own in a given context. In this sentence, words like “I”, “always”, “enjoy”, “having”, “with”, “my”, “on” are considered stop words.

After removing the stop words, the sentence becomes:

“pizza friends weekends.”

Now, the sentence is distilled to its key components, highlighting the main subject (pizza) and the associated context (friends and weekends). If you find yourself wishing you could convince people to do this in real life (coughmeetingscough)… you aren’t alone.

Stop word removal is straightforward to apply programmatically: given a list of stop words, examine some text input to see if it contains any of the stop words on your list. If it does, remove them, then return the altered text.

def clean_stopwords(text: str) -> str:
    stopwords = ["a", "an", "and", "at", "but", "how", "in", "is", "on", "or", "the", "to", "what", "will"]
    tokens = text.split()
    clean_tokens = [t for t in tokens if not t in stopwords]
    return " ".join(clean_tokens)

To see how effective stop word removal can be, I took the entire text of my Tech Leader Docs newsletter (17,230 words consisting of 104,892 characters) and processed it using the above function. How effective was it? The resulting text contained 89,337 characters, which is about a 15% reduction in size.

Other pruning techniques can also be applied programmatically. Removing punctuation, numbers, HTML tags, URLs and email addresses, or non-alphabetical characters are all valid pruning techniques that can be straightforward to apply. Here is a function that does just that:

import re

def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    
    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)
    
    # Remove everything that's not a letter (a-z, A-Z)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove whitespace, tabs, and new lines
    text = ''.join(text.split())

    return text

What measure of length reduction might we be able to get from this additional processing? Applying these techniques to the remaining characters of Tech Leader Docs results in just 75,217 characters; an overall reduction of about 28% from the original text.

More opinionated pruning, such as removing short words or specific words or phrases, can be tailored to a specific use case. These don’t lend themselves well to general functions, however.

Now that you have some text processing techniques in your toolkit, let’s look at how a reduction in characters translates to fewer tokens used when it comes to ChatGPT. To understand this, we’ll examine Byte-Pair Encoding.

Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE) is a subword tokenization method. It was originally introduced for data compression but has since been adapted for tokenization in NLP tasks. It allows representing common words as tokens and splits more rare words into subword units. This enables a balance between character-level and word-level tokenization.

Let’s make that more concrete. Imagine you have a big box of LEGO bricks, and each brick represents a single letter or character. You’re tasked with building words using these LEGO bricks. At first, you might start by connecting individual bricks to form words. But over time, you notice that certain combinations of bricks (or characters) keep appearing together frequently, like “th” in “the” or “ing” in “running.”

BPE is like a smart LEGO-building buddy who suggests, “Hey, since ’th’ and ‘ing’ keep appearing together a lot, why don’t we glue them together and treat them as a single piece?” This way, the next time you want to build a word with “the” or “running,” you can use these glued-together pieces, making the process faster and more efficient.

Colloquially, the BPE algorithm looks like this:

Start with single characters.
Observe which pairs of characters frequently appear together.
Merge those frequent pairs together to treat them as one unit.
Repeat this process until you have a mix of single characters and frequently occurring character combinations.

BPE is a particularly powerful tokenization method, especially when dealing with diverse and extensive vocabularies. Here’s why:

Handling rare words: Traditional tokenization methods might stumble upon rare or out-of-vocabulary words. BPE, with its ability to break words down into frequent subword units, can represent these words without needing to have seen them before.
Efficiency: By representing frequent word parts as single tokens, BPE can compress text more effectively. This is especially useful for models like ChatGPT, where token limits apply.
Adaptability: BPE is language-agnostic. It doesn’t rely on predefined dictionaries or vocabularies. Instead, it learns from the data, making it adaptable to various languages and contexts.

In essence, BPE strikes a balance, offering the granularity of character-level tokenization and the context-awareness of word-level tokenization. This hybrid approach ensures that NLP models like ChatGPT can understand a wide range of texts while maintaining computational efficiency.

Sending lots of text to ChatGPT

At time of writing, a message to ChatGPT via its web interface has a maximum token length of 4,096 tokens. If we assume the prior mentioned percent reduction as an average, this means you could reduce text of up to 5,712 tokens down to the appropriate size with just text preprocessing.

What about when this isn’t enough? Beyond text preprocessing, larger input can be sent in chunks using the OpenAI API. In my next post, I’ll show you how to build a Python module that does exactly that.

Measuring productivity with GitHub issues

2021-08-30T05:35:02+00:00

How long does it take for a bug to get squashed, or for a pull request to be merged? What kind of issues take the longest to close?

Most organizations want to improve productivity and output, but few technical teams seem to take a data-driven approach to discovering productivity bottlenecks. If you’re looking to improve development velocity, a couple key metrics could help your team get unblocked. Here’s how you can apply a smidge of data science to visualize how your repository is doing, and where improvements can be made.

Getting quality data

The first and most difficult part, as any data scientist would likely tell you, is ensuring the quality of your data. It’s especially important to consider consistency: are dates throughout the dataset presented in a consistent format? Have tags or labels been applied under consistent rules? Does the dataset contain repeated values, empty values, or unmatched types?

If your repository has previously changed up processes or standards, consider the timeframe of the data you collect. If labeling issues is done arbitrarily, those may not be a useful feature. While cleaning data is outside the scope of this article, I can, at least, help you painlessly collect it.

I wrote a straightforward Python utility that uses the GitHub API to pull data for any repository. You can use this on the command line and output the data to a file. It uses the list repository issues endpoint (docs), which, perhaps confusingly, includes both issues and pull requests (PRs) for the repository. I get my data like this:

$ python fetch.py -h
usage: fetch.py [-h] [--token TOKEN] repository months
$ python fetch.py OWASP/wstg 24 > data.json

Using the GitHub API means less worry about standardization, for example, all the dates are expressed as ISO 8601. Now that you have some data to process, it’s time to play with Pandas.

Plotting with Pandas

You can use a Jupyter Notebook to do some simple calculations and data visualization.

First, create the Notebook file:

touch stats.ipynb

Open the file in your favorite IDE, or in your browser by running jupyter notebook.

In the first code cell, import Pandas and load your data:

import pandas as pd

data = pd.read_json("data.json")
data

You can then run that cell to see a preview of the data you collected.

Pandas is a well-documented data analysis library. With a little imagination and a few keyword searches, you can begin to measure all kinds of repository metrics. For this walk-through, here’s how you can calculate and create a graph that shows the number of days an issue or PR remains open in your repository.

Create a new code cell and, for each item in your Series, subtract the date it was closed from the date it was created:

duration = pd.Series(data.closed_at - data.created_at)
duration.describe()

Series.describe() will give you some summary statistics that look something like these (from mypy on GitHub):

count                           514
mean      5 days 08:04:17.239299610
std      14 days 12:04:22.979308668
min                 0 days 00:00:09
25%          0 days 00:47:46.250000
50%                 0 days 06:18:47
75%          2 days 20:22:49.250000
max               102 days 20:56:30

Series.plot() uses a specified plotting backend (matplotlib by default) to visualize your data. A histogram can be a helpful way to examine issue duration:

duration.apply(lambda x: x.days).plot(kind="hist")

This will plot a histogram that represents the frequency distribution of issues over days, which is one way you can tell how long most issues take to close. For example, mypy seems to handle the majority of issues and PRs within 10 days, with some outliers taking more than three months:

It would be interesting to visualize other repository data, such as its most frequent contributors, or most often used labels. Does a relationship exist between the author or reviewers of an issue and how quickly it is resolved? Does the presence of particular labels predict anything about the duration of the issue?

You aim for what you measure

Now that you have some data-driven superpowers, remember that it comes with great responsibility. Deciding what to measure is just as, if not more, important than measuring it.

Consider how to translate the numbers you gather into productivity improvements. For example, if your metric is closing issues and PRs faster, what actions can you take to encourage the right behavior in your teams? I’d suggest encouraging issues to be clearly defined, and pull requests to be small and have a well-contained scope, making them easier to understand and review.

To prepare to accurately take measurements for your repository, establish consistent standards for labels, tags, milestones, and other features you might want to examine. Remember that meaningful results are more easily gleaned from higher quality data.

Finally, have fun exercising your data science skills. Who knows what you can discover and improve upon next!

Do I raise or return errors in Python?

2021-02-09T05:34:48-05:00

I hear this question a lot: “Do I raise or return this error in Python?”

The right answer will depend on the goals of your application logic. You want to ensure your Python code doesn’t fail silently, saving you and your teammates from having to hunt down deeply entrenched errors.

Here’s the difference between raise and return when handling failures in Python.

When to raise

The raise statement allows the programmer to force a specific exception to occur. (8.4 Raising Exceptions)

Use raise when you know you want a specific behavior, such as:

raise TypeError("Wanted strawberry, got grape.")

Raising an exception terminates the flow of your program, allowing the exception to bubble up the call stack. In the above example, this would let you explicitly handle TypeError later. If TypeError goes unhandled, code execution stops and you’ll get an unhandled exception message.

Raise is useful in cases where you want to define a certain behavior to occur. For example, you may choose to disallow certain words in a text field:

if "raisins" in text_field:
    raise ValueError("That word is not allowed here")

Raise takes an instance of an exception, or a derivative of the Exception class. Here are all of Python’s built-in exceptions.

Raise can help you avoid writing functions that fail silently. For example, this code will not raise an exception if JAM doesn’t exist:

import os


def sandwich_or_bust(bread: str) -> str:
    jam = os.getenv("JAM")
    return bread + str(jam) + bread


s = sandwich_or_bust("\U0001F35E")
print(s)
# Prints "🍞None🍞" which is not very tasty.

To cause the sandwich_or_bust() function to actually bust, add a raise:

import os


def sandwich_or_bust(bread: str) -> str:
    jam = os.getenv("JAM")
    if not jam:
        raise ValueError("There is no jam. Sad bread.")
    return bread + str(jam) + bread


s = sandwich_or_bust("\U0001F35E")
print(s)
# ValueError: There is no jam. Sad bread.

Any time your code interacts with an external variable, module, or service, there is a possibility of failure. You can use raise in an if statement to help ensure those failures aren’t silent.

Raise in `try` and `except`

To handle a possible failure by taking an action if there is one, use a try … except statement.

try:
    s = sandwich_or_bust("\U0001F35E")
    print(s)
except ValueError:
    buy_more_jam()
    raise

This lets you buy_more_jam() before re-raising the exception. If you want to propagate a caught exception, use raise without arguments to avoid possible loss of the stack trace.

If you don’t know that the exception will be a ValueError, you can also use a bare except: or catch any derivative of the Exception class with except Exception:. Whenever possible, it’s better to raise and handle exceptions explicitly.

Use else for code to execute if the try does not raise an exception. For example:

try:
    s = sandwich_or_bust("\U0001F35E")
    print(s)
except ValueError:
    buy_more_jam()
    raise
else:
    print("Congratulations on your sandwich.")

You could also place the print line within the try block, however, this is less explicit.

When to return

When you use return in Python, you’re giving back a value. A function returns to the location it was called from.

While it’s more idiomatic to raise errors in Python, there may be occasions where you find return to be more applicable.

For example, if your Python code is interacting with other components that do not handle exception classes, you may want to return a message instead. Here’s an example using a try … except statement:

from typing import Union


def share_sandwich(sandwich: int) -> Union[float, Exception]:
    try:
        bad_math = sandwich / 0
        return bad_math
    except Exception as e:
        return e


s = share_sandwich(1)
print(s)
# Prints "division by zero"

Note that when you return an Exception class object, you’ll get a representation of its associated value, usually the first item in its list of arguments. In the example above, this is the string explanation of the exception. In some cases, it may be a tuple with other information about the exception.

You may also use return to give a specific error object, such as with HttpResponseNotFound in Django. For example, you may want to return a 404 instead of a 403 for security reasons:

if object.owner != request.user:
    return HttpResponseNotFound

Using return can help you write appropriately noisy code when your function is expected to give back a certain value, and when interacting with outside elements.

The most important part

Silent failures create some of the most frustrating bugs to find and fix. You can help create a pleasant development experience for yourself and your team by using raise and return to ensure that errors are handled in your Python code.

I write about good development practices and how to improve productivity as a software developer. You can get these tips right in your inbox by signing up below!

Increase developer confidence with a great Django test suite

2020-10-01T05:50:37-04:00

Done correctly, tests are one of your application’s most valuable assets.

The Django framework in particular offers your team the opportunity to create an efficient testing practice. Based on the Python standard library unittest, proper tests in Django are fast to write, faster to run, and can offer you a seamless continuous integration solution for taking the pulse of your developing application.

With comprehensive tests, developers have higher confidence when pushing changes. I’ve seen firsthand in my own teams that good tests can boost development velocity as a direct result of a better developer experience.

In this article, I’ll share my own experiences in building useful tests for Django applications, from the basics to the best possible execution. If you’re using Django or building with it in your organization, you might like to read the rest of my Django series.

What to test

Tests are extremely important. Far beyond simply letting you know if a function works, tests can form the basis of your team’s understanding of how your application is intended to work.

Here’s the main goal: if you hit your head and forgot everything about how your application works tomorrow, you should be able to regain most of your understanding by reading and running the tests you write today.

Here are some questions that may be helpful to ask as you decide what to test:

What is our customer supposed to be able to do?
What is our customer not supposed to be able to do?
What should this method, view, or logical flow achieve?
When, how, or where is this feature supposed to execute?

Tests that make sense for your application can help build developer confidence. With these sensible safeguards in place, developers make improvements more readily, and feel confident introducing innovative solutions to product needs. The result is an application that comes together faster, and features that are shipped often and with confidence.

Where to put tests

If you only have a few tests, you may organize your test files similarly to Django’s default app template by putting them all in a file called tests.py. This straightforward approach is best for smaller applications.

As your application grows, you may like to split your tests into different files, or test modules. One method is to use a directory to organize your files, such as projectroot/app/tests/. The name of each test file within that directory should begin with test, for example, test_models.py.

Besides being aptly named, Django will find these files using built-in test discovery based on the unittest module. All files in your application with names that begin with test will be collected into a test suite.

This convenient test discovery allows you to place test files anywhere that makes sense for your application. As long as they’re correctly named, Django’s test utility can find and run them.

How to document a test

Use docstrings to explain what a test is intended to verify at a high level. For example:

def test_create_user(self):
    """Creating a new user object should also create an associated profile object"""
    # ...

These docstrings help you quickly understand what a test is supposed to be doing. Besides navigating the codebase, this helps to make it obvious when a test doesn’t verify what the docstring says it should.

Docstrings are also shown when the tests are being run, which can be helpful for logging and debugging.

What a test needs to work

Django tests can be quickly set up using data created in the setUpTestData() method. You can use various approaches to create your test data, such as utilizing external files, or even hard-coding silly phrases or the names of your staff. Personally, I much prefer to use a fake-data-generation library, such as faker.

The proper set up of arbitrary testing data can help you ensure that you’re testing your application functionality instead of accidentally testing test data. Because generators like faker add some degree of unexpectedness to your inputs, it can be more representative of real-world use.

Here is an example set up for a test:

from django.test import TestCase
from faker import Faker

from app.models import MyModel, AnotherModel

fake = Faker()


class MyModelTest(TestCase):
    def setUpTestData(cls):
        """Quickly set up data for the whole TestCase"""
        cls.user_first = fake.first_name()
        cls.user_last = fake.last_name()

    def test_create_models(self):
        """Creating a MyModel object should also create AnotherModel object"""
        # In test methods, use the variables created above
        test_object = MyModel.objects.create(
            first_name=self.user_first,
            last_name=self.user_last,
            # ...
        )
        another_model = AnotherModel.objects.get(my_model=test_object)
        self.assertEqual(another_model.first_name, self.user_first)
        # ...

Tests pass or fail based on the outcome of the assertion methods. You can use Python’s unittest methods, and Django’s assertion methods.

For further guidance on writing tests, see Testing in Django.

Best possible execution for running your tests

Django’s test suite is manually run with:

./manage.py test

I rarely run my Django tests this way.

The best, or most efficient, testing practice is one that occurs without you or your developers ever thinking, “I need to run the tests first.” The beauty of Django’s near-effortless test suite set up is that it can be seamlessly run as a part of regular developer activities. This could be in a pre-commit hook, or in a continuous integration or deployment workflow.

I’ve previously written about how to use pre-commit hooks to improve your developer ergonomics and save your team some brainpower. Django’s speedy tests can be run this way, and they become especially efficient if you can run tests in parallel.

Tests that run as part of a CI/CD workflow, for example, on pull requests with GitHub Actions, require no regular effort from your developers to remember to run tests at all. I’m not sure how plainly I can put it – this one’s literally a no-brainer.

Testing your way to a great Django application

Tests are extremely important, and underappreciated. They can catch logical errors in your application. They can help explain and validate how concepts and features of your product actually function. Best of all, tests can boost developer confidence and development velocity as a result.

The best tests are ones that are relevant, help to explain and define your application, and are run continuously without a second thought. I hope I’ve now shown you how testing in Django can help you to achieve these goals for your team!

Django project best practices to keep your developers happy

2020-09-22T04:55:19-04:00

Do you want your team to enjoy your development workflow? Do you think building software should be fun and existentially fulfilling? If so, this is the post for you!

I’ve been developing with Django for years, and I’ve never been happier with my Django project set up than I am right now. Here’s how I’m making a day of developing with Django the most relaxing and enjoyable development experience possible for myself and my engineering team.

A custom CLI tool for your Django project

Instead of typing:

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
python3 manage.py makemigrations
python3 manage.py migrate
python3 manage.py collectstatic
python3 manage.py runserver

Wouldn’t it be much nicer to type:

make start

…and have all that happen for you? I think so!

We can do that with a self-documenting Makefile! Here’s one I frequently use when developing my Django applications, like ApplyByAPI.com:

VENV := env
BIN := $(VENV)/bin
PYTHON := $(BIN)/python
SHELL := /bin/bash

include .env

.PHONY: help
help: ## Show this help
    @egrep -h '\s##\s' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-20s\033[0m %s\n", $$1, $$2}'

.PHONY: venv
venv: ## Make a new virtual environment
    python3 -m venv $(VENV) && source $(BIN)/activate

.PHONY: install
install: venv ## Make venv and install requirements
    $(BIN)/pip install --upgrade -r requirements.txt

freeze: ## Pin current dependencies
    $(BIN)/pip freeze > requirements.txt

migrate: ## Make and run migrations
    $(PYTHON) manage.py makemigrations
    $(PYTHON) manage.py migrate

db-up: ## Pull and start the Docker Postgres container in the background
    docker pull postgres
    docker-compose up -d

db-shell: ## Access the Postgres Docker database interactively with psql. Pass in DBNAME=.
    docker exec -it container_name psql -d $(DBNAME)

.PHONY: test
test: ## Run tests
    $(PYTHON) manage.py test application --verbosity=0 --parallel --failfast

.PHONY: run
run: ## Run the Django server
    $(PYTHON) manage.py runserver

start: install migrate run ## Install requirements, apply migrations, then start development server

You’ll notice the presence of the line include .env above. This ensures make has access to environment variables stored in a file called .env. This allows Make to utilize these variables in its commands, for example, the name of my virtual environment, or to pass in $(DBNAME) to psql.

What’s with that weird “##” comment syntax? A Makefile like this gives you a handy suite of command-line aliases you can check in to your Django project. It’s very useful so long as you’re able to remember what all those aliases are.

The help command above, which runs by default, prints a helpful list of available commands when you run make or make help:

help                 Show this help
venv                 Make a new virtual environment
install              Make venv and install requirements
migrate              Make and run migrations
db-up                Pull and start the Docker Postgres container in the background
db-shell             Access the Postgres Docker database interactively with psql
test                 Run tests
run                  Run the Django server
start                Install requirements, apply migrations, then start development server

All the usual Django commands are covered, and we’ve got a test command that runs our tests with the options we prefer. Brilliant.

You can read my full post about self-documenting Makefiles here, which also includes an example Makefile using pipenv.

Save your brainpower with pre-commit hooks

I previously wrote about some technical ergonomics that can make it a lot easier for teams to develop great software.

One area that’s a no-brainer is using pre-commit hooks to lint code prior to checking it in. This helps to ensure the quality of the code your developers check in, but most importantly, ensures that no one on your team is spending time trying to remember if it should be single or double quotes or where to put a line break.

The confusingly-named pre-commit framework is an otherwise fantastic way to keep hooks (which are not included in cloned repositories) consistent across local environments.

Here is my configuration file, .pre-commit-config.yaml, for my Django projects:

fail_fast: true
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v3.1.0
    hooks:
      - id: detect-aws-credentials
  - repo: https://github.com/psf/black
    rev: 19.3b0
    hooks:
      - id: black
  - repo: https://github.com/asottile/blacken-docs
    rev: v1.7.0
    hooks:
      - id: blacken-docs
        additional_dependencies: [black==19.3b0]
  - repo: local
    hooks:
      - id: markdownlint
        name: markdownlint
        description: "Lint Markdown files"
        entry: markdownlint '**/*.md' --fix --ignore node_modules --config "./.markdownlint.json"
        language: node
        types: [markdown]

These hooks check for accidental secret commits, format Python files using Black, format Python snippets in Markdown files using blacken-docs, and lint Markdown files as well. To install them, just type pre-commit install.

There are likely even more useful hooks available for your particular use case: see supported hooks to explore.

Useful gitignores

An underappreciated way to improve your team’s daily development experience is to make sure your project uses a well-rounded .gitignore file. It can help prevent files containing secrets from being committed, and can additionally save developers hours of tedium by ensuring you’re never sifting through a git diff of generated files.

To efficiently create a gitignore for Python and Django projects, Toptal’s gitignore.io can be a nice resource for generating a robust .gitignore file.

I still recommend examining the generated results yourself to ensure that ignored files suit your use case, and that nothing you want ignored is commented out.

Continuous testing with GitHub Actions

If your team works on GitHub, setting up a testing process with Actions is low-hanging fruit.

Tests that run in a consistent environment on every pull request can help eliminate “works on my machine” conundrums, as well as ensure no one’s sitting around waiting for a test to run locally.

A hosted CI environment like GitHub Actions can also help when running integration tests that require using managed services resources. You can use encrypted secrets in a repository to grant the Actions runner access to resources in a testing environment, without worrying about creating testing resources and access keys for each of your developers to use.

I’ve written on many occasions about setting up Actions workflows, including using one to run your Makefile, and how to integrate GitHub event data. GitHub even interviewed me about Actions once.

For Django projects, here’s a GitHub Actions workflow that runs tests with a consistent Python version whenever someone opens a pull request in the repository.

name: Run Django tests

on: pull_request

jobs:
  test:

    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.8'
      - name: Install dependencies
        run: make install
      - name: Run tests
        run: make test

For the installation and test commands, I’ve simply utilized the Makefile that’s been checked in to the repository. A benefit of using your Makefile commands in your CI test workflows is that you only need to keep them updated in one place – your Makefile! No more “why is this working locally but not in CI??!?” headaches.

If you want to step up your security game, you can add Django Security Check as an Action too.

Set up your Django project for success

Want to help keep your development team happy? Set them up for success with these best practices for Django development. Remember, an ounce of brainpower is worth a pound of software!

Manipulating data with Django migrations

2020-09-14T02:12:57-04:00

Growing, successful applications are a lovely problem to have. As a product develops, it tends to accumulate complication the way your weekend cake project accumulates layers of frosting. Thankfully, Django, my favorite batteries-included framework, handles complexity pretty well.

Django models help humans work with data in a way that makes sense to our brains, and the framework offers plenty of classes you can inherit to help you rapidly develop a robust application from scratch. As for developing on existing Django applications, there’s a feature for that, too. In this article, we’ll cover how to use Django migrations to update your existing models and database.

What’s under the hood

Django migrations are Python files that help you add and change things in your database tables to reflect changes in your Django models. To understand how Django migrations help you work with data, it may be helpful to understand the underlying structures we’re working with.

What’s a database table

If you’ve laid eyes on a spreadsheet before, you’re already most of the way to understanding a database table. In a relational database, for example, a PostgreSQL database, you can expect to see data organized into columns and rows. A relational database table may have a set number of columns and any number of rows.

In Django, each model is its own table. For example, here’s a Django model:

from django.db import models


class Lunch(models.Model):
    left_side = models.CharField(max_length=100, null=True)
    center = models.CharField(max_length=100, null=True)
    right_side = models.CharField(max_length=100, null=True)

Each field is a column, and each row is a Django object instance of that model. Here’s a representation of a database table for the Django model “Lunch” above. In the database, its name would be lunch_table.

id	left_side	center	right_side
1	Fork	Plate	Spoon

The model Lunch has three fields: left_side, center, and right-side. One instance of a Lunch object would have “Fork” for the left_side, a “Plate” for the center, and “Spoon” for the right_side. Django automatically adds an id field if you don’t specify a primary key.

If you wanted to change the name of your Lunch model, you would do so in your models.py code. For example, change “Lunch” to “Dinner,” then run python manage.py makemigrations. You’ll see:

python manage.py makemigrations
Did you rename the backend.Lunch model to Dinner? [y/N] y
Migrations for 'backend':
  backend/migrations/0003_auto_20200922_2331.py
    - Rename model Lunch to Dinner

Django automatically generates the appropriate migration files. The relevant line of the generated migrations file in this case would look like:

migrations.RenameModel(old_name="Lunch", new_name="Dinner"),

This operation would rename our “Lunch” model to “Dinner” while keeping everything else the same. But what if you also wanted to change the structure of the database table itself, its schema, as well as make sure that existing data ends up in the right place on your Dinner table?

Let’s explore how to turn our Lunch model into a Dinner model that looks like this:

from django.db import models


class Dinner(models.Model):
    top_left = models.CharField(max_length=100, null=True)
    top_center = models.CharField(max_length=100, null=True)
    top_right = models.CharField(max_length=100, null=True)
    bottom_left = models.CharField(max_length=100, null=True)
    bottom_center = models.CharField(max_length=100, null=True)
    bottom_right = models.CharField(max_length=100, null=True)

…with a database table that would look like this:

id	top_left	top_center	top_right	bottom_left	bottom_center	bottom_right
1	Bread plate	Spoon	Glass	Fork	Plate	Knife

Manipulating data with Django migrations

Before you begin to manipulate your data, it’s always a good idea to create a backup of your database that you can restore in case something goes wrong. There are various ways to do this depending on the database you’re using. You can typically find instructions by searching for and keywords like backup, recovery, or snapshot.

In order to design your migration, it’s helpful to become familiar with the available migration operations. Migrations are run step-by-step, and each operation is some flavor of adding, removing, or altering data. Like a strategic puzzle, it’s important to make model changes one step at a time so that the generated migrations have the correct result.

We’ve already renamed our model successfully. Now, we’ll rename the fields that hold the data we want to retain:

class Dinner(models.Model):
    bottom_left = models.CharField(max_length=100, null=True)
    bottom_center = models.CharField(max_length=100, null=True)
    top_center = models.CharField(max_length=100, null=True)

Django is sometimes smart enough to determine the old and new field names correctly. You’ll be asked for confirmation:

python manage.py makemigrations
Did you rename dinner.center to dinner.bottom_center (a CharField)? [y/N] y
Did you rename dinner.left_side to dinner.bottom_left (a CharField)? [y/N] y
Did you rename dinner.right_side to dinner.top_center (a CharField)? [y/N] y
Migrations for 'backend':
  backend/migrations/0004_auto_20200914_2345.py
    - Rename field center on dinner to bottom_center
    - Rename field left_side on dinner to bottom_left
    - Rename field right_side on dinner to top_center

In some cases, you’ll want to try renaming the field and running makemigrations one at a time.

Now that the existing fields have been migrated to their new names, add the remaining fields to the model:

class Dinner(models.Model):
    top_left = models.CharField(max_length=100, null=True)
    top_center = models.CharField(max_length=100, null=True)
    top_right = models.CharField(max_length=100, null=True)
    bottom_left = models.CharField(max_length=100, null=True)
    bottom_center = models.CharField(max_length=100, null=True)
    bottom_right = models.CharField(max_length=100, null=True)

Running makemigrations again now gives us:

python manage.py makemigrations
Migrations for 'backend':
  backend/migrations/0005_auto_20200914_2351.py
    - Add field bottom_right to dinner
    - Add field top_left to dinner
    - Add field top_right to dinner

You’re done! By generating Django migrations, you’ve successfully set up your dinner_table and moved existing data to its new spot.

Additional complexity

You’ll notice that our Lunch and Dinner models are not very complex. Out of Django’s many model field options, we’re just using CharField. We also set null=True to let Django store empty values as NULL in the database.

Django migrations can handle additional complexity, such as changing field types, and whether a blank or null value is permitted. I keep Django’s model field reference handy as I work with varying types of data and different use cases.

De-mystified migrations

I hope this article has helped you better understand Django migrations and how they work!

Now that you can change models and manipulate existing data in your Django application, be sure to use your powers wisely! Backup your database, research and plan your migrations, and always run tests before working with customer data. By doing so, you have the potential to enable your application to grow – with manageable levels of complexity.

Writing efficient Django

2020-07-09T04:02:47-05:00

I like Django. It’s a well-considered and intuitive framework with a name I can pronounce out loud. You can use it to quickly spin up a weekend-sized project, and you can still use it to run full-blown production applications at scale. I’ve done both these things, and over the years I’ve discovered how to use some of Django’s features for maximum efficiency. These are:

Class-based versus function-based views
Django models
Retrieving objects with queries

Understanding these main features are the building blocks for maximizing development efficiency with Django. They’ll build the foundation for you to test efficiently and create an awesome development experience for your engineers. Let’s look at how these tools let you create a performant Django application that’s pleasant to build and maintain.

Class-based versus function-based views

Remember that Django is all Python under the hood. When it comes to views, you’ve got two choices: view functions (sometimes called “function-based views”), or class-based views.

Years ago when I first built ApplyByAPI, it was initially composed entirely of function-based views. These offer granular control, and are good for implementing complex logic; just as in a Python function, you have complete control (for better or worse) over what the view does. With great control comes great responsibility, and function-based views can be a little tedious to use. You’re responsible for writing all the necessary methods for the view to work - this is what allows you to completely tailor your application.

In the case of ApplyByAPI, there were only a sparse few places where that level of tailored functionality was really necessary. Everywhere else, function-based views began making my life harder. Writing what is essentially a custom view for run-of-the-mill operations like displaying data on a list page became tedious, repetitive, and error-prone.

With function-based views, you’ll need figure out which Django methods to implement in order to handle requests and pass data to views. Unit testing can take some work to write. In short, the granular control that function-based views offer also requires some granular tedium to properly implement.

I ended up holding back ApplyByAPI while I refactored the majority of the views into class-based views. This was not a small amount of work and refactoring, but when it was done, I had a bunch of tiny views that made a huge difference. I mean, just look at this one:

class ApplicationsList(ListView):
    model = Application
    template_name = "applications.html"

It’s three lines. My developer ergonomics, and my life, got a lot easier.

You may think of class-based views as templates that cover most of the functionality any app needs. There are views for displaying lists of things, for viewing a thing in detail, and editing views for performing CRUD (Create, Read, Update, Delete) operations. Because implementing one of these generic views takes only a few lines of code, my application logic became dramatically succinct. This gave me less repeated code, fewer places for something to go wrong, and a more manageable application in general.

Class-based views are fast to implement and use. The built-in class-based generic views may require less work to test, since you don’t need to write tests for the base view Django provides. (Django does its own tests for that; no need for your app to double-check.) To tweak a generic view to your needs, you can subclass a generic view and override attributes or methods. In my case, since I only needed to write tests for any customizations I added, my test files became dramatically shorter, as did the time and resources it took to run them.

When you’re weighing the choice between function-based or class-based views, consider the amount of customization the view needs, and the future work that will be necessary to test and maintain it. If the logic is common, you may be able to hit the ground running with a generic class-based view. If you need sufficient granularity that re-writing a base view’s methods would make it overly complicated, consider a function-based view instead.

Django models

Models organize your Django application’s central concepts to help make them flexible, robust, and easy to work with. If used wisely, models are a powerful way to collate your data into a definitive source of truth.

Like views, Django provides some built-in model types for the convenience of implementing basic authentication, including the User and Permission models. For everything else, you can create a model that reflects your concept by inheriting from a parent Model class.

class StaffMember(models.Model):
    user = models.OneToOneField(User, on_delete=models.CASCADE)
    company = models.OneToOneField(Company, on_delete=models.CASCADE)

    def __str__(self):
        return self.company.name + " - " + self.user.email

When you create a custom model in Django, you subclass Django’s Model class and take advantage of all its power. Each model you create generally maps to a database table. Each attribute is a database field. This gives you the ability to create objects to work with that humans can better understand.

You can make a model useful to you by defining its fields. Many built-in field types are conveniently provided. These help Django figure out the data type, the HTML widget to use when rendering a form, and even form validation requirements. If you need to, you can write custom model fields.

Database relationships can be defined using a ForeignKey field (many-to-one), or a ManyToManyField (give you three guesses). If those don’t suffice, there’s also a OneToOneField. Together, these allow you to define relations between your models with levels of complexity limited only by your imagination. (Depending on the imagination you have, this may or may not be an advantage.)

Retrieving objects with queries

Use your model’s Manager (objects by default) to construct a QuerySet. This is a representation of objects in your database that you can refine, using methods, to retrieve specific subsets. All available methods are in the QuerySet API and can be chained together for even more fun.

Post.objects.filter(
    type="new"
).exclude(
    title__startswith="Blockchain"
)

Some methods return new QuerySets, such as filter(), or exclude(). Chaining these can give you powerful queries without affecting performance, as QuerySets aren’t fetched from the database until they are evaluated. Methods that evaluate a QuerySet include get(), count(), len(), list(), or bool().

Iterating over a QuerySet also evaluates it, so avoid doing so where possible to improve query performance. For instance, if you just want to know if an object is present, you can use exists() to avoid iterating over database objects.

Use get() in cases where you want to retrieve a specific object. This method raises MultipleObjectsReturned if something unexpected happens, as well as the DoesNotExist exception, if, take a guess.

If you’d like to get an object that may not exist in the context of a user’s request, use the convenient get_object_or_404() or get_list_or_404() which raises Http404 instead of DoesNotExist. These helpful shortcuts are suited to just this purpose. To create an object that doesn’t exist, there’s also the convenient get_or_create().

Efficient essentials

You’ve now got a handle on these three essential tools for building your efficient Django application – congratulations! You can make Django work even better for you by learning about manipulating data with migrations, testing effectively, and setting up your team’s Django development for maximum happiness.

If you’re going to build on GitHub, you may like to set up my django-security-check GitHub Action. In the meantime, you’re well on your way to building a beautiful software project.

Multithreaded Python: slithering through an I/O bottleneck

2020-02-28T09:31:02-05:00

I recently developed a project that I called Hydra: a multithreaded link checker written in Python. Unlike many Python site crawlers I found while researching, Hydra uses only standard libraries, with no external dependencies like BeautifulSoup. It’s intended to be run as part of a CI/CD process, so part of its success depended on being fast.

Multiple threads in Python is a bit of a bitey subject (not sorry) in that the Python interpreter doesn’t actually let multiple threads execute at the same time. Python’s Global Interpreter Lock, or GIL, prevents multiple threads from executing Python bytecodes at once. Each thread that wants to execute must first wait for the GIL to be released by the currently executing thread. The GIL is pretty much the microphone in a low-budget conference panel, except where no one gets to shout.

This has the advantage of preventing race conditions. It does, however, lack the performance advantages afforded by running multiple tasks in parallel. (If you’d like a refresher on concurrency, parallelism, and multithreading, see Concurrency, parallelism, and the many threads of Santa Claus.) While I prefer Go for its convenient first-class primitives that support concurrency (see Goroutines), this project’s recipients were more comfortable with Python. I took it as an opportunity to test and explore!

Simultaneously performing multiple tasks in Python isn’t impossible; it just takes a little extra work. For Hydra, the main advantage is in overcoming the input/output (I/O) bottleneck.

In order to get web pages to check, Hydra needs to go out to the Internet and fetch them. When compared to tasks that are performed by the CPU alone, going out over the network is comparatively slower. How slow?

Here are approximate timings for tasks performed on a typical PC:

	Task	Time
CPU	execute typical instruction	1/1,000,000,000 sec = 1 nanosec
CPU	fetch from L1 cache memory	0.5 nanosec
CPU	branch misprediction	5 nanosec
CPU	fetch from L2 cache memory	7 nanosec
RAM	Mutex lock/unlock	25 nanosec
RAM	fetch from main memory	100 nanosec
Network	send 2K bytes over 1Gbps network	20,000 nanosec
RAM	read 1MB sequentially from memory	250,000 nanosec
Disk	fetch from new disk location (seek)	8,000,000 nanosec (8ms)
Disk	read 1MB sequentially from disk	20,000,000 nanosec (20ms)
Network	send packet US to Europe and back	150,000,000 nanosec (150ms)

Peter Norvig first published these numbers some years ago in Teach Yourself Programming in Ten Years. Since computers and their components change year over year, the exact numbers shown above aren’t the point. What these numbers help to illustrate is the difference, in orders of magnitude, between operations.

Compare the difference between fetching from main memory and sending a simple packet over the Internet. While both these operations occur in less than the blink of an eye (literally) from a human perspective, you can see that sending a simple packet over the Internet is over a million times slower than fetching from RAM. It’s a difference that, in a single-thread program, can quickly accumulate to form troublesome bottlenecks.

In Hydra, the task of parsing response data and assembling results into a report is relatively fast, since it all happens on the CPU. The slowest portion of the program’s execution, by over six orders of magnitude, is network latency. Not only does Hydra need to fetch packets, but whole web pages! One way of improving Hydra’s performance is to find a way for the page fetching tasks to execute without blocking the main thread.

Python has a couple options for doing tasks in parallel: multiple processes, or multiple threads. These methods allow you to circumvent the GIL and speed up execution in a couple different ways.

Multiple processes

To execute parallel tasks using multiple processes, you can use Python’s ProcessPoolExecutor. A concrete subclass of Executor from the concurrent.futures module, ProcessPoolExecutor uses a pool of processes spawned with the multiprocessing module to avoid the GIL.

This option uses worker subprocesses that maximally default to the number of processors on the machine. The multiprocessing module allows you to maximally parallelize function execution across processes, which can really speed up compute-bound (or CPU-bound) tasks.

Since the main bottleneck for Hydra is I/O and not the processing to be done by the CPU, I’m better served by using multiple threads.

Multiple threads

Fittingly named, Python’s ThreadPoolExecutor uses a pool of threads to execute asynchronous tasks. Also a subclass of Executor, it uses a defined number of maximum worker threads (at least five by default, according to the formula min(32, os.cpu_count() + 4)) and reuses idle threads before starting new ones, making it pretty efficient.

Here is a snippet of Hydra with comments showing how Hydra uses ThreadPoolExecutor to achieve parallel multithreaded bliss:

# Create the Checker class
class Checker:
    # Queue of links to be checked
    TO_PROCESS = Queue()
    # Maximum workers to run
    THREADS = 100
    # Maximum seconds to wait for HTTP response
    TIMEOUT = 60

    def __init__(self, url):
        ...
        # Create the thread pool
        self.pool = futures.ThreadPoolExecutor(max_workers=self.THREADS)


def run(self):
    # Run until the TO_PROCESS queue is empty
    while True:
        try:
            target_url = self.TO_PROCESS.get(block=True, timeout=2)
            # If we haven't already checked this link
            if target_url["url"] not in self.visited:
                # Mark it as visited
                self.visited.add(target_url["url"])
                # Submit the link to the pool
                job = self.pool.submit(self.load_url, target_url, self.TIMEOUT)
                job.add_done_callback(self.handle_future)
        except Empty:
            return
        except Exception as e:
            print(e)

You can view the full code in Hydra’s GitHub repository.

Single thread to multithread

If you’d like to see the full effect, I compared the run times for checking my website between a prototype single-thread program, and the ~~multiheaded~~multithreaded Hydra.

time python3 slow-link-check.py https://victoria.dev

real    17m34.084s
user    11m40.761s
sys     0m5.436s


time python3 hydra.py https://victoria.dev

real    0m15.729s
user    0m11.071s
sys     0m2.526s

The single-thread program, which blocks on I/O, ran in about seventeen minutes. When I first ran the multithreaded version, it finished in 1m13.358s - after some profiling and tuning, it took a little under sixteen seconds. Again, the exact times don’t mean all that much; they’ll vary depending on factors such as the size of the site being crawled, your network speed, and your program’s balance between the overhead of thread management and the benefits of parallelism.

The more important thing, and the result I’ll take any day, is a program that runs some orders of magnitude faster.

Breaking bottlenecks 🍾

2020-02-25T12:50:29-05:00

I recently gave a lecture on the benefits of building non-blocking processes. This is a write-up of the full talk, minus any “ums” that may have occurred.

I’ve been helping out a group called the Open Web Application Security Project (OWASP). They’re a non-profit foundation that produces some of the foremost application testing guides and cybersecurity resources. OWASP’s publications, checklists, and reference materials are a help to security professionals, penetration testers, and developers all over the world. Most of the individual teams that create these materials are run almost entirely by volunteers.

OWASP is a great group doing important work. I’ve seen this firsthand as part of the core team that produces the Web Security Testing Guide. However, while OWASP inspires in its large volunteer base, it lacks in the area of central organization.

This lack of organization was most recently apparent in the group’s website, OWASP.org. A big organization with an even bigger website to match, OWASP.org enjoys hundreds of thousands of visitors. Unfortunately, many of its pages - individually managed by disparate projects - are infrequently updated. Some are abandoned. The website as a whole lacks a centralized quality assurance process, and as a result, OWASP.org is peppered with broken links.

The trouble with broken links

Customers don’t like broken links; attackers really do. That’s because broken links are a security vulnerability. Broken links can signal opportunities for attacks like broken link hijacking and subdomain takeovers. At their least effective, these attacks can be embarrassing; at their worst, severely damaging to businesses and organizations. One OWASP group, the Application Security Verification Standard (ASVS) project, writes about integrity controls that can help to mitigate the likelihood of these attacks. This knowledge, unfortunately, has not yet propagated throughout the rest of OWASP yet.

This is the story of how I created a fast and efficient tool to help OWASP solve this problem.

The job

I took on the task of creating a program that could run as part of a CI/CD process to detect and report broken links. The program needed to:

Find and enumerate all the broken links on OWASP.org in a report.
Keep track of the parent pages the broken links were on so they could be fixed.
Run efficiently as part of a CI/CD pipeline.

Essentially; I need to build a web crawler.

My original journey through this process was also in Python, as that was a comfortable language choice for everyone in the OWASP group. Personally, I prefer to use Go for higher performance as it offers more convenient concurrency primitives. Between the task and this talk, I wrote three programs: a prototype single-thread Python program, a multithreaded Python program, and a Go program using goroutines. We’ll see a comparison of how each worked out near the end of the talk - first, let’s explore how to build a web crawler.

Prototyping a web crawler

Here’s what our web crawler will need to do:

Get the HTML data of the first page of the website (for example, https://victoria.dev)
Check all of the links on the page
1. Keep track of the links we’ve already visited so we don’t end up checking them twice
2. Record any broken links we find
Fetch more HTML data from any valid links on the page, as long as they’re in the same domain (https://victoria.dev and not https://github.com, for instance)
Repeat step #2 until all of the links on the site have been checked

Here’s what the execution flow will look like:

As you can see, the nodes “GET page” -> “HTML” -> “Parse links” -> “Valid link” -> “Check visited” all form a loop. These are what enable our web crawler to continue crawling until all the links on the site have been accounted for in the “Check visited” node. When the crawler encounters links it’s already checked, it will “Stop.” This loop will become more important in a moment.

For now, the question on everyone’s mind (I hope): how do we make it fast?

How fast can you do the thing

Here are some approximate timings for tasks performed on a typical PC:

Type	Task	Time
CPU	execute typical instruction	1/1,000,000,000 sec = 1 nanosec
CPU	fetch from L1 cache memory	0.5 nanosec
CPU	branch misprediction	5 nanosec
CPU	fetch from L2 cache memory	7 nanosec
RAM	Mutex lock/unlock	25 nanosec
RAM	fetch from main memory	100 nanosec
RAM	read 1MB sequentially from memory	250,000 nanosec
Disk	fetch from new disk location (seek)	8,000,000 nanosec (8ms)
Disk	read 1MB sequentially from disk	20,000,000 nanosec (20ms)
Network	send packet US to Europe and back	150,000,000 nanosec (150ms)

Peter Norvig first published these numbers some years ago in Teach Yourself Programming in Ten Years. They typically crop up now and then in articles titled along the lines of, “Latency numbers every developer should know.”

Since computers and their components change year over year, the exact numbers shown above aren’t the point. What these numbers help to illustrate is the difference, in orders of magnitude, between operations.

Bottleneck: network latency

The numbers above mean that the difference in time it takes to send something over the Internet compared to fetching data from main memory is over six orders of magnitude. Remember the loop in our execution chart? The “GET page” node, in which our crawler fetches page data over the network, is going to be a million times slower than the next slowest thing in the loop!

We don’t need to run our prototype to see what that means in practical terms; we can estimate it. Let’s take OWASP.org, which has upwards of 12,000 links, as an example:

      150 milliseconds
 x 12,000 links
---------
1,800,000 milliseconds (30 minutes)

A whole half hour, just for the network tasks. It may even be much slower than that, since web pages are frequently much larger than a packet. This means that in our single-thread prototype web crawler, our biggest bottleneck is network latency. Why is this problematic?

Feedback loops

I previously wrote about feedback loops. In essence, in order to improve at doing anything, you first need to be able to get feedback from your last attempt. That way, you have the necessary information to make adjustments and get closer to your goal on your next iteration.

As a software developer, bottlenecks can contribute to long and inefficient feedback loops. If I’m waiting on a process that’s part of a CI/CD pipeline, in our bottlenecked web crawler example, I’d be sitting around for a minimum of a half hour before learning whether or not changes in my last push were successful, or whether they broke master (hopefully staging).

Multiply a slow and inefficient feedback loop by many runs per day, over many days, and you’ve got a slow and inefficient developer. Multiply that by many developers in an organization bottlenecked on the same process, and you’ve got a slow and inefficient company.

The cost of bottlenecks

To add insult to injury, not only are you waiting on a bottlenecked process to run; you’re also paying to wait. Take the serverless example - AWS Lambda, for instance. Here’s a chart showing the cost of functions by compute time and CPU usage.

Source: Understanding and Controlling AWS Lambda Costs

Again, the numbers change over the years, but the main concepts remain the same: the bigger the function and the longer its compute time, the bigger the cost. For applications taking advantage of serverless, these costs can add up dramatically.

Bottlenecks are a recipe for failure, for both productivity and the bottom line.

The good news is that bottlenecks are mostly unnecessary. If we know how to identify them, we can strategize our way out of them. To understand how, let’s get some tacos.

Tacos and threading

Everyone, meet Bob. He’s a gopher who works at the taco stand down the street as the cashier. Say “Hi,” Bob.

🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
🌮                                          🌳
🌮
🌮   ╔══════════════╗
🌮      Hi I'm Bob                          🌳
🌮   ╚══════════════╝ \
🌮                     🐹 🌮
🌮
🌮
🌮                                          🌳
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮

Bob works very hard at being a cashier, but he’s still just one gopher. The customers who frequent Bob’s taco stand can eat tacos really quickly; but in order to get the tacos to eat them, they’ve got to order them through Bob. Here’s what our bottlenecked, single-thread taco stand currently looks like:

🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
🌮                                          🌳
🌮
🌮
🌮                                          🌳
🌮      🐹 🧑💵🧑💵🧑💵🧑💵🧑💵🧑💵🧑💵🧑💵🧑💵
🌮
🌮
🌮
🌮                                          🌳
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮

As you can see, all the customers are queued up, right out the door. Poor Bob handles one customer’s transaction at a time, starting and finishing with that customer completely before moving on to the next. Bob can only do so much, so our taco stand is rather inefficient at the moment. How can we make Bob faster?

We can try splitting the queue:

🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
🌮                                          🌳
🌮
🌮         🧑💵🧑💵🧑💵🧑💵
🌮                                          🌳
🌮      🐹
🌮
🌮         🧑💵🧑💵🧑💵🧑💵🧑💵
🌮
🌮                                          🌳
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮

Now Bob can do some multitasking. For example, he can start a transaction with a customer in one queue; then, while that customer counts their bills, Bob can pop over to the second queue and get started there. This arrangement, known as a concurrency model, helps Bob go a little bit faster by jumping back and forth between lines. However, it’s still just one Bob, which limits our improvement possibilities. If we were to make four queues, they’d all be shorter; but Bob would be very thinly stretched between them. Can we do better?

We could get two Bobs:

🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
🌮                                          🌳
🌮
🌮                                          🌳
🌮      🐹 🧑💵🧑💵🧑💵🧑💵
🌮                                          🌳
🌮      🐹 🧑💵🧑💵🧑💵🧑💵🧑💵
🌮                                          🌳
🌮
🌮                                          🌳
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮

With twice the Bobs, each can handle a queue of his own. This is our most efficient solution for our taco stand so far, since two Bobs can handle much more than one Bob can, even if each customer is still attended to one at a time.

We can do even better than that:

🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮
🌮                                          🌳
🌮      🐹 🧑💵🧑💵
🌮                                          🌳
🌮      🐹 🧑💵🧑💵
🌮                                          🌳
🌮      🐹 🧑💵🧑💵
🌮                                          🌳
🌮      🐹 🧑💵🧑💵🧑💵
🌮                                          🌳
🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮🌮

With quadruple the Bobs, we have some very short queues, and a much more efficient taco stand. In computing, the concept of having multiple workers do tasks in parallel is called multithreading.

In Go, we can apply this concept using goroutines. Here are some illustrative snippets from my Go solution.

Setting up a Go web crawler

In order to share data between our goroutines, we’ll need to create some data structures. Our Checker structure will be shared, so it will have a Mutex (mutual exclusion) to allow our goroutines to lock and unlock it. The Checker structure will also hold a list of brokenLinks results, and visitedLinks. The latter will be a map of strings to booleans, which we’ll use to directly and efficiently check for visited links. By using a map instead of iterating over a list, our visitedLinks lookup will have a constant complexity of O(1) as opposed to a linear complexity of O(n), thus avoiding the creation of another bottleneck. For more on time complexity, see my coffee-break introduction to time complexity of algorithms article.

type Checker struct {
    startDomain             string
    brokenLinks             []Result
    visitedLinks            map[string]bool
    workerCount, maxWorkers int
    sync.Mutex
}
...
// Page allows us to retain parent and sublinks
type Page struct {
    parent, loc, data string
}

// Result adds error information for the report
type Result struct {
    Page
    reason string
    code   int
}

To extract links from HTML data, here’s a parser I wrote on top of package html:

// Extract links from HTML
func parse(parent, data string) ([]string, []string) {
    doc, err := html.Parse(strings.NewReader(data))
    if err != nil {
        fmt.Println("Could not parse: ", err)
    }
    goodLinks := make([]string, 0)
    badLinks := make([]string, 0)

    var f func(*html.Node)
    f = func(n *html.Node) {
        if n.Type == html.ElementNode && checkKey(string(n.Data)) {
            for _, a := range n.Attr {
                if checkAttr(string(a.Key)) {
                    j, err := formatURL(parent, a.Val)
                    if err != nil {
                        badLinks = append(badLinks, j)

                    } else {
                        goodLinks = append(goodLinks, j)
                    }
                    break
                }
            }
        }
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            f(c)
        }
    }
    f(doc)
    return goodLinks, badLinks
}

If you’re wondering why I didn’t use a more full-featured package for this project, I highly recommend the story of left-pad. The short of it: more dependencies, more problems.

Here are snippets of the main function, where we pass in our starting URL and create a queue (or channels, in Go) to be filled with links for our goroutines to process.

func main() {
    ...
    startURL := flag.String("url", "http://example.com", "full URL of site")
    ...

    firstPage := Page{
        parent: *startURL,
        loc:    *startURL,
    }

    toProcess := make(chan Page, 1)
    toProcess <- firstPage

    var wg sync.WaitGroup

The last significant piece of the puzzle is to create our workers, which we’ll do here:

for i := range toProcess {
    wg.Add(1)
    checker.addWorker()
    🐹 go worker(i, &checker, &wg, toProcess)
    if checker.workerCount > checker.maxWorkers {
        time.Sleep(1 * time.Second) // throttle down
    }
}
wg.Wait()

A WaitGroup does just what it says on the tin: it waits for our group of goroutines to finish. When they have, we’ll know our Go web crawler has finished checking all the links on the site.

Did we do the thing fast

Here’s a comparison of the three programs I wrote on this journey. First, the prototype single-thread Python version:

time python3 slow-link-check.py https://victoria.dev

real 17m34.084s
user 11m40.761s
sys     0m5.436s

This finished crawling my website in about seventeen-and-a-half minutes, which is rather long for a site at least an order of magnitude smaller than OWASP.org.

The multithreaded Python version did a bit better:

time python3 hydra.py https://victoria.dev

real 1m13.358s
user 0m13.161s
sys     0m2.826s

My multithreaded Python program (which I dubbed Hydra) finished in one minute and thirteen seconds.

How did Go do?

time ./go-link-check --url=https://victoria.dev

real 0m7.926s
user 0m9.044s
sys     0m0.932s

At just under eight seconds, I found the Go version to be extremely palatable.

Breaking bottlenecks

As fun as it is to simply enjoy the speedups, we can directly relate these results to everything we’ve learned so far. Consider taking a process that used to soak up seventeen minutes and turning it into an eight-second-affair instead. Not only will that give developers a much shorter and more efficient feedback loop, it will give companies the ability to develop faster, and thus grow more quickly - while costing less. To drive the point home: a process that runs in seventeen-and-a-half minutes when it could take eight seconds will also cost over a hundred and thirty times as much to run!

A better work day for developers, and a better bottom line for companies. There’s a lot of benefit to be had in making functions, code, and processes as efficient as possible - by breaking bottlenecks.

Iteration in Python: for, list, and map

2017-01-18T21:58:28+07:00

Iteration in Python can be a little hard to understand. Subtle differences in terminology like iteration, iterator, iterating, and iterable aren’t the most beginner-friendly.

When tackling new concepts, I find concrete examples to be most useful. I’ll share some in this post and discuss appropriate situations for each. (Pun intended.)

For loop

First, in pseudocode:

for iterating_variable in iterable:
    statement(s)

I find for loops to be the most readable way to iterate in Python. This is especially nice when you’re writing code that someone else needs to read and understand, which is always.

An iterating_variable, loosely speaking, is anything you could put in a group. For example: a letter in a string, an item from a list, or an integer in a range of integers.

An iterable houses the things you iterate on. This can also take different forms: a string with multiple characters, a range of numbers, a list, and so on.

A statement or multiple statements indicates doing something to the iterating variable. This could be anything from mathematical expressions to simply printing a result.

Here are a couple simple examples that print each iterating_variable of an iterable:

for letter in "Hello world":
    print(letter)

for i in range(10):
    print(i)

breakfast_menu = ["toast", "eggs", "waffles", "coffee"]
for choice in breakfast_menu:
    print(choice)

You can even use a for loop in a more compact situation, such as this one-liner:

breakfast_buffet = " ".join(str(item) for item in breakfast_menu)

The downside to for loops is that they can be a bit verbose, depending on how much you’re trying to achieve. Still, for anyone hoping to make their Python code as easily understood as possible, for loops are the most straightforward choice.

List comprehensions

A pseudocode example:

new_list = [statement(s) for iterating_variable in iterable]

List comprehensions are a concise and elegant way to create a new list by iterating on variables. Once you have a grasp of how they work, you can perform efficient iterations with very little code.

List comprehensions will always return a list, which may or may not be appropriate for your situation.

For example, you could use a list comprehension to quickly calculate and print tip percentage on a few bar tabs at once:

tabs = [23.60, 42.10, 17.50]
tabs_incl_tip = [round(tab*1.15, 2) for tab in tabs]
print(tabs_incl_tip)

>>> [27.14, 48.41, 20.12]

In one concise line, we’ve taken each tab amount, added a 15% tip, rounded it to the nearest cent, and made a new list of the tabs plus the tip values.

List comprehensions can be an elegant tool if output to a list is useful to you. Be advised that the more statements you add, the more complicated your list comprehension begins to look, especially once you get into nested list comprehensions. If your code isn’t well annotated, it may become difficult for another reader to figure out.

Map

How to map, in pseudocode:

map(statement, iterable)

Map is pretty compact, for better or worse. It can be harder to read and understand, especially if your line of code has a lot of parentheses.

In terms of efficiency for character count, map is hard to beat. It applies your statement to every instance of your iterable and returns an iterator.

Here’s an example casting each element of input() (the iterable) from string representation to integer representation. Since map returns an iterator, you also cast the result to a list representation.

values = list(map(int, input().split()))
weights = list(map(int, input().split()))

It’s worth noting that you can also use for loops, list comprehension, and map all together:

output = sum([x[0] * x[1] for x in zip(values, weights)]) / sum(weights)

print(round(output, 1))

Your iteration toolbox

Each of these methods of iteration in Python have a special place in the code I write every day. I hope these examples have helped you see how to use for loops, list comprehensions, and map in your own Python code!

If you like this post, there’s a lot more where that came from! I write about efficient programming for coders and for leading technical teams. Check out the posts below!

Python on victoria.dev

How to send long text input to ChatGPT using the OpenAI API

Chunking your input

Handling responses

Putting it all together

Error handling

Optimization

Now what

Optimizing text for ChatGPT: NLP and text pre-processing techniques

Text preprocessing

Tokenization and ChatGPT input limits

A general programmatic approach

Byte-Pair Encoding (BPE)

Sending lots of text to ChatGPT

Measuring productivity with GitHub issues

Getting quality data

Plotting with Pandas

You aim for what you measure

Do I raise or return errors in Python?

When to raise

Raise in try and except

When to return

The most important part

Increase developer confidence with a great Django test suite

What to test

Where to put tests

How to document a test

What a test needs to work

Best possible execution for running your tests

Testing your way to a great Django application

Django project best practices to keep your developers happy

A custom CLI tool for your Django project

Save your brainpower with pre-commit hooks

Useful gitignores

Continuous testing with GitHub Actions

Set up your Django project for success

Manipulating data with Django migrations

What’s under the hood

What’s a database table

Manipulating data with Django migrations

Additional complexity

De-mystified migrations

Writing efficient Django

Class-based versus function-based views

Django models

Retrieving objects with queries

Efficient essentials

Multithreaded Python: slithering through an I/O bottleneck

Multiple processes

Multiple threads

Single thread to multithread

Breaking bottlenecks 🍾

The trouble with broken links

The job

Prototyping a web crawler

How fast can you do the thing

Bottleneck: network latency

Feedback loops

The cost of bottlenecks

Tacos and threading

Setting up a Go web crawler

Did we do the thing fast

Breaking bottlenecks

Iteration in Python: for, list, and map

For loop

List comprehensions

Map

Your iteration toolbox

Raise in `try` and `except`