Django DB Routers: Avoiding Race Conditions

chris48s

2025-04-04

If you are using django with a Postgres primary/replica setup for the first time, your first instinct may be to write something like this:

DATABASES = {
    "default": {
        "ENGINE": "django.db.backends.postgresql",
        ...
    },
    "replica1": {
        "ENGINE": "django.db.backends.postgresql",
        ...
    },
    "replica2": {
        "ENGINE": "django.db.backends.postgresql",
        ...
    },
}


import random

class AuthRouter:
    def db_for_read(self, model, **hints):
        return random.choice(
            ["replica1", "replica2"]
        )

    def db_for_write(self, model, **hints):
        return "default"
    ...

Read from the replicas, write to the primary - simple. There are tutorials online that will tell you to do this.

However, there's a problem with this setup: You now have a race condition. Replication is not instant. If you write something to the primary and then read back the same record from one of the replicas, the data may be stale.

How you choose to solve this depends on your application, but here's one strategy I have recently used.

A site I am working on has essentially a read-only public-facing site with a large number of users. Writes are made either by a handful of internal users via django's admin backend, or by management commands. We want to prevent race conditions in contexts where we will perform writes and scale DB reads in the public-facing parts of the site.

In order to do this, we used a package called django-middleware-global-request. This package provides a middleware that can store the current request in a thread-local variable so it can be accessed in contexts where a request would not normally be in scope.

Using this package allows us to write the following code:

import random
from django_middleware_global_request import get_request

class AuthRouter:
    def db_for_read(self, model, **hints):
        request = get_request()

        if request:
            if request.path.startswith("/admin"):
                # always read from the primary
                # in the admin interface
                return "default"

            # use replicas to serve read traffic
            # on the public site
            return random.choice(
                ["replica1", "replica2"]
            )

        # if there's no request in scope,
        # read from the primary
        # we don't care about trying to "scale"
        # management commands
        return "default"

    def db_for_write(self, model, **hints):
        return "default"
    ...

There are some situations where this wouldn't work. Obviously if you can't generalise about traffic based on the URL this strategy isn't going to hold water. Perhaps less obviously, if you've got an application which uses a lot of background tasks (which also will not have a request in scope) you might want to also serve some reads from the replicas in that situation. This approach would serve all the reads from the primary in that context.

For my use-case, this worked like a charm.

Django DB Routers: Pitfalls

chris48s

2025-02-16

Django has a feature called Database Routers. This gives you an application-level abstraction to manage multiple database connections. This can be useful if you are working with a primary/replica setup or if you're implementing database partitioning or sharding in your application.

There are a few ways in which it is can be a bit of a leaky abstraction though.

Raw SQL

This one is pretty obvious if you think about it, but anything you define in your database routers only applies to database I/O that happens via the ORM. As soon as you use a cursor to execute raw SQL your DB routers are bypassed and it is your responsibility to manage which connection is being used. You can do this by importing from django.db import connections and selecting a named connection.

This isn't a big surprise, but it is worth being aware of if you are converting an existing application to use multiple DB connections. This is not normally something you implement from day one.

The other thing worth noting here is that even if your own application code only interacts with the DB through the ORM, if there is any package anywhere in your dependency tree that uses connection.cursor(), that is going to bypass any logic in your database router. This could lead to reads or writes performed by a third party package which don't happen on the expected DB connection.

Transactions

This one is perhaps less obvious. Any time you use a transaction, you are also responsible for explicitly managing your own DB connections. Imagine we have an application with the following setup:

DATABASES = {
    "default": {
        "ENGINE": "django.db.backends.sqlite3",
        "NAME": BASE_DIR / "db1.sqlite3",
    },
    "auth": {
        "ENGINE": "django.db.backends.sqlite3",
        "NAME": BASE_DIR / "db2.sqlite3",
    },
}

class AuthRouter:
    route_app_labels = {"auth", "contenttypes"}

    def db_for_read(self, model, **hints):
        if model._meta.app_label in self.route_app_labels:
            return "auth"
        return None

    def db_for_write(self, model, **hints):
        if model._meta.app_label in self.route_app_labels:
            return "auth"
        return None

If I write the following code:

with transaction.atomic():
    User.objects.create_user(
        username="user1",
        email="user1@example.com",
        password="changeme1",
    )
    User.objects.create_user(
        username="user2",
        email="user2@example.com",
        password="changeme2",
    )

this will create 2 users, writing the records to our auth DB. But the transaction wrapping those writes uses the default DB connection. This won't throw any exception or error. It just means these two writes are not wrapped in a meaningful transaction. However, to a casual reader of the code, it probably looks like they are.

The correct code in this case would be with transaction.atomic(using="auth"): but it is worth being aware of this. Using multiple DBs opens the door to a class of wrong code that looks right.

Data Migrations

It is also important to be careful when writing custom migrations using RunPython. Continuing to use our example above where we have a default DB and an auth DB, consider the following migration code:

def create_user(apps, schema_editor):
    User = apps.get_model("auth", "User")
    User.objects.create_user(
        username="user1",
        email="user1@example.com",
        password="changeme1",
    )

class Migration(migrations.Migration):
    operations = [
        migrations.RunPython(
            create_user,
            reverse_code=migrations.RunPython.noop
        )
    ]

This code looks fairly reasonable on the surface. If you are moving to using multiple DBs in an application which has historically run on a single DB, you probably have some migrations like this kicking about in the project history. The catch here is: If we run ./manage.py migrate it will attempt to create the User object on the default DB connection. That will also happen if we run ./manage.py migrate --database auth. This happens because the model retrieved via apps.get_model() is unaware of the database being used in the migration process. Django uses the default database unless explicitly specified.

The correct code in this case would be:

def create_user(apps, schema_editor):
    User = apps.get_model("auth", "User")
    User.objects.using(
        schema_editor.connection.alias
    ).create_user(
        username="user1",
        email="user1@example.com",
        password="changeme1",
    )

Three lessons from moving to Renovate

chris48s

2025-01-19

At work recently I have been migrating some python projects using UV for package management from dependabot to renovate. Renovate is really impressive software. It can do all the things. However with that capability comes some complexity. Whereas dependabot is a go kart, renovate is more like an aeroplane cockpit. Although I've used renovate before, the projects I've been migrating recently are a bit bigger with a few more moving parts.

Here's a random grab-bag of things I have recently learned that were not immediately obvious to me.

Order of packageRules evaluation

Renovate evaluates packageRules[] from top to bottom. The order they are declared in is important. If a package matches a rule, there’s no bail-out or "early return". It keeps going down the array. This means if a package matches multiple rules, later rules override earlier rules in the array.

To give a worked example:

If I have

[project]
dependencies = [
    "boto3==1.35.98",
    "botocore==1.35.98",
]

In my pyproject.toml

This renovate config

"packageRules": [
    {
        "matchManagers": ["pep621"],
        "matchDepTypes": ["project.dependencies"],
        "schedule": ["before 4am on monday"], // weekly
    },
    {
        "groupName": "boto",
        "matchManagers": ["pep621"],
        "matchPackageNames": ["/^boto.*$/"],
        "schedule": ["* 0-3 1 * *"], // monthly
    }
]

Will bump the two boto packages monthly, whereas the same rules in the opposite order

"packageRules": [
    {
        "groupName": "boto",
        "matchManagers": ["pep621"],
        "matchPackageNames": ["/^boto.*$/"],
        "schedule": ["* 0-3 1 * *"], // monthly
    },
    {
        "matchManagers": ["pep621"],
        "matchDepTypes": ["project.dependencies"],
        "schedule": ["before 4am on monday"], // weekly
    }
]

will bump them weekly because the packages match both rules and the second one "wins".

This means that you broadly want to order your packageRules from the most general at the start of the array to most specific at the end.

VCS Dependencies

UV allows you to specify VCS dependencies in a couple of ways.

One of them is to do it inline in dependencies using PEP-508 syntax e.g:

[project]
dependencies = [
    "arcgis2geojson @ git+https://github.com/chris48s/arcgis2geojson.git@3.0.3"
]

The other us to use UV dependency sources e.g:

[project]
dependencies = [
    "arcgis2geojson"
]

[tool.uv.sources]
arcgis2geojson = { git = "https://github.com/chris48s/arcgis2geojson.git", tag = "3.0.3" }

Renovate's UV manager natively knows how to bump VCS dependencies in pyproject.toml. However, it will only detect and bump dependencies which are specified with the dependency sources syntax. VCS dependencies specified inline will be ignored, so the syntax you use here matters.

Additionally, the specific syntax you use inside tool.uv.sources is also important. These two declarations are essentially identical in terms of what gets installed into your virtual environment:

[tool.uv.sources]
arcgis2geojson = { git = "https://github.com/chris48s/arcgis2geojson.git", rev = "3.0.3" }

[tool.uv.sources]
arcgis2geojson = { git = "https://github.com/chris48s/arcgis2geojson.git", tag = "3.0.3" }

However, renovate will interpret them and bump them differently. Using rev here will cause renovate to bump you to the latest commit every time a commit is pushed to the referenced repository, whereas using tag will only offer to upgrade the dependency when a new tag is pushed.

Pre-Commit

I'm going to go on record and say I don't particularly like pre-commit. It is not something I use on my own projects. I do encounter on other people's repositories a lot though, including some of the repos at work. As such, although it is something I prefer not to use, I do need to find coping mechanisms for it. One of my least favourite things about pre-commit is that unless you use repository-local hooks (which is relatively uncommon), pre-commit pushes you towards using custom "hook" repos. This leads to a situation where you end up with the version numbers of your tools duplicated in your package manifest and your .pre-commit.yaml. You now no longer have a single source of truth for this information and the two version numbers can end up out of sync.

Usefully, renovate has a pre-commit manager which can bump version numbers in .pre-commit.yaml. This means you can write a packageRule like:

{
    "groupName": "ruff",
    "matchManagers": [
        "pep621",
        "pre-commit"
    ],
    "matchPackageNames": [
        "ruff",
        "astral-sh/ruff-pre-commit"
    ],
}

for example, which will bump the version of ruff in pyproject.toml/uv.lock and .pre-commit.yaml in a single PR, keeping the two in sync.

Note that renovate's pre-commit manager is disabled by default and you must explicitly opt-in to it, as noted in the docs.

Where do the logs from fake lambda go?

chris48s

2024-10-30

I've written before about moto. It is a library for mocking out AWS services under test, and it is basically magic.

Recently, I was working on a test for some code that invokes an AWS lambda function, and I was using moto to mock out the lambda service. The fact that moto can do this is quite impressive in itself, but during the course of this I found myself needing some visibility onto what was happening in the lambda while moto was running it in the mock lambda environment. This led me to the question "Where do the logs from fake lambda go?"

It turns out the answer to this question is: "Into fake CloudWatch".

On one level this answer makes complete logical sense. Simultaneously I found this a bit mind-blowing 🤯 Moto really is incredibly capable software.

So, armed with this knowledge, how do we use it?

I've pushed a repo with some complete working code demonstrating this to https://github.com/chris48s/moto-lambda-logs-demo. Here's a cut down version leaving out some imports and helper functions for the sake of brevity.

First lets define a toy lambda function we can test. As well as returning a response, our handler also prints something to stdout.

# handler.py
def lambda_handler(event, context):
    print("log message")

    return {"statusCode": 200, "body": "Hello from Lambda!"}

We can use moto to run this handler in a mock lambda environment and test the response like this:

def test_lambda():
    with mock_aws():
        role = _get_mock_role()

        lambda_client = boto3.client(
            "lambda",
            region_name="eu-west-1"
        )

        fn = lambda_client.create_function(
            FunctionName="TestLambdaFunction",
            Runtime="python3.10",
            Role=role["Role"]["Arn"],
            Handler="handler.lambda_handler",
            Code={"ZipFile": _make_lambda_zip()},
        )

        response = lambda_client.invoke(
            FunctionName=fn["FunctionName"],
            Payload=json.dumps({}),
        )

        payload = json.loads(
            response["Payload"].read().decode()
        )
        assert (
            payload == {
                "statusCode": 200,
                "body": "Hello from Lambda!"
            }
        )
        ...

Having invoked the lambda, we can then also inspect the CloudWatch logs generated while running that function and make assertions about anything written to the log streams. In this case, I'm asserting the output of our print() statement made it into the logs.

def test_lambda():
    with mock_aws():

        ...
        assert (
            payload == {
                "statusCode": 200,
                "body": "Hello from Lambda!"
            }
        )

        logs_client = boto3.client(
            "logs",
            region_name="eu-west-1"
        )
        log_streams = logs_client.describe_log_streams(
            logGroupName=f"/aws/lambda/{fn['FunctionName']}"
        ).get("logStreams")

        log_events = logs_client.get_log_events(
            logGroupName=f"/aws/lambda/{fn['FunctionName']}",
            logStreamName=log_streams[0]["logStreamName"],
        ).get("events")

        assert (
            len([
                e for e in log_events
                if e["message"] == "log message"
            ]) == 1
        )

Moto really does provide an exceptionally deep and comprehensive mock AWS environment.

An analysis of python package manifest files

chris48s

2024-02-17

Python packaging is messy and fragmented. Lots of people have been writing about it recently and there have been some great articles that have attracted a lot of attention. For example, I've particularly enjoyed:

An unbiased evaluation of environment management and packaging tools by Anna-Lena Popkes
How to improve Python packaging, or why fourteen tools are at least twelve too many by Chris Warrick and
Thoughts on the Python packaging ecosystem by Pradyun Gedam

Gregory Szorc also captured the frustrating experience many developers face trying to navigate the modern python packaging landscape in My User Experience Porting Off setup.py.

It is a topic I also spend a lot of time thinking about, but I decided to take a look at the topic from a slightly different angle. Instead of lamenting the proliferation of different tools, or attempting to round them all up and compare them, I decided to ask: What are package authors actually doing out there in the wild, and how is the community responding to this change and fragmentation?

So I conducted a bit of research. I looked at a sample of 31,474 public GitHub repos associated with one or more python packages on PyPI and analysed the manifests to find out a bit more about how people are actually specifying their package metadata and building their packages. For the purposes of this research, I'm focussing on packages. You could probably ask and answer some similarly interesting questions about applications, but I haven't done it here. There's a bit more information about how and why I arrived at this sample of ~30k GitHub repos in the methodology notes, but I'm not going to bury the lead. Lets just jump straight into the good stuff.

Manifest files

I looked for the presence of 3 files: pyproject.toml, setup.py and setup.cfg. Most of the repos I looked at contained more than one.

File	Count	Percent
setup.py	20,684	66%
pyproject.toml	17,245	55%
setup.cfg	10,406	33%
Total	31,474	-

Pyproject.toml

One of the big pushes in python is for adoption of pyproject.toml. So how is that going out there in the real world?

First of all, it is worth reviewing some of the ways pyproject.toml is or can be used.

PEP 517 Defines a way to declare a package build backend in pyproject.toml.
PEP 621 Defines a way to define the package metadata in pyproject.toml.
PEP 518 Defines a way to declare package build requirements in pyproject.toml and a way for python tools (which may or may not be related to packaging) to store configuration in the tool.* namespace. Many python tools like pytest, black, mypy, etc allow their configuration to be stored in pyproject.toml using the tool.* namespace.
In particular, poetry allows package metadata to be specified in pyproject.toml in a tool.poetry declaration, but predates and does not conform to PEP 621. I'm going to consider poetry separately.

A point to note here is that these can be combined in various ways. For example, it is possible to declare a build backend in pyproject.toml following PEP 517 and also declare PEP 621 package metadata. However using setuptools it is also possible to declare a build backend in pyproject.toml but specify the rest of the package metadata in setup.py or setup.cfg. Some repos only use pyproject.toml for storing linter configuration and everything to do with packaging is stored in setup.py or setup.cfg. Some repos specify package metadata in pyproject.toml (either following PEP 621 or using poetry), but don't declare a build system. One does not necessarily imply another. I found examples of pretty much every combination. This makes it difficult to conduct a completely coherent analysis or arrive at universally valid assumptions.

In the sample of repos I looked at 17,245 (55%) contained a pyproject.toml file. 15,754 (91%) of those declare a build backend, requirements, and/or package metadata. 1,497 (9%) did not contain any of those things. Presumably in basically all of those cases, pyproject.toml is being used exclusively as a configuration file for dev tooling.

Feature	Count	Percent
Has build requirements	15,427	89%
Has build backend	14,328	83%
Has PEP 621 metadata	6,563	38%
Has Poetry metadata	4,890	28%
Has no packaging metadata	1,497	9%
Total	17,245	-

There are a few interesting results here. The first is that most repos containing a pyproject.toml declare either a build backend and/or requirements. I was actually surprised that more files declare build requirements than a build backend. I expected repos declaring build requirements would basically be a subset of those declaring a build backend. Turns out the inverse is true.

Many repos are declaring package metadata in pyproject.toml using either PEP 621 or Poetry format, but adoption of pyproject.toml for this purpose is less common.

My hunch is that a lot of the repos which are only specifying build backend/requirements may have adopted pyproject.toml primarily as a configuration format (as opposed to a package manifest format) and then added a minimal build-system declaration for compatibility purposes. However that is just my conjecture.

Setup.py and setup.cfg

The oldest way to specify package metadata is using setup.py. This has served the community well for many years, but the package metadata is mixed with executable python code. The python community's first attempt at a declarative manifest format was setup.cfg. This was a format specific to setuptools rather than a standard and the setuptools project plans to eventually deprecate setup.cfg. One of the big pushes in python is for moving away from setup.py and setup.cfg to specify package metadata, and towards pyproject.toml. So how is that going out there in the real world?

Of the repos I looked at, 20,684 (66%) contained a setup.py and 10,406 (33%) contained a setup.cfg file. Many contained both. As with pyproject.toml, presence or absence of the file in a repo doesn't necessarily tell us the full story. Some repos that are primarily using pyproject.toml also have a stub setup.py that just contains

import setuptools
setuptools.setup()

for backwards compatibility reasons. This may be needed, for example for compatibility with tools that don't support PEP 660 editable installs.

As with pyproject.toml, many python based tools like isort and flake8 allow their configuration to be stored in setup.cfg so some repos contain a setup.cfg but aren't using to store any information related to packaging - it is just there to store linter configuration. Again, basically every combination of scenarios exists in the sample of repos I looked at.

I haven't attempted to parse the setup.py and setup.cfg files. I am perhaps missing a bit of nuance here, but I have made some assumptions:

A repo which declares poetry or PEP 621 package metadata in pyproject.toml is using pyproject.toml as the package manifest.
A repo that has a setup.py but not a setup.cfg and either doesn't have a pyproject.toml at all or has a pyproject.toml which does not contain poetry or PEP 621 package metadata is using setup.py as the package manifest.
A repo that has a setup.py and setup.cfg and either doesn't have a pyproject.toml at all or has a pyproject.toml which does not contain poetry or PEP 621 package metadata is using setup.py or setup.cfg as the package manifest.
There were also just over 1,000 repos doing some other combination of things. A lot of these were a pyproject.toml declaring a build system or build requirements only, with metadata in setup.cfg. I didn't attempt to break them down any further.

Manifest type	Count	Percent
pyproject.toml with metadata	11,349	36%
setup.py only	10,695	34%
setup.py and setup.cfg	8,235	26%
Other	1,195	4%
Total	31,474	100%

18,930 (63%) of the repos I looked at are sticking with setup.py and/or setup.cfg as the package manifest.

Using only a setup.py is still a very popular method of packaging at 34%. This is nearly equal with storing package metadata in pyproject.toml at 36%, despite efforts to transition the community away from executable package manifests and towards declarative manifest formats.

Build Backends

14,328 of the repos I looked at are using a pyproject.toml that declares a PEP-517 build backend. So next I dug into that. Which build backends are these repos using?

Build backend	Count	Percent
Setuptools	6,732	47%
Poetry	4,671	33%
Hatch	1,592	11%
Flit	687	5%
Other	223	2%
Pdm	215	2%
Maturin	208	1%
Total	14,328	100%

There are more interesting findings here:

Among repos using pyproject.toml, setuptools is the by far the most commonly declared build backend, accounting for nearly half the repos I looked at.
New shiny tools like poetry, hatch and flit have some adoption, but account for a much smaller share of the ecosystem.
By far the most widely used of these more modern packaging tools is poetry, accounting for 33% of the repos I looked at declaring a build backend in pyproject.toml.

Setuptools

Finally, I wanted to look at those repos using setuptools and pyproject.toml. Broadly, these are going to divide into 2 camps:

Those specifying package metadata in pyproject.toml, following PEP-621
Those specifying a build backend only in pyproject.toml, following PEP-517, but storing the package metadata in setup.py or setup.cfg.

Metadata location	Count	Percent
Outside pyproject.toml	3,615	54%
Inside pyproject.toml (PEP-621)	3,117	46%
Total	6,732	100%

Among repos using setuptools and pyproject.toml, only a minority have adopted PEP-621 for declaring package metadata. In the sample of repos I looked at which declare setuptools as a build backend in pyproject.toml, the most popular approach (albeit by a small margin) is to declare only the build backend details in pyproject.toml and store the package metadata elsewhere.

Conclusions

Based on the analysis I've done here, it seems reasonable to say that adoption of pyproject.toml has been slow, particularly as a package manifest format. Most of the repos I looked at are only or primarily using setup.py and/or setup.cfg. Modern packaging tools are generating blog posts, debate, and mindshare. Out there in the real world we are seeing limited adoption in comparison to more traditional approaches. While a blog post about setuptools is less likely to hit the front page of hackernews, setuptools is the real workhorse when it comes to getting packages shipped.

As noted at the start of this article, python packaging is a confusing and fragmented space at the moment. There are a lot of ways to skin this cat. It seems reasonable to infer that as a response to this, many developers are choosing to stick with an existing working solution, rather than make sense of the chaos. Who can blame them?

The python community often moves slowly in response to change. For example the migration from python 2 to 3 dragged on for about a decade, but in that case the direction of travel was at least clear. There was a single linear path. When it comes to modernizing the packaging space, progress is also hindered by the fact that for some projects there are many possible directions of travel. For some projects, there are still zero. Perhaps this is a journey that will take even longer to shake out.

Methodology notes

This research was based on a convenience sample. I looked at a selection of repos that made it quick and easy to harvest data, rather than the most robust sample or a complete census of PyPI.

As a starting point, I used the 2023-10-22 Ecosyste.ms Open Data Release (which is licensed under CC BY-SA 4.0 ). This was an easy place to get a bulk list of python packages with GitHub repos attached. I then applied a few filters.

First I excluded any packages which didn't have one or more releases published inside 2023. I'm really looking into modern packaging practices, so packages without a recent release are less useful to consider here.

Then I excluded any packages that had less than 100 downloads in the last month. There is a lot of junk on PyPI. This is a low bar for popularity, but I wanted to apply some kind of measure of "someone is actually using this code for something". Applying even this modest filter excluded a surprisingly large number of packages.

Then finally, I looked only at packages which had a GitHub repo attached to them in the Ecosyste.ms data. This was mainly about making it easy to fetch data in bulk. This means I excluded repos hosted on GitLab, BitBucket, CodeBerg, etc from this analysis. I also did not attempt to look at packages that had no repository_url attached in the data. As such, the sample contains some blind spots.

After de-duplicating, this gave me 35,732 GitHub repository URLs.

I then used the GitHub GraphQL API to attempt to fetch a setup.py, setup.cfg and pyproject.toml if they existed in the repo root. After excluding any repos that were private, did not exist at all, or repos that didn't contain any of those files in the root, I was left with the 31,474 repos that formed the basis of this analysis. Another obvious blind spot here is repos that host a package in a subdirectory instead of the repo root. Those will have been excluded too.

Finally, I grabbed whatever files were at the HEAD of the default branch in GitHub. I didn't attempt to find a latest release, or the release that would have been current at the time of the ecosyste.ms open data release. I don't think this makes a huge difference, but it is worth noting.

Future work

This has been an interesting process, but it only represents a snapshot in a landscape that is shifting over time. I'd like to repeat this analysis again in future to see how the things have changed. It's been a blast. Let's do it again some time.

Arq and TaskIQ

chris48s

2024-01-18

At work, we recently found ourselves in the market for an asynchronous task queue for python. A traditional task queue like Celery can be said to be "asynchronous" in the sense that your web server can kick a task into the queue and continue processing the web request without waiting for the task to complete. However it is "synchronous" in the sense that the task functions in your queue must be synchronous functions (declared with def rather than async def). If you want to queue an async function, you need an async worker to process it.

The two contenders we've been looking at in this space are arq and taskiq. These two solutions take slightly different approaches to solving the same problem.

Taskiq takes a conceptually simple push/pop approach to interacting with the queue. This is the same model used by popular synchronous packages like Celery and rq. When a worker is free to take a task, it pops a task off the queue and then executes it. The downside of this approach is that if a worker pops a task off the queue and then shuts down without processing the task to completion, that task is already gone from your queue without having been run to completion. Another worker can't try it again.

Arq takes a different approach called "pessimistic execution" which solves that specific problem. When an arq worker takes a task from the queue, it doesn't remove it from the queue yet. The task stays in the queue while it is being run. The task is finally deleted from the queue in a post hook after the task is complete. This means a task is only removed from the queue after it has run to completion.

In order to ensure every worker in your cluster is not trying to process the same task at once, arq also maintains some additional shared state. When a worker takes a task, the worker acquires a lock on a task. That lock is automatically set to release after a timeout. If a worker never deletes the task in the post hook, the task is eventually unlocked and becomes available for another worker to process at a later time once the timeout expires.

This gives arq some slightly different characteristics than taskiq.

Arq will ideally try to deliver your task exactly once, but guarantees "at least once execution". Executing your task multiple times is considered preferable to executing it zero times. This means no lost tasks, but it also means if you use arq, your tasks must be written to be idempotent.

The simple push/pop relationship with the queue employed by taskiq lends itself to being compatible with a wide range of backends. Taskiq already has plugins for using NATS, Redis, RabbitMQ and Kafka as brokers. Taskiq defines a plugin interface for brokers, so it would be possible to write plugins for aditional backends like SQS for example.

Conversely, arq has a more complicated relationship with the data store. The additional shared state required to implement the locking behaviour needs a richer set of operations. As such, arq is tightly coupled to redis as a backend. There is no mechanism to substitute another broker.

So here's a summary of those tradeoffs:

Arq is tied to redis. It provides stronger guarantees about eventual task execution and requires you to write your tasks with the assumption they could be attempted multiple times.
Taskiq follows a model similar to Celery or rq. It provides weaker assurances, but this conceptually simpler model means you can assume your tasks will only be executed once. This setup also allows for compatibility with a wider range of brokers.

Our project has a lot of long-running tasks, which are vulnerable to being killed off before running to completion by deploy or scale-in events. Because of this, we prefer the pessimistic execution model offered by arq. We ended up moving forward with arq for our project.

So, you want to start a side project?

chris48s

2023-12-03

Over the years, I have started or worked on many side projects and spent a lot of time maintaining them. This has taught me a lot about what makes a project easy and hard to maintain. This post is a reflection on some of the lessons learned over that time.

First, lets start off with some assumptions:

You want to start a side project, not a side hustle. A side hustle is an entrepreneurial exercise. It is something you eventually want to turn into a business or job, even if it is not on day one. The objective of a side project is the creative satisfaction of the project itself. It might be a learning experience, or exploration of personal interests.
Your side project is software or programming related.
The project is in some way public. It might be open source. You want to make something with a userbase or community beyond just yourself.
Crucially, you care about maintaining this project over a period of time. It is not just a throwaway learning exercise.

If you are looking to start a side project, this implies you have some time on your hands right now. This is a great place to be, but it won't be true forever. Life happens, and it usually happens unexpectedly.

So, how can we optimise for projects that are low-maintenance or at least projects that require the type of maintenance that can be done on our own terms? A side project that may require attention urgently and unexpectedly can quickly become a burden.

Third party APIs

Third Party APIs are one of the most likely sources of sudden and unexpected maintenance tasks. You may or may not get warning when changes happen. If you use an API with authentication or credentials, you probably had to sign up for an account so the upstream service provider probably does at least hold some contact details they could use to inform you of changes. Expect the unexpected from any API where you use public or anonymous access.

Sometimes an API you depend on will:

Make a non-backwards compatible change.
Introduce rate limits or enforce stricter rate limits.
Change their terms of service such that your project now violates them in some way.
Withdraw service completely or shut down.

All of this is very in vogue at the moment.

Scrapers

Web scrapers are like third party APIs, only worse. API authors expect that other people's code depends on their API. They may still choose to make a breaking change anyway, but there is some incentive there to maintain a stable platform for their users. Nobody assumes or cares that your code relies on scraping their website. It certainly won't be a consideration in changing it. You definitely won't get any communication informing you of a change that impacts your code. Website authors may even be actively trying to prevent you from scraping them. Code that relies on web scraping is certain to break at some point. It is matter of when, rather than if.

Infrastructure

Any kind of infrastructure you run (web servers, DB, cache, etc) is going to come with some maintenance overhead. Applying security upgrades, backup and restore, ensuring uptime, etc. The exact tasks that come up will vary a bit depending on the type of infrastructure, but could include a mix of tasks that can be planned in advance and things that happen unexpectedly. You can plan or defer applying an upgrade, but data loss requiring a restore from backup will happen when you least expect it.

There are some tradeoffs to be made here. Using a managed service can outsource some of this maintenance. For example, if we consider something like a Postgres DB: Running your own Postgres instance on a VPS leaves everything up to you (of course, with a side project, managing this yourself could be part of the joy or satisfaction). A fully managed service like Heroku Postgres will handle most of this stuff for you transparently. Something like RDS or Fly.io's "semi-managed" Postgres sits somewhere in the middle of those two extremes.

A fully managed service comes with some costs though. A managed service can be it's own source of breaking changes or deprecations. Some platforms have a greater or lesser reputation for stability (think AWS vs GCP 😀). The more obvious cost is the literal financial cost though, which brings us on to..

Finances

If you go down the route of a project that requires some sort of infrastructure, that has a cost associated with it, and somebody needs to cover that. Maybe you will pay for it out of your own pocket. In general, side projects are not revenue generating and need to run on a modest budget. Often that budget will be zero. Maybe you can run a service using free tier offerings. However, even if you're using it for free, someone is still paying. For the moment, the company offering that "free" service is covering that cost out of marketing budget because offering a free tier is good promotion, but that might not stay true forever. If your project is popular, you might also consider soliciting some sponsorship to cover your costs, either from your community or a corporate sponsor.

Again, funding concerns are another common source of suddenly urgent work.

Your project may become more popular, outgrowing your current pricing plan or the level of sponsorship your project currently attracts.
If your project runs on a free tier service, that offering will probably be withdrawn at some point. Most of them are, eventually.
If you have a corporate sponsor, also assume it will not last forever. Sponsorships of open source and community projects are often the first things to be cut when times are tough.

All of this can leave you quickly scrambling to migrate to another service, find ways to consume fewer resources, or present an immediate existential threat to your project.

Personal data

nopenopenopenopenope

If your side project stores personal data, congratulations. You now have compliance obligations. Choo choo 🚂 All aboard the fun train! A service that stores any kind of personal data (e.g: user accounts) is not a good choice for a side project. This one is a hard no from me.

It's not all doom and gloom

That is a list of things that can, to one degree or another, generate some maintenance overhead. So what are some types of projects that don't have any of those characteristics (or at least as few of them as possible)?

Command line applications (compiled language)

If your project is distributed as a compiled binary and it doesn't call any external APIs, there are very few externalities that can break this type of project or require attention from you as a maintainer. This is even more true for a statically linked binary. The only real exception to this might be needing to respond to a security issue.

Command line applications (dynamic language) or Libraries

This type of project has similar properties to a compiled command line tool. However, with anything that the user installs via pip install, npm install, etc, your dependencies are resolved at install time. This means your previously working code can be broken for some users by non-backwards compatible changes made in an in-range dependency version. In theory SemVer saves us here, but in most languages (other than javascript) it is necessary to support wide ranges. This type of breakage is not super common, but it does happen.

Static sites

Static content is good content. If you have the type of static site that can be served from a S3 bucket, there are multiple places that will host it for free and scale it to handle as much traffic as the internet can throw at it. If you do need to move it, it is relatively easy and you have zero infrastructure to maintain.

For a low-maintenance project side project that involves a website, "could this be a static site?" is generally a good question to ask. Sometimes by making a compromise or two, it is possible to get rid of a web server and DB and replace them with a static site. This is usually an advisable tradeoff. A good example of this might be choosing a SSG for your blog, instead of hosting a CMS. (Edit: A couple of months after I wrote this I came across this article simple lasts longer by Przemek, which gives a great concrete example of making some tradeoffs to allow a project to be delivered as a static site in preference to running a database).

It is worth noting that this is not true of the type of "static site" which is heavily tied to the specific features of a platform like Vercel or Netlify. These basically have the same tradeoffs as managed infrastructure with the additional downside of vendor lock-in.

End

So, that's some thoughts on the characteristics of a low-maintenance side project. Go forth. May your side project bring you many hours of joy and few unexpected urgent maintenance issues.

Querying GitHub Repos in Bulk

chris48s

2023-09-18

I've recently been working on a tool called pip-abandoned, which allows you to search for abandoned and deprecated python packages.

In order to make this, one of the things I needed to do was fetch the archived property for lots of GitHub repositories. Using the GitHub v3 (rest) API, this would need one round-trip per repo. So if I want to fetch the archived property for

then I need to make three API calls:

..and if I had a list of 200 GitHub repos, that would require 200 individual network requests.

Fortunately, GitHub also has a GraphQL API. Using the GraphQL API, we can query more than one repo in a single request. To fetch the same data as above using the GraphQL API, I can make the following query

query {
  pygments: repository(
    owner: "pygments", name: "pygments"
  ) {
    isArchived
  }
  mccabe: repository(
    owner: "pycqa", name: "mccabe"
  ) {
    isArchived
  }
  commonmark: repository(
    owner: "readthedocs", name: "commonmark.py"
  ) {
    isArchived
  }
}

to retrieve the archived property for those three repos in a single round-trip over the network. When querying a large number of repos this is a big benefit.

Always bet on SQL

chris48s

2023-08-07

I don't often write a spicy opinion piece, but it is spicy opinion time: ORMs are fine, but it is not worth investing too much time into becoming an expert in one.

I first learned SQL over 20 years ago. Everything I learned then is still true now. Lots of other things have also become true in the meantime, but everything I learned then still holds.

I've also used a number of different ORMs in some depth. Some free-standing. Some attached to a particular framework.

Django ORM
SQLAlchemy
Doctrine
CakePHP ORM

All of those use different design patterns and provide conceptually quite different abstractions over the same underlying SQL. There are some common ideas between them, but if you pick any two from that list there are probably more differences between them than similarities. Only a subset of the knowledge from learning one is directly transferable to another ORM.

ORMs come and go as different languages and frameworks gain or lose traction, but the underlying SQL is a constant.

So here's a rule of thumb: ORMs are fine. They do a useful job. Even if you don't like them, you'll probably end up using one. But any given ORM doesn't really warrant too much energy or attention. Focus on SQL. Understand how an ORM maps onto the underlying SQL and learn your current ORM well enough to make it generate the SQL statement you want under the hood.

There is a good chance that 5 years from now:

Anything you learn about SQL today will remain valid and useful, even when transitioning to a new ORM
You are likely to be using a different ORM, and much of your current ORM knowledge may not apply

relay.fedi.buzz

chris48s

2023-06-26

I self-host my own mastodon instance. I am the only user on my server and I only follow a handful of accounts. This means I inhabit a somewhat weakly connected corner of the fediverse. For example, following a tag doesn't act as a useful way for me to discover new posts because it almost exclusively shows me posts from people I am already following, which I know about anyway.

This is a problem that is theoretically solved by ActivityPub Relays, although practically I was never really able to get a handle on which (if any) made sense for me to add to my instance.

That changed recently when I found out about relay.fedi.buzz which allows you to generate ad-hoc follow-only ActivityPub relays for specific tags or instances (the most useful of those being tags IMO). For example, if I want to follow the #python tag, I can add https://relay.fedi.buzz/tag/python to my instance's relays. That ingests a much wider range of posts featuring that tag into my federated timeline, including from instance I was not already federated with via my follows. Then following that tag from my personal account now allows me to discover new posts on that topic.