Skip to main content

An analysis of python package manifest files

Python packaging is messy and fragmented. Lots of people have been writing about it recently and there have been some great articles that have attracted a lot of attention. For example, I've particularly enjoyed:

Gregory Szorc also captured the frustrating experience many developers face trying to navigate the modern python packaging landscape in My User Experience Porting Off setup.py.

It is a topic I also spend a lot of time thinking about, but I decided to take a look at the topic from a slightly different angle. Instead of lamenting the proliferation of different tools, or attempting to round them all up and compare them, I decided to ask: What are package authors actually doing out there in the wild, and how is the community responding to this change and fragmentation?

So I conducted a bit of research. I looked at a sample of 31,474 public GitHub repos associated with one or more python packages on PyPI and analysed the manifests to find out a bit more about how people are actually specifying their package metadata and building their packages. For the purposes of this research, I'm focussing on packages. You could probably ask and answer some similarly interesting questions about applications, but I haven't done it here. There's a bit more information about how and why I arrived at this sample of ~30k GitHub repos in the methodology notes, but I'm not going to bury the lead. Lets just jump straight into the good stuff.

Manifest files

I looked for the presence of 3 files: pyproject.toml, setup.py and setup.cfg. Most of the repos I looked at contained more than one.

File Count Percent
setup.py 20,684 66%
pyproject.toml 17,245 55%
setup.cfg 10,406 33%
Total 31,474 -

Pyproject.toml

One of the big pushes in python is for adoption of pyproject.toml. So how is that going out there in the real world?

First of all, it is worth reviewing some of the ways pyproject.toml is or can be used.

  • PEP 517 Defines a way to declare a package build backend in pyproject.toml.
  • PEP 621 Defines a way to define the package metadata in pyproject.toml.
  • PEP 518 Defines a way to declare package build requirements in pyproject.toml and a way for python tools (which may or may not be related to packaging) to store configuration in the tool.* namespace. Many python tools like pytest, black, mypy, etc allow their configuration to be stored in pyproject.toml using the tool.* namespace.
  • In particular, poetry allows package metadata to be specified in pyproject.toml in a tool.poetry declaration, but predates and does not conform to PEP 621. I'm going to consider poetry separately.

A point to note here is that these can be combined in various ways. For example, it is possible to declare a build backend in pyproject.toml following PEP 517 and also declare PEP 621 package metadata. However using setuptools it is also possible to declare a build backend in pyproject.toml but specify the rest of the package metadata in setup.py or setup.cfg. Some repos only use pyproject.toml for storing linter configuration and everything to do with packaging is stored in setup.py or setup.cfg. Some repos specify package metadata in pyproject.toml (either following PEP 621 or using poetry), but don't declare a build system. One does not necessarily imply another. I found examples of pretty much every combination. This makes it difficult to conduct a completely coherent analysis or arrive at universally valid assumptions.

In the sample of repos I looked at 17,245 (55%) contained a pyproject.toml file. 15,754 (91%) of those declare a build backend, requirements, and/or package metadata. 1,497 (9%) did not contain any of those things. Presumably in basically all of those cases, pyproject.toml is being used exclusively as a configuration file for dev tooling.

Feature Count Percent
Has build requirements 15,427 89%
Has build backend 14,328 83%
Has PEP 621 metadata 6,563 38%
Has Poetry metadata 4,890 28%
Has no packaging metadata 1,497 9%
Total 17,245 -

There are a few interesting results here. The first is that most repos containing a pyproject.toml declare either a build backend and/or requirements. I was actually surprised that more files declare build requirements than a build backend. I expected repos declaring build requirements would basically be a subset of those declaring a build backend. Turns out the inverse is true.

Many repos are declaring package metadata in pyproject.toml using either PEP 621 or Poetry format, but adoption of pyproject.toml for this purpose is less common.

My hunch is that a lot of the repos which are only specifying build backend/requirements may have adopted pyproject.toml primarily as a configuration format (as opposed to a package manifest format) and then added a minimal build-system declaration for compatibility purposes. However that is just my conjecture.

Setup.py and setup.cfg

The oldest way to specify package metadata is using setup.py. This has served the community well for many years, but the package metadata is mixed with executable python code. The python community's first attempt at a declarative manifest format was setup.cfg. This was a format specific to setuptools rather than a standard and the setuptools project plans to eventually deprecate setup.cfg. One of the big pushes in python is for moving away from setup.py and setup.cfg to specify package metadata, and towards pyproject.toml. So how is that going out there in the real world?

Of the repos I looked at, 20,684 (66%) contained a setup.py and 10,406 (33%) contained a setup.cfg file. Many contained both. As with pyproject.toml, presence or absence of the file in a repo doesn't necessarily tell us the full story. Some repos that are primarily using pyproject.toml also have a stub setup.py that just contains

import setuptools
setuptools.setup()

for backwards compatibility reasons. This may be needed, for example for compatibility with tools that don't support PEP 660 editable installs.

As with pyproject.toml, many python based tools like isort and flake8 allow their configuration to be stored in setup.cfg so some repos contain a setup.cfg but aren't using to store any information related to packaging - it is just there to store linter configuration. Again, basically every combination of scenarios exists in the sample of repos I looked at.

I haven't attempted to parse the setup.py and setup.cfg files. I am perhaps missing a bit of nuance here, but I have made some assumptions:

  • A repo which declares poetry or PEP 621 package metadata in pyproject.toml is using pyproject.toml as the package manifest.
  • A repo that has a setup.py but not a setup.cfg and either doesn't have a pyproject.toml at all or has a pyproject.toml which does not contain poetry or PEP 621 package metadata is using setup.py as the package manifest.
  • A repo that has a setup.py and setup.cfg and either doesn't have a pyproject.toml at all or has a pyproject.toml which does not contain poetry or PEP 621 package metadata is using setup.py or setup.cfg as the package manifest.
  • There were also just over 1,000 repos doing some other combination of things. A lot of these were a pyproject.toml declaring a build system or build requirements only, with metadata in setup.cfg. I didn't attempt to break them down any further.
Manifest type Count Percent
pyproject.toml with metadata 11,349 36%
setup.py only 10,695 34%
setup.py and setup.cfg 8,235 26%
Other 1,195 4%
Total 31,474 100%

18,930 (63%) of the repos I looked at are sticking with setup.py and/or setup.cfg as the package manifest.

Using only a setup.py is still a very popular method of packaging at 34%. This is nearly equal with storing package metadata in pyproject.toml at 36%, despite efforts to transition the community away from executable package manifests and towards declarative manifest formats.

Build Backends

14,328 of the repos I looked at are using a pyproject.toml that declares a PEP-517 build backend. So next I dug into that. Which build backends are these repos using?

Build backend Count Percent
Setuptools 6,732 47%
Poetry 4,671 33%
Hatch 1,592 11%
Flit 687 5%
Other 223 2%
Pdm 215 2%
Maturin 208 1%
Total 14,328 100%

There are more interesting findings here:

  • Among repos using pyproject.toml, setuptools is the by far the most commonly declared build backend, accounting for nearly half the repos I looked at.
  • New shiny tools like poetry, hatch and flit have some adoption, but account for a much smaller share of the ecosystem.
  • By far the most widely used of these more modern packaging tools is poetry, accounting for 33% of the repos I looked at declaring a build backend in pyproject.toml.

Setuptools

Finally, I wanted to look at those repos using setuptools and pyproject.toml. Broadly, these are going to divide into 2 camps:

  • Those specifying package metadata in pyproject.toml, following PEP-621
  • Those specifying a build backend only in pyproject.toml, following PEP-517, but storing the package metadata in setup.py or setup.cfg.
Metadata location Count Percent
Outside pyproject.toml 3,615 54%
Inside pyproject.toml (PEP-621) 3,117 46%
Total 6,732 100%

Among repos using setuptools and pyproject.toml, only a minority have adopted PEP-621 for declaring package metadata. In the sample of repos I looked at which declare setuptools as a build backend in pyproject.toml, the most popular approach (albeit by a small margin) is to declare only the build backend details in pyproject.toml and store the package metadata elsewhere.

Conclusions

Based on the analysis I've done here, it seems reasonable to say that adoption of pyproject.toml has been slow, particularly as a package manifest format. Most of the repos I looked at are only or primarily using setup.py and/or setup.cfg. Modern packaging tools are generating blog posts, debate, and mindshare. Out there in the real world we are seeing limited adoption in comparison to more traditional approaches. While a blog post about setuptools is less likely to hit the front page of hackernews, setuptools is the real workhorse when it comes to getting packages shipped.

As noted at the start of this article, python packaging is a confusing and fragmented space at the moment. There are a lot of ways to skin this cat. It seems reasonable to infer that as a response to this, many developers are choosing to stick with an existing working solution, rather than make sense of the chaos. Who can blame them?

The python community often moves slowly in response to change. For example the migration from python 2 to 3 dragged on for about a decade, but in that case the direction of travel was at least clear. There was a single linear path. When it comes to modernizing the packaging space, progress is also hindered by the fact that for some projects there are many possible directions of travel. For some projects, there are still zero. Perhaps this is a journey that will take even longer to shake out.

Methodology notes

This research was based on a convenience sample. I looked at a selection of repos that made it quick and easy to harvest data, rather than the most robust sample or a complete census of PyPI.

As a starting point, I used the 2023-10-22 Ecosyste.ms Open Data Release (which is licensed under CC BY-SA 4.0 ). This was an easy place to get a bulk list of python packages with GitHub repos attached. I then applied a few filters.

First I excluded any packages which didn't have one or more releases published inside 2023. I'm really looking into modern packaging practices, so packages without a recent release are less useful to consider here.

Then I excluded any packages that had less than 100 downloads in the last month. There is a lot of junk on PyPI. This is a low bar for popularity, but I wanted to apply some kind of measure of "someone is actually using this code for something". Applying even this modest filter excluded a surprisingly large number of packages.

Then finally, I looked only at packages which had a GitHub repo attached to them in the Ecosyste.ms data. This was mainly about making it easy to fetch data in bulk. This means I excluded repos hosted on GitLab, BitBucket, CodeBerg, etc from this analysis. I also did not attempt to look at packages that had no repository_url attached in the data. As such, the sample contains some blind spots.

After de-duplicating, this gave me 35,732 GitHub repository URLs.

I then used the GitHub GraphQL API to attempt to fetch a setup.py, setup.cfg and pyproject.toml if they existed in the repo root. After excluding any repos that were private, did not exist at all, or repos that didn't contain any of those files in the root, I was left with the 31,474 repos that formed the basis of this analysis. Another obvious blind spot here is repos that host a package in a subdirectory instead of the repo root. Those will have been excluded too.

Finally, I grabbed whatever files were at the HEAD of the default branch in GitHub. I didn't attempt to find a latest release, or the release that would have been current at the time of the ecosyste.ms open data release. I don't think this makes a huge difference, but it is worth noting.

Future work

This has been an interesting process, but it only represents a snapshot in a landscape that is shifting over time. I'd like to repeat this analysis again in future to see how the things have changed. It's been a blast. Let's do it again some time.

Arq and TaskIQ

At work, we recently found ourselves in the market for an asynchronous task queue for python. A traditional task queue like Celery can be said to be "asynchronous" in the sense that your web server can kick a task into the queue and continue processing the web request without waiting for the task to complete. However it is "synchronous" in the sense that the task functions in your queue must be synchronous functions (declared with def rather than async def). If you want to queue an async function, you need an async worker to process it.

The two contenders we've been looking at in this space are arq and taskiq. These two solutions take slightly different approaches to solving the same problem.

Taskiq takes a conceptually simple push/pop approach to interacting with the queue. This is the same model used by popular synchronous packages like Celery and rq. When a worker is free to take a task, it pops a task off the queue and then executes it. The downside of this approach is that if a worker pops a task off the queue and then shuts down without processing the task to completion, that task is already gone from your queue without having been run to completion. Another worker can't try it again.

Arq takes a different approach called "pessimistic execution" which solves that specific problem. When an arq worker takes a task from the queue, it doesn't remove it from the queue yet. The task stays in the queue while it is being run. The task is finally deleted from the queue in a post hook after the task is complete. This means a task is only removed from the queue after it has run to completion.

In order to ensure every worker in your cluster is not trying to process the same task at once, arq also maintains some additional shared state. When a worker takes a task, the worker acquires a lock on a task. That lock is automatically set to release after a timeout. If a worker never deletes the task in the post hook, the task is eventually unlocked and becomes available for another worker to process at a later time once the timeout expires.

This gives arq some slightly different characteristics than taskiq.

Arq will ideally try to deliver your task exactly once, but guarantees "at least once execution". Executing your task multiple times is considered preferable to executing it zero times. This means no lost tasks, but it also means if you use arq, your tasks must be written to be idempotent.

The simple push/pop relationship with the queue employed by taskiq lends itself to being compatible with a wide range of backends. Taskiq already has plugins for using NATS, Redis, RabbitMQ and Kafka as brokers. Taskiq defines a plugin interface for brokers, so it would be possible to write plugins for aditional backends like SQS for example.

Conversely, arq has a more complicated relationship with the data store. The additional shared state required to implement the locking behaviour needs a richer set of operations. As such, arq is tightly coupled to redis as a backend. There is no mechanism to substitute another broker.

So here's a summary of those tradeoffs:

  • Arq is tied to redis. It provides stronger guarantees about eventual task execution and requires you to write your tasks with the assumption they could be attempted multiple times.
  • Taskiq follows a model similar to Celery or rq. It provides weaker assurances, but this conceptually simpler model means you can assume your tasks will only be executed once. This setup also allows for compatibility with a wider range of brokers.

Our project has a lot of long-running tasks, which are vulnerable to being killed off before running to completion by deploy or scale-in events. Because of this, we prefer the pessimistic execution model offered by arq. We ended up moving forward with arq for our project.

So, you want to start a side project?

Over the years, I have started or worked on many side projects and spent a lot of time maintaining them. This has taught me a lot about what makes a project easy and hard to maintain. This post is a reflection on some of the lessons learned over that time.

First, lets start off with some assumptions:

  1. You want to start a side project, not a side hustle. A side hustle is an entrepreneurial exercise. It is something you eventually want to turn into a business or job, even if it is not on day one. The objective of a side project is the creative satisfaction of the project itself. It might be a learning experience, or exploration of personal interests.
  2. Your side project is software or programming related.
  3. The project is in some way public. It might be open source. You want to make something with a userbase or community beyond just yourself.
  4. Crucially, you care about maintaining this project over a period of time. It is not just a throwaway learning exercise.

If you are looking to start a side project, this implies you have some time on your hands right now. This is a great place to be, but it won't be true forever. Life happens, and it usually happens unexpectedly.

So, how can we optimise for projects that are low-maintenance or at least projects that require the type of maintenance that can be done on our own terms? A side project that may require attention urgently and unexpectedly can quickly become a burden.

Third party APIs

Third Party APIs are one of the most likely sources of sudden and unexpected maintenance tasks. You may or may not get warning when changes happen. If you use an API with authentication or credentials, you probably had to sign up for an account so the upstream service provider probably does at least hold some contact details they could use to inform you of changes. Expect the unexpected from any API where you use public or anonymous access.

Sometimes an API you depend on will:

  • Make a non-backwards compatible change.
  • Introduce rate limits or enforce stricter rate limits.
  • Change their terms of service such that your project now violates them in some way.
  • Withdraw service completely or shut down.

All of this is very in vogue at the moment.

Scrapers

Web scrapers are like third party APIs, only worse. API authors expect that other people's code depends on their API. They may still choose to make a breaking change anyway, but there is some incentive there to maintain a stable platform for their users. Nobody assumes or cares that your code relies on scraping their website. It certainly won't be a consideration in changing it. You definitely won't get any communication informing you of a change that impacts your code. Website authors may even be actively trying to prevent you from scraping them. Code that relies on web scraping is certain to break at some point. It is matter of when, rather than if.

Infrastructure

Any kind of infrastructure you run (web servers, DB, cache, etc) is going to come with some maintenance overhead. Applying security upgrades, backup and restore, ensuring uptime, etc. The exact tasks that come up will vary a bit depending on the type of infrastructure, but could include a mix of tasks that can be planned in advance and things that happen unexpectedly. You can plan or defer applying an upgrade, but data loss requiring a restore from backup will happen when you least expect it.

There are some tradeoffs to be made here. Using a managed service can outsource some of this maintenance. For example, if we consider something like a Postgres DB: Running your own Postgres instance on a VPS leaves everything up to you (of course, with a side project, managing this yourself could be part of the joy or satisfaction). A fully managed service like Heroku Postgres will handle most of this stuff for you transparently. Something like RDS or Fly.io's "semi-managed" Postgres sits somewhere in the middle of those two extremes.

A fully managed service comes with some costs though. A managed service can be it's own source of breaking changes or deprecations. Some platforms have a greater or lesser reputation for stability (think AWS vs GCP 😀). The more obvious cost is the literal financial cost though, which brings us on to..

Finances

If you go down the route of a project that requires some sort of infrastructure, that has a cost associated with it, and somebody needs to cover that. Maybe you will pay for it out of your own pocket. In general, side projects are not revenue generating and need to run on a modest budget. Often that budget will be zero. Maybe you can run a service using free tier offerings. However, even if you're using it for free, someone is still paying. For the moment, the company offering that "free" service is covering that cost out of marketing budget because offering a free tier is good promotion, but that might not stay true forever. If your project is popular, you might also consider soliciting some sponsorship to cover your costs, either from your community or a corporate sponsor.

Again, funding concerns are another common source of suddenly urgent work.

  • Your project may become more popular, outgrowing your current pricing plan or the level of sponsorship your project currently attracts.
  • If your project runs on a free tier service, that offering will probably be withdrawn at some point. Most of them are, eventually.
  • If you have a corporate sponsor, also assume it will not last forever. Sponsorships of open source and community projects are often the first things to be cut when times are tough.

All of this can leave you quickly scrambling to migrate to another service, find ways to consume fewer resources, or present an immediate existential threat to your project.

Personal data

nopenopenopenopenope

If your side project stores personal data, congratulations. You now have compliance obligations. Choo choo 🚂 All aboard the fun train! A service that stores any kind of personal data (e.g: user accounts) is not a good choice for a side project. This one is a hard no from me.

It's not all doom and gloom

That is a list of things that can, to one degree or another, generate some maintenance overhead. So what are some types of projects that don't have any of those characteristics (or at least as few of them as possible)?

Command line applications (compiled language)

If your project is distributed as a compiled binary and it doesn't call any external APIs, there are very few externalities that can break this type of project or require attention from you as a maintainer. This is even more true for a statically linked binary. The only real exception to this might be needing to respond to a security issue.

Command line applications (dynamic language) or Libraries

This type of project has similar properties to a compiled command line tool. However, with anything that the user installs via pip install, npm install, etc, your dependencies are resolved at install time. This means your previously working code can be broken for some users by non-backwards compatible changes made in an in-range dependency version. In theory SemVer saves us here, but in most languages (other than javascript) it is necessary to support wide ranges. This type of breakage is not super common, but it does happen.

Static sites

Static content is good content. If you have the type of static site that can be served from a S3 bucket, there are multiple places that will host it for free and scale it to handle as much traffic as the internet can throw at it. If you do need to move it, it is relatively easy and you have zero infrastructure to maintain.

For a low-maintenance project side project that involves a website, "could this be a static site?" is generally a good question to ask. Sometimes by making a compromise or two, it is possible to get rid of a web server and DB and replace them with a static site. This is usually an advisable tradeoff. A good example of this might be choosing a SSG for your blog, instead of hosting a CMS. (Edit: A couple of months after I wrote this I came across this article simple lasts longer by Przemek, which gives a great concrete example of making some tradeoffs to allow a project to be delivered as a static site in preference to running a database).

It is worth noting that this is not true of the type of "static site" which is heavily tied to the specific features of a platform like Vercel or Netlify. These basically have the same tradeoffs as managed infrastructure with the additional downside of vendor lock-in.

End

So, that's some thoughts on the characteristics of a low-maintenance side project. Go forth. May your side project bring you many hours of joy and few unexpected urgent maintenance issues.

Querying GitHub Repos in Bulk

I've recently been working on a tool called pip-abandoned, which allows you to search for abandoned and deprecated python packages.

In order to make this, one of the things I needed to do was fetch the archived property for lots of GitHub repositories. Using the GitHub v3 (rest) API, this would need one round-trip per repo. So if I want to fetch the archived property for

then I need to make three API calls:

..and if I had a list of 200 GitHub repos, that would require 200 individual network requests.

Fortunately, GitHub also has a GraphQL API. Using the GraphQL API, we can query more than one repo in a single request. To fetch the same data as above using the GraphQL API, I can make the following query

query {
  pygments: repository(
    owner: "pygments", name: "pygments"
  ) {
    isArchived
  }
  mccabe: repository(
    owner: "pycqa", name: "mccabe"
  ) {
    isArchived
  }
  commonmark: repository(
    owner: "readthedocs", name: "commonmark.py"
  ) {
    isArchived
  }
}

to retrieve the archived property for those three repos in a single round-trip over the network. When querying a large number of repos this is a big benefit.

Always bet on SQL

I don't often write a spicy opinion piece, but it is spicy opinion time: ORMs are fine, but it is not worth investing too much time into becoming an expert in one.

I first learned SQL over 20 years ago. Everything I learned then is still true now. Lots of other things have also become true in the meantime, but everything I learned then still holds.

I've also used a number of different ORMs in some depth. Some free-standing. Some attached to a particular framework.

  • Django ORM
  • SQLAlchemy
  • Doctrine
  • CakePHP ORM

All of those use different design patterns and provide conceptually quite different abstractions over the same underlying SQL. There are some common ideas between them, but if you pick any two from that list there are probably more differences between them than similarities. Only a subset of the knowledge from learning one is directly transferable to another ORM.

ORMs come and go as different languages and frameworks gain or lose traction, but the underlying SQL is a constant.

So here's a rule of thumb: ORMs are fine. They do a useful job. Even if you don't like them, you'll probably end up using one. But any given ORM doesn't really warrant too much energy or attention. Focus on SQL. Understand how an ORM maps onto the underlying SQL and learn your current ORM well enough to make it generate the SQL statement you want under the hood.

There is a good chance that 5 years from now:

  • Anything you learn about SQL today will remain valid and useful, even when transitioning to a new ORM
  • You are likely to be using a different ORM, and much of your current ORM knowledge may not apply

relay.fedi.buzz

I self-host my own mastodon instance. I am the only user on my server and I only follow a handful of accounts. This means I inhabit a somewhat weakly connected corner of the fediverse. For example, following a tag doesn't act as a useful way for me to discover new posts because it almost exclusively shows me posts from people I am already following, which I know about anyway.

This is a problem that is theoretically solved by ActivityPub Relays, although practically I was never really able to get a handle on which (if any) made sense for me to add to my instance.

That changed recently when I found out about relay.fedi.buzz which allows you to generate ad-hoc follow-only ActivityPub relays for specific tags or instances (the most useful of those being tags IMO). For example, if I want to follow the #python tag, I can add https://relay.fedi.buzz/tag/python to my instance's relays. That ingests a much wider range of posts featuring that tag into my federated timeline, including from instance I was not already federated with via my follows. Then following that htag from my personal account now allows me to discover new posts on that topic.

Adding a custom tag to a Sentry event

Sentry allows you to enrich captured events by applying custom tags and attributes. I was recently working on a python application where I needed a re-usable abstraction to express the logic "if function X throws exception Y then apply this custom key=value tag when we log the exception to Sentry" in a bunch of places. Here's what I came up with:

from functools import wraps
from sentry_sdk import capture_exception, push_scope

def tag_error(exc_class, key, value):
    def decorator(fn):
        @wraps(fn)
        def wrapper(*args, **kwargs):
            try:
                return fn(*args, **kwargs)
            except (KeyboardInterrupt, SystemExit):
                raise
            except Exception as err:
                if isinstance(err, exc_class):
                    with push_scope() as scope:
                        scope.set_tag(key, value)
                        capture_exception(err)
                raise
        return wrapper
    return decorator

This gives us a @tag_error decorator, which can be applied to any function. For example:

@tag_error(ValueError, "custom-key", "custom-value")
def do_a_thing():
    ...
    raise ValueError("Oh no. A terrible thing has happened.")
    ...

This will tag any ValueErrors raised by calling do_a_thing() with custom-key=custom-value when we log the exception to sentry.

Generating a GitHub Markdown Summary from Mocha

I recently wanted to migrate some CI builds running mocha tests from CircleCI to GitHub Actions. I also wanted to use Job Summaries to produce a markdown summary of the build. This allows you to output a summary of your workflow run by echoing markdown to a special environment variable called $GITHUB_STEP_SUMMARY e.g: echo '### Hello world! :rocket:' >> $GITHUB_STEP_SUMMARY

We run our tests with mocha, which doesn't ship with a markdown output formatter. The "min" formatter was quite close to what I wanted (a markdown summary of any failed tests but a "quiet" output if everything passed). Dumping that to a code fence would have probably been acceptable. Unfortunately our test suite has a number of tests which log output to stdout which made things a bit messy as the "min" formatter also dumps to stdout. So I decided to write a quick script to parse mocha's json output and produce a markdown summary. Doing this also allowed me to uses some nicer formatting than dumping console output into a code fence.

// mocha2md.js

import fs from 'fs'

const title = process.argv[2]
const data = JSON.parse(fs.readFileSync(process.argv[3]))

process.stdout.write(`# ${title}\n\n`)

if (data.stats.passes > 0) {
  process.stdout.write(`✔ ${data.stats.passes} passed\n`)
}
if (data.stats.failures > 0) {
  process.stdout.write(`✖ ${data.stats.failures} failed\n\n`)
}

if (data.stats.failures > 0) {
  for (const test of data.tests) {
    if (test.err && Object.keys(test.err).length > 0) {
      process.stdout.write(`### ${test.title}\n\n`)
      process.stdout.write(`${test.fullTitle}\n\n`)
      process.stdout.write('```\n')
      process.stdout.write(`${test.err.stack}\n`)
      process.stdout.write('```\n\n')
    }
  }
}

Combine that with some workflow yaml to run the tests with the json reporter and use our script to write the report.

- name: Run tests
  run: npm run test:core -- --reporter json --reporter-option 'output=reports/test.json'

- name: Write Markdown Summary
  run: node mocha2md.js Tests reports/test.json >> $GITHUB_STEP_SUMMARY

and we've got ourselves a nice little summary report from our mocha tests.

example markdown summary

Diagrams

Last week I tried out diagrams to knock up some cloud infrastructure diagrams. There are several things I really like about this tool:

  • The learning curve is very easy. I was able to absorb the key concepts and produce a useful diagram showing the AWS setup for an application I am working on within about 30 mins of installing it for the first time.
  • The [effort in]:[pretty pictures out] ratio is very satisfying.
  • Because the diagram is generated from code, it can live in your repo. The diff changing the diagram could be in the same commit as the updates to your CDK definitions or ansible playbooks or whatever it is that actually makes the infrastructure changes.

For example, the following diagram

example diagram

is generated from this short python snippet:

from diagrams import Cluster, Diagram
from diagrams.aws.compute import Fargate
from diagrams.aws.database import RDS, ElastiCache
from diagrams.aws.engagement import SES
from diagrams.aws.network import ELB, Route53
from diagrams.aws.storage import S3

with Diagram("", show=False):
    ses = SES("Mail Transport (SES)")
    dns = Route53("Route 53 (DNS)")
    s3 = S3("S3")

    with Cluster("VPC"):
        lb = ELB("Load Balancer (ALB)")
        elasticache = ElastiCache("Redis (ElastiCache)")

        with Cluster("ECS"):
            web = Fargate("web")

        with Cluster("DB Cluster (RDS)"):
            db_primary = RDS("primary")
            db_primary - RDS("read replica")

    dns >> lb
    lb >> web

    web >> elasticache
    web >> db_primary
    web >> s3
    web >> ses