Skip to main content

Querying GitHub Repos in Bulk

I've recently been working on a tool called pip-abandoned, which allows you to search for abandoned and deprecated python packages.

In order to make this, one of the things I needed to do was fetch the archived property for lots of GitHub repositories. Using the GitHub v3 (rest) API, this would need one round-trip per repo. So if I want to fetch the archived property for

then I need to make three API calls:

..and if I had a list of 200 GitHub repos, that would require 200 individual network requests.

Fortunately, GitHub also has a GraphQL API. Using the GraphQL API, we can query more than one repo in a single request. To fetch the same data as above using the GraphQL API, I can make the following query

query {
  pygments: repository(
    owner: "pygments", name: "pygments"
  ) {
    isArchived
  }
  mccabe: repository(
    owner: "pycqa", name: "mccabe"
  ) {
    isArchived
  }
  commonmark: repository(
    owner: "readthedocs", name: "commonmark.py"
  ) {
    isArchived
  }
}

to retrieve the archived property for those three repos in a single round-trip over the network. When querying a large number of repos this is a big benefit.

Always bet on SQL

I don't often write a spicy opinion piece, but it is spicy opinion time: ORMs are fine, but it is not worth investing too much time into becoming an expert in one.

I first learned SQL over 20 years ago. Everything I learned then is still true now. Lots of other things have also become true in the meantime, but everything I learned then still holds.

I've also used a number of different ORMs in some depth. Some free-standing. Some attached to a particular framework.

  • Django ORM
  • SQLAlchemy
  • Doctrine
  • CakePHP ORM

All of those use different design patterns and provide conceptually quite different abstractions over the same underlying SQL. There are some common ideas between them, but if you pick any two from that list there are probably more differences between them than similarities. Only a subset of the knowledge from learning one is directly transferable to another ORM.

ORMs come and go as different languages and frameworks gain or lose traction, but the underlying SQL is a constant.

So here's a rule of thumb: ORMs are fine. They do a useful job. Even if you don't like them, you'll probably end up using one. But any given ORM doesn't really warrant too much energy or attention. Focus on SQL. Understand how an ORM maps onto the underlying SQL and learn your current ORM well enough to make it generate the SQL statement you want under the hood.

There is a good chance that 5 years from now:

  • Anything you learn about SQL today will remain valid and useful, even when transitioning to a new ORM
  • You are likely to be using a different ORM, and much of your current ORM knowledge may not apply

relay.fedi.buzz

I self-host my own mastodon instance. I am the only user on my server and I only follow a handful of accounts. This means I inhabit a somewhat weakly connected corner of the fediverse. For example, following a tag doesn't act as a useful way for me to discover new posts because it almost exclusively shows me posts from people I am already following, which I know about anyway.

This is a problem that is theoretically solved by ActivityPub Relays, although practically I was never really able to get a handle on which (if any) made sense for me to add to my instance.

That changed recently when I found out about relay.fedi.buzz which allows you to generate ad-hoc follow-only ActivityPub relays for specific tags or instances (the most useful of those being tags IMO). For example, if I want to follow the #python tag, I can add https://relay.fedi.buzz/tag/python to my instance's relays. That ingests a much wider range of posts featuring that tag into my federated timeline, including from instance I was not already federated with via my follows. Then following that htag from my personal account now allows me to discover new posts on that topic.

Adding a custom tag to a Sentry event

Sentry allows you to enrich captured events by applying custom tags and attributes. I was recently working on a python application where I needed a re-usable abstraction to express the logic "if function X throws exception Y then apply this custom key=value tag when we log the exception to Sentry" in a bunch of places. Here's what I came up with:

from functools import wraps
from sentry_sdk import capture_exception, push_scope

def tag_error(exc_class, key, value):
    def decorator(fn):
        @wraps(fn)
        def wrapper(*args, **kwargs):
            try:
                return fn(*args, **kwargs)
            except (KeyboardInterrupt, SystemExit):
                raise
            except Exception as err:
                if isinstance(err, exc_class):
                    with push_scope() as scope:
                        scope.set_tag(key, value)
                        capture_exception(err)
                raise
        return wrapper
    return decorator

This gives us a @tag_error decorator, which can be applied to any function. For example:

@tag_error(ValueError, "custom-key", "custom-value")
def do_a_thing():
    ...
    raise ValueError("Oh no. A terrible thing has happened.")
    ...

This will tag any ValueErrors raised by calling do_a_thing() with custom-key=custom-value when we log the exception to sentry.

Generating a GitHub Markdown Summary from Mocha

I recently wanted to migrate some CI builds running mocha tests from CircleCI to GitHub Actions. I also wanted to use Job Summaries to produce a markdown summary of the build. This allows you to output a summary of your workflow run by echoing markdown to a special environment variable called $GITHUB_STEP_SUMMARY e.g: echo '### Hello world! :rocket:' >> $GITHUB_STEP_SUMMARY

We run our tests with mocha, which doesn't ship with a markdown output formatter. The "min" formatter was quite close to what I wanted (a markdown summary of any failed tests but a "quiet" output if everything passed). Dumping that to a code fence would have probably been acceptable. Unfortunately our test suite has a number of tests which log output to stdout which made things a bit messy as the "min" formatter also dumps to stdout. So I decided to write a quick script to parse mocha's json output and produce a markdown summary. Doing this also allowed me to uses some nicer formatting than dumping console output into a code fence.

// mocha2md.js

import fs from 'fs'

const title = process.argv[2]
const data = JSON.parse(fs.readFileSync(process.argv[3]))

process.stdout.write(`# ${title}\n\n`)

if (data.stats.passes > 0) {
  process.stdout.write(`✔ ${data.stats.passes} passed\n`)
}
if (data.stats.failures > 0) {
  process.stdout.write(`✖ ${data.stats.failures} failed\n\n`)
}

if (data.stats.failures > 0) {
  for (const test of data.tests) {
    if (test.err && Object.keys(test.err).length > 0) {
      process.stdout.write(`### ${test.title}\n\n`)
      process.stdout.write(`${test.fullTitle}\n\n`)
      process.stdout.write('```\n')
      process.stdout.write(`${test.err.stack}\n`)
      process.stdout.write('```\n\n')
    }
  }
}

Combine that with some workflow yaml to run the tests with the json reporter and use our script to write the report.

- name: Run tests
  run: npm run test:core -- --reporter json --reporter-option 'output=reports/test.json'

- name: Write Markdown Summary
  run: node mocha2md.js Tests reports/test.json >> $GITHUB_STEP_SUMMARY

and we've got ourselves a nice little summary report from our mocha tests.

example markdown summary

Diagrams

Last week I tried out diagrams to knock up some cloud infrastructure diagrams. There are several things I really like about this tool:

  • The learning curve is very easy. I was able to absorb the key concepts and produce a useful diagram showing the AWS setup for an application I am working on within about 30 mins of installing it for the first time.
  • The [effort in]:[pretty pictures out] ratio is very satisfying.
  • Because the diagram is generated from code, it can live in your repo. The diff changing the diagram could be in the same commit as the updates to your CDK definitions or ansible playbooks or whatever it is that actually makes the infrastructure changes.

For example, the following diagram

example diagram

is generated from this short python snippet:

from diagrams import Cluster, Diagram
from diagrams.aws.compute import Fargate
from diagrams.aws.database import RDS, ElastiCache
from diagrams.aws.engagement import SES
from diagrams.aws.network import ELB, Route53
from diagrams.aws.storage import S3

with Diagram("", show=False):
    ses = SES("Mail Transport (SES)")
    dns = Route53("Route 53 (DNS)")
    s3 = S3("S3")

    with Cluster("VPC"):
        lb = ELB("Load Balancer (ALB)")
        elasticache = ElastiCache("Redis (ElastiCache)")

        with Cluster("ECS"):
            web = Fargate("web")

        with Cluster("DB Cluster (RDS)"):
            db_primary = RDS("primary")
            db_primary - RDS("read replica")

    dns >> lb
    lb >> web

    web >> elasticache
    web >> db_primary
    web >> s3
    web >> ses

Three Rich tips

I've mentioned Will McGugan's excellent library Rich on this blog before. It is a great tool for building nice terminal interfaces, but it is also an important local development tool. Here's three top tips:

  1. Rich can be registered as a handler to render stacktraces. As well as the aesthetics, using Rich to handle stacktraces like this provides additional context which improves the usefulness of error messages in comparison to python's default handler.
  2. Rich.inspect can be used to examine a python object at runtime. I used to use dir() or vars() for this, but rich.inspect() is a big step up.
  3. Rich can be used as a log handler. The docs cover how to use it with python's logging module, but Will has also published this blog post showing how to configure Django to use Rich as the default log handler.

HTML 5 Kitchen Sink

HTML 5 Kitchen Sink is really useful for testing out themes and stylesheets. It also has the helpful side effect of introducing me to (or reminding me about) some of the less common HTML5 elements that exist in the spec as I use it.

Composite Actions vs Reusable Workflows

A few days after I blogged about GitHub Composite Actions, GitHub launched another similar feature: Reusable Workflows.

There is a lot of overlap between these features and there are certainly some tasks that could be accomplished with either. Simultaneously, there are some important differences that drive a bit of a wedge between them.

  • A composite action is presented as one "step" when it is invoked in a workflow, even if the action yaml contains multiple steps. Invoking a reusable workflow presents each step separately in the summary output. This can make debugging a failed composite action run harder.

  • Reusable workflows can use secrets, whereas a composite action can't. You have to pass secrets in as parameters. Reusable workflows are also always implicitly passed secrets.GITHUB_TOKEN. This is often convenient, but another way to see this tradeoff would be to say: If you're using a reusable workflow published by someone else, it can always read your GITHUB_TOKEN var with whatever scopes that is granted, which may not always be desirable. A composite action can only read what you explicitly pass it.

  • Both can only take string, number or boolean as a param. Arrays are not allowed.

  • Only a subset of job keywords can be used when calling a reusable workflow. This places some restrictions on how they can be used. To give an example, reusable workflows can't be used with a matrix but composite actions can, so

jobs:
  build:
    strategy:
      matrix:
        param: ['foo', 'bar']

    uses: chris48s/my-reusable-workflow/.github/workflows/reuse-me.yml@main
    with:
      param: ${% raw %}{{ matrix.param }}{% endraw %}

will throw an error, but

jobs:
  build:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        param: ['foo', 'bar']
    steps:
      - uses: chris48s/my-shared-action@main
        with:
          param: ${% raw %}{{ matrix.param }}{% endraw %}

is valid

  • Steps in a composite action can not use if: conditions, although there are workarounds. Update: Thanks to @bewuethr who pointed out that composite actions now support conditionals 🎉

  • A composite action is called as a job step so a job that calls a composite action can have other steps (including calling other composite actions). A job can only call one reusable workflow and can't contain other steps.