May & June 2026: Defer, Scoring, and Visible Abs
Two months heads-down for AI Engineer World's Fair, plus a digital garden and a fitness app that wants abs.
Work
May and June were almost entirely consumed by getting two new Inngest features ready for the AI Engineer World’s Fair, which kicks off this week in San Francisco. I was a primary contributor on both, so I’ll keep the details light and let the docs do the talking:
- Defer lets a function spin up follow-up work inline, without coordinating a separate function over events. The interesting part was modeling it as a new opcode class: “lazy ops” that piggyback on a host op in the same SDK response instead of standing alone, and “priority ops” that generalize the existing
WaitForEventpattern so a defer registration persists before finalize fires. I shipped the coreDeferAdd/DeferCancelopcodes, wired deferred runs into the dev-server UI, and wrote the announcement post. - Scoring & Experiments let you A/B test and grade outcomes (LLM judges, guardrails, engagement signals) right in the execution layer. I built the Scoring Dashboard end to end: backend, UI, and a perf pass once real traffic showed up, tightening a query predicate for the runs view only so the steps view’s semantics didn’t shift underneath anyone.
Alongside that, a real slice of May and June went into a thread of checkpointing-reliability bugs: an HTTP timeout could cause the executor to requeue a run while the original dispatch was still checkpointing, producing duplicate steps or stranding the run entirely. The fix fences off stale dispatches by echoing a request ID on every checkpoint POST and treating a 409 as stale (SDK); a later pass renamed the mechanism to GenerationID to better match what it actually tracks.
That stretch also included a websocket memory leak in Realtime, which was a bigger deal than the one-line PR title suggests: two Realtime pods were slowly leaking memory until they OOMed, which the load balancer surfaced as plain 503s with nothing obviously wrong on our end. Chasing it down was my first real hands-on experience with Go memory profiling: pulling heap and goroutine dumps via pprof and diffing snapshots with Riadh’s help to trace the leak to a Redis pipe receive loop in the broadcaster. It had been quietly behind a handful of customer support tickets, so finding it felt good.
Smaller stuff: a checkpoint-resume bug after parallelism.
Honestly, it was a slog — the whole Inngest team worked its ass off to get so much hot new stuff stage-ready for the Fair. I’m proud of what we shipped, though.
Personal
I’ve started thinking of my non-work coding time as tending a digital garden: a few small, deliberately unpolished things I grow for myself rather than an audience, revisited and reworked instead of stamped with a publish date and forgotten. The current resident is Voodoo, a “code garden” written in Go with Echo, sqlc, and goose, server-rendered with no JavaScript at all. It’s not open source, and that’s sort of the point. (An earlier attempt at the same idea, a TanStack Start + Hono stack called Kudzu, didn’t stick.)
The other one is Jocko, a fitness coaching app that pulls Apple Health data for deep analysis and hands it to an LLM for personalized advice, because I want visible abs, damn it. Not on GitHub.
For fun, but doubling as real demos of the scoring work above, I also built Scoop (live at scoop.thelinell.com), an ice-cream-themed RSS reader that A/B tests its own summary strategies and scores them live. It’s TanStack Start on Cloudflare Workers with D1, and it leans on Inngest’s actual experiment and scoring primitives rather than anything bespoke: each new story gets summarized by one of two variants — questionLed (opens with a curiosity-gap question) or factLed (leads with the sharpest fact) — picked by a weighted, run-seeded experiment. Guardrail scores (length, em-dash abuse, refusal) run inline; an LLM-judge faithfulness score runs as a deferred scorer so a slow judge call never blocks the saved summary; engagement scores (opens, clickthroughs, saves) arrive later as their own event-driven runs, re-attributed back to the original summary run. The early numbers make the case for multi-metric scoring on their own: factLed wins faithfulness (0.97 vs 0.83), but questionLed wins every engagement metric — grading on faithfulness alone would ship the less-clicked variant. The README goes deeper on the wiring if you want to poke around, and I also recorded a Loom walkthrough showing Scoop off alongside the Inngest features it’s demoing. There’s also Demotivational Quotes, which picks Claude or GPT via experiment and scores the result, and a tiny change-set-image Cloudflare Worker that turns release changesets into Slack-ready images, because someone had to.