Monthly Retrospective15 min read

May 2026: The Month the Classifier Caught Itself — A Mid-Month Calibration Reckoning, 27 Posts, and 964 Commits

May 2026 retrospective: the tier classifier caught itself inflating. Calibration found the rubric structurally broken mid-month. 27 posts, 964 commits.

April’s retrospective ended with a question: “If May settles to 5–15% Tier 3, the system is calibrated correctly and April was a genuine peak. If May stays at 40%+ Tier 3, the classifier’s anti-inflation flags need re-tuning.”

May answered it, but not the way the question was framed. The Tier 3 rate did fall — from April’s 44% to 33% of classified posts. But the headline isn’t the number. It’s that the measurement system caught itself. On May 16th, the monthly calibration machinery — restored from dormancy that same week — graded the month’s posts against the strict rubric and returned a one-word verdict: MISCALIBRATED. Not “drifting.” Not “needs watching.” Structurally broken, with a named root cause: the TCH and SCP dimensions had a floor at 3, which made the Tier 2 gate mathematically always-true and Tier 1 unreachable.

That finding is the most important thing that happened to this blog in May. Not a single post. The discovery that the thing deciding what the posts are had a provable bug — and the fact that the system surfaced it without anyone going looking.

The rest of the month is the cleanup. The back half shows the classifier fighting its own inflation in the audit trail, day after day, until on May 26th it finally produced the month’s first and only Tier 1 — a 56-line field note it had every structural reason to inflate and didn’t. That is the system doing its job. It is also the first time the calibration loop closed end to end: a bug detected by the grader, a fix injected into the classifier, and a behavior change visible in the data within ten days.

Twenty-seven posts. Nine hundred sixty-four commits across fifty-one active repositories — the heaviest build month since the trend started being tracked. And a four-day hole at the end of the month where the pipeline went dark because it was being repaired. All three of those facts are connected, and the connection is the story.

Velocity Dashboard

MetricMay 2026April 2026Delta
Posts published2730-3
Non-merge commits (Jeremy-authored, all active repos)964443+118%
Active repos (≥1 commit)5125+26
Commits per post35.714.8+21
Posts classified21 of 27 (78%)18 of 30 (60%)+18 pp
Tier 1 (Field Note)1 (5%)5 (28%)-23 pp
Tier 2 (Deep-Dive)13 (62%)4 (22%)+40 pp
Tier 3 (Case Study)7 (33%)8 (44%)-11 pp
Avg confidence (classified)0.830.81+0.02
Structural sweep ↔ classifier agreement (145-record backlog)63%(first measurement)
Classification errors in the inflation direction47 of 47 (100%)

A note on the commit numbers, because this is the second retrospective in a row where the accounting matters more than the figure.

The commit method changed, and the change is itself a lesson. Prior retrospectives published a descending commit trend — 627 in January, 446 in February, 340 in March, 162 in April — and read it as a deliberate signal: less code, denser writing. That trend cannot be reproduced. Running a consistent, scriptable count today (non-merge commits authored by Jeremy across every active repo under the projects root, excluding forked and archived third-party code) yields 639 for March, 443 for April, and 964 for May. The previously published figures used a narrower accounting that left no record of its method, so there is no way to re-derive them. The April retrospective’s own Lessons Learned flagged exactly this — “active-repo count without a depth filter overcounts… May’s retrospective should report both numbers.” This month makes the fix permanent: the method is now a committed script, and every future retrospective will report this number and only this number. The April column above is restated under the new method (443, not the published 162) so the delta is honest.

Under that consistent lens, May was not a quiet, dense month. It was a sprawling, heavy build month — the busiest on record. Guidewire MCP shipped three releases. The Knowledge OS (intentional-cognition-os) took 132 commits in a single month. The Slack channel shipped a full production day. Partner deliverables moved across five repos. The blog pipeline rebuilt its own CI. Commits-per-post nearly tripled. That volume is the precondition for the calibration story: when every day produces a real ship with a named artifact, a classifier with a broken floor will call every day Tier 2.

The Calibration Reckoning

This section replaces the usual “Methodology Calibration” note because in May it was the month.

What the grader found

On May 16th, /blog-calibrate ran for the first time in weeks — the skill had been moved to an inactive directory and was restored that week. It re-graded May’s posts against the strict tier rubric and reported:

  • Tier distribution, classifier vs. rubric: the classifier had produced 0% Tier 1, 40% Tier 2, 60% Tier 3 across its classified posts; a strict rubric pass over the 15 posts published through mid-month produced 60% Tier 1, 40% Tier 2, 0% Tier 3. (The two passes cover overlapping-but-not-identical pools — 10 classifier-graded vs. 15 rubric-graded — but the direction of the gap is the finding, not the exact denominators.) The expected baseline for this blog is 60–70% Tier 1 — most days are field notes, by design.
  • Strict-rubric ↔ classifier agreement on that 15-post pass: 47% against a target of ≥80%. (The 63% in the velocity table is a different, looser measurement — the deterministic structural sweep over the full 145-record backlog. The strict mid-month rubric pass is harsher, which is why its agreement is lower.)
  • Root cause, stated as arithmetic: the Tier 2 gate fires when max(NOV, TCH, NAR) ≥ 3 AND 2+ dimensions ≥ 3. The classifier was scoring TCH ≥ 3 every single day and SCP ≥ 3 every single day. With two dimensions permanently at or above 3, the gate is structurally always true. Tier 1 was not unlikely. It was unreachable.

The anchor evidence was damning in its specificity. transitive-cve-clearance — 107 lines, a single-repo dependency bump explaining one pattern — was scored TCH and SCP high enough for Tier 2, when the rubric anchors put a single-PR dep bump at TCH 1–2, SCP 1–2. coherence-day — 92 lines, below the Tier 2 minimum of 150 — was classified Tier 2. The classifier was reading “I shipped something and learned from it” as “this changes how a category of problems is understood,” which is the TCH=5 anchor. Routine competence was being scored as framework-level insight.

The unidirectional bias

The single cleanest signal in the whole month: across 145 graded records, every one of the 47 disagreements with the classifier pointed the same way — the classifier rated too high. Zero pointed the other way. Not one post in the entire decision backlog was under-rated. A classifier that errs randomly produces a mix of inflation and deflation. A classifier with a stuck floor produces pure inflation. The data shows pure inflation. That is not a calibration wobble; it is a structural bias with a known mechanism, and the 100%-one-direction figure is what proves the mechanism rather than merely suggesting it.

(The 145 records are 130 from the deterministic structural sweep plus 15 from the mid-month calibration’s retrospective rubric grading, and they cover the full decision history, not just 27 May posts. The weekly structural sweep, added May 16th, walks every prior classification that lacked a feedback record. May is when the backlog got graded, which is why the calibration signal is corpus-wide and not month-scoped.)

The fix, and the proof it worked

The mid-month report made one recommendation that could be tested before month-end: inject the running 14-day tier distribution into the classifier prompt, and when Tier 1 has been 0% for two weeks, force the classifier to justify why today isn’t Tier 1. That landed, and the audit trail from May 20th onward reads completely differently from the first half of the month:

  • May 20: “distribution-inflation: past 14d 0% Tier 1 vs 60–70% target — strict anchors applied… NOV held at 3… SCP held at 3.”
  • May 22: “tier-3-share-inflated-applying-strict-anchor-enforcement… confidence-gated-downgrade fired.”
  • May 23: “14-day window at 0% Tier 1 — inflation acknowledged… SCP held at 2… TCH capped at 3.”
  • May 24: “distribution-signal: past 7 days Tier 1 at 0% — applied anchors strictly, held SCP at 2.”
  • May 26: “14-day-distribution: Tier 1 at 0% across 11 classifier decisions, severe inflation… volume-not-quality: 10 commits + a v1.6.0 release is busy not distinguished.”Tier 1.

The May 26th post — a 56-line note about a CI lint gap that shellcheck and ruff caught — is the proof. It shipped on a 10-commit day with a version release. Under the broken floor, that day classifies Tier 2 without hesitation. Under the corrected prompt, the classifier looked at the 14-day Tier-1 drought, recognized “busy is not distinguished,” and called it what it was: a field note. The loop closed.

The honest read on the full-month distribution: 5% Tier 1 is still wrong. The target is 60–70%, and a single Tier 1 in a month does not fix a structural floor — it proves the prompt-injection patch can produce the right answer when the distribution pressure is high enough, not that the rubric is repaired. The patch is a brake applied late, not a rebuilt engine. The actual fix — hard anchor enforcement in the classifier agent so TCH ≥ 4 requires a named transferable artifact and SCP ≥ 3 requires genuine multi-system coordination — is June’s first job.

Top 3 Posts by Teaching Potential

Ranked by the TCH dimension, with the month’s calibration finding applied as a discount — these are the posts whose teaching value survives a skeptical re-read, not just the ones the classifier scored highest.

1. Self-Improving Skills: Three Schema Versions in One Day (May 14)

The post. The only post all month to score TCH=5, and the rare case where that score holds up. Three schema versions of a skill validator shipped in a single day, each one generated by feeding the prior version’s failures back into the spec. The transferable artifact is real and named: a feedback loop where the validator’s own rejections become the next version’s test cases. This is the post a reader can apply to a system they haven’t built yet — which is the actual TCH=5 bar, and almost nothing else this month cleared it.

2. An Anti-Slop Framework Found Three Bugs Inside Itself on Day One (May 3)

The post. 292 lines, and the highest reproducibility score of the month (RPR=5). An OSS-contribution framework built to catch low-quality AI output shipped to v0.1.0, then immediately shipped two more releases the same day — because its first real run found three silent failures in its own gates. The teaching value is the candor: a quality tool’s credibility comes from publishing the bugs it found in itself, not from claiming it has none. The narrative earns its length.

3. Spec Graduation: When a Partner Email Rewrites Architecture (May 9)

The post. A partner emailed to remind Jeremy what the engagement actually contracted for. The instinct is to defend the work already done; the right move was to re-read the contract and discover that the constraint was the architecture — the thing that looked like a limitation was the design. The transferable lesson — when a stakeholder pushes back, re-read the spec before defending the build — is the kind that applies far outside this one engagement.

Projects: Started / Shipped / In Progress

Shipped This Month

  • Guidewire MCP v0.1.0 → v0.1.2 (85 commits). A carrier-native MCP server for the insurance vertical — three public releases, the v0.1.0→v0.1.1 sequence in 76 minutes. Carrier-vocabulary tool design plus a hash-chained audit moat. The discipline of shipping the version sequence in public is the product story.
  • server-ops-mcp — a 16-tool privileged-action ops MCP built safety-model-first: write denylist, dry-run default, argument validation, key-only auth. The post’s thesis — design the 7-point safety model before you expose a single tool — is the right order of operations for any MCP that can touch production.
  • Claude Code Slack channel — a full production day (74 commits). The “ship dormant, wire later” pattern: infrastructure landed behind feature flags days earlier, so activation day was wiring plus a CLI, not architecture plus wiring plus a CLI.
  • contributing-clanker v0.1.0 (43 commits) — the anti-slop OSS contribution framework (see Top Posts #2).
  • Blog pipeline CI hardening — shellcheck + ruff + Hugo-build gates now run on plugin/skills/**/*.{sh,py}, closing a lint gap that covered every language except the pipeline’s own scripts (the May 26 Tier 1).

In Progress

  • intentional-cognition-os (Knowledge OS) — the single heaviest repo of the month at 132 commits. Dogfooding surfaced a silent search-engagement bug that unit tests missed; the fix was a strict-then-broad FTS5 fallback. The eval stack around it (intent-eval-lab, intent-eval-core, j-rig-binary-eval) moved in parallel.
  • VPS-as-the-home program (28 commits in the runbook). Day one alone took eight deploy iterations to get Tailscale OIDC + a reusable workflow right. The per-priority discipline held: code → CLAUDE.md → 000-docs → runbook → AAR → bead+GH closure.
  • Partner engagements — Kobiton (61 commits), partner-portals (22), Nixtla (25), Lilly (17), Mandy (10). The “three guards against shipping slop” post came out of a day shipping seven partner PRs without shipping slop: adversarial pre-flight, empirical verification, post-delivery consistency check.

Started This Month

  • The calibration feedback loop itself. The weekly structural sweep (May 16) and the distribution-injection patch are new this month. They are the reason this retrospective can say “the system caught itself” instead of “I noticed the tiers looked high.”

Content Strategy Metrics

  • Cross-post queue: 18 entries pending, all currently skipping cleanly because MEDIUM_INTEGRATION_TOKEN is unset — graceful no-op, not failure. The canonical URL strategy (Dev.to/Hashnode +24h, Medium +48h) is unchanged; the queue is built but the Medium leg is dark pending a token.
  • The pipeline repaired itself, loudly, at month-end. The four-day gap (May 28–31) is not missing work — it is the pipeline being rebuilt. The commits dated May 27–28 tell the story: the daily backfill timeout was raised from 1800s to 5400s, claude -p was pty-wrapped so a timeout leaves a diagnosable trail, phase markers were added so a long run can be bisected, and a commit titled “eliminate the two largest avoidable costs in the blog pipeline” landed. The mid-month calibration had already named the cause — bookkeeping steps skipped under headless runs, “possibly token-budget exhaustion before the final audit-record write.” The end-of-month repair is the fix for that exact failure. The cost of the fix was four days of no posts. That is a fair trade and it is on the record.

Wins

  • The measurement system found a bug in itself, unprompted. A grader that only ever rubber-stamps is theater. This one returned MISCALIBRATED on its own decisions and named the arithmetic. That is the single most valuable thing a calibration system can do, and it did it.
  • The prompt patch was visible in the data within ten days. Detection on the 16th, prompt patch shortly after, first correct Tier 1 on the 26th. A closed loop with a timestamp on every step — proof of concept, not full repair.
  • The unidirectional-bias finding is publishable on its own. “100% of our classification errors are inflation, zero are deflation” is a clean, falsifiable diagnostic that any team running an LLM classifier can steal.
  • The commit-accounting method is now reproducible and committed. The April lesson is fixed, not just noted. No future retrospective will publish an un-deriveable number.
  • Volume held under genuine load. 964 commits, 51 repos, three MCP releases, a production Slack day, five partner repos — and still 27 posts. The heaviest build month did not stop the daily cadence until the pipeline itself needed surgery.

Lessons Learned

  • A classifier’s distribution prior is a load-bearing input, not a sanity check. The whole inflation problem was invisible until the 14-day distribution got injected into the prompt. The model could not see that it had produced zero Tier 1 posts in two weeks because nothing told it. Models do not have memory of their own aggregate behavior; you have to feed it back.
  • Restore your calibration tooling before you need it, not when you suspect a problem. /blog-calibrate and /blog-feedback had been parked in an inactive directory. They caught a structural bug the first time they ran after being restored. How long had the floor been broken before May 16th? The audit trail can’t say, because the grader wasn’t running. Dormant instrumentation is the same as no instrumentation.
  • A prompt-injection patch is a brake, not a repair. The distribution-feedback loop produced one correct Tier 1, which proves it can work under pressure. It did not fix the underlying floor — TCH and SCP still default high. Patching the symptom bought a correct answer; it did not rebuild the rubric. Don’t mistake the brake for the engine.
  • Bookkeeping is the first thing to fail under a token budget, and it fails silently. Six of 27 posts have no classification record; the audit-trail addendum was missing on roughly a third of days mid-month. The headless runs were spending their budget on the post and running dry before the audit write. The fix (raise the timeout, cut the costs, hard-fail on a missing record) is the right shape — verify the record exists before pushing, not after.
  • A heavy build month is exactly when a broken classifier does the most damage. When every day ships something real, an always-true Tier 2 gate has maximum surface area to inflate. The calibration bug was most dangerous precisely in the month with the most work.

Next Month Focus

  1. Repair the rubric, not just the prompt. Land hard anchor enforcement in blog-classifier: TCH ≥ 4 requires an explicit named transferable artifact; SCP ≥ 3 requires genuine multi-system coordination; single-PR, single-repo work defaults to TCH/SCP 1–2. Re-grade May under the repaired rubric and confirm the distribution moves toward 60–70% Tier 1.
  2. Close the bookkeeping gap to zero. Hard-fail the daily cron if decisions.jsonl did not gain a record for the target date before the push step. No silent skips. Target 100% classification coverage and 100% audit-addendum coverage in June.
  3. Backfill the four-day hole and the six unclassified posts. May 28–31 need posts; May 1, 4, 7, 10, 15, and 25 need classification records. Close both before they calcify into permanent gaps.
  4. Get a human in the calibration loop. The 145 structural grades are agent-judged. Sweep the highest-signal disagreements through /blog-feedback --correct|--wrong so the next calibration report has a real Brier score instead of a structural proxy.

April was the month Tier 3 found its footing. May was the month the floor under it turned out to be tilted — and the month the system that measures the floor proved it could find that out by itself. The most useful blog post of the month was the calibration report nobody published. This retrospective is the version that gets read.