Monthly Retrospective14 min read

June 2026: The First Full Month Under the Repaired Rubric — Tier 1 Went From 5% to 50%, and the Last Mile Got Louder

June 2026 retrospective: the first full month running under May's repaired tier rubric. The distribution corrected from 5% to 50% Tier 1, exactly as predicted — but the structural sweep still flags every disagreement as inflation, and no human has adjudicated one. 27 posts, 1,166 commits.

May’s retrospective ended with an instruction disguised as a hope: “Land hard anchor enforcement… re-grade May under the repaired rubric and confirm the distribution moves toward 60–70% Tier 1.” It named the fix, admitted the fix wasn’t done, and called it June’s first job.

Here is the thing the June data makes clear: the fix was already in the code. The classifier agent and its rubric reference were last modified on May 16th — the same day the calibration reckoning fired — and they have not been touched since. The hard anchor enforcement May described as “June’s first job” was in the classifier the whole back half of May. What May didn’t have was a clean month of it. Half of May predated the patch, and the four-day repair hole at the end meant the corrected rubric never got a full, uninterrupted run.

June is that run. Twenty-seven days, one rubric, no code changes to the thing doing the deciding. It is the first honest test of whether the May fix actually works when applied to a whole month instead of a fortnight.

It works. Tier 1 went from 5% of classified posts in May to 50% in June. That is the single most important number in this retrospective, and it is exactly the direction May predicted. The classifier that could not physically produce a field note in May — the floor was mathematically stuck — produced eleven of them in June, and its audit trail shows it choosing to, day after day, against the pull of a busy build month.

And yet. The same deterministic structural sweep that caught May’s inflation ran every week in June, and it disagreed with the classifier on exactly half the classified posts — eleven of twenty-two. Every one of those eleven disagreements points the same direction it did in May: the classifier rated higher; the line-count grader says lower. No human has looked at a single one. June proves the fix works at the level of the distribution. It also proves the last mile — a human in the calibration loop — is still not built, and the structural sweep is now loud enough that you can’t pretend otherwise.

Velocity Dashboard

MetricJune 2026May 2026Delta
Posts published27270
Non-merge commits (Jeremy-authored, active repos, one reproducible command)1,166911+28%
Active repos (≥1 authored commit)4345−2
Commits per post43.233.7+9.5
Posts classified22 of 27 (81%)21 of 27 (78%)+3 pp
Tier 1 (Field Note)11 (50%)1 (5%)+45 pp
Tier 2 (Deep-Dive)10 (45%)13 (62%)−17 pp
Tier 3 (Case Study)1 (5%)7 (33%)−28 pp
Avg confidence (classified)0.850.83+0.02
Structural sweep ↔ classifier agreement50% (11 of 22)63%−13 pp
Sweep disagreements in the inflation direction11 of 11 (100%)47 of 47 (100%)

A note on the commit numbers, because it is the third retrospective in a row where the accounting is part of the story. May’s retrospective declared victory over the commit-counting problem: “the method is now a committed script, and every future retrospective will report this number and only this number.” That script was never actually committed. There is no commit-count script in the pipeline. So June could not “report this number” — it had to re-derive one.

The figures above come from a single command run identically over both months: non-merge commits authored by Jeremy (-i --author=jeremy\|longshore), across every git repository under the projects root at depth ≤4, excluding forked third-party clones, archived trees, vendored dependencies, and submodules. Under that one lens, June logged 1,166 authored commits to May’s 911 — a 28% increase and the heaviest build month yet measured. May’s published figure was 964 across 51 repos, under a method that left no reproducible record; restated under this command May is 911 across 45. The delta above is honest because both columns come from the same command, not because either matches a prior publication. The lesson repeats itself, so this time it goes in Next Month with a checkbox: actually commit the script.

Under either accounting, June was not a quiet month. Commits-per-post climbed to 43. The Knowledge OS, the Slack channel, and a wall of agent-governance work all moved at once. That volume matters for the calibration story the same way it did in May: a busy month is the hardest test of an anti-inflation rubric, because every day ships something real and every day tempts the classifier to call routine competence “distinguished.” The distribution held at 50% Tier 1 through that pressure. That is the corrected rubric earning its keep.

The Distribution Corrected — And What the Sweep Still Sees

This section replaces the usual methodology note, because in June the calibration story is the whole point.

The correction is real, large, and visible in the audit trail

The May patch had two parts: hard anchor enforcement (TCH ≥ 4 requires a named transferable artifact; SCP ≥ 3 requires genuine multi-system coordination; single-PR, single-repo work is capped at 1–2) and a distribution feedback loop (inject the running 14-day tier mix into the prompt so the classifier can see its own drift). June’s decision records show both firing, constantly and specifically:

  • “TCH anchor enforced: no named transferable artifact, capped at 2.”
  • “SCP anchor enforced: cross-repo re-key is two near-identical chores, capped at 2.”
  • “Tier 3 rejected at hard gate max(NOV,TCH)=3<4.”
  • “busy-not-distinguished: 4 commits across 4 repos is breadth without depth.”
  • “Step-0 distribution: Tier 1 at 29% signals inflation bias; Tier 2 justified by three independent threshold dims.”

That last one is the machine reasoning correctly under pressure: it saw its own Tier-1 share, recognized the inflation risk, and then justified a Tier 2 on three anchored dimensions rather than the always-true two-dimension floor that broke May. The single Tier 3 of the month — The Wrong Product, Built Perfectly — cleared the hard NAR ≥ 4 gate on genuine merits and nothing else did. Eleven days got called what they were: field notes. The floor that made Tier 1 unreachable in May is gone.

What the structural sweep still sees

The weekly deterministic sweep — pure line-count and title heuristics, no LLM — graded all 22 classified June posts and agreed with the classifier on 11. On the other 11 it said the same thing it said 47 times in May: the classifier rated too high, and never too low. Zero disagreements point the other way. The unidirectional signature that proved May’s structural bias is still present in June.

There are two honest readings, and the retrospective owes both:

  1. The pessimistic read: the rubric repair moved a lot of clearly-Tier-1 days down, but on the borderline days the classifier still defaults to Tier 2, and residual inflation persists in exactly the region where it’s hardest to see.
  2. The generous read: the structural sweep is a crude proxy. It maps tier by line count and cannot see an anchored, named TCH artifact. A 160-line post that genuinely teaches a transferable pattern is Tier 2 even though a line-count grader calls it Tier 1. May’s own retrospective warned that the structural sweep runs looser than a strict rubric pass.

The only instrument that can decide between those readings is a human, and no human adjudicated a single one of the 22 June posts. Every feedback record is source: structural_auto_confirm — the machine grading the machine. May’s Next-Month-Focus #4 was “get a human in the calibration loop.” It did not happen. The 11-of-11 inflation signature is precisely the evidence that says it needs to. This is the June finding: the distribution moved to where it should be, and we still cannot prove each individual call is right, because the loop that would prove it is still open.

May’s four commitments, graded

  • #1 — Repair the rubric, confirm the distribution moves toward 60–70% Tier 1. Directionally delivered. 5% → 50% is the right direction and most of the distance. It is not yet 60–70%; the classifier still leans Tier 2 on borderline days. Half-repaired is the honest grade — the engine runs, the timing isn’t perfect.
  • #2 — Close the bookkeeping gap to zero. Missed. 22 of 27 posts have clean classification records (81%, up from 78%). The gap narrowed; it did not close. Two of the misses are manual posts that never entered the pipeline; the rest are genuine holes.
  • #3 — Backfill the four-day May hole (May 28–31). Not done, now permanent. May 28, 29, and 30 have no posts and never will. A commitment deferred into the next month is a commitment that calcifies.
  • #4 — Get a human in the calibration loop. Not done. Still 100% structural auto-confirm. This is the one that matters most, and it’s the one with nothing to show.

Top 3 Posts by Teaching Potential

Ranked by the TCH dimension with the calibration discount applied — the posts whose teaching value survives a skeptical re-read, not just the ones scored highest.

1. The Wrong Product, Built Perfectly (June 5)

The post. The month’s only Tier 3, and it earned the tier on merits the hard gate now actually enforces. The thesis is the most transferable idea of the month: a clean pipeline faithfully amplifies a requirements-level error all the way to production. Good engineering does not protect you from building the wrong thing — it makes the wrong thing arrive faster and more polished. Every team that has ever shipped a beautiful implementation of a misread spec can steal this directly.

2. Human-in-the-Loop Is a Delivery Guarantee, Not a UI Feature (June 7)

The post. The reframe is the teaching. Most teams treat human-in-the-loop as an approval button in a UI. This post argues it is a fail-closed, exactly-once delivery problem: when a human’s Allow/Deny is the gate, the system has to guarantee the action fires once and only once on approval, and never on silence. That moves HITL out of the frontend and into the same reliability discipline as message queues — a genuinely non-obvious, portable lesson.

3. Coverage Said 69%, Mutation Testing Said 25% (June 28)

The post. The most broadly stealable post of the month. A rules engine with 69% line coverage scored 25% on mutation testing — meaning three-quarters of the code the tests executed could be silently broken without a single test failing. The lesson that coverage measures execution, not verification, is one nearly every engineer has heard and nearly none has felt. This post makes it concrete with a number gap you can’t argue with.

Projects: Shipped / In Progress / Started

Shipped This Month

In Progress

Started This Month

  • The honest confrontation with the last mile. The structural sweep ran clean all month and produced the 11-of-11 signal. Nothing acted on it yet — but June is the first month where the gap between “the distribution is right” and “each call is verified” is measured, visible, and impossible to file under later.

Content Strategy Metrics

  • The pipeline ran the whole month. After May’s four-day repair hole, June published 27 straight days with no dark gap. The daily backfill, weekly feedback sweep, and cross-post queue all fired on schedule. Whatever the May end-of-month surgery cost, it bought a stable June.
  • Cross-posting came back online. Hashnode was re-enabled on June 30 (Pro plan live); the Dev.to/Hashnode +24h, Medium +48h canonical-URL cadence is intact. The queue processed cleanly into July.
  • Bookkeeping is the recurring soft failure. Five posts without clean classification records is the same class of gap May flagged. It is small, it is boring, and it is exactly the kind of thing that only gets fixed when the cron hard-fails on a missing record instead of soft-warning.

Wins

  • The corrected rubric survived a full, heavy month. 5% → 50% Tier 1, held through 1,166 commits and a build-heavy June. The floor that made Tier 1 unreachable is provably gone, and the audit trail shows the classifier reasoning about its own distribution in real time.
  • The prediction was falsifiable and it came true. May said “confirm the distribution moves toward 60–70% Tier 1.” June is the confirmation — directionally exact, and honest about the remaining gap to target.
  • The measurement system is still adversarial to itself. The structural sweep did not rubber-stamp the corrected classifier. It found 11 disagreements and kept the unidirectional signature visible. A grader that stopped disagreeing the moment the numbers looked better would be worthless; this one didn’t.
  • 27 days, no dark gap. The pipeline that went silent for nine days earlier in the year and four days at the end of May ran clean for all of June.

Lessons Learned

  • A fix in the code is not a fix in the data until a full clean cycle runs. The hard anchors were in the classifier since May 16th, but the distribution didn’t visibly correct until June gave them 27 uninterrupted days. Don’t declare a calibration fix on a partial month — declare it on the first full one.
  • “The script is committed now” must be verified, not asserted. May said the commit-count method was a committed script. It wasn’t. Two retrospectives later the same manual re-derivation happened again. A claim that a process is automated is worthless until the file exists in the tree.
  • A distribution being right is not the same as each decision being right. 50% Tier 1 is the correct aggregate. The 11-of-11 inflation signature says the individual borderline calls may still skew high. Aggregate calibration and per-item calibration are different properties; only a human loop proves the second.
  • Deferred commitments calcify. The May hole (May 28–30) was going to be backfilled “next month.” It never was, and now it can’t be honestly dated. A gap you don’t close in the month it opens becomes a permanent hole in the record.
  • Automated self-grading has a ceiling, and June hit it. The structural sweep can flag that something’s off; it cannot decide whether a short-but-anchored post is Tier 1 or Tier 2. Past that ceiling you need judgment, and judgment means a person. The auto-confirm loop has taken calibration as far as it can alone.

Next Month Focus

  1. Get a human in the loop — this is the only priority that carries a debt from two months. Sweep the 11 structural-sweep disagreements through /blog-feedback --correct|--wrong and record real verdicts. Turn the 100%-inflation-direction signal into an actual per-item judgment so the July calibration report can publish a Brier score instead of a structural proxy.
  2. Actually commit the commit-count script. Write the one command this retrospective used into scripts/blog/, commit it, and have the monthly retro call it. Three months of re-deriving the same number ends here.
  3. Hard-fail the daily cron on a missing classification record. Close bookkeeping to zero by making the pipeline refuse to push a post whose date gained no decisions.jsonl record. No more soft warnings; make the gap impossible.
  4. Push Tier 1 the rest of the way to baseline. 50% is halfway from the broken floor to the 60–70% target. Watch whether July’s borderline days keep drifting Tier 2, and if the human loop confirms residual inflation, tighten the Tier 2 gate one more notch.

May was the month the classifier caught itself. June was the month it lived a full cycle under the correction and proved the aggregate fix holds — while making it undeniable that the last mile, the human check, is the one thing automation cannot finish on its own. The distribution is right. Proving each call is right is July’s job, and it’s overdue.