AI EngineeringApr 29, 202618 min read

Forty-Four Minutes Before First Pitch: An LLM Fallback Chain and a Live Probability Gauge in One Session

Groq quota died 44 minutes before a live broadcast. The fix was a 4-tier LLM fallback chain ending in a facts-only renderer. Then we shipped JChad's Challenge in the same session.

TL;DR

At 22:31 on 2026-04-29, Groq’s daily token quota for braves-booth (on-demand tier, 100K tokens/day) blew through 96.7% with first pitch 44 minutes away. The site was up and the data was flowing, but the storylines pane — the LLM-generated narrative the broadcast actually consumes — was about to go blank. The fix that landed before the broadcast started was a four-stage fallback chain: NVIDIA NIM Llama 3.3 70B as new primary, Groq 70B as warm secondary, Groq 8B as cold tertiary, and a deterministic facts-only renderer as the final floor. By the time the chain was deployed and serving, the producer asked for a second feature — a live AB-by-AB Monte Carlo probability gauge tracking a named on-air prediction by executive producer Jonathan Chadwick. JChad’s Challenge shipped in the same session, prior-art-checked, math-modelled, and live on the dashboard before the third inning. The thesis of this post is that when your free LLM tier dies 44 minutes before a live broadcast, the resilience pattern is the product. The same posture — ship the floor, then ship the feature — carried us from incident response into a genuinely novel artifact in one window.

Section 1 — The 44-minute crisis

The Braves Booth Intelligence dashboard runs at dixieroad.org (since this session, also at the new canonical scorecardecho.com). It is the radio booth’s pre-game and live-game cockpit: GUMBO poller pulling MLB Stats every five seconds, a FastAPI Python sidecar driving pybaseball lookups, an Express backend assembling pre-game briefings, an SSE-pushed React frontend. The narrative copy that the booth reads on air is generated by an LLM call — callLLM() — which hits Groq’s free tier serving Llama 3.3 70B Versatile. That had been fine for nineteen straight game days.

The check-in at 22:31 was routine. All three containers green. /api/health returning 200 in 40ms. SSE streaming. Logs clean except for one line that wasn’t:

groq.error.RateLimitError: 429 Too Many Requests
{
  "error": {
    "type": "tokens",
    "message": "Rate limit reached for model llama-3.3-70b-versatile in organization
                org_xxx: Limit 100000 tokens per day, Used 96749, Requested 4108"
  }
}

96,749 of 100,000 tokens used. Daily ceiling. Three thousand tokens of headroom against a single pre-game request that costs four thousand. First pitch in 44 minutes. The broadcast doesn’t care that we’re on a free tier.

There were three real options. Pay Groq for a paid tier — fast, but it requires a billing decision the project doesn’t have approval for and a credit card that nobody had on file at 22:31. Wire Vertex AI on GCP — the dormant Vertex code path was still in the codebase from the Cloud Run era — but that would mean re-enabling a billable surface area we’d intentionally walked away from on 2026-04-14. Or build resilience: a fallback chain that survives any single provider’s quota event, with the floor low enough that storylines never come back empty.

The third option was the only one that buys you something past tonight.

Section 2 — The 4-tier fallback design

The first version of the chain was Groq 70B → Groq 8B → facts-only. That’s what landed in PR #52 a few minutes after midnight in transcript time. The second version — landed within the same hour, before broadcast — slotted NVIDIA NIM in as the new primary after a quick spec compare:

Provider	Daily token cap	Per-minute cap	Authentication
Groq Llama 3.3 70B (on-demand)	100K tokens/day	6K TPM	API key
NVIDIA NIM Llama 3.3 70B free	None	~40 RPM	API key
Groq Llama 3.1 8B free	500K tokens/day	30K TPM	API key

The big win is the absent daily cap on NVIDIA’s free tier. The braves-booth never bursts past 40 requests per minute — pregame is one large call, live narratives fire on inning boundaries — so request-rate limiting is not the binding constraint. Tokens-per-day is the constraint that just killed us, and NVIDIA doesn’t have one.

The wrapper that made this practical is small. The whole thing is one TypeScript module:

// backend/src/services/llm.ts (excerpt)

type LLMResult =
  | { ok: true; text: string; provider: Provider; model: string }
  | { ok: false; reason: 'all-providers-failed' };

const CHAIN: Provider[] = ['nvidia', 'groq-70b', 'groq-8b'];

export async function callLLM(prompt: string, opts: CallOpts): Promise<LLMResult> {
  for (const provider of CHAIN) {
    if (!isConfigured(provider)) continue;          // skip if no API key
    if (isCircuitOpen(provider)) continue;          // skip if recent fail
    try {
      const text = await invokeProvider(provider, prompt, opts);
      closeCircuit(provider);
      return { ok: true, text, provider, model: modelFor(provider) };
    } catch (err) {
      const status = (err as ProviderError)?.status;
      const transient = status === 429 || status === 413 || status >= 500;
      if (transient) {
        openCircuit(provider, /* coolMs= */ 60_000);
        log.warn({ provider, status }, 'llm provider transient — falling through');
        continue;
      }
      // 4xx that isn't 429/413 — unrecoverable for THIS call; fall through anyway
      log.error({ provider, status, err }, 'llm provider unrecoverable — falling through');
    }
  }
  return { ok: false, reason: 'all-providers-failed' };
}

Three things in that block earn their place. First, isCircuitOpen(provider) — a sixty-second cool-down per provider after any transient error. Without it the chain re-pings a dead provider on every request and you get cascading 429s that consume what little quota recovers. Second, the explicit handling of HTTP 413 alongside 429. Groq’s 8B fallback turned out to be 413-prone on parallel narrative calls that fit the tokens-per-minute window but exceeded the per-request size limit; treating 413 like 429 keeps the chain moving instead of bailing. Third, returning a tagged { ok: false } result rather than throwing — the caller is responsible for deciding what “no LLM” means in context.

Above this wrapper sits the storylines pipeline, which handles ok: false by rendering from facts alone:

// backend/src/services/pregame.ts (excerpt)

const llmResult = await callLLM(buildStorylinesPrompt(state), {
  maxTokens: 1200,
  temperature: 0.7,
});

if (llmResult.ok) {
  return {
    storylines: parseStorylinesJSON(llmResult.text),
    provenance: {
      source: 'llm',
      provider: llmResult.provider,
      model: llmResult.model,
      generatedAt: Date.now(),
    },
  };
}

// All LLM providers exhausted. Render from raw stats.
log.warn('llm chain exhausted — rendering facts-only briefing');
return {
  storylines: renderFactsOnly(state),     // deterministic, no model
  provenance: {
    source: 'facts-only',
    generatedAt: Date.now(),
  },
};

The renderFactsOnly() renderer is the floor. It’s a pure function over the same state object the LLM prompt is built from. It produces shorter, drier paragraphs — “Tarik Skubal enters with a 2.72 ERA and 9.41 K/9 over 36.1 IP. Ronald Acuña Jr. is hitting .276 with a .894 OPS at home” — but nothing on the broadcast page goes blank. The producer can read it on air without apology.

The provenance field is load-bearing. Whichever path served the request gets stamped on the response. The frontend can choose to show “narrative by NVIDIA Llama 3.3 70B” in a footer or hide it entirely, but the audit trail is always there, and the in-memory cache key incorporates it so a facts-only briefing doesn’t permanently shadow a healthy LLM call. As soon as a provider comes back, the next request blows past the cache.

The deploy itself was uneventful. pnpm test — 105 tests pass. pnpm typecheck — clean. Image rebuild, rolling restart, smoke test against /api/pregame/824930 showed a fresh narrative arriving from NVIDIA: Acuña, Skubal, real numbers, no error log lines. PR #52 merged at 23:11. First pitch was 7:15 PM ET, twenty-four minutes later. The chain has held since.

Section 3 — JChad’s Challenge

The voice memo at 23:13 was characteristically ambitious: “Executive producer Jonathan Chadwick made a prediction that Matt Olson would be at .300 by the time they got to Denver batting average. I need you to pull all the stats, the Braves schedule, figure out all the possibilities, different ways in which… break it down into highly likely… make it fun though, fun project. We’re gonna post a little something maybe on the website. Called Jay Chad’s Challenge. Jay Chad Olson’s BA Challenge. I don’t know.”

That’s a one-paragraph prompt for a feature with three unknowns: the math model, the prior art, and the live update mechanism. Plan mode kicked in.

Prior art research

Before writing a line of TypeScript I ran the originality question — has anyone shipped a live, AB-by-AB probability tracker tied to a specific player milestone with a hard deadline? Searched MLB.com, FanGraphs, Baseball-Reference, MLB Player Tracker, Baseball Savant Game Feed, OddsTrader, FanGraphs Steamer Rest-of-Season projections. The closest analogs:

Existing project	What it does	What it is missing
MLB.com 2022 Aaron Judge HR pace tracker	Daily pace projection toward 60+ HRs	Daily, not live; pace not probability; no on-air tie
FanGraphs ZiPS / Steamer RoS	Daily-updated player projections	Daily; aggregate; no fan-facing widget
Baseball-Reference Milestone Watch	Static career-milestone countdowns	Static; no probability layer
OddsTrader player props	Sportsbook single-game prop projections	Single-game; sportsbook UX, not booth UX
Baseball Savant Game Feed	Live pitch-by-pitch Statcast	Pitch-level, not season-arc-milestone

None of them surface a live probability gauge that moves with each at-bat, anchored to a named on-air call by a specific person with a hard date deadline. That’s a small, defensible niche. The combination of live, named-call-anchored, and deadline-shaped clears the originality bar even though every individual ingredient (Monte Carlo, probability gauge, pace tracker) is well-trodden.

Math model

The first instinct — single binomial against remaining at-bats — undersells what’s actually happening. Olson’s path to .300 by Denver depends on which games are left, who he’s facing, and where they’re being played. The model that landed is a per-game Monte Carlo walking the actual schedule with per-game adjustments stacked multiplicatively against a blended baseline pTrue.

// backend/src/services/olson-challenge.ts (excerpt)

function blendTrueTalent(
  seasonAvg: number,    // 2026 to date
  careerAvg: number,    // last 3 seasons from getPlayerBio()
  rolling21: number,    // rolling last-21-days from game log
): number {
  const blended = 0.45 * seasonAvg + 0.30 * careerAvg + 0.25 * rolling21;
  return clamp(blended, 0.220, 0.330);
}

function pTrueForGame(
  baseline: number,
  game: ScheduledGame,
  splits: PlayerSplits,
): number {
  let p = baseline;

  // Opp starter handedness
  if (game.opponentSP.throws === 'L') p *= splits.vsLHP_factor ?? 0.84;
  else                                p *= splits.vsRHP_factor ?? 1.07;

  // Opp starter ERA tier
  const era = game.opponentSP.seasonERA;
  if      (era < 3.0)  p *= 0.92;
  else if (era < 4.5)  p *= 1.00;
  else                 p *= 1.08;

  // Park factor (small static table keyed by venueId)
  p *= PARK_FACTORS[game.venueId] ?? 1.00;

  // Home / Away
  p *= game.isHome ? splits.homeFactor : splits.awayFactor;

  return clamp(p, 0.180, 0.380);
}

The clamps are not cosmetic. Without them, a brutal matchup (lefty starter, sub-3.00 ERA, pitcher’s park, on the road) can multiplicatively zero out the per-game pTrue, and a juicy one (righty starter, 5.00+ ERA, hitter’s park, at home) can peg it at one. Both extremes are wrong as priors. Olson is a real hitter against real pitching with real variance. Clamping to [0.180, 0.380] is the cheap way to keep the simulator honest.

The Monte Carlo loop is twenty lines:

function simulate(
  state: ChallengeState,
  remaining: ScheduledGame[],
  splits: PlayerSplits,
  baseline: number,
  N = 10_000,
): SimulationResult {
  const finals: number[] = [];
  let aboveTarget = 0;

  for (let run = 0; run < N; run++) {
    let H = state.hitsActual;
    let AB = state.atBatsActual;

    for (const game of remaining) {
      const ab = sampleABDistribution(rng);          // 5/4/3/2/0 weighted
      if (ab === 0) continue;
      const p = pTrueForGame(baseline, game, splits);
      const h = sampleBinomial(ab, p, rng);          // pure JS, ~10 lines
      H += h;
      AB += ab;
    }

    const finalAvg = H / AB;
    finals.push(finalAvg);
    if (finalAvg >= 0.300) aboveTarget++;
  }

  finals.sort((a, b) => a - b);
  return {
    probAtOrAbove: aboveTarget / N,
    median: finals[N >> 1],
    p10: finals[Math.floor(N * 0.10)],
    p90: finals[Math.floor(N * 0.90)],
  };
}

10,000 runs × ~30 remaining games × ~4 ABs per game × the per-game adjustment work runs in well under 50ms in pure JS. There’s no external dependency, no Python sidecar call, no statistical library. The RNG is a deterministic xorshift32 seeded from (playerId, gamePk, h, ab, lastUpdateMs) so the same input state produces the same probability — important for caching and for the “if today flops” hypothetical the booth wanted to surface.

Frontend gauge

The card lives in IdleView between NextGameCard and StandingsSection. It subscribes to an SSE stream relayed off the existing GUMBO poller’s per-player boxscore extraction, which already detects line changes; the new event type is olson-challenge, the route is /api/olson-challenge, and the polling fallback for between-games hits the same route every five minutes:

// frontend/src/hooks/useOlsonChallenge.ts (excerpt)

export function useOlsonChallenge() {
  const [state, setState] = useState<OlsonChallengeState | null>(null);

  // Live mode: subscribe to SSE
  useEventStream('/api/events', {
    onEvent: (ev) => {
      if (ev.type === 'olson-challenge') {
        setState((prev) => ({
          ...ev.data,
          trend: prev ? Math.sign(ev.data.probAtOrAbove - prev.probAtOrAbove) : 0,
        }));
      }
    },
  });

  // Idle mode: poll every 5 min when no live game
  usePolling(
    () => fetchOlsonChallenge().then(setState),
    /* intervalMs= */ 5 * 60_000,
    /* enabled= */ !state?.liveGameInProgress,
  );

  return state;
}

The trend pill — ▲ +3% after a hit, ▼ -2% after an out, animates briefly — is the bit that gives the gauge its 538-election-needle-shaped charm. The probability ticking up during the at-bat is the thing nobody else has shipped, because nobody else has tied a probability gauge to a named on-air call with a hard deadline. That’s the originality budget the prior-art research bought.

By the end of the session: 22 backend tests passing, 127 total backend suite green, frontend typecheck clean, frontend tests green. The card was rendering live on the IdleView. PR #53 included the JChad’s Challenge backend and frontend; PR was merged. The same session also included a vanity-but-warranted domain rebrand to scorecardecho.com via the Porkbun API (DNS records updated, Caddy block rewritten, dixieroad.org 301-redirected to the new canonical), so the JChad’s Challenge endpoint went live at both https://dixieroad.org/api/olson-challenge and https://scorecardecho.com/api/olson-challenge.

Section 4 — What we gave up

The fallback chain and the gauge each carry tradeoffs that deserve naming, because resilience patterns and live widgets both have a way of accumulating debt that’s invisible on launch day.

A latency hop on cold-path requests. The chain’s circuit-breaker is per-provider, not global. A NVIDIA blip that opens its breaker re-routes the next request to Groq 70B with a small added latency from the failed first call. We measured roughly 200-400ms of dead time per cold-path, which on a 1.5-second pre-game generation is a 13-26% tail-latency tax. We accepted it. The alternative — racing all three providers in parallel — burns three times the token budget and is the exact failure mode the chain exists to prevent.

Model variance across providers. NVIDIA’s Llama 3.3 70B and Groq’s Llama 3.3 70B should return identical or near-identical text for identical prompts. They don’t. The serving stacks differ, sampling defaults differ, and even at temperature 0 we observe small lexical differences that occasionally surface in the booth’s narrative voice. We handle this by writing the storylines prompt to be model-portable (no provider-specific markdown quirks, no chain-of-thought prompting that varies in how each backend handles it) and by accepting that a cold-path narrative may read slightly differently from a hot-path one. A purist might want strict equivalence and a single primary; we’d rather have the chain.

Infrastructure surface area. Three API keys instead of one. Three sets of failure modes to monitor. Three providers’ terms of service to track. The NVIDIA NIM key landed in .env, gitignored, never committed; the rotation discipline is on the human, not the code. We added a startup log line that prints which providers are configured (llm chain: nvidia=ok groq-70b=ok groq-8b=ok facts-only=ok) so a missing key is obvious in a fresh deploy. That’s the cheapest version of monitoring; a Prometheus counter per provider per outcome is what we’d add next.

JChad’s Challenge is single-player. The math model is hard-coded around Matt Olson and around the .300-by-Denver call. We considered making it generic — any player, any milestone, any deadline — and explicitly rejected the generalization. Generic widgets are not stories. The story is Chad called this on-air, Matt Olson is hitting .297, the Denver series is in eleven games, here is the live probability. A generic version of that is a stat tool for analysts. The specific version is a fan-shaped, broadcast-shaped artifact. If a second on-air call surfaces, we’ll lift the abstraction then.

The probability surfaces a real number to a public audience. This is the tradeoff that took the longest to think through. A 38% probability tells a story; a 38% probability that drifts to 22% mid-game tells a different and slightly cruel story; a 38% probability that drifts to 22% mid-game and is wrong because the per-game adjustments are imperfect tells a story we don’t want to tell. We anchored on transparency over precision: the card has an expandable detail panel showing the median final AVG, the P10/P90 band, the hardest and easiest remaining matchups, and the deterministic “if today goes well” / “if today flops” hypotheticals. The model is openly imperfect; the imperfection is part of the story; the booth has more to talk about, not less.

Also shipped

The day’s cycle of work touched four other projects in parallel with the broadcast-day work, in roughly descending order of stakes.

The stale-briefing schema fix that ran 22:09–22:20 was the warm-up act for the LLM chain. The producer flagged “Casey Mize is the starter tonight” — except Mize was scratched and Skubal was the actual probable. The 22:08 briefing from the night before still had four “Mize” mentions and zero “Skubal” because the briefing was cached in SQLite without any pitcher-identity stamp. The fix: a migration that adds away_pitcher_id and home_pitcher_id columns to pregame_briefings, a write-path stamp on every briefing generation, and a read-path validator that compares stored pitcher IDs against the current gameState.probablePitchers and invalidates plus regenerates on mismatch. Branch fix/pregame-pitcher-id-validation, PR #51, merged. Validation proven live by tampering with stored IDs and watching the log emit "Pregame briefing invalidated", reason="probable pitcher changed (stored away=666157 home=702275, current away=669373 home=702275)". Same posture as the LLM chain: durable schema, observable log, no human-only failsafe.

Twenty CRM v1.22.4 → v2.1.0 major upgrade with rollback discipline. The Twenty deploy had been pinned at v1.22.4 for reproducibility. The user asked for current; current was v2.1.0, which crosses a major and includes DB migrations. The sequence: pre-upgrade pg_dump to a 211KB gzipped file (the rollback artifact), release-notes scan for breaking changes (mostly fixes + i18n + AI feature defaults — no compose-config breakage), TAG=v2.1 in .env, docker compose up -d. Post-upgrade verification: API health green, all 13 cohort People records preserved, the cohort Opportunity intact, 16 Notes intact, all 7 EMAIL_* env vars survived (SMTP wiring lived through the major upgrade because it was env-driven, not config-file-driven), all 5 containers healthy. Watchtower reconfigured to track v2.1 floating tag for patches at 03:15 UTC. The 13 cohort members also got re-attributed from the placeholder “API CcA” creator to “Jeremy Longshore” via direct DB update on createdByWorkspaceMemberId and createdByName while preserving createdBySource='API' for audit honesty.

# Rollback recipe (what we'd run if v2.1 broke after restart)
docker compose down
gunzip < .twenty-pre-v2-backup.sql.gz | docker exec -i twenty-db-1 psql -U postgres twenty
sed -i 's/^TAG=v2.1$/TAG=v1.22.4/' .env
docker compose up -d
# Verify against the backup-time row counts in person/opportunity/note/noteTarget

The rollback recipe is the durable artifact. If you don’t write the rollback recipe before the upgrade, the upgrade is a one-way door pretending to be a two-way door.

Kobiton plugin pilot — JWT REST API breakthrough. Earlier in the M1 review window we’d assumed a credential gap would block end-to-end Appium drive-throughs against Kobiton’s automate API. That assumption was overturned this session: the OAuth flow yields a JWT, the JWT works against Kobiton’s REST endpoints, and a full Appium session against a real device completed. Eight GitHub issues were backfilled with the citations, the snapshot tag was cut, and the R1 sitrep email went out to Stu and Frank. The pilot is now positioned for R2 mid-cycle re-validation on May 4.

Nixtla v1.9.0 release with public-readiness audit + 13-epic build plan. The day’s nixtla work landed v1.9.0, including F1 SDK migration baseline (the audit table that was corrected mid-PR via gemini and coderabbit review comments), a black-formatted timegpt-finetune-lab, a CI fix dropping the backslash escapes in the jq split filter for the plugin-validator, and a dependency floor bump to nixtla>=0.7.3 across 7 plugins plus root. The 35-skill migration to the new validator landed alongside; a Tessl PR was audited and declined; the F1+F2 foundation work was scaffolded with a 13-epic build plan committed to beads.

# nixtla v1.9.0 release sanity check (post-tag)
git fetch --tags
git tag -l | grep '^v1.9.0$'                                      # exists
git log v1.8.0..v1.9.0 --oneline | wc -l                          # commit count
gh release view v1.9.0 --json publishedAt,assets,body | jq .      # release artifact
pnpm -r exec audit-harness verify                                  # no drift

Closing — the thesis revisited

The thesis at the top of this post was that the resilience pattern is the product. By the end of the session that statement reads less like a slogan and more like a description. The fallback chain is the broadcast-grade narrative engine — the LLM is just the fastest implementation of one of the four tiers. JChad’s Challenge is a feature only because the resilience floor was solid enough by 23:11 to free up the next 90 minutes for novel work. The chain bought the gauge. The gauge wouldn’t exist without the chain.

The deeper lesson, the one I want to hold onto: when a free tier dies 44 minutes before a broadcast, “ship it” and “ship the floor” are the same instruction. You don’t get to choose between the urgent fix and the durable fix. You ship the durable fix because there isn’t time for two passes. The version of this story where we paid Groq for a paid tier and went back to bed gets us through tonight; the version where we built the fallback chain gets us through every tonight after that. Same window, same hour, two completely different artifacts at the end of it. The producer’s prediction-tracking widget is the receipt for picking the second version.

Four Primitives, Three Reviews: How a Contributor PR Reshaped a Roadmap — design-rationale case study on convergent review reshaping a planned release
Braves Postgame Expansion and Two AI Lessons — earlier cycle of broadcast-grade resilience work in the same booth dashboard
audit-harness v0.10: Enforcement Travels With the Code — the durable-fix-not-urgent-fix posture as applied to test enforcement infrastructure