AI EngineeringJul 3, 20263 min read

Shipping gpt-5.4 as One Config Line

Deploy gpt-5.4 to production as a one-line config with eval-gated rollout. Key finding: report word-ceilings came from the model, not the prompt.

On 2026-07-03, DiagnosticPro shipped gpt-5.4 in production. The structural finding: gpt-4o silently capped reports at ~1,225 words across 36 benchmark runs—no matter the prompt. gpt-5.4 delivers depth. A real stored report came back at 15 sections, 3,393 words.

The rollout was step 2 of a two-axis, independently eval-gated ladder:

Step 1 (June): Prompt framework v3.0 (v3-b few-shot exemplar) beat v2.0 18/18 blind. Later corroborated by adding a third model to the grid.

Step 2 (this day): Model gpt-5.4. v3-b on gpt-5.4 beat v2.0 9-0-0 on both judges (+44 to +47 points)—the only cross-family-validated model change in RESULTS.md.

The engineering to keep LLM_MODEL a pure config switch: OpenAI’s newer models reject max_tokens and want max_completion_tokens; some reasoning models also reject a non-default temperature. The createChatCompletionAdaptive function detects the model family, picks the param shape up front, and — if the model still 400s on the token param or the temperature — flips that one param and retries once before giving up.

async function createChatCompletionAdaptive(openai, { model, messages, maxTokens, temperature }) {
  const usesMaxCompletionTokens = /^(gpt-5|o\d)/i.test(model);
  const state = { maxCompletion: usesMaxCompletionTokens, dropTemp: false };
  const build = () => {
    const body = { model, messages };
    if (!state.dropTemp) body.temperature = temperature;
    body[state.maxCompletion ? 'max_completion_tokens' : 'max_tokens'] = maxTokens;
    return body;
  };
  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      return await openai.chat.completions.create(build());
    } catch (err) {
      if (err?.status === 400) {
        const msg = String(err?.message).toLowerCase();
        if (!state.maxCompletion && msg.includes('max_completion_tokens')) { state.maxCompletion = true; continue; }
        if (!state.dropTemp && msg.includes('temperature')) { state.dropTemp = true; continue; }
      }
      throw err;
    }
  }
}

Crucially, this mirrors the eval harness (tests/prompt-eval/lib/common.mjs), so production sends the exact same request the judges scored. Eval-prod parity is non-negotiable.

Verification: llm.test.js adds three test cases—gpt-4o keeps max_tokens; gpt-5.x uses max_completion_tokens with no wasted 400; an unrecognized model adapts on the 400 then succeeds. 98/98 pass. A twin run (gpt-5.4 + v3.0) completed a full paid customer journey 11/11. Stored row: model=gpt-5.4, 15 sections, 3,393 words, zero exemplar leak, DTC P0301 extracted, ~48s analysis latency inside the poll budget. Prod flipped: LLM_MODEL=gpt-5.4 set + systemctl restart. Public healthz + unpaid smoke-test green.

Cost: ~$0.079/report (~3x gpt-4o) but under 2% of the $4.99 ticket. Rollback is one env line + restart; v2.0 prompt and gpt-4o model stay selectable for A/B or rollback.

Takeaways:

Gate each axis of a change on its own eval. Ship each as a one-line config switch, one-command rollback.
A prompt’s length target can be silently capped by the model. Check word-count distributions in your eval, not just quality scores.
Make production send the same request your eval judged, or the eval doesn’t transfer.

Also shipped: Across intent-os and prompts-intent-solutions, the doc-filing standard reached v4.4. Deleted a never-used shasum cross-repo rule; replaced it with a positive cluster-archetype definition (§3.1.2) for nested 000-docs folders. The skill becomes canonical; the master copy becomes a published mirror.

BM25 Before Vectors: An Eval-Gated Retrieval ADR — the same “gate the decision on an eval” discipline, applied to a retrieval backend.
Run the Readiness Audit Before You Flip DNS — the DiagnosticPro self-host rollout, one door earlier.
The LLM Should Never Do the Math — where to draw the line on what the model is allowed to decide.

Related Posts