Gemini PR Reviews and Four Releases in One Day

Posted on Dec 6, 2025

54 commits. Four releases. And a debugging journey through three different Gemini configurations before PR reviews actually worked.

The Gemini PR Review Pipeline

The goal: automated code review on every PR to the Nixtla repo. An AI reviewer reads the diff, checks for compliance violations, flags potential issues, and posts a structured review comment. The reviewer is Gemini, because the free tier is generous enough for a repo this size.

What followed was three iterations of “this should work” followed by “it does not work.”

Attempt 1: Gemini 3

Started with Gemini 3 via the public API. The model handled small diffs fine – under 500 lines, reviews came back structured and useful. But the Nixtla repo has PRs that touch 50+ files. At that scale, Gemini 3 hit context length limits and returned truncated reviews that cut off mid-sentence.

The truncation wasn’t a hard error. The API returned 200 OK with partial content. The GitHub Actions workflow posted the truncated review as if it were complete. Looked like a working pipeline until you read the output.

Attempt 2: Gemini 2.5 Pro

Upgraded to Gemini 2.5 Pro for the larger context window. The context limit was no longer the problem. The new problem was rate limiting. Gemini 2.5 Pro’s free tier has tighter rate limits than 3, and the review pipeline makes multiple API calls per PR – one for the overall diff summary, one per file for detailed review, one for the compliance check.

A 20-file PR made 22 API calls. The rate limiter kicked in at call 15, and the remaining reviews came back as 429 errors. The workflow retried with exponential backoff, but the GitHub Actions job timed out after 10 minutes of waiting.

Could have batched the file reviews into fewer calls, but that defeats the purpose. Per-file reviews produce better feedback than “here’s everything wrong with all 20 files in one blob.”

Attempt 3: Vertex AI

Moved to Vertex AI’s Gemini endpoint. Same model (2.5 Pro), but Vertex AI’s rate limits are tied to project quotas, not free tier restrictions. The project already had Vertex AI configured for other workloads, so the quota was already allocated.

The migration required swapping the API client library (from google-generativeai to google-cloud-aiplatform), updating authentication from API key to service account, and changing the model identifier format. About 40 lines of code changed. The rest of the pipeline – prompt templates, output parsing, GitHub comment formatting – stayed identical.

Vertex AI worked. Reviews came back complete, within rate limits, for PRs up to 80 files.

Daily Audit Workflow

With the model sorted, the pipeline got a daily audit mode. A scheduled GitHub Actions workflow runs nightly, reviews any PRs that were opened or updated in the last 24 hours, and posts a summary to Slack. The summary includes: number of PRs reviewed, total issues flagged, compliance score per PR, and a list of PRs that need human attention.

The daily audit catches PRs that were opened just before end of day and didn’t get the automatic review in time (GitHub’s webhook delivery has occasional delays).

Four Releases

The review pipeline debugged, four releases shipped in sequence:

v1.3.0: Prediction Markets

The Polymarket Analyst skill from Dec 5 promoted from draft to released. Full skill implementation, demo project with scored benchmarks, and the Global Standard Skill Schema integrated into the release validation process.

This is the first release where the skill schema was enforced by CI. Previous releases validated skills manually. v1.3.0 was the proof that automated validation works.

v1.3.1: Hotfix

A config file in the Polymarket skill referenced a deprecated API endpoint. Caught by the daily audit workflow within hours of the v1.3.0 release. One-line fix, tagged, shipped.

This is the release cadence working as designed. Small fix, small release. No batching corrections into the next feature release.

v1.4.0: Marketplace Compliance

Brought the entire repo into compliance with the marketplace’s quality standards. This meant updating plugin manifests, adding missing metadata fields, standardizing version strings, and ensuring every published skill had a minimum of three usage examples.

The marketplace compliance checker runs externally – it’s not the repo’s own compliance tool. Different standards, different validator. Several skills that scored 100% on the internal checker scored 70-80% on the marketplace checker because the marketplace requires fields the internal spec considers optional.

Resolved by promoting those marketplace-required fields to required in the Global Standard Skill Schema. The internal and external standards now align. v1.4.0 is the release where they converged.

v1.4.1: Documentation Patch

Stale references in three documents that still pointed to the archived baseline-lab plugin. Updated all cross-references to point to the skills framework equivalents.

Master Claude Skills Standard

The 098-SPEC-MASTER document. A master specification that defines what a “Claude Skill” is across all repos, not just Nixtla. Covers:

  • Skill identity (name, version, author, license)
  • Interface contract (inputs, outputs, permissions)
  • Quality requirements (compliance score thresholds, test coverage minimums)
  • Lifecycle states (draft, published, deprecated, archived)
  • Versioning rules (when to bump major, minor, patch)

This spec exists because skills are proliferating across multiple repos. Without a shared standard, each repo invents its own conventions and they gradually diverge. The master spec is the authoritative reference. Individual repos can extend it but can’t contradict it.

De-Hyping PRDs for Nixtla Review

The last batch of commits: reviewing every Product Requirements Document in the repo and removing inflated language. “Revolutionary” became “novel.” “Industry-disrupting” became “differentiated.” “Unprecedented accuracy” became “competitive accuracy.”

Max is going to show these documents to Nixtla’s engineering team. Engineers notice when a PRD promises things the architecture can’t deliver. Inflated language doesn’t just sound bad – it undermines credibility with the exact audience that needs to trust the document.

The de-hyping pass touched 6 PRDs and removed or replaced 23 instances of marketing superlatives. The content didn’t change. The confidence in the content increased.

The Model Fallback Lesson

The Gemini debugging journey is a pattern worth naming. When an AI API works in testing but fails in production, the failure mode is almost never “the model is bad.” It’s one of three things:

  1. Context limits – your test data was small, your production data isn’t
  2. Rate limits – your test volume was low, your production volume isn’t
  3. Authentication model – your test creds work, your production creds don’t

The fix is usually the same: move from the consumer API to the enterprise API. The model is identical. The infrastructure around it is not.


Related Posts: