j-rig Binary Eval Framework: Ten Epics, One Day

Sun, 29 Mar 2026 10:00:00 -0500

Twenty-eight commits. Ten epics. A TypeScript monorepo that went from pnpm init to drift detection, eval packs, and a calibration engine in one calendar day. j-rig is a binary evaluation framework: given a skill definition and an execution trace, did the agent demonstrate the skill or not? Yes/no. Binary.

The “binary” part is the whole point. Eval frameworks love to produce scores — 0.73 out of 1.0, 4 out of 5 stars, “mostly correct.” These numbers feel precise but they’re not actionable. When your agent scores 0.73 on “can it parse a config file,” what do you do? Is that good? Is that a regression? Binary evaluation strips away the false precision. The agent either parsed the config file or it didn’t. Pass or fail. Now you can count, trend, and alert on real signals.

Evaluation on Start AI Tools - Presented by Intent Solutions

j-rig Binary Eval Framework: Ten Epics, One Day