Eight Deploy Iterations: Tailscale OIDC + Reusable Workflow
Day 1 of VPS-as-the-Home was the launch of a 9-priority program: consolidate Intent Solutions hosting onto a single Contabo VPS, execute the GCP exodus, and harden every active repo’s deploy posture. The deploy pipeline cost 8 GitHub Actions run iterations across two priorities to land what should have been one. The plan-v4.1 rewrite — adding testing as a first-class requirement after the user surfaced that cd <random-repo> && git push should “just work” — was the discipline that survived contact with reality.
Twenty PRs on braves-booth. Fifteen commits on a brand-new runbook repo. A new jeremylongshore/.github repo with a reusable workflow. Eight propagation repos picked up the testing harness. All twenty-five production containers stayed healthy through every iteration. No production downtime.
This is the deploy-side story.
The program shape
VPS-as-the-Home is nine priorities under epic OPS-5nm. The umbrella problem: Intent Solutions production was scattered across a dev box that kept tripping OOM cascades, three GCP projects with mismatched billing, and ad-hoc deploy scripts that didn’t agree on anything. The fix is a single Contabo VPS (intentsolutions, 167.86.106.29, 24 GiB RAM) running every Intent Solutions production stack behind a single Caddy ingress, with deploys driven from GitHub Actions through Tailscale.
Today’s priorities:
- P0 Rotate leaked tokens — closed scope-modified. User accepted residual risk on Tailscale + GitHub PAT values still in
secrets.prod.sops.yaml. Memory entry locked so no future session re-asks. - P1 Braves baseline + foundational docs — closed. Established the CLAUDE.md format that every other priority’s repo would inherit.
- P2 Tailscale OIDC migration — closed after 3 deploy-run iterations.
- P3 Netdata + ntfy tailnet-only monitoring — closed 2026-05-02.
- P4 Slack split + sops-encrypt notify env — partial. VPS-side complete, firehose-channel split pending.
- P5 Reusable workflow + braves refactor — closed after 5 deploy-run iterations.
- P6 SOPS+age propagation + repo testing baseline — in flight, 11/23 done by end of day.
- P7 GCP exodus — unblocked, tracker landed.
- P8 Final cleanup — open.
The runbook repo intentsolutions-vps-runbook was bootstrapped at 3:15 AM with commit e715f5f Initial bootstrap. By morning, the Phase 0 tracking infrastructure was in place: 00-plan.md at v4.1, 01-tracking-index.md with bead ↔ GitHub issue ↔ AAR mapping per priority, and AAR scaffolding waiting to be filled. The discipline came first; the iterations followed.
Priority 2: the Tailscale OIDC migration
The starting state: braves-booth deploys authenticated to Tailscale via a long-lived TS_OAUTH_CLIENT_SECRET GitHub secret. The goal: replace it with a GitHub-issued OIDC token Tailscale can verify on each run via workload identity federation — short-lived, no static secret.
Should have been one PR.
| Run | Audience sent | Subject sent | Result | Lesson |
|---|---|---|---|---|
| 25235116847 (PR #86) | https://github.com/jeremylongshore | refs/heads/main | HTTP 400 “invalid request” | Wrong client_id — legacy OAuth kAhtjSrYrz11CNTRL doesn’t work in OIDC mode |
| 25235249300 (workflow_dispatch) | https://github.com/jeremylongshore | refs/heads/fix/tailscale-oidc-client-id | HTTP 403 “Unauthorized” | Credential format accepted; some other claim mismatch |
| 25235414350 (PR #87 → main) | https://github.com/jeremylongshore | refs/heads/main | HTTP 403 “token has invalid audience” | Subject hypothesis was wrong — audience was the actual mismatch |
| 25237418475 (workflow_dispatch) | api.tailscale.com/T54Ta7mgLc11CNTRL-kgkBgu2cSi11CNTRL | refs/heads/feat/re-enable-... | HTTP 403 “Cannot validate subject” | Audience now passes; subject mismatch expected on non-main |
| 25237516269 (PR #90 → main) | (correct audience) | refs/heads/main | SUCCESS | All claims match; OIDC handshake → SSH → docker compose → health 200 |
The first two runs burned on a wrong assumption: that the client_id field in the legacy OAuth flow was the same identifier OIDC wanted. It is not. OIDC mode wants the OIDC-flow client identifier from the Tailscale OAuth-trust UI — a separate value with a different prefix. HTTP 400 was the symptom; the assumption was the bug.
Run #3 swapped to the correct OIDC client id and immediately got HTTP 403 “token has invalid audience.” The assumption this time: the subject claim must be wrong because OIDC tokens from actions/checkout-driven runs have a documented subject format. Spent an hour testing subject variants. None worked.
The actual fix surfaced from reading the Tailscale OAuth-trust UI directly instead of inferring: Tailscale auto-generates the OIDC trust audience as api.tailscale.com/<oidc-client-id>. It is not a value you choose. It is not documented in the Quick Start. You read it from the trust card after creating the trust, and you paste it verbatim into the GitHub Actions step’s audience input.
Run #4 was a deliberate verification probe — not a real deploy attempt. The audience was now correct, the branch name was intentionally feat/re-enable-..., and the expected outcome was an HTTP 403 “Cannot validate subject” (because the wildcard subject pattern only matches main). Confirming that the failure mode flipped from “audience invalid” to “subject mismatch” verified the audience fix in isolation before touching main. Run #5 was the same code, merged to main, and OIDC handshake succeeded for the first time. The single-line root cause: the audience was always going to be api.tailscale.com/<client-id> because Tailscale generates it; treating it as a chosen string sent us through three failed runs.
Two follow-on wins:
- Wildcard subject pattern. The Tailscale trust got configured with
repo:jeremylongshore/*:ref:refs/heads/main— one trust covers everyjeremylongshorerepo’smainbranch. The reusable workflow can serve the whole portfolio without per-repo trusts. - Secret deletion. With OIDC verified,
TS_OAUTH_CLIENT_SECRETwas removed frombraves-boothGitHub Actions secrets. The long-lived static credential is gone.
The hot-fix revert (PR #88) restored the OAuth-secret path partway through iteration so production deploys never lost capability while the OIDC story was still being figured out. That mattered: 25 containers were live, scorecardecho.com was serving real users, and the cost of breaking deploys mid-experiment would have been losing the ability to ship a fix.
Cost: 3 failed real deploy attempts (runs #1, #2, #3) plus one deliberate verification probe (run #4), 4 PRs (#86, #87, #88, #90) to land what should have been 1. Worth it.
Priority 5: the cross-repo reusable workflow
With OIDC working in braves-booth, the next move was extracting the deploy job into a reusable workflow living in a new repo: jeremylongshore/.github. The reasoning: every Intent Solutions repo is going to need the same OIDC → tailnet-routability poll → SSH → docker compose → smoke check sequence. Copy-paste across N repos is how drift happens.
Should have been two PRs (one to create the reusable, one to call it from braves-booth).
| Run | Issue | Fix |
|---|---|---|
| 25237819383 (PR #92) | “Workflow file issue” | jeremylongshore/.github actions/permissions/access defaulted to none — blocked cross-repo reusable workflow call |
| 25237847204 | SSH “No ED25519 host key is known for intentsolutions” | secrets: inherit didn’t forward VPS_HOST_KEY to the reusable workflow |
| 25237950256 (PR #93) | Same SSH host-key error | Explicit per-secret pass-through alone didn’t fix — secret arrived (978 bytes) but its hashed entries didn’t match intentsolutions lookup |
| 25238044624 (PR #94) | Same | Added size-debug step. 3 hashed entries; none decode to intentsolutions |
| 25238103973 (PR #95) | Same | Added hostname-field debug. Confirmed the pinned secret has wrong hash format |
| 25238177470 (PR #96) | SUCCESS | Switched to inline ssh-keyscan -t ed25519 intentsolutions >> known_hosts; SSH match works; deploy completes; smoke passes |
The first failure was the most misleading. The error message GitHub returned was “Workflow file issue” — which sounds like a YAML problem in the reusable. It isn’t. It’s GitHub’s repo-level setting actions/permissions/access. The brand-new jeremylongshore/.github repo had it set to none out of the box (GitHub’s docs don’t explicitly document the default for new private repos, but none was what the API returned when we read it before changing it). That setting controls whether other repos in the same org or user account can call the reusable workflow. With none, the call fails before the workflow file is even parsed.
Fix:
gh api -X PUT repos/jeremylongshore/.github/actions/permissions/access -f access_level=user
One API call. Took twenty minutes to find because the error pointed at the wrong thing.
Run #2 surfaced secrets: inherit. The caller workflow had:
jobs:
deploy:
uses: jeremylongshore/.github/.github/workflows/vps-deploy.yml@709a07f
secrets: inherit
inherit is documented as “pass all caller secrets to the called workflow.” In our run, VPS_HOST_KEY did not arrive in the reusable. GitHub’s documented inherit failure modes are chained workflows (A→B→C) and environment-scoped secrets — neither applied here, so this was an undocumented edge or runner state. (The Actions log group Setting up job shows which secrets were resolved on the runner — useful diagnostic surface for this class of issue.) Either way, PR #93 switched to explicit per-secret pass-through:
jobs:
deploy:
uses: jeremylongshore/.github/.github/workflows/vps-deploy.yml@709a07f
secrets:
TS_OIDC_CLIENT_ID: ${{ secrets.TS_OIDC_CLIENT_ID }}
TS_AUDIENCE: ${{ secrets.TS_AUDIENCE }}
VPS_DEPLOY_KEY: ${{ secrets.VPS_DEPLOY_KEY }}
VPS_HOST_KEY: ${{ secrets.VPS_HOST_KEY }}
Same SSH error.
Run #3 added a size-debug step: print the byte count of VPS_HOST_KEY after it arrived in the reusable. Result: 978 bytes. The secret was arriving. So why couldn’t SSH match it?
Run #4 added hostname-field debugging — decode the hashed entries in the known_hosts file and check what hostname they hash against. Three entries; none of them, when computed against the literal string intentsolutions, produced a matching hash.
The diagnosis (inferred from symptoms — we did not recompute the salt to confirm): the pinned VPS_HOST_KEY secret was generated months ago by piping ssh-keyscan -H output into the secret value. At some point in that pipeline, a trailing newline or whitespace got included before the hostname. The most likely explanation, consistent with how HMAC-SHA1 known_hosts hashing works (the hostname is the exact message string hashed with a per-entry salt), is that the captured hostname-with-whitespace produced hashes that don’t match a clean intentsolutions lookup. The pin was perfect; the input string had drifted.
The fix in PR #96 stopped trying to use the pinned hashed entries. Instead, the workflow does a live ssh-keyscan against the tailnet hostname and appends the result inline:
- name: Add VPS host key
run: |
mkdir -p ~/.ssh
chmod 700 ~/.ssh
echo "${{ secrets.VPS_HOST_KEY }}" >> ~/.ssh/known_hosts
ssh-keyscan -t ed25519 intentsolutions >> ~/.ssh/known_hosts
chmod 600 ~/.ssh/known_hosts
Both entries land. The pinned-but-stale entry stays (defense in depth). The live-scanned entry is what SSH actually matches. Tailscale tunnel authentication is the real trust layer here — the Tailnet identity verification means the live scan is meeting a server that already proved it’s the right machine. The strict known_hosts pin was less load-bearing than it looked.
Cost: 5 PRs (#92 through #96), 5 failed deploy runs, the longest single block of the day. Worth it because the resulting reusable workflow is now the canonical deploy path for every jeremylongshore/* repo.
The smoke check that catches degraded states
The reusable workflow ends with a smoke check. The naive version is curl -sf $URL && exit 0. That is a lie detector that doesn’t detect lies.
- name: Smoke check
run: |
for i in 1 2 3 4 5; do
response=$(curl -sf --resolve scorecardecho.com:443:$VPS_PUBLIC_IP https://scorecardecho.com/api/health || echo "{}")
if echo "$response" | jq -e '.status == "ok" and .gumbo.running == true' > /dev/null; then
echo "Smoke passed on attempt $i"
exit 0
fi
sleep 5
done
echo "Smoke failed after 5 attempts"
exit 1
Three things this does that a plain status-code check doesn’t:
--resolve scorecardecho.com:443:$VPS_PUBLIC_IPpins the curl request to the VPS’s public IP. No DNS caching surprise. We are testing this server, not whichever server DNS happened to point at.- The jq filter
.status == "ok" and .gumbo.running == truecatches the degraded state where the HTTP server is up (returns 200) but the pregame narrative job (gumbo) is dead. A plain 200 OK passes when the application is half-dead. The custom predicate catches it. - Five-retry warm-up allows for
docker compose upcontainers to finish boot. Each retry waits 5 seconds. Hard fail at 25 seconds.
This is what “deploy succeeded” should mean. Container running is not enough. HTTP responding is not enough. Application functioning — that’s the bar.
Plan v4.1: testing as first-class
Mid-iteration on P5, the user surfaced the success criterion that reframed the whole program:
“I want to be able to
cd <random-repo> && git pushand have CI just deploy without hiccups.”
That sentence was not in the original plan. The original plan was about consolidating hosting and running the GCP exodus. Deploys were treated as an outcome of getting the pieces in place.
The user’s framing put deploys at the center: every repo, every push, just works. That demands testing as a first-class gate, not a polish step. Plan v4.1 (a same-day rewrite of the program plan) baked it in:
- Pre-deploy:
needs: teston the caller workflow. No deploy runs unless tests pass. - Post-deploy: smoke check with custom jq predicate (above).
- Auto-rollback: smoke fail →
exit 1→ deferred VPS-side wrapper tags the previous-known-good and ntfy escalates. - P6 expanded: audit-tests rollout per repo paired with the SOPS+age pass.
The discipline isn’t testing as a step in the build. It’s testing as the gate the deploy must pass to even start, plus the gate the deploy must pass to be considered successful. The two halves change what “merged to main” means.
Parallel work: P6 propagation
While P2 and P5 were eating GitHub Actions minutes, P6 was moving in parallel via a script: scripts/p6-install-harness.sh. The script:
- Vendored install of
@intentsolutions/audit-harness v0.1.0(the enforcement-travels-with-code package). - Appended a
## Testingsection to the repo’sCLAUDE.md. - Created an auto-numbered
000-docs/entry recording the install. - Used a worktree-based install for repos with dirty trees so iteration didn’t disturb in-flight work.
By end of day, 11 of 23 testing-baseline candidates were done: hybrid-ai-stack (pilot), j-rig-binary-eval, claude-code-slack-channel, intent-blueprint-docs, intent-genai-project-template, executive-intent, perception, moat, git-with-intent, intentional-cognition-os, intent-solutions-landing.
One cross-repo discovery surfaced from this rollout: gitleaks was flagging .beads/issues.jsonl as containing credentials. The bd memory store includes string content from past sessions, and one such string matched a gitleaks pattern. False positive. Fix: a path-based .gitleaks.toml allowlist that excludes the bd state directory. Filed as OPS-x6n and replicated across the other repos that hit the same false positive. The harness propagation was the thing that surfaced it — running the same gates across many repos exposes the gate’s blind spots.
What’s deferred
Honest counterweight. Not everything landed today.
- VPS-side wrapper
/usr/local/sbin/deploy-srv-appwith per-repo allowlist, flock per-repo, drop privileges. Needs VPS sudo + scope extension. Currently the SSH command from the reusable workflow runs unwrapped. - Filesystem split
/srv/code/<app>(code) +/var/lib/intentsolutions/<app>(state). Today everything is under/srv/<app>/. - Per-repo SSH
command=restrictions on the VPS authorized_keys file. A leaked deploy key currently grants any command, not just deploy. - Port allocation registry. Allocated by hand right now. Will get a
ports.yamlsource-of-truth file with aport-checkscript. vps-deploy-canarythrowaway test repo. A repo whose only job is exercising the reusable workflow without touching production traffic.- Auto-rollback semantics. Today the smoke check exits 1 on failure; the deploy is recorded as “failed” but the previous version isn’t automatically promoted back. Full rollback (tag-previous-deploy + ntfy escalation) needs the VPS-side wrapper.
bd-sync close-mirror bug (OPS-nhi). The bd → GitHub mirror tool doesn’t propagate the close-comment. Workaround used all day:bd close+ manualgh issue comment. Fix queued for P5 follow-up.
These are not blockers for tomorrow. They’re the next layer of armor on a deploy path that already works end-to-end.
Lessons
- Tailscale OIDC audience is auto-generated. It is
api.tailscale.com/<oidc-client-id>. You read it from the Tailscale trust UI, not invent it from documentation conventions. Three failed runs say so. - GitHub cross-repo private workflow access can be locked by default. Our brand-new
.githubrepo was set tonone; the error message said “workflow file issue,” which sounds like YAML. It isn’t.gh api -X PUT repos/<org>/<repo>/actions/permissions/access -f access_level=user. secrets: inheritis fragile across reusable workflow boundaries. Use explicit per-secret pass-through. It’s verbose, it’s strictly clearer, and it actually delivers the secrets.- Pinned
ssh-keyscanoutput drifts. A trailing whitespace at scan time produces a hashed entry that doesn’t match a clean lookup later. Livessh-keyscanagainst the tailnet hostname inside the workflow is the right model when Tailscale’s tunnel authentication is the real trust layer. - Wildcard subjects scale.
repo:<org>/*:ref:refs/heads/mainlets one Tailscale trust serve every repo’s main branch. Per-repo trusts would have been a maintenance bog. - Custom smoke predicates beat status-code checks.
.status == "ok" and .gumbo.running == truecatches half-dead applications that return 200. A plain status-code check is a lie detector that can’t detect lies. - Plan-v4.1 testing-as-first-class is what made 8 iterations safe instead of expensive. The runbook bootstrapped at 3 AM with
00-plan.md+01-tracking-index.md+ per-priority AAR templates. Mid-iteration on P5, the user surfaced thegit push“just works” criterion, and the plan was rewritten on the spot to bake in pre-deployneeds:testgates and post-deploy smoke predicates. Each iteration could fail in flight without producing a half-broken production state, because the smoke predicate would refuse to call any deploy successful that didn’t return.status == "ok" and .gumbo.running == true. The 25 production containers stayed healthy through every retry. The discipline came first; the iterations followed; the test gates kept failure local to the iteration instead of propagating to users. - Hot-fix reverts protect production during experiments. PR #88 restored the OAuth-secret path mid-iteration so 25 production containers never lost the ability to ship a fix. The cost of an experiment is bounded when you keep the previous-known-good path warm.
Day 1 cost summary
braves-booth: 20 PRs in one day (deploy work clustered 16:15 to 19:51 ET).intentsolutions-vps-runbook: 15 commits (initial bootstrap + 8 plan-document iterations).jeremylongshore/.github: new repo with the reusable workflow, pinned by 40-char SHA709a07fbebb1d51806e171204e63f5332abcb0dafrom the caller.- 11 propagation repos got the audit-harness install.
- 35+ commits across the relevant repos.
- All 25 production containers stayed healthy throughout.
The deploy pipeline is now: git push → CI tests pass → reusable workflow → Tailscale OIDC handshake → SSH over tailnet → docker compose pull + up → smoke check with custom jq predicate → exit 0 or exit 1. From the caller’s perspective: one job invocation, one set of explicit secrets, one 40-char-pinned SHA. The whole jeremylongshore portfolio can adopt it by changing four lines.
Day 1 ended with eight priorities open, three closed, one closed-scope-modified, and a working deploy path that didn’t exist twelve hours earlier. Day 2 starts with monitoring (P3) and the SOPS+age propagation push (P6).