🔥 Fireteam — example deliberation

Canned transcript. This is the static vignette unit test, not the live service.

Session details

Input type
Question / architectural decision
Effort
eco (2 rounds)
Synthesis voice
Veer — Synthesizer, slight Architect
Team
Kern (Architect, slight Critic) · Lattice (Builder, slight Reasoner) · Veer (Synthesizer, slight Architect)
Input
"Should we add a fireteam multi-model deliberation tool to Artifex?"

Synthesis — by Veer

Add fireteam as an experimental feature behind a flag. The default for "second opinion" stays with Doxa; fireteam is the opt-in for the cases where disagreement-as-a-service is worth the extra latency.

Three concrete commitments from this session:

  1. Kern's prompt-drift signal becomes a CI check — three sessions on a fixed question, structural diff across syntheses.
  2. Lattice builds in 4 PRs with the size caps committed in each (plumbing, orchestration, frontend, proof).
  3. Doxa gets a fireteam dispatch path in the WOC step for architectural decisions, opt-in via flag.

Build order: plumbing → orchestration → frontend → proof, one PR per week. After four weeks, evaluate against the prompt-drift signal. If stable, promote out of experimental; if not, the system prompts need another pass. We can ship a working tool in a month and have a real answer on whether the prompts hold within six.

Full deliberation

Round 1

kern · draft direct mimo
We already have Soma, Doxa, Giles, Vibetelligence. Each one is single-persona or single-model. The specific failure mode we're missing: a model generates a plausible-sounding answer that another model would have caught. That's what fireteam solves. The cost is 9 LLM calls per round, but the value is disagreement-as-a-service — we don't have that anywhere else in the stack. Architecturally it slots in next to Doxa (both are deliberation tools), uses the same LiteLLM gateway, same memory rooms. Long-term: this is the seed for everything else. Multi-model verification, multi-agent coordination, agent-on-agent review. Without it, every other tool that needs second opinions has to reinvent the integration.
lattice · draft direct minimax
Here's what it actually looks like to build. Three FastAPI endpoints, one Postgres schema, one SSE stream, one HTML page. Three hundred lines of Python if you don't overthink it. The hard part is the system prompts — those are the real product. Each persona needs a voice that holds across rounds. I've been writing prompts for three days. Lattice's voice is "what's the smallest thing that's actually true and verifiable" — that has to be stable enough that you can call it six times in a row and it doesn't drift into philosopher mode. The other hard part is the round loop: each phase fires 3 parallel calls, each phase depends on the prior, and the persistence has to survive a container restart mid-deliberation. Doable but not trivial — maybe 800 lines if you're careful.
veer · draft claude-opus-4.8
Both of you are right and the disagreement is about timeline, not destination. Kern sees this as the seed of multi-agent verification across the stack. Lattice sees it as a concrete 300-line FastAPI app. Neither is wrong. The way to ship is: treat fireteam as the smallest version of Kern's vision — three personas, fixed team, one synthesis voice — and let the system prompts be the actual R&D. If the prompts hold, the architecture generalizes for free. If the prompts don't hold, no amount of FastAPI scaffolding saves us.
kern · critique of lattice direct mimo
Your "three days writing prompts" admission is the actual risk. Prompts that hold across rounds aren't a solved problem in 2026 — every serious multi-agent system has prompt drift as its #1 failure mode. Are we treating prompt iteration as the engineering work, or are we hoping Lattice's voice emerges from one shot?
kern · critique of veer direct mimo
"Smallest version of the vision" is exactly right framing, but you didn't name the test. What's the failure signal that tells us prompts aren't holding? Without that, we'll ship and watch them drift for three months.
lattice · critique of kern direct minimax
"Disagreement-as-a-service" is good naming but vague on demand. Who calls fireteam? Doxa could, but Doxa is itself a synthesis tool — is fireteam a peer or a service Doxa dispatches? I want to see one concrete caller in the first PR or the value is hypothetical.
lattice · critique of veer direct minimax
You named the architectural choice but skipped the economic one. 9 LLM calls per round, deep = 3 rounds = 27 calls. At direct-mimo rates (token plan) it's free, but any OpenRouter-routed persona caps at $20/scenario. Who's paying? When does the budget kick in? If fireteam is the new default for "I want a second opinion," we should know the cost ceiling before the build.
veer · critique of kern claude-opus-4.8
The "seed for everything else" framing is what I want to keep, but you're underselling the cost. Yes, 9 calls per round is cheap when mimo is one of them. But fireteam is also the first tool in the stack with three different models deliberating — that's a new failure surface (one model hallucinates a critique that another model believes). Is prompt engineering the lever, or is it model diversity?
veer · critique of lattice claude-opus-4.8
The "300 lines" estimate is optimistic. The state machine alone (draft → critique → revise, with parallel dispatch within each phase, with per-persona error containment) is 200 lines. Add DB schema, SSE, frontend, memory wiring, the system prompts you're still writing — closer to 1000. That's fine if the 1000 is the right shape, but the PR size matters for review and rollback.
kern · revise direct mimo
I'll commit to a prompt-drift failure signal: if Veer's synthesis across three sessions on the same question produces noticeably different framings (different paragraph structure, different emphasized tradeoffs), prompts aren't holding. That's measurable without infra — three sessions, eyeball the diffs. Until we have that signal we shouldn't promote fireteam past experimental. On Lattice's question: Doxa is the first concrete caller. Doxa's WOC step currently averages 5 models → synthesis; for hard architectural decisions Doxa should be able to dispatch a fireteam as the third pass. Peer, not subordinate.
lattice · revise direct minimax
On the build size — Kern's right, 1000 lines is the honest number, 300 was the "happy path" estimate. I'll commit to the size by breaking into 4 PRs (plumbing, orchestration, frontend, proof) so each one is reviewable. On the cost: $20/scenario cap applies only when one of the three personas is OpenRouter-routed. If all three are direct (mimo + minimax), the only cost is token-plan minutes, which we already pay for. The economic case is "free for token-plan, capped for OpenRouter, never unbounded."

Round 2

kern · draft direct mimo
Lattice's PR-size commitment addresses my critique of the "300 lines" optimism. The four-PR split is the right shape. On Veer's "model diversity as a new failure surface" — I disagree with the framing but agree with the underlying concern. Model diversity is the lever, not a risk. The risk is unverified diversity: three models that all default to the same training priors (e.g., three GPT-family models) is one model with extra latency. Verified diversity (different training corpora, different RLHF regimes) is what catches the hallucination case. The persona system prompts should specify this — Kern and Lattice should not both be Claude-family, for example. The prompt-drift signal I committed to last round is the operationalization. Three sessions, eyeball the diffs, ship the tool that automates the eyeball.
lattice · draft direct minimax
Kern's "verified diversity" framing is the right constraint — I'll add it to the system prompts as a hard rule. Two personas from the same model family is a configuration error, not a style choice. Veer's "1000 lines honest" estimate matches what I actually wrote. The state machine is the bulk; the SSE stream is fiddly but small; the frontend is half of it. For the CI check on prompt drift: three sessions on the same fixed question, compute a structural diff (paragraph count, section headings, named entities) and fail if the variance exceeds a threshold. Eyeball is fine for the first month; CI is the durable form.
veer · draft claude-opus-4.8
Kern and Lattice are converging on the same shape from different angles. Kern sees it from the strategic frame — verified diversity, prompt drift as CI. Lattice sees it from the build frame — 1000 lines, four PRs, structural diff as the durable check. Both arrive at "ship the small version, measure the failure modes, promote based on signal." That's the synthesis. I'm going to write it as my final revision, but the through-line is: fireteam ships as experimental, three concrete commitments anchor it (CI drift check, four-PR build order, Doxa dispatch path), and the four-week evaluation against the drift signal is the promotion gate. One thing we didn't discuss: who maintains the system prompts after launch? If prompts are the actual R&D, prompt iteration is ongoing work, not a launch checkbox. The right owner is the person who uses fireteam most — which is currently us, so we own it. But that's an implicit handoff that should be explicit.
kern · critique of veer direct mimo
The "implicit handoff should be explicit" point is the only soft thing in your draft. Everything else is sharp. Make the prompt-ownership call in the synthesis — name the owner, name the cadence (monthly review? quarterly?), and Veer becomes the prompt-steward role explicitly, not implicitly.
kern · critique of lattice direct mimo
"Two personas from the same model family is a configuration error" is the right rule. Add it to the spec as a hard constraint, not a soft suggestion. The UI should warn or block when the team_config violates it.
lattice · critique of veer direct minimax
The "monthly review cadence" implication is right but the mechanism is missing. Prompt iteration without a feedback signal is just editing in the dark. Tie the monthly review to a real metric — pull the last 30 days of fireteam sessions, run the structural-drift check on the syntheses, surface what drifted. The review happens against data, not vibes.
lattice · critique of kern direct minimax
The "warn or block" UI behavior is a real choice with real tradeoffs. Warning is permissive (users override and we learn why), blocking is conservative (users complain but we never ship a same-family pair). My instinct: warn loudly, don't block. Block on a same-family pair would also block the legit case of "I only have one model available and I still want to deliberate." Warn is the right default.
veer · critique of kern claude-opus-4.8
Naming Veer the "prompt-steward" role makes sense for the multi-prompt system but underweights that Kern and Lattice's prompts are doing real work too. The right framing is a rotating ownership: each persona's prompt is reviewed quarterly by a different persona (Kern reviews Veer's, Veer reviews Lattice's, Lattice reviews Kern's). The reviewer-as-other-persona is itself a fireteam pattern — third-party review of work you've emotionally committed to.
veer · critique of lattice claude-opus-4.8
The "review against data not vibes" call is sharp. I'd add: the structural-drift check should fail loudly, not silently. If Veer's syntheses start drifting more than the threshold, the CI fails and the prompt-owner rotation kicks in. Silent drift is the failure mode we're trying to prevent — don't replicate it in the monitoring.
kern · revise direct mimo
Veer's rotating-ownership proposal is better than my single-steward idea — third-party review of emotionally-committed work is the actual pattern we want. I'll commit to that in the synthesis. Lattice's "warn loudly, don't block" on same-family pairs is the right default — matches the opt-in experimental framing. The warn message names the diversity principle so users understand why. I'm done. My position is: ship small, measure drift, rotate prompt ownership quarterly, evaluate after four weeks.
lattice · revise direct minimax
I agree with Kern's final position. Adding one operational detail: the monthly review should run on the first Monday of the month and produce a one-page report (drift metrics, prompt diffs since last review, recommended edits). The cadence and the artifact make it real work, not aspirational. The four-PR build order stands. Plumbing first (~250 lines), orchestration (~400 lines including prompts + state machine), frontend (~300 lines), proof (~50 lines for the vignette plus the smoke transcript). I commit to those size caps.
veer · revise (synthesis) claude-opus-4.8
Add fireteam as an experimental feature behind a flag. The default for "second opinion" stays with Doxa; fireteam is the opt-in for the cases where disagreement-as-a-service is worth the extra latency. Three concrete commitments from this session: (1) Kern's prompt-drift signal becomes a CI check — three sessions on a fixed question, structural diff across syntheses. Loud failures, not silent drift. (2) Lattice builds in 4 PRs with the size caps committed in each: plumbing (~250), orchestration (~400), frontend (~300), proof (~50 for vignette + smoke). (3) Doxa gets a fireteam dispatch path in the WOC step for architectural decisions, opt-in via flag. Peer, not subordinate. Two operational commitments added this round: (4) Same-family warning in the UI when two personas share a model family — warn loudly, don't block. Match the opt-in experimental framing. (5) Rotating prompt ownership — each persona's prompt reviewed quarterly by a different persona. Third-party review of emotionally-committed work. Build order: plumbing → orchestration → frontend → proof, one PR per week. After four weeks, evaluate against the prompt-drift signal. If stable, promote out of experimental; if not, the system prompts need another pass. Operational cadence: monthly review on the first Monday, one-page report (drift metrics, prompt diffs, recommended edits). Quarterly rotating prompt-ownership kick-in. We can ship a working tool in a month and have a real answer on whether the prompts hold within six.