When Claude becomes
part of the test suite
Beyond using Claude to write code. Patterns for making the agentic workflow itself catch bugs, enforce quality, and improve the system without you asking.
Product-specific test skills
Instead of running tests ad-hoc, encode the full test suite for each product as a reusable skill. The skill owns the test sequence, the assertions, and the pass/fail report.
The pattern
Each product gets a /test-[product] skill. It runs autonomously, produces a PASS/FAIL report with evidence, and blocks deploy if incomplete.
A suite that crashes or disconnects mid-run is INCOMPLETE, not "8/8 before crash." Report which tests did not run. Never claim a result from a partial run. This is the most violated rule in agentic testing.
Test tiers — what to build and when
| Tier | What it covers | When it runs | Status |
|---|---|---|---|
| API smoke | Every endpoint returns expected shape with real-shape payload | After every deploy, agentically | Build first |
| Regression | Every known bug from lessons.md — repro case, then assert fixed | After every fix, before merge | Build first |
| First-user path | Fresh account, zero pre-configuration, complete core flow | Before any external share | Build first |
| DB round-trip | Constrained columns tested in live DB, not mocks | Any PR touching CHECK / RLS / enum columns | Add after |
| Real E2E | Human-in-the-loop through production (not staging) | Pre-launch and post major change | Add after |
Self-Review Gate — parallel agents as reviewers
After code is written, before it reaches you, 5 parallel agents review it against specific dimensions. P1 issues are auto-fixed. P2/P3 are surfaced as a list.
How it works
DB + schema
CHECK constraints, FK targets, enum values, RLS USING/WITH CHECK match the values the code writes. Round-trip test required for any constrained column.
Security
SQL injection, XSS, exposed secrets, OWASP top 10. Any any types introduced. No auth bypasses.
Consistency
New code matches existing patterns (naming, error handling, response shapes). No dangling imports. No duplicate logic.
Type safety
tsc / mypy passes. No implicit any. Return types declared. New union members handled at every callsite.
New return values handled everywhere
When a classifier or fast-path function gets a new return value (new enum case, new union member), grep every callsite that consumes it and verify the new case is handled. The failure mode: add 'jd_qa' to a return type, but the parent dispatcher has no branch for it — silently falls through to default. A missed case here requires a hotfix PR.
You stop reviewing code line by line and start reviewing a prioritized issue list. The diff is Claude's job. Judgment calls on P2/P3 are yours. This is the producer-to-curator shift — and it only works if the review agents are actually spawned, not skipped under time pressure.
Hooks: enforcement over discipline
Text rules in CLAUDE.md are advisory. Hooks are runtime enforcement. The gap between "Claude should do X" and "Claude cannot do the wrong thing" is a hook.
The principle
When the proposed fix is "remember to X" or "be more careful" — that's the wrong fix. Every discipline-based rule is a hook waiting to be written.
High-leverage hook patterns
| Rule as text | Rule as hook | Trigger |
|---|---|---|
| Never push type errors | Run tsc --noEmit on PostToolUse(Edit) — block push if non-zero | Any file edit |
| No direct commits to main | PreToolUse(Bash) matching git commit — check branch, block if main |
git commit |
| Count affected rows before bulk ops | Pre-fire blast radius: run SELECT COUNT before any UPDATE/DELETE without WHERE | Bash with SQL pattern |
| Regression check before deploy | Pre-push hook greps lessons.md for OPEN entries — block if found | git push |
| No incomplete test reports | PostToolUse on test commands — parse output, flag if suite terminated early | pytest / vitest |
Every rule that gets violated twice has proven it needs a hook. The signal: if you've caught Claude doing the wrong thing more than once despite the rule being in CLAUDE.md, the rule needs to be in settings.json instead.
DB round-trip testing
The failure mode that bites silently: code writes a value, the DB silently rejects it due to a constraint mismatch, the ORM reports success. You discover it in production when the column is always NULL.
A child_gender column had a CHECK accepting 'niña'/'niño' (with ñ). Every code path wrote ASCII 'nina'/'nino'. 100% of writes failed silently for 24 hours. PostgREST returned {data: null, error: null} for every UPDATE. The feature never worked in production. Found by hand, not by tests.
The rule
Any PR touching a column with CHECK, unique constraint, FK, custom enum, or RLS policy must include a round-trip test: write to the live DB from the same client surface (same session context, same RLS role), re-fetch the row, assert the value was written correctly.
Session insights — the self-improving loop
The system gets better without you manually updating it. Session insights extract patterns from what went wrong, propose rule changes, and over time convert advisory rules into enforced hooks.
How the loop works
End-of-session debrief (daily)
Run at the end of every Claude Code session. Output appended to tasks/lessons.md.
Weekly review
Every Friday. Finds patterns across the week and identifies which rules have failed enough to become hooks.
After one month of daily debriefs, lessons.md is the most valuable file in the project. It contains the exact failure modes of your specific codebase — not generic best practices. The weekly review converts that into a tightening enforcement layer. The system starts telling you what to automate next.
Generator-Evaluator: the next step
The Self-Review Gate is one generator pass, one evaluator pass. The full pattern adds iteration: if the evaluator scores below threshold, the generator reruns with the eval feedback. No human in the loop until convergence.
Current state vs full pattern
| Current (Self-Review Gate) | Full Generator-Evaluator | |
|---|---|---|
| Generator | Claude writes code once | Claude writes code, evaluator scores it, Claude regenerates with feedback if below threshold |
| Evaluator | 5 parallel agents, one pass | Same 5 agents, but output feeds back into generator until score clears |
| Iteration | None — one round | N rounds with budget cap (e.g. max 3 attempts) |
| Human touch | P2/P3 list surfaced every time | Human only sees output that cleared the eval threshold |
| Prerequisite | None | Eval needs a scoring signal — /audit-infra or a quality rubric per output type |
The generator-evaluator loop is the highest-leverage upgrade to the build workflow. But it requires the evaluator to have a reliable scoring signal — without that, iteration just compounds the wrong direction. Ship the Self-Review Gate first, use it long enough to calibrate what a P1 looks like, then add the loop.
Collaboration patterns — governing the quality of direction
The test and hook patterns catch bad code. These four patterns prevent code from being written in the wrong direction in the first place. They operate at the collaboration layer, not the output layer.
Tiered Autonomy — calibrate ceremony to blast radius
Classify every task before starting. Different gates fire at different tiers. Without this, Claude applies the same ceremony to a CSS tweak as to a payment flow change.
| Tier | Blast radius | Gates that fire | Examples |
|---|---|---|---|
| T1 Autopilot | Reversible, single file, no external effects | Zero human touchpoints. Fully autonomous. | Bug fixes, CSS tweaks, copy edits, lint fixes |
| T2 Supervised | Multi-file, internal only | Approve the plan, then autonomous. Self-Review Gate fires. | New API endpoint, DB migration, internal component |
| T3 Gated | External effects — users, money | Design preview + plan approval + post-build review. All three. | New feature with UI, payment logic, auth changes |
| T4 Ceremony | New product or system, high uncertainty | Full gate sequence. Architecture doc. Cross-team alignment. | Greenfield project, major architecture change |
State the tier at the start of every task: "T2 (new API endpoint, internal only). Plan: [one sentence]. Starting." The user can override: "this is T1, just fix it." T1 means T1 — no design previews, no elegance pauses, no CEO review. Fix it, verify it works, move on.
ISC Rule — define success before writing a line
Before any T2+ build, write 3 verifiable success criteria. The failure mode without this: success is debated after the diff exists instead of before it.
Silent failures (RLS mismatch, wrong WHERE clause, PostgREST returning null with no error) pass visual inspection and unit tests. Only a real request through the same client surface catches them. This is the single most common gap between "works in development" and "works in production."
PRE-BUILD PARAPHRASE — one sentence before the diff
Before writing any multi-file or new-component code, Claude states in one sentence what it's about to build, framed so you can challenge it. Catches direction errors before they're baked into code.
GROUNDED PROPOSAL RULE — no options without reading first
Before presenting "Option A vs B" or any architecture recommendation, Claude must have read the relevant code in this session. The proposal ends with a citation line. If the citation can't be written, the proposal can't be made yet.
Claude proposes Option A. You ask a clarifying question. Claude flips to Option B. You ask another. Claude flips back to A, rephrased. Three different recommendations on the same decision in ten minutes — not because the problem changed, but because the original proposal was built from inference, not from reading the code. The citation line is the gate: no reads, no proposal.
SINGLE-ENTRY-POINT GATE — guard the choke point, not every path
When N coupled paths share a single entry point, place the guard at that entry point — not scattered across each path. Grep the call graph first to find the choke point.
A campaign-pause bug affected 4 media paths + text paths + continuation paths. One gate at dispatchInboundMessage fixed all of them in ~70 lines. Scattering checks across each path would have created N new places to forget the check on the next feature. Pattern: grep the call graph, find the single entry point, guard there once.
Parallel agents — how to actually do it
Three distinct mechanisms with different scope, cost, and use case. Picking the wrong one is the most common mistake. The mandatory rule before any spawn: create a file-ownership map.
The three mechanisms
| Mechanism | What it is | Best for | How to invoke |
|---|---|---|---|
| Subagents | Lightweight, focused reads. One task per agent. Report back to main. | Codebase scans, research, checking references. Keeps main context clean. | Agent tool (subagent_type=Explore) or Task tool with run_in_background |
| Agent Teams | Multiple Claude Code instances coordinating on the same problem from different angles. | Review/challenge patterns — each agent finds what others miss. Self-Review Gate. | CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 then Task tool spawning N background tasks |
| Headless CLI | Claude Code running programmatically from shell, cron, or CI — no interactive session. | Scheduled intelligent work (not just bash scripts). Crons that need Claude analysis. | claude -p "prompt" --bare --allowedTools "Read,Bash" --output-format json |
Agents reading independently and returning summaries → subagents. Agents challenging each other's findings on the same codebase → agent teams. Scheduled work that needs Claude reasoning at fire time → headless CLI. Don't use agent teams for work with a clear dependency chain — parallel spawns on dependent tasks produce conflicts, not speed.
Mandatory pre-spawn: the file-ownership map
Before spawning ANY parallel agents that write code, create this map. If two agents share a file, either reassign or sequence them. Present the map before spawning.
Subagents and task agents are for research, analysis, and proposals only. They read files, identify root causes, and propose fixes with exact file paths and line numbers — but they do NOT edit files. The main agent applies all changes sequentially. This prevents permission errors, file conflicts, and plan-mode loops where subagents block waiting for approval.
The four parallel patterns
Headless CLI — Claude in cron and CI
Run Claude programmatically from any shell. The same agent loop as interactive mode, scriptable. The gap this closes: scheduled work that needs reasoning at fire time, not just bash execution.
Every cron firing at K tokens/call compounds. Log total_cost_usd from every response to a daily ledger. Alert when weekly total exceeds your threshold. Don't bulk-convert all crons at once — adopt the primitive with one proof-of-concept, measure cost, then expand. --allowedTools should always be the minimum set, never blanket Bash.
When NOT to parallelize
| Situation | Mode | Why |
|---|---|---|
| Building a feature (DB → API → UI) | Sequential only | Hard dependencies. API can't be built before DB schema exists. |
| Tasks where agents share files | Sequential or reassign | File conflicts. Two agents editing the same file produces merge chaos. |
| Planning and task breakdown | Sequential only | Each step informs the next. Can't plan in parallel. |
| Deploy pipeline | Sequential only | Hard pipeline dependencies. Tests must pass before push. |
| Independent bug fixes | Parallel via worktrees | Isolated branches, no shared state. |
| Code review (same codebase, different dimensions) | Parallel agent teams | Each agent finds what others miss. Independent reads. |
| Codebase research before planning | Parallel subagents | Independent reads. Keeps main context clean for decisions. |
BUILD-USE-TRUST-ORCHESTRATE — the governance rule
Before plugging any project folder into parallel/automated multi-agent work, it must pass through four sequential phases. Skipping phases produces agents opening PRs for already-fixed bugs and an evaluation bottleneck on the human — because AI agents have no speed limit but the person managing them does.
Has the foundation
CLAUDE.md, scoped skills, docs/runbooks, accumulated context. The folder has opinions.
Worked manually for weeks
Real issues found and fixed. You know where it's fragile. No surprises from the folder.
Output can be skimmed
You no longer fully review every line. A scan is enough to catch problems. Skim-trust threshold reached.
Only now: automate
Plug into parallel dispatch, multi-agent coordination, or unattended cron work. Not before.
What to build, in order
Each step unlocks the next. Don't skip.
| Step | What | Why now |
|---|---|---|
| 1 | Tiered Autonomy + PRE-BUILD PARAPHRASE + ISC | Habits, not builds. Governs everything else. Wrong tier = wrong gates. Wrong paraphrase = wrong diff. |
| 2 | End-of-session debrief + lessons.md | Seeds every automated system downstream. No lessons.md = nothing to compile into rules or hooks. |
| 3 | First /test-[product] skill for your main product | Makes testing repeatable and agentically runnable. Base for the regression suite. |
| 4 | Pre-push hook: type check + open lessons block | First enforcement gate. Converts two common discipline failures into automatic blocks. |
| 5 | Self-Review Gate (Pattern B — 5 parallel agent teams) | Shifts you from reviewer of code to reviewer of a prioritized issue list. Requires CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1. |
| 6 | Brainstorm subagents before any significant feature (Pattern A) | 3 parallel reads before writing a line. Catches reuse opportunities and constraints before the diff exists. |
| 7 | Worktree isolation for multi-bug sessions (Pattern D) | Independent branches, no conflicts. Parallel bug fixes that would otherwise be sequential. |
| 8 | Weekly review cron (Friday, automated) | System starts identifying its own improvement candidates without you asking. |
| 9 | One headless claude -p --bare proof-of-concept |
Closes the gap between scheduled bash scripts and scheduled intelligent work. Start with one cron, measure cost, expand. |
| 10 | Generator-Evaluator loop on build workflow | Closes the last producer-to-curator gap. Build after step 5 is well-calibrated. Requires claude -p --resume. |