Agentic Development Patterns
Testing, QA, and self-improving workflows

When Claude becomes
part of the test suite

Beyond using Claude to write code. Patterns for making the agentic workflow itself catch bugs, enforce quality, and improve the system without you asking.

Pattern 01

Product-specific test skills

Instead of running tests ad-hoc, encode the full test suite for each product as a reusable skill. The skill owns the test sequence, the assertions, and the pass/fail report.

The pattern

Each product gets a /test-[product] skill. It runs autonomously, produces a PASS/FAIL report with evidence, and blocks deploy if incomplete.

SKILL.md — /test-sophie (structure)
## Steps 1. Run API smoke test curl /api/health → expect 200 curl /api/test-buffer → send test payload, verify response shape 2. WhatsApp pipeline E2E Trigger inbound message via WAHA test endpoint Assert: intent classified, reply dispatched within 3s Assert: send_at timestamp written to DB (dedup guard) 3. Regression suite For every entry in tasks/lessons.md marked OPEN: Run the specific repro case Assert it no longer triggers the failure 4. First-user path Create fresh org (no pre-configuration) Complete full onboarding flow Assert: all expected rows created in DB 5. Report PASS: all assertions green, suite completed fully FAIL: list which assertions failed + which tests did not run INCOMPLETE: suite terminated early — flag explicitly, do not report as partial pass
INCOMPLETE TEST RULE

A suite that crashes or disconnects mid-run is INCOMPLETE, not "8/8 before crash." Report which tests did not run. Never claim a result from a partial run. This is the most violated rule in agentic testing.

Test tiers — what to build and when

TierWhat it coversWhen it runsStatus
API smoke Every endpoint returns expected shape with real-shape payload After every deploy, agentically Build first
Regression Every known bug from lessons.md — repro case, then assert fixed After every fix, before merge Build first
First-user path Fresh account, zero pre-configuration, complete core flow Before any external share Build first
DB round-trip Constrained columns tested in live DB, not mocks Any PR touching CHECK / RLS / enum columns Add after
Real E2E Human-in-the-loop through production (not staging) Pre-launch and post major change Add after
Pattern 02

Self-Review Gate — parallel agents as reviewers

After code is written, before it reaches you, 5 parallel agents review it against specific dimensions. P1 issues are auto-fixed. P2/P3 are surfaced as a list.

How it works

Code written
5 agents spawn
P1 auto-fixed
P2/P3 listed
You review the list, not the diff
Agent 1

DB + schema

CHECK constraints, FK targets, enum values, RLS USING/WITH CHECK match the values the code writes. Round-trip test required for any constrained column.

Agent 2

Security

SQL injection, XSS, exposed secrets, OWASP top 10. Any any types introduced. No auth bypasses.

Agent 3

Consistency

New code matches existing patterns (naming, error handling, response shapes). No dangling imports. No duplicate logic.

Agent 4

Type safety

tsc / mypy passes. No implicit any. Return types declared. New union members handled at every callsite.

Agent 5 — Dispatcher completeness

New return values handled everywhere

When a classifier or fast-path function gets a new return value (new enum case, new union member), grep every callsite that consumes it and verify the new case is handled. The failure mode: add 'jd_qa' to a return type, but the parent dispatcher has no branch for it — silently falls through to default. A missed case here requires a hotfix PR.

The shift this enables

You stop reviewing code line by line and start reviewing a prioritized issue list. The diff is Claude's job. Judgment calls on P2/P3 are yours. This is the producer-to-curator shift — and it only works if the review agents are actually spawned, not skipped under time pressure.

Pattern 03

Hooks: enforcement over discipline

Text rules in CLAUDE.md are advisory. Hooks are runtime enforcement. The gap between "Claude should do X" and "Claude cannot do the wrong thing" is a hook.

The principle

When the proposed fix is "remember to X" or "be more careful" — that's the wrong fix. Every discipline-based rule is a hook waiting to be written.

settings.json — hook structure
{ "hooks": { "PreToolUse": [ { "matcher": "Bash", "hooks": [{ "type": "command", "command": "~/.claude/hooks/pre-fire-blast-radius.sh" }] } ], "PostToolUse": [ { "matcher": "Edit", "hooks": [{ "type": "command", "command": "~/.claude/hooks/tsc-check.sh" }] } ] } }

High-leverage hook patterns

Rule as textRule as hookTrigger
Never push type errors Run tsc --noEmit on PostToolUse(Edit) — block push if non-zero Any file edit
No direct commits to main PreToolUse(Bash) matching git commit — check branch, block if main git commit
Count affected rows before bulk ops Pre-fire blast radius: run SELECT COUNT before any UPDATE/DELETE without WHERE Bash with SQL pattern
Regression check before deploy Pre-push hook greps lessons.md for OPEN entries — block if found git push
No incomplete test reports PostToolUse on test commands — parse output, flag if suite terminated early pytest / vitest
The real gap: rules without hooks are bets on discipline

Every rule that gets violated twice has proven it needs a hook. The signal: if you've caught Claude doing the wrong thing more than once despite the rule being in CLAUDE.md, the rule needs to be in settings.json instead.

Pattern 04

DB round-trip testing

The failure mode that bites silently: code writes a value, the DB silently rejects it due to a constraint mismatch, the ORM reports success. You discover it in production when the column is always NULL.

The exact failure

A child_gender column had a CHECK accepting 'niña'/'niño' (with ñ). Every code path wrote ASCII 'nina'/'nino'. 100% of writes failed silently for 24 hours. PostgREST returned {data: null, error: null} for every UPDATE. The feature never worked in production. Found by hand, not by tests.

The rule

Any PR touching a column with CHECK, unique constraint, FK, custom enum, or RLS policy must include a round-trip test: write to the live DB from the same client surface (same session context, same RLS role), re-fetch the row, assert the value was written correctly.

Round-trip test checklist — any constrained column
1. Read the constraint: CHECK list, FK target, enum values, RLS USING/WITH CHECK clauses. 2. Confirm the value the code writes matches what the constraint accepts: - Case: 'niña' vs 'nina' - Encoding: accented chars vs ASCII - Type: string vs integer - Length: varchar(50) vs longer 3. Write from the same client surface: Same browser session, same function context, same RLS role. Not from psql as superuser. Not from mocked client. 4. Re-fetch the row: Assert the column holds the expected value, not NULL. A toast/error caught client-side is NOT proof of success. 5. PostgREST gotcha: UPDATE that touches zero rows (RLS mismatch, wrong eq filter) returns {data: null, error: null} — client treats as success. Only a re-fetch proves the write landed.
Pattern 05

Session insights — the self-improving loop

The system gets better without you manually updating it. Session insights extract patterns from what went wrong, propose rule changes, and over time convert advisory rules into enforced hooks.

How the loop works

Bug / mistake happens Captured in lessons.md Session insights surfaces pattern Rule added to CLAUDE.md Hook added to settings.json Pattern never recurs

End-of-session debrief (daily)

Run at the end of every Claude Code session. Output appended to tasks/lessons.md.

Daily debrief prompt
Read the git diff from this session and tasks/lessons.md. Answer: 1. What was built? One sentence. 2. What mistakes were made and corrected during this session? 3. Which of these mistakes could have been caught by a hook or a test? 4. What is the single most important rule to add or update in CLAUDE.md? 5. Is that rule a hook candidate? If yes, what is the trigger and the check? 6. Were any tasks deferred? Write them to tasks/todo.md now — deferred items mentioned only in chat are lost on context reset.

Weekly review

Every Friday. Finds patterns across the week and identifies which rules have failed enough to become hooks.

Weekly review prompt
Read tasks/lessons.md and git log --oneline --since="7 days ago". 1. Which lessons appeared more than once this week? These are pattern-class bugs. 2. Which rules in CLAUDE.md were violated despite being written down? Each violation = hook candidate. List them. 3. Which lessons are still OPEN (not yet a rule, not yet a hook)? Prioritize by frequency. 4. What is the one rule-to-hook conversion with highest leverage this week? Give the exact trigger type (PreToolUse / PostToolUse / Bash pattern) and the shell check it would run.
The compounding effect

After one month of daily debriefs, lessons.md is the most valuable file in the project. It contains the exact failure modes of your specific codebase — not generic best practices. The weekly review converts that into a tightening enforcement layer. The system starts telling you what to automate next.

Pattern 06

Generator-Evaluator: the next step

The Self-Review Gate is one generator pass, one evaluator pass. The full pattern adds iteration: if the evaluator scores below threshold, the generator reruns with the eval feedback. No human in the loop until convergence.

Current state vs full pattern

Current (Self-Review Gate)Full Generator-Evaluator
Generator Claude writes code once Claude writes code, evaluator scores it, Claude regenerates with feedback if below threshold
Evaluator 5 parallel agents, one pass Same 5 agents, but output feeds back into generator until score clears
Iteration None — one round N rounds with budget cap (e.g. max 3 attempts)
Human touch P2/P3 list surfaced every time Human only sees output that cleared the eval threshold
Prerequisite None Eval needs a scoring signal — /audit-infra or a quality rubric per output type
Why build this in order

The generator-evaluator loop is the highest-leverage upgrade to the build workflow. But it requires the evaluator to have a reliable scoring signal — without that, iteration just compounds the wrong direction. Ship the Self-Review Gate first, use it long enough to calibrate what a P1 looks like, then add the loop.

Pattern 07

Collaboration patterns — governing the quality of direction

The test and hook patterns catch bad code. These four patterns prevent code from being written in the wrong direction in the first place. They operate at the collaboration layer, not the output layer.

Tiered Autonomy — calibrate ceremony to blast radius

Classify every task before starting. Different gates fire at different tiers. Without this, Claude applies the same ceremony to a CSS tweak as to a payment flow change.

TierBlast radiusGates that fireExamples
T1 Autopilot Reversible, single file, no external effects Zero human touchpoints. Fully autonomous. Bug fixes, CSS tweaks, copy edits, lint fixes
T2 Supervised Multi-file, internal only Approve the plan, then autonomous. Self-Review Gate fires. New API endpoint, DB migration, internal component
T3 Gated External effects — users, money Design preview + plan approval + post-build review. All three. New feature with UI, payment logic, auth changes
T4 Ceremony New product or system, high uncertainty Full gate sequence. Architecture doc. Cross-team alignment. Greenfield project, major architecture change
How to use it

State the tier at the start of every task: "T2 (new API endpoint, internal only). Plan: [one sentence]. Starting." The user can override: "this is T1, just fix it." T1 means T1 — no design previews, no elegance pauses, no CEO review. Fix it, verify it works, move on.

ISC Rule — define success before writing a line

Before any T2+ build, write 3 verifiable success criteria. The failure mode without this: success is debated after the diff exists instead of before it.

ISC format — required before any T2+ build
ISC (Iterative Success Criteria) before build: 1. [Functional criterion] — e.g. "User can complete checkout flow end-to-end" 2. [State criterion] — e.g. "Order row written to DB with correct status + timestamps" 3. [API round-trip criterion] — REQUIRED: real request to the endpoint, verified response shape "It works" is not a criterion. Each ISC must be independently verifiable without asking the builder whether it passed.
Why criterion 3 must always be an API round-trip

Silent failures (RLS mismatch, wrong WHERE clause, PostgREST returning null with no error) pass visual inspection and unit tests. Only a real request through the same client surface catches them. This is the single most common gap between "works in development" and "works in production."

PRE-BUILD PARAPHRASE — one sentence before the diff

Before writing any multi-file or new-component code, Claude states in one sentence what it's about to build, framed so you can challenge it. Catches direction errors before they're baked into code.

The paraphrase format (T2+)
"I'll add a nullable dismissed_at column to the campaigns table and surface it in the dashboard list view as a grey 'Dismissed' badge." — One sentence. Specific enough to challenge. — If you can't challenge it, it's not specific enough. — Skip for T1 (single-file fix, refactor, copy edit). — If the user says "wait, that's not right" — redirect costs 5 seconds. If the paraphrase is skipped — redirect costs undoing a diff.

GROUNDED PROPOSAL RULE — no options without reading first

Before presenting "Option A vs B" or any architecture recommendation, Claude must have read the relevant code in this session. The proposal ends with a citation line. If the citation can't be written, the proposal can't be made yet.

Grounded proposal format
Recommendation: [specific approach] Reasoning: [2-3 sentences from what was read] Verified by reading: src/api/routes/orders.py:142 — existing session lookup logic src/models/order.py:87 — Order model constraints migrations/0042_order_status.sql:12 — status enum values — If this citation block can't be written, stop and read first. — A clarifying question from you ("what does X mean?") is a request to explain, not a request to revise the recommendation.
The failure mode this prevents

Claude proposes Option A. You ask a clarifying question. Claude flips to Option B. You ask another. Claude flips back to A, rephrased. Three different recommendations on the same decision in ten minutes — not because the problem changed, but because the original proposal was built from inference, not from reading the code. The citation line is the gate: no reads, no proposal.

SINGLE-ENTRY-POINT GATE — guard the choke point, not every path

When N coupled paths share a single entry point, place the guard at that entry point — not scattered across each path. Grep the call graph first to find the choke point.

Real example

A campaign-pause bug affected 4 media paths + text paths + continuation paths. One gate at dispatchInboundMessage fixed all of them in ~70 lines. Scattering checks across each path would have created N new places to forget the check on the next feature. Pattern: grep the call graph, find the single entry point, guard there once.

Pattern 08

Parallel agents — how to actually do it

Three distinct mechanisms with different scope, cost, and use case. Picking the wrong one is the most common mistake. The mandatory rule before any spawn: create a file-ownership map.

The three mechanisms

MechanismWhat it isBest forHow to invoke
Subagents Lightweight, focused reads. One task per agent. Report back to main. Codebase scans, research, checking references. Keeps main context clean. Agent tool (subagent_type=Explore) or Task tool with run_in_background
Agent Teams Multiple Claude Code instances coordinating on the same problem from different angles. Review/challenge patterns — each agent finds what others miss. Self-Review Gate. CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 then Task tool spawning N background tasks
Headless CLI Claude Code running programmatically from shell, cron, or CI — no interactive session. Scheduled intelligent work (not just bash scripts). Crons that need Claude analysis. claude -p "prompt" --bare --allowedTools "Read,Bash" --output-format json
Rule of thumb

Agents reading independently and returning summaries → subagents. Agents challenging each other's findings on the same codebase → agent teams. Scheduled work that needs Claude reasoning at fire time → headless CLI. Don't use agent teams for work with a clear dependency chain — parallel spawns on dependent tasks produce conflicts, not speed.

Mandatory pre-spawn: the file-ownership map

Before spawning ANY parallel agents that write code, create this map. If two agents share a file, either reassign or sequence them. Present the map before spawning.

File-ownership map format
Agent 1 (DB fixes): migrations/0043_add_index.sql, models/order.py Agent 2 (API fixes): api/routes/orders.py, api/services/order_service.py Agent 3 (Tests): tests/test_orders.py, tests/fixtures/orders.py Shared files: None ✓ Conflict check: CLEAR — spawn approved
Task agents READ. Main agent WRITES.

Subagents and task agents are for research, analysis, and proposals only. They read files, identify root causes, and propose fixes with exact file paths and line numbers — but they do NOT edit files. The main agent applies all changes sequentially. This prevents permission errors, file conflicts, and plan-mode loops where subagents block waiting for approval.

The four parallel patterns

Pattern A — Brainstorm (3 subagents, read-only)
Subagent 1 (Pattern Scanner): grep for existing patterns, utilities, components in src/ → returns: reusable patterns + file paths Subagent 2 (API Scanner): scan api/ routes and services for reusable endpoints → returns: available endpoints + data shapes Subagent 3 (Spec Reader): read CLAUDE.md, PRODUCT_SPECS.md, ROADMAP.md for constraints → returns: constraints, dependencies, risks Orchestrator merges → 2-3 approaches with tradeoffs → present to user
Pattern B — Self-Review Gate (5 agent teams, read-only + type check)
CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 Agent 1 (DB): migrations, RLS policies, org_id filtering, constraints Agent 2 (Security): auth on every route, injection, exposed secrets, OWASP Agent 3 (UI): broken refs, dangling imports, loading/error states Agent 4 (Types): run mypy/tsc, check for any types, interface coverage Agent 5 (Parity): verify every UI action has an API backing it Each returns structured findings: P1 / P2 / P3 P1 → main agent auto-fixes, re-runs that agent only P2/P3 → surfaced to human as a list (not the diff)
Pattern C — Build (sequential, one agent, dependency order)
# Building is NOT parallelizable — strict dependency order required Step 1: Database (migrations → user runs manually) Step 2: API (routes + services) → mypy/tsc check Step 3: Logic (processing, validation, workflows) → mypy/tsc check Step 4: UI (components, pages) → mypy/tsc check → full build Rule: 3 consecutive type-check failures on the same step → STOP. Surface the error. Do not loop or accumulate hacks.
Pattern D — Multi-bug fix (N agents, isolated git worktrees)
# Each bug gets an isolated worktree — no shared state, no conflicts git worktree add ../project-fix-1 -b fix/bug-1 git worktree add ../project-fix-2 -b fix/bug-2 git worktree add ../project-fix-3 -b fix/bug-3 # Claude Code instance per worktree (separate tmux panes) # Each runs independently: fix → type check → PR # Merge sequentially via PR after all pass # Clean up after merge git worktree remove ../project-fix-1

Headless CLI — Claude in cron and CI

Run Claude programmatically from any shell. The same agent loop as interactive mode, scriptable. The gap this closes: scheduled work that needs reasoning at fire time, not just bash execution.

claude -p key flags
# Minimal locked-down cron invocation claude -p "$(cat prompt.txt)" \ --bare \ # skip CLAUDE.md, hooks, skills (deterministic CI) --allowedTools "Read,Bash" \ # minimum required tools only --permission-mode dontAsk \ # no interactive prompts --output-format json # structured output, no parsing fragility # With structured schema (replaces jq/awk chains) claude -p "analyze this output and return findings" \ --bare \ --output-format json \ --json-schema '{"issues": [{"severity": "P1|P2|P3", "file": "str", "description": "str"}]}' # Multi-turn (generator-evaluator loop) claude -p "first pass" --bare → get session_id from output claude -p "evaluate and improve" --resume --bare # Cost tracking (log this, alert on weekly total) # Response JSON includes: total_cost_usd per call
Cost discipline for headless Claude

Every cron firing at K tokens/call compounds. Log total_cost_usd from every response to a daily ledger. Alert when weekly total exceeds your threshold. Don't bulk-convert all crons at once — adopt the primitive with one proof-of-concept, measure cost, then expand. --allowedTools should always be the minimum set, never blanket Bash.

When NOT to parallelize

SituationModeWhy
Building a feature (DB → API → UI) Sequential only Hard dependencies. API can't be built before DB schema exists.
Tasks where agents share files Sequential or reassign File conflicts. Two agents editing the same file produces merge chaos.
Planning and task breakdown Sequential only Each step informs the next. Can't plan in parallel.
Deploy pipeline Sequential only Hard pipeline dependencies. Tests must pass before push.
Independent bug fixes Parallel via worktrees Isolated branches, no shared state.
Code review (same codebase, different dimensions) Parallel agent teams Each agent finds what others miss. Independent reads.
Codebase research before planning Parallel subagents Independent reads. Keeps main context clean for decisions.

BUILD-USE-TRUST-ORCHESTRATE — the governance rule

Before plugging any project folder into parallel/automated multi-agent work, it must pass through four sequential phases. Skipping phases produces agents opening PRs for already-fixed bugs and an evaluation bottleneck on the human — because AI agents have no speed limit but the person managing them does.

Phase 1 — Build

Has the foundation

CLAUDE.md, scoped skills, docs/runbooks, accumulated context. The folder has opinions.

Phase 2 — Use

Worked manually for weeks

Real issues found and fixed. You know where it's fragile. No surprises from the folder.

Phase 3 — Trust

Output can be skimmed

You no longer fully review every line. A scan is enough to catch problems. Skim-trust threshold reached.

Phase 4 — Orchestrate

Only now: automate

Plug into parallel dispatch, multi-agent coordination, or unattended cron work. Not before.


What to build, in order

Each step unlocks the next. Don't skip.

StepWhatWhy now
1 Tiered Autonomy + PRE-BUILD PARAPHRASE + ISC Habits, not builds. Governs everything else. Wrong tier = wrong gates. Wrong paraphrase = wrong diff.
2 End-of-session debrief + lessons.md Seeds every automated system downstream. No lessons.md = nothing to compile into rules or hooks.
3 First /test-[product] skill for your main product Makes testing repeatable and agentically runnable. Base for the regression suite.
4 Pre-push hook: type check + open lessons block First enforcement gate. Converts two common discipline failures into automatic blocks.
5 Self-Review Gate (Pattern B — 5 parallel agent teams) Shifts you from reviewer of code to reviewer of a prioritized issue list. Requires CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1.
6 Brainstorm subagents before any significant feature (Pattern A) 3 parallel reads before writing a line. Catches reuse opportunities and constraints before the diff exists.
7 Worktree isolation for multi-bug sessions (Pattern D) Independent branches, no conflicts. Parallel bug fixes that would otherwise be sequential.
8 Weekly review cron (Friday, automated) System starts identifying its own improvement candidates without you asking.
9 One headless claude -p --bare proof-of-concept Closes the gap between scheduled bash scripts and scheduled intelligent work. Start with one cron, measure cost, expand.
10 Generator-Evaluator loop on build workflow Closes the last producer-to-curator gap. Build after step 5 is well-calibrated. Requires claude -p --resume.