Agentic Development Patterns

Pattern 01

Product-specific test skills

Instead of running tests ad-hoc, encode the full test suite for each product as a reusable skill. The skill owns the test sequence, the assertions, and the pass/fail report.

The pattern

Each product gets a /test-[product] skill. It runs autonomously, produces a PASS/FAIL report with evidence, and blocks deploy if incomplete.

SKILL.md — /test-sophie (structure)

## Steps

1. Run API smoke test
   curl /api/health → expect 200
   curl /api/test-buffer → send test payload, verify response shape

2. WhatsApp pipeline E2E
   Trigger inbound message via WAHA test endpoint
   Assert: intent classified, reply dispatched within 3s
   Assert: send_at timestamp written to DB (dedup guard)

3. Regression suite
   For every entry in tasks/lessons.md marked OPEN:
     Run the specific repro case
     Assert it no longer triggers the failure

4. First-user path
   Create fresh org (no pre-configuration)
   Complete full onboarding flow
   Assert: all expected rows created in DB

5. Report
   PASS: all assertions green, suite completed fully
   FAIL: list which assertions failed + which tests did not run
   INCOMPLETE: suite terminated early — flag explicitly, do not report as partial pass

INCOMPLETE TEST RULE

A suite that crashes or disconnects mid-run is INCOMPLETE, not "8/8 before crash." Report which tests did not run. Never claim a result from a partial run. This is the most violated rule in agentic testing.

Test tiers — what to build and when

Tier	What it covers	When it runs	Status
API smoke	Every endpoint returns expected shape with real-shape payload	After every deploy, agentically	Build first
Regression	Every known bug from lessons.md — repro case, then assert fixed	After every fix, before merge	Build first
First-user path	Fresh account, zero pre-configuration, complete core flow	Before any external share	Build first
DB round-trip	Constrained columns tested in live DB, not mocks	Any PR touching CHECK / RLS / enum columns	Add after
Real E2E	Human-in-the-loop through production (not staging)	Pre-launch and post major change	Add after

Pattern 02

Self-Review Gate — parallel agents as reviewers

After code is written, before it reaches you, 5 parallel agents review it against specific dimensions. P1 issues are auto-fixed. P2/P3 are surfaced as a list.

How it works

Code written

→

5 agents spawn

→

P1 auto-fixed

→

P2/P3 listed

→

You review the list, not the diff

Agent 1

DB + schema

CHECK constraints, FK targets, enum values, RLS USING/WITH CHECK match the values the code writes. Round-trip test required for any constrained column.

Agent 2

Security

SQL injection, XSS, exposed secrets, OWASP top 10. Any any types introduced. No auth bypasses.

Agent 3

Consistency

New code matches existing patterns (naming, error handling, response shapes). No dangling imports. No duplicate logic.

Agent 4

Type safety

tsc / mypy passes. No implicit any. Return types declared. New union members handled at every callsite.

Agent 5 — Dispatcher completeness

New return values handled everywhere

When a classifier or fast-path function gets a new return value (new enum case, new union member), grep every callsite that consumes it and verify the new case is handled. The failure mode: add 'jd_qa' to a return type, but the parent dispatcher has no branch for it — silently falls through to default. A missed case here requires a hotfix PR.

The shift this enables

You stop reviewing code line by line and start reviewing a prioritized issue list. The diff is Claude's job. Judgment calls on P2/P3 are yours. This is the producer-to-curator shift — and it only works if the review agents are actually spawned, not skipped under time pressure.

Pattern 03

Hooks: enforcement over discipline

Text rules in CLAUDE.md are advisory. Hooks are runtime enforcement. The gap between "Claude should do X" and "Claude cannot do the wrong thing" is a hook.

The principle

When the proposed fix is "remember to X" or "be more careful" — that's the wrong fix. Every discipline-based rule is a hook waiting to be written.

settings.json — hook structure

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [{
          "type": "command",
          "command": "~/.claude/hooks/pre-fire-blast-radius.sh"
        }]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Edit",
        "hooks": [{
          "type": "command",
          "command": "~/.claude/hooks/tsc-check.sh"
        }]
      }
    ]
  }
}

High-leverage hook patterns

Rule as text	Rule as hook	Trigger
Never push type errors	Run tsc --noEmit on PostToolUse(Edit) — block push if non-zero	Any file edit
No direct commits to main	PreToolUse(Bash) matching `git commit` — check branch, block if main	git commit
Count affected rows before bulk ops	Pre-fire blast radius: run SELECT COUNT before any UPDATE/DELETE without WHERE	Bash with SQL pattern
Regression check before deploy	Pre-push hook greps lessons.md for OPEN entries — block if found	git push
No incomplete test reports	PostToolUse on test commands — parse output, flag if suite terminated early	pytest / vitest

The real gap: rules without hooks are bets on discipline

Every rule that gets violated twice has proven it needs a hook. The signal: if you've caught Claude doing the wrong thing more than once despite the rule being in CLAUDE.md, the rule needs to be in settings.json instead.

Pattern 04

DB round-trip testing

The failure mode that bites silently: code writes a value, the DB silently rejects it due to a constraint mismatch, the ORM reports success. You discover it in production when the column is always NULL.

The exact failure

A child_gender column had a CHECK accepting 'niña'/'niño' (with ñ). Every code path wrote ASCII 'nina'/'nino'. 100% of writes failed silently for 24 hours. PostgREST returned {data: null, error: null} for every UPDATE. The feature never worked in production. Found by hand, not by tests.

The rule

Any PR touching a column with CHECK, unique constraint, FK, custom enum, or RLS policy must include a round-trip test: write to the live DB from the same client surface (same session context, same RLS role), re-fetch the row, assert the value was written correctly.

Round-trip test checklist — any constrained column

1. Read the constraint:
   CHECK list, FK target, enum values, RLS USING/WITH CHECK clauses.

2. Confirm the value the code writes matches what the constraint accepts:
   - Case: 'niña' vs 'nina'
   - Encoding: accented chars vs ASCII
   - Type: string vs integer
   - Length: varchar(50) vs longer

3. Write from the same client surface:
   Same browser session, same function context, same RLS role.
   Not from psql as superuser. Not from mocked client.

4. Re-fetch the row:
   Assert the column holds the expected value, not NULL.
   A toast/error caught client-side is NOT proof of success.

5. PostgREST gotcha:
   UPDATE that touches zero rows (RLS mismatch, wrong eq filter) returns
   {data: null, error: null} — client treats as success.
   Only a re-fetch proves the write landed.

Pattern 05

Session insights — the self-improving loop

The system gets better without you manually updating it. Session insights extract patterns from what went wrong, propose rule changes, and over time convert advisory rules into enforced hooks.

How the loop works

Bug / mistake happens → Captured in lessons.md → Session insights surfaces pattern → Rule added to CLAUDE.md → Hook added to settings.json → Pattern never recurs

End-of-session debrief (daily)

Run at the end of every Claude Code session. Output appended to tasks/lessons.md.

Daily debrief prompt

Read the git diff from this session and tasks/lessons.md.

Answer:
1. What was built? One sentence.
2. What mistakes were made and corrected during this session?
3. Which of these mistakes could have been caught by a hook or a test?
4. What is the single most important rule to add or update in CLAUDE.md?
5. Is that rule a hook candidate? If yes, what is the trigger and the check?
6. Were any tasks deferred? Write them to tasks/todo.md now — deferred items
   mentioned only in chat are lost on context reset.

Weekly review

Every Friday. Finds patterns across the week and identifies which rules have failed enough to become hooks.

Weekly review prompt

Read tasks/lessons.md and git log --oneline --since="7 days ago".

1. Which lessons appeared more than once this week? These are pattern-class bugs.
2. Which rules in CLAUDE.md were violated despite being written down?
   Each violation = hook candidate. List them.
3. Which lessons are still OPEN (not yet a rule, not yet a hook)?
   Prioritize by frequency.
4. What is the one rule-to-hook conversion with highest leverage this week?
   Give the exact trigger type (PreToolUse / PostToolUse / Bash pattern)
   and the shell check it would run.

The compounding effect

After one month of daily debriefs, lessons.md is the most valuable file in the project. It contains the exact failure modes of your specific codebase — not generic best practices. The weekly review converts that into a tightening enforcement layer. The system starts telling you what to automate next.

Pattern 06

Generator-Evaluator: the next step

The Self-Review Gate is one generator pass, one evaluator pass. The full pattern adds iteration: if the evaluator scores below threshold, the generator reruns with the eval feedback. No human in the loop until convergence.

Current state vs full pattern

	Current (Self-Review Gate)	Full Generator-Evaluator
Generator	Claude writes code once	Claude writes code, evaluator scores it, Claude regenerates with feedback if below threshold
Evaluator	5 parallel agents, one pass	Same 5 agents, but output feeds back into generator until score clears
Iteration	None — one round	N rounds with budget cap (e.g. max 3 attempts)
Human touch	P2/P3 list surfaced every time	Human only sees output that cleared the eval threshold
Prerequisite	None	Eval needs a scoring signal — /audit-infra or a quality rubric per output type

Why build this in order

The generator-evaluator loop is the highest-leverage upgrade to the build workflow. But it requires the evaluator to have a reliable scoring signal — without that, iteration just compounds the wrong direction. Ship the Self-Review Gate first, use it long enough to calibrate what a P1 looks like, then add the loop.

Pattern 07

Collaboration patterns — governing the quality of direction

The test and hook patterns catch bad code. These four patterns prevent code from being written in the wrong direction in the first place. They operate at the collaboration layer, not the output layer.

Tiered Autonomy — calibrate ceremony to blast radius

Classify every task before starting. Different gates fire at different tiers. Without this, Claude applies the same ceremony to a CSS tweak as to a payment flow change.

Tier	Blast radius	Gates that fire	Examples
T1 Autopilot	Reversible, single file, no external effects	Zero human touchpoints. Fully autonomous.	Bug fixes, CSS tweaks, copy edits, lint fixes
T2 Supervised	Multi-file, internal only	Approve the plan, then autonomous. Self-Review Gate fires.	New API endpoint, DB migration, internal component
T3 Gated	External effects — users, money	Design preview + plan approval + post-build review. All three.	New feature with UI, payment logic, auth changes
T4 Ceremony	New product or system, high uncertainty	Full gate sequence. Architecture doc. Cross-team alignment.	Greenfield project, major architecture change

How to use it

State the tier at the start of every task: "T2 (new API endpoint, internal only). Plan: [one sentence]. Starting." The user can override: "this is T1, just fix it." T1 means T1 — no design previews, no elegance pauses, no CEO review. Fix it, verify it works, move on.

ISC Rule — define success before writing a line

Before any T2+ build, write 3 verifiable success criteria. The failure mode without this: success is debated after the diff exists instead of before it.

ISC format — required before any T2+ build

ISC (Iterative Success Criteria) before build:

1. [Functional criterion] — e.g. "User can complete checkout flow end-to-end"
2. [State criterion] — e.g. "Order row written to DB with correct status + timestamps"
3. [API round-trip criterion] — REQUIRED: real request to the endpoint, verified response shape

"It works" is not a criterion. Each ISC must be independently verifiable
without asking the builder whether it passed.

Why criterion 3 must always be an API round-trip

Silent failures (RLS mismatch, wrong WHERE clause, PostgREST returning null with no error) pass visual inspection and unit tests. Only a real request through the same client surface catches them. This is the single most common gap between "works in development" and "works in production."

PRE-BUILD PARAPHRASE — one sentence before the diff

Before writing any multi-file or new-component code, Claude states in one sentence what it's about to build, framed so you can challenge it. Catches direction errors before they're baked into code.

The paraphrase format (T2+)

"I'll add a nullable dismissed_at column to the campaigns table
and surface it in the dashboard list view as a grey 'Dismissed' badge."

— One sentence. Specific enough to challenge.
— If you can't challenge it, it's not specific enough.
— Skip for T1 (single-file fix, refactor, copy edit).
— If the user says "wait, that's not right" — redirect costs 5 seconds.
   If the paraphrase is skipped — redirect costs undoing a diff.

GROUNDED PROPOSAL RULE — no options without reading first

Before presenting "Option A vs B" or any architecture recommendation, Claude must have read the relevant code in this session. The proposal ends with a citation line. If the citation can't be written, the proposal can't be made yet.

Grounded proposal format

Recommendation: [specific approach]

Reasoning: [2-3 sentences from what was read]

Verified by reading:
  src/api/routes/orders.py:142 — existing session lookup logic
  src/models/order.py:87 — Order model constraints
  migrations/0042_order_status.sql:12 — status enum values

— If this citation block can't be written, stop and read first.
— A clarifying question from you ("what does X mean?") is a request
  to explain, not a request to revise the recommendation.

The failure mode this prevents

Claude proposes Option A. You ask a clarifying question. Claude flips to Option B. You ask another. Claude flips back to A, rephrased. Three different recommendations on the same decision in ten minutes — not because the problem changed, but because the original proposal was built from inference, not from reading the code. The citation line is the gate: no reads, no proposal.

SINGLE-ENTRY-POINT GATE — guard the choke point, not every path

When N coupled paths share a single entry point, place the guard at that entry point — not scattered across each path. Grep the call graph first to find the choke point.

Real example

A campaign-pause bug affected 4 media paths + text paths + continuation paths. One gate at dispatchInboundMessage fixed all of them in ~70 lines. Scattering checks across each path would have created N new places to forget the check on the next feature. Pattern: grep the call graph, find the single entry point, guard there once.

Pattern 08

Parallel agents — how to actually do it

Three distinct mechanisms with different scope, cost, and use case. Picking the wrong one is the most common mistake. The mandatory rule before any spawn: create a file-ownership map.

The three mechanisms

Mechanism	What it is	Best for	How to invoke
Subagents	Lightweight, focused reads. One task per agent. Report back to main.	Codebase scans, research, checking references. Keeps main context clean.	Agent tool (subagent_type=Explore) or Task tool with run_in_background
Agent Teams	Multiple Claude Code instances coordinating on the same problem from different angles.	Review/challenge patterns — each agent finds what others miss. Self-Review Gate.	`CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1` then Task tool spawning N background tasks
Headless CLI	Claude Code running programmatically from shell, cron, or CI — no interactive session.	Scheduled intelligent work (not just bash scripts). Crons that need Claude analysis.	`claude -p "prompt" --bare --allowedTools "Read,Bash" --output-format json`

Rule of thumb

Agents reading independently and returning summaries → subagents. Agents challenging each other's findings on the same codebase → agent teams. Scheduled work that needs Claude reasoning at fire time → headless CLI. Don't use agent teams for work with a clear dependency chain — parallel spawns on dependent tasks produce conflicts, not speed.

Mandatory pre-spawn: the file-ownership map

Before spawning ANY parallel agents that write code, create this map. If two agents share a file, either reassign or sequence them. Present the map before spawning.

File-ownership map format

Agent 1 (DB fixes):    migrations/0043_add_index.sql, models/order.py
Agent 2 (API fixes):   api/routes/orders.py, api/services/order_service.py
Agent 3 (Tests):       tests/test_orders.py, tests/fixtures/orders.py

Shared files: None ✓
Conflict check: CLEAR — spawn approved

Task agents READ. Main agent WRITES.

Subagents and task agents are for research, analysis, and proposals only. They read files, identify root causes, and propose fixes with exact file paths and line numbers — but they do NOT edit files. The main agent applies all changes sequentially. This prevents permission errors, file conflicts, and plan-mode loops where subagents block waiting for approval.

The four parallel patterns

Pattern A — Brainstorm (3 subagents, read-only)

Subagent 1 (Pattern Scanner):
  grep for existing patterns, utilities, components in src/
  → returns: reusable patterns + file paths

Subagent 2 (API Scanner):
  scan api/ routes and services for reusable endpoints
  → returns: available endpoints + data shapes

Subagent 3 (Spec Reader):
  read CLAUDE.md, PRODUCT_SPECS.md, ROADMAP.md for constraints
  → returns: constraints, dependencies, risks

Orchestrator merges → 2-3 approaches with tradeoffs → present to user

Pattern B — Self-Review Gate (5 agent teams, read-only + type check)

CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1

Agent 1 (DB):       migrations, RLS policies, org_id filtering, constraints
Agent 2 (Security): auth on every route, injection, exposed secrets, OWASP
Agent 3 (UI):       broken refs, dangling imports, loading/error states
Agent 4 (Types):    run mypy/tsc, check for any types, interface coverage
Agent 5 (Parity):   verify every UI action has an API backing it

Each returns structured findings: P1 / P2 / P3
P1 → main agent auto-fixes, re-runs that agent only
P2/P3 → surfaced to human as a list (not the diff)

Pattern C — Build (sequential, one agent, dependency order)

# Building is NOT parallelizable — strict dependency order required

Step 1: Database   (migrations → user runs manually)
Step 2: API        (routes + services) → mypy/tsc check
Step 3: Logic      (processing, validation, workflows) → mypy/tsc check
Step 4: UI         (components, pages) → mypy/tsc check → full build

Rule: 3 consecutive type-check failures on the same step → STOP.
Surface the error. Do not loop or accumulate hacks.

Pattern D — Multi-bug fix (N agents, isolated git worktrees)

# Each bug gets an isolated worktree — no shared state, no conflicts

git worktree add ../project-fix-1 -b fix/bug-1
git worktree add ../project-fix-2 -b fix/bug-2
git worktree add ../project-fix-3 -b fix/bug-3

# Claude Code instance per worktree (separate tmux panes)
# Each runs independently: fix → type check → PR
# Merge sequentially via PR after all pass

# Clean up after merge
git worktree remove ../project-fix-1

Headless CLI — Claude in cron and CI

Run Claude programmatically from any shell. The same agent loop as interactive mode, scriptable. The gap this closes: scheduled work that needs reasoning at fire time, not just bash execution.

claude -p key flags

# Minimal locked-down cron invocation
claude -p "$(cat prompt.txt)" \
  --bare \                          # skip CLAUDE.md, hooks, skills (deterministic CI)
  --allowedTools "Read,Bash" \      # minimum required tools only
  --permission-mode dontAsk \       # no interactive prompts
  --output-format json              # structured output, no parsing fragility

# With structured schema (replaces jq/awk chains)
claude -p "analyze this output and return findings" \
  --bare \
  --output-format json \
  --json-schema '{"issues": [{"severity": "P1|P2|P3", "file": "str", "description": "str"}]}'

# Multi-turn (generator-evaluator loop)
claude -p "first pass" --bare → get session_id from output
claude -p "evaluate and improve" --resume  --bare

# Cost tracking (log this, alert on weekly total)
# Response JSON includes: total_cost_usd per call

Cost discipline for headless Claude

Every cron firing at K tokens/call compounds. Log total_cost_usd from every response to a daily ledger. Alert when weekly total exceeds your threshold. Don't bulk-convert all crons at once — adopt the primitive with one proof-of-concept, measure cost, then expand. --allowedTools should always be the minimum set, never blanket Bash.

When NOT to parallelize

Situation	Mode	Why
Building a feature (DB → API → UI)	Sequential only	Hard dependencies. API can't be built before DB schema exists.
Tasks where agents share files	Sequential or reassign	File conflicts. Two agents editing the same file produces merge chaos.
Planning and task breakdown	Sequential only	Each step informs the next. Can't plan in parallel.
Deploy pipeline	Sequential only	Hard pipeline dependencies. Tests must pass before push.
Independent bug fixes	Parallel via worktrees	Isolated branches, no shared state.
Code review (same codebase, different dimensions)	Parallel agent teams	Each agent finds what others miss. Independent reads.
Codebase research before planning	Parallel subagents	Independent reads. Keeps main context clean for decisions.

BUILD-USE-TRUST-ORCHESTRATE — the governance rule

Before plugging any project folder into parallel/automated multi-agent work, it must pass through four sequential phases. Skipping phases produces agents opening PRs for already-fixed bugs and an evaluation bottleneck on the human — because AI agents have no speed limit but the person managing them does.

Phase 1 — Build

Has the foundation

CLAUDE.md, scoped skills, docs/runbooks, accumulated context. The folder has opinions.

Phase 2 — Use

Worked manually for weeks

Real issues found and fixed. You know where it's fragile. No surprises from the folder.

Phase 3 — Trust

Output can be skimmed

You no longer fully review every line. A scan is enough to catch problems. Skim-trust threshold reached.

Phase 4 — Orchestrate

Only now: automate

Plug into parallel dispatch, multi-agent coordination, or unattended cron work. Not before.

What to build, in order

Each step unlocks the next. Don't skip.

Step	What	Why now
1	Tiered Autonomy + PRE-BUILD PARAPHRASE + ISC	Habits, not builds. Governs everything else. Wrong tier = wrong gates. Wrong paraphrase = wrong diff.
2	End-of-session debrief + lessons.md	Seeds every automated system downstream. No lessons.md = nothing to compile into rules or hooks.
3	First /test-[product] skill for your main product	Makes testing repeatable and agentically runnable. Base for the regression suite.
4	Pre-push hook: type check + open lessons block	First enforcement gate. Converts two common discipline failures into automatic blocks.
5	Self-Review Gate (Pattern B — 5 parallel agent teams)	Shifts you from reviewer of code to reviewer of a prioritized issue list. Requires `CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1`.
6	Brainstorm subagents before any significant feature (Pattern A)	3 parallel reads before writing a line. Catches reuse opportunities and constraints before the diff exists.
7	Worktree isolation for multi-bug sessions (Pattern D)	Independent branches, no conflicts. Parallel bug fixes that would otherwise be sequential.
8	Weekly review cron (Friday, automated)	System starts identifying its own improvement candidates without you asking.
9	One headless `claude -p --bare` proof-of-concept	Closes the gap between scheduled bash scripts and scheduled intelligent work. Start with one cron, measure cost, expand.
10	Generator-Evaluator loop on build workflow	Closes the last producer-to-curator gap. Build after step 5 is well-calibrated. Requires `claude -p --resume`.

When Claude becomespart of the test suite

Product-specific test skills

The pattern

Test tiers — what to build and when

Self-Review Gate — parallel agents as reviewers

How it works

DB + schema

Security

Consistency

Type safety

New return values handled everywhere

Hooks: enforcement over discipline

The principle

High-leverage hook patterns

DB round-trip testing

The rule

Session insights — the self-improving loop

How the loop works

End-of-session debrief (daily)

Weekly review

Generator-Evaluator: the next step

Current state vs full pattern

Collaboration patterns — governing the quality of direction

Tiered Autonomy — calibrate ceremony to blast radius

ISC Rule — define success before writing a line

PRE-BUILD PARAPHRASE — one sentence before the diff

GROUNDED PROPOSAL RULE — no options without reading first

SINGLE-ENTRY-POINT GATE — guard the choke point, not every path

Parallel agents — how to actually do it

The three mechanisms

Mandatory pre-spawn: the file-ownership map

The four parallel patterns

Headless CLI — Claude in cron and CI

When NOT to parallelize

BUILD-USE-TRUST-ORCHESTRATE — the governance rule

Has the foundation

Worked manually for weeks

Output can be skimmed

Only now: automate

What to build, in order

When Claude becomes
part of the test suite