Agentic Development Patterns
Testing, QA, and self-improving workflows

When Claude becomes
part of the test suite

Beyond using Claude to write code. Patterns for making the agentic workflow itself catch bugs, enforce quality, and improve the system without you asking.

Pattern 01

Product-specific test skills

Instead of running tests ad-hoc, encode the full test suite for each product as a reusable skill. The skill owns the test sequence, the assertions, and the pass/fail report.

The pattern

Each product gets a /test-[product] skill. It runs autonomously, produces a PASS/FAIL report with evidence, and blocks deploy if incomplete.

SKILL.md — /test-sophie (structure)
## Steps 1. Run API smoke test curl /api/health → expect 200 curl /api/test-buffer → send test payload, verify response shape 2. WhatsApp pipeline E2E Trigger inbound message via WAHA test endpoint Assert: intent classified, reply dispatched within 3s Assert: send_at timestamp written to DB (dedup guard) 3. Regression suite For every entry in tasks/lessons.md marked OPEN: Run the specific repro case Assert it no longer triggers the failure 4. First-user path Create fresh org (no pre-configuration) Complete full onboarding flow Assert: all expected rows created in DB 5. Report PASS: all assertions green, suite completed fully FAIL: list which assertions failed + which tests did not run INCOMPLETE: suite terminated early — flag explicitly, do not report as partial pass
INCOMPLETE TEST RULE

A suite that crashes or disconnects mid-run is INCOMPLETE, not "8/8 before crash." Report which tests did not run. Never claim a result from a partial run. This is the most violated rule in agentic testing.

Test tiers — what to build and when

TierWhat it coversWhen it runsStatus
API smoke Every endpoint returns expected shape with real-shape payload After every deploy, agentically Build first
Regression Every known bug from lessons.md — repro case, then assert fixed After every fix, before merge Build first
First-user path Fresh account, zero pre-configuration, complete core flow Before any external share Build first
DB round-trip Constrained columns tested in live DB, not mocks Any PR touching CHECK / RLS / enum columns Add after
Real E2E Human-in-the-loop through production (not staging) Pre-launch and post major change Add after
Pattern 02

Self-Review Gate — parallel agents as reviewers

After code is written, before it reaches you, 5 parallel agents review it against specific dimensions. P1 issues are auto-fixed. P2/P3 are surfaced as a list.

How it works

Code written
5 agents spawn
P1 auto-fixed
P2/P3 listed
You review the list, not the diff
Agent 1

DB + schema

CHECK constraints, FK targets, enum values, RLS USING/WITH CHECK match the values the code writes. Round-trip test required for any constrained column.

Agent 2

Security

SQL injection, XSS, exposed secrets, OWASP top 10. Any any types introduced. No auth bypasses.

Agent 3

Consistency

New code matches existing patterns (naming, error handling, response shapes). No dangling imports. No duplicate logic.

Agent 4

Type safety

tsc / mypy passes. No implicit any. Return types declared. New union members handled at every callsite.

Agent 5 — Dispatcher completeness

New return values handled everywhere

When a classifier or fast-path function gets a new return value (new enum case, new union member), grep every callsite that consumes it and verify the new case is handled. The failure mode: add 'jd_qa' to a return type, but the parent dispatcher has no branch for it — silently falls through to default. A missed case here requires a hotfix PR.

The shift this enables

You stop reviewing code line by line and start reviewing a prioritized issue list. The diff is Claude's job. Judgment calls on P2/P3 are yours. This is the producer-to-curator shift — and it only works if the review agents are actually spawned, not skipped under time pressure.

Pattern 03

Hooks: enforcement over discipline

Text rules in CLAUDE.md are advisory. Hooks are runtime enforcement. The gap between "Claude should do X" and "Claude cannot do the wrong thing" is a hook.

The principle

When the proposed fix is "remember to X" or "be more careful" — that's the wrong fix. Every discipline-based rule is a hook waiting to be written.

settings.json — hook structure
{ "hooks": { "PreToolUse": [ { "matcher": "Bash", "hooks": [{ "type": "command", "command": "~/.claude/hooks/pre-fire-blast-radius.sh" }] } ], "PostToolUse": [ { "matcher": "Edit", "hooks": [{ "type": "command", "command": "~/.claude/hooks/tsc-check.sh" }] } ] } }

High-leverage hook patterns

Rule as textRule as hookTrigger
Never push type errors Run tsc --noEmit on PostToolUse(Edit) — block push if non-zero Any file edit
No direct commits to main PreToolUse(Bash) matching git commit — check branch, block if main git commit
Count affected rows before bulk ops Pre-fire blast radius: run SELECT COUNT before any UPDATE/DELETE without WHERE Bash with SQL pattern
Regression check before deploy Pre-push hook greps lessons.md for OPEN entries — block if found git push
No incomplete test reports PostToolUse on test commands — parse output, flag if suite terminated early pytest / vitest
The real gap: rules without hooks are bets on discipline

Every rule that gets violated twice has proven it needs a hook. The signal: if you've caught Claude doing the wrong thing more than once despite the rule being in CLAUDE.md, the rule needs to be in settings.json instead.

Pattern 04

DB round-trip testing

The failure mode that bites silently: code writes a value, the DB silently rejects it due to a constraint mismatch, the ORM reports success. You discover it in production when the column is always NULL.

The exact failure

A child_gender column had a CHECK accepting 'niña'/'niño' (with ñ). Every code path wrote ASCII 'nina'/'nino'. 100% of writes failed silently for 24 hours. PostgREST returned {data: null, error: null} for every UPDATE. The feature never worked in production. Found by hand, not by tests.

The rule

Any PR touching a column with CHECK, unique constraint, FK, custom enum, or RLS policy must include a round-trip test: write to the live DB from the same client surface (same session context, same RLS role), re-fetch the row, assert the value was written correctly.

Round-trip test checklist — any constrained column
1. Read the constraint: CHECK list, FK target, enum values, RLS USING/WITH CHECK clauses. 2. Confirm the value the code writes matches what the constraint accepts: - Case: 'niña' vs 'nina' - Encoding: accented chars vs ASCII - Type: string vs integer - Length: varchar(50) vs longer 3. Write from the same client surface: Same browser session, same function context, same RLS role. Not from psql as superuser. Not from mocked client. 4. Re-fetch the row: Assert the column holds the expected value, not NULL. A toast/error caught client-side is NOT proof of success. 5. PostgREST gotcha: UPDATE that touches zero rows (RLS mismatch, wrong eq filter) returns {data: null, error: null} — client treats as success. Only a re-fetch proves the write landed.
Pattern 05

Session insights — the self-improving loop

The system gets better without you manually updating it. Session insights extract patterns from what went wrong, propose rule changes, and over time convert advisory rules into enforced hooks.

How the loop works

Bug / mistake happens Captured in lessons.md Session insights surfaces pattern Rule added to CLAUDE.md Hook added to settings.json Pattern never recurs

End-of-session debrief (daily)

Run at the end of every Claude Code session. Output appended to tasks/lessons.md.

Daily debrief prompt
Read the git diff from this session and tasks/lessons.md. Answer: 1. What was built? One sentence. 2. What mistakes were made and corrected during this session? 3. Which of these mistakes could have been caught by a hook or a test? 4. What is the single most important rule to add or update in CLAUDE.md? 5. Is that rule a hook candidate? If yes, what is the trigger and the check? 6. Were any tasks deferred? Write them to tasks/todo.md now — deferred items mentioned only in chat are lost on context reset.

Weekly review

Every Friday. Finds patterns across the week and identifies which rules have failed enough to become hooks.

Weekly review prompt
Read tasks/lessons.md and git log --oneline --since="7 days ago". 1. Which lessons appeared more than once this week? These are pattern-class bugs. 2. Which rules in CLAUDE.md were violated despite being written down? Each violation = hook candidate. List them. 3. Which lessons are still OPEN (not yet a rule, not yet a hook)? Prioritize by frequency. 4. What is the one rule-to-hook conversion with highest leverage this week? Give the exact trigger type (PreToolUse / PostToolUse / Bash pattern) and the shell check it would run.
The compounding effect

After one month of daily debriefs, lessons.md is the most valuable file in the project. It contains the exact failure modes of your specific codebase — not generic best practices. The weekly review converts that into a tightening enforcement layer. The system starts telling you what to automate next.

Pattern 06

Generator-Evaluator: the next step

The Self-Review Gate is one generator pass, one evaluator pass. The full pattern adds iteration: if the evaluator scores below threshold, the generator reruns with the eval feedback. No human in the loop until convergence.

Current state vs full pattern

Current (Self-Review Gate)Full Generator-Evaluator
Generator Claude writes code once Claude writes code, evaluator scores it, Claude regenerates with feedback if below threshold
Evaluator 5 parallel agents, one pass Same 5 agents, but output feeds back into generator until score clears
Iteration None — one round N rounds with budget cap (e.g. max 3 attempts)
Human touch P2/P3 list surfaced every time Human only sees output that cleared the eval threshold
Prerequisite None Eval needs a scoring signal — /audit-infra or a quality rubric per output type
Why build this in order

The generator-evaluator loop is the highest-leverage upgrade to the build workflow. But it requires the evaluator to have a reliable scoring signal — without that, iteration just compounds the wrong direction. Ship the Self-Review Gate first, use it long enough to calibrate what a P1 looks like, then add the loop.


What to build, in order

Each step unlocks the next. Don't skip.

StepWhatWhy now
1 End-of-session debrief habit + lessons.md Seeds every other system. No lessons.md = nothing to compile into rules or hooks.
2 First /test-[product] skill for your main product Makes testing repeatable and agentically runnable. Base for regression suite.
3 Pre-push hook: type check + open lessons block First enforcement gate. Converts two common discipline failures into automatic blocks.
4 Self-Review Gate inside your build workflow Shifts you from reviewer of code to reviewer of a prioritized issue list.
5 Weekly review cron (Friday, automated) System starts identifying its own improvement candidates without you asking.
6 Generator-Evaluator loop on /build-feature Closes the last producer-to-curator gap. Build after step 4 is well-calibrated.