Agentic Development Patterns

Pattern 01

Product-specific test skills

Instead of running tests ad-hoc, encode the full test suite for each product as a reusable skill. The skill owns the test sequence, the assertions, and the pass/fail report.

The pattern

Each product gets a /test-[product] skill. It runs autonomously, produces a PASS/FAIL report with evidence, and blocks deploy if incomplete.

SKILL.md — /test-sophie (structure)

## Steps

1. Run API smoke test
   curl /api/health → expect 200
   curl /api/test-buffer → send test payload, verify response shape

2. WhatsApp pipeline E2E
   Trigger inbound message via WAHA test endpoint
   Assert: intent classified, reply dispatched within 3s
   Assert: send_at timestamp written to DB (dedup guard)

3. Regression suite
   For every entry in tasks/lessons.md marked OPEN:
     Run the specific repro case
     Assert it no longer triggers the failure

4. First-user path
   Create fresh org (no pre-configuration)
   Complete full onboarding flow
   Assert: all expected rows created in DB

5. Report
   PASS: all assertions green, suite completed fully
   FAIL: list which assertions failed + which tests did not run
   INCOMPLETE: suite terminated early — flag explicitly, do not report as partial pass

INCOMPLETE TEST RULE

A suite that crashes or disconnects mid-run is INCOMPLETE, not "8/8 before crash." Report which tests did not run. Never claim a result from a partial run. This is the most violated rule in agentic testing.

Test tiers — what to build and when

Tier	What it covers	When it runs	Status
API smoke	Every endpoint returns expected shape with real-shape payload	After every deploy, agentically	Build first
Regression	Every known bug from lessons.md — repro case, then assert fixed	After every fix, before merge	Build first
First-user path	Fresh account, zero pre-configuration, complete core flow	Before any external share	Build first
DB round-trip	Constrained columns tested in live DB, not mocks	Any PR touching CHECK / RLS / enum columns	Add after
Real E2E	Human-in-the-loop through production (not staging)	Pre-launch and post major change	Add after

Pattern 02

Self-Review Gate — parallel agents as reviewers

After code is written, before it reaches you, 5 parallel agents review it against specific dimensions. P1 issues are auto-fixed. P2/P3 are surfaced as a list.

How it works

Code written

→

5 agents spawn

→

P1 auto-fixed

→

P2/P3 listed

→

You review the list, not the diff

Agent 1

DB + schema

CHECK constraints, FK targets, enum values, RLS USING/WITH CHECK match the values the code writes. Round-trip test required for any constrained column.

Agent 2

Security

SQL injection, XSS, exposed secrets, OWASP top 10. Any any types introduced. No auth bypasses.

Agent 3

Consistency

New code matches existing patterns (naming, error handling, response shapes). No dangling imports. No duplicate logic.

Agent 4

Type safety

tsc / mypy passes. No implicit any. Return types declared. New union members handled at every callsite.

Agent 5 — Dispatcher completeness

New return values handled everywhere

When a classifier or fast-path function gets a new return value (new enum case, new union member), grep every callsite that consumes it and verify the new case is handled. The failure mode: add 'jd_qa' to a return type, but the parent dispatcher has no branch for it — silently falls through to default. A missed case here requires a hotfix PR.

The shift this enables

You stop reviewing code line by line and start reviewing a prioritized issue list. The diff is Claude's job. Judgment calls on P2/P3 are yours. This is the producer-to-curator shift — and it only works if the review agents are actually spawned, not skipped under time pressure.

Pattern 03

Hooks: enforcement over discipline

Text rules in CLAUDE.md are advisory. Hooks are runtime enforcement. The gap between "Claude should do X" and "Claude cannot do the wrong thing" is a hook.

The principle

When the proposed fix is "remember to X" or "be more careful" — that's the wrong fix. Every discipline-based rule is a hook waiting to be written.

settings.json — hook structure

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [{
          "type": "command",
          "command": "~/.claude/hooks/pre-fire-blast-radius.sh"
        }]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Edit",
        "hooks": [{
          "type": "command",
          "command": "~/.claude/hooks/tsc-check.sh"
        }]
      }
    ]
  }
}

High-leverage hook patterns

Rule as text	Rule as hook	Trigger
Never push type errors	Run tsc --noEmit on PostToolUse(Edit) — block push if non-zero	Any file edit
No direct commits to main	PreToolUse(Bash) matching `git commit` — check branch, block if main	git commit
Count affected rows before bulk ops	Pre-fire blast radius: run SELECT COUNT before any UPDATE/DELETE without WHERE	Bash with SQL pattern
Regression check before deploy	Pre-push hook greps lessons.md for OPEN entries — block if found	git push
No incomplete test reports	PostToolUse on test commands — parse output, flag if suite terminated early	pytest / vitest

The real gap: rules without hooks are bets on discipline

Every rule that gets violated twice has proven it needs a hook. The signal: if you've caught Claude doing the wrong thing more than once despite the rule being in CLAUDE.md, the rule needs to be in settings.json instead.

Pattern 04

DB round-trip testing

The failure mode that bites silently: code writes a value, the DB silently rejects it due to a constraint mismatch, the ORM reports success. You discover it in production when the column is always NULL.

The exact failure

A child_gender column had a CHECK accepting 'niña'/'niño' (with ñ). Every code path wrote ASCII 'nina'/'nino'. 100% of writes failed silently for 24 hours. PostgREST returned {data: null, error: null} for every UPDATE. The feature never worked in production. Found by hand, not by tests.

The rule

Any PR touching a column with CHECK, unique constraint, FK, custom enum, or RLS policy must include a round-trip test: write to the live DB from the same client surface (same session context, same RLS role), re-fetch the row, assert the value was written correctly.

Round-trip test checklist — any constrained column

1. Read the constraint:
   CHECK list, FK target, enum values, RLS USING/WITH CHECK clauses.

2. Confirm the value the code writes matches what the constraint accepts:
   - Case: 'niña' vs 'nina'
   - Encoding: accented chars vs ASCII
   - Type: string vs integer
   - Length: varchar(50) vs longer

3. Write from the same client surface:
   Same browser session, same function context, same RLS role.
   Not from psql as superuser. Not from mocked client.

4. Re-fetch the row:
   Assert the column holds the expected value, not NULL.
   A toast/error caught client-side is NOT proof of success.

5. PostgREST gotcha:
   UPDATE that touches zero rows (RLS mismatch, wrong eq filter) returns
   {data: null, error: null} — client treats as success.
   Only a re-fetch proves the write landed.

Pattern 05

Session insights — the self-improving loop

The system gets better without you manually updating it. Session insights extract patterns from what went wrong, propose rule changes, and over time convert advisory rules into enforced hooks.

How the loop works

Bug / mistake happens → Captured in lessons.md → Session insights surfaces pattern → Rule added to CLAUDE.md → Hook added to settings.json → Pattern never recurs

End-of-session debrief (daily)

Run at the end of every Claude Code session. Output appended to tasks/lessons.md.

Daily debrief prompt

Read the git diff from this session and tasks/lessons.md.

Answer:
1. What was built? One sentence.
2. What mistakes were made and corrected during this session?
3. Which of these mistakes could have been caught by a hook or a test?
4. What is the single most important rule to add or update in CLAUDE.md?
5. Is that rule a hook candidate? If yes, what is the trigger and the check?
6. Were any tasks deferred? Write them to tasks/todo.md now — deferred items
   mentioned only in chat are lost on context reset.

Weekly review

Every Friday. Finds patterns across the week and identifies which rules have failed enough to become hooks.

Weekly review prompt

Read tasks/lessons.md and git log --oneline --since="7 days ago".

1. Which lessons appeared more than once this week? These are pattern-class bugs.
2. Which rules in CLAUDE.md were violated despite being written down?
   Each violation = hook candidate. List them.
3. Which lessons are still OPEN (not yet a rule, not yet a hook)?
   Prioritize by frequency.
4. What is the one rule-to-hook conversion with highest leverage this week?
   Give the exact trigger type (PreToolUse / PostToolUse / Bash pattern)
   and the shell check it would run.

The compounding effect

After one month of daily debriefs, lessons.md is the most valuable file in the project. It contains the exact failure modes of your specific codebase — not generic best practices. The weekly review converts that into a tightening enforcement layer. The system starts telling you what to automate next.

Pattern 06

Generator-Evaluator: the next step

The Self-Review Gate is one generator pass, one evaluator pass. The full pattern adds iteration: if the evaluator scores below threshold, the generator reruns with the eval feedback. No human in the loop until convergence.

Current state vs full pattern

	Current (Self-Review Gate)	Full Generator-Evaluator
Generator	Claude writes code once	Claude writes code, evaluator scores it, Claude regenerates with feedback if below threshold
Evaluator	5 parallel agents, one pass	Same 5 agents, but output feeds back into generator until score clears
Iteration	None — one round	N rounds with budget cap (e.g. max 3 attempts)
Human touch	P2/P3 list surfaced every time	Human only sees output that cleared the eval threshold
Prerequisite	None	Eval needs a scoring signal — /audit-infra or a quality rubric per output type

Why build this in order

The generator-evaluator loop is the highest-leverage upgrade to the build workflow. But it requires the evaluator to have a reliable scoring signal — without that, iteration just compounds the wrong direction. Ship the Self-Review Gate first, use it long enough to calibrate what a P1 looks like, then add the loop.

What to build, in order

Each step unlocks the next. Don't skip.

Step	What	Why now
1	End-of-session debrief habit + lessons.md	Seeds every other system. No lessons.md = nothing to compile into rules or hooks.
2	First /test-[product] skill for your main product	Makes testing repeatable and agentically runnable. Base for regression suite.
3	Pre-push hook: type check + open lessons block	First enforcement gate. Converts two common discipline failures into automatic blocks.
4	Self-Review Gate inside your build workflow	Shifts you from reviewer of code to reviewer of a prioritized issue list.
5	Weekly review cron (Friday, automated)	System starts identifying its own improvement candidates without you asking.
6	Generator-Evaluator loop on /build-feature	Closes the last producer-to-curator gap. Build after step 4 is well-calibrated.

When Claude becomespart of the test suite

Product-specific test skills

The pattern

Test tiers — what to build and when

Self-Review Gate — parallel agents as reviewers

How it works

DB + schema

Security

Consistency

Type safety

New return values handled everywhere

Hooks: enforcement over discipline

The principle

High-leverage hook patterns

DB round-trip testing

The rule

Session insights — the self-improving loop

How the loop works

End-of-session debrief (daily)

Weekly review

Generator-Evaluator: the next step

Current state vs full pattern

What to build, in order

When Claude becomes
part of the test suite