When Claude becomes
part of the test suite
Beyond using Claude to write code. Patterns for making the agentic workflow itself catch bugs, enforce quality, and improve the system without you asking.
Product-specific test skills
Instead of running tests ad-hoc, encode the full test suite for each product as a reusable skill. The skill owns the test sequence, the assertions, and the pass/fail report.
The pattern
Each product gets a /test-[product] skill. It runs autonomously, produces a PASS/FAIL report with evidence, and blocks deploy if incomplete.
A suite that crashes or disconnects mid-run is INCOMPLETE, not "8/8 before crash." Report which tests did not run. Never claim a result from a partial run. This is the most violated rule in agentic testing.
Test tiers — what to build and when
| Tier | What it covers | When it runs | Status |
|---|---|---|---|
| API smoke | Every endpoint returns expected shape with real-shape payload | After every deploy, agentically | Build first |
| Regression | Every known bug from lessons.md — repro case, then assert fixed | After every fix, before merge | Build first |
| First-user path | Fresh account, zero pre-configuration, complete core flow | Before any external share | Build first |
| DB round-trip | Constrained columns tested in live DB, not mocks | Any PR touching CHECK / RLS / enum columns | Add after |
| Real E2E | Human-in-the-loop through production (not staging) | Pre-launch and post major change | Add after |
Self-Review Gate — parallel agents as reviewers
After code is written, before it reaches you, 5 parallel agents review it against specific dimensions. P1 issues are auto-fixed. P2/P3 are surfaced as a list.
How it works
DB + schema
CHECK constraints, FK targets, enum values, RLS USING/WITH CHECK match the values the code writes. Round-trip test required for any constrained column.
Security
SQL injection, XSS, exposed secrets, OWASP top 10. Any any types introduced. No auth bypasses.
Consistency
New code matches existing patterns (naming, error handling, response shapes). No dangling imports. No duplicate logic.
Type safety
tsc / mypy passes. No implicit any. Return types declared. New union members handled at every callsite.
New return values handled everywhere
When a classifier or fast-path function gets a new return value (new enum case, new union member), grep every callsite that consumes it and verify the new case is handled. The failure mode: add 'jd_qa' to a return type, but the parent dispatcher has no branch for it — silently falls through to default. A missed case here requires a hotfix PR.
You stop reviewing code line by line and start reviewing a prioritized issue list. The diff is Claude's job. Judgment calls on P2/P3 are yours. This is the producer-to-curator shift — and it only works if the review agents are actually spawned, not skipped under time pressure.
Hooks: enforcement over discipline
Text rules in CLAUDE.md are advisory. Hooks are runtime enforcement. The gap between "Claude should do X" and "Claude cannot do the wrong thing" is a hook.
The principle
When the proposed fix is "remember to X" or "be more careful" — that's the wrong fix. Every discipline-based rule is a hook waiting to be written.
High-leverage hook patterns
| Rule as text | Rule as hook | Trigger |
|---|---|---|
| Never push type errors | Run tsc --noEmit on PostToolUse(Edit) — block push if non-zero | Any file edit |
| No direct commits to main | PreToolUse(Bash) matching git commit — check branch, block if main |
git commit |
| Count affected rows before bulk ops | Pre-fire blast radius: run SELECT COUNT before any UPDATE/DELETE without WHERE | Bash with SQL pattern |
| Regression check before deploy | Pre-push hook greps lessons.md for OPEN entries — block if found | git push |
| No incomplete test reports | PostToolUse on test commands — parse output, flag if suite terminated early | pytest / vitest |
Every rule that gets violated twice has proven it needs a hook. The signal: if you've caught Claude doing the wrong thing more than once despite the rule being in CLAUDE.md, the rule needs to be in settings.json instead.
DB round-trip testing
The failure mode that bites silently: code writes a value, the DB silently rejects it due to a constraint mismatch, the ORM reports success. You discover it in production when the column is always NULL.
A child_gender column had a CHECK accepting 'niña'/'niño' (with ñ). Every code path wrote ASCII 'nina'/'nino'. 100% of writes failed silently for 24 hours. PostgREST returned {data: null, error: null} for every UPDATE. The feature never worked in production. Found by hand, not by tests.
The rule
Any PR touching a column with CHECK, unique constraint, FK, custom enum, or RLS policy must include a round-trip test: write to the live DB from the same client surface (same session context, same RLS role), re-fetch the row, assert the value was written correctly.
Session insights — the self-improving loop
The system gets better without you manually updating it. Session insights extract patterns from what went wrong, propose rule changes, and over time convert advisory rules into enforced hooks.
How the loop works
End-of-session debrief (daily)
Run at the end of every Claude Code session. Output appended to tasks/lessons.md.
Weekly review
Every Friday. Finds patterns across the week and identifies which rules have failed enough to become hooks.
After one month of daily debriefs, lessons.md is the most valuable file in the project. It contains the exact failure modes of your specific codebase — not generic best practices. The weekly review converts that into a tightening enforcement layer. The system starts telling you what to automate next.
Generator-Evaluator: the next step
The Self-Review Gate is one generator pass, one evaluator pass. The full pattern adds iteration: if the evaluator scores below threshold, the generator reruns with the eval feedback. No human in the loop until convergence.
Current state vs full pattern
| Current (Self-Review Gate) | Full Generator-Evaluator | |
|---|---|---|
| Generator | Claude writes code once | Claude writes code, evaluator scores it, Claude regenerates with feedback if below threshold |
| Evaluator | 5 parallel agents, one pass | Same 5 agents, but output feeds back into generator until score clears |
| Iteration | None — one round | N rounds with budget cap (e.g. max 3 attempts) |
| Human touch | P2/P3 list surfaced every time | Human only sees output that cleared the eval threshold |
| Prerequisite | None | Eval needs a scoring signal — /audit-infra or a quality rubric per output type |
The generator-evaluator loop is the highest-leverage upgrade to the build workflow. But it requires the evaluator to have a reliable scoring signal — without that, iteration just compounds the wrong direction. Ship the Self-Review Gate first, use it long enough to calibrate what a P1 looks like, then add the loop.
What to build, in order
Each step unlocks the next. Don't skip.
| Step | What | Why now |
|---|---|---|
| 1 | End-of-session debrief habit + lessons.md | Seeds every other system. No lessons.md = nothing to compile into rules or hooks. |
| 2 | First /test-[product] skill for your main product | Makes testing repeatable and agentically runnable. Base for regression suite. |
| 3 | Pre-push hook: type check + open lessons block | First enforcement gate. Converts two common discipline failures into automatic blocks. |
| 4 | Self-Review Gate inside your build workflow | Shifts you from reviewer of code to reviewer of a prioritized issue list. |
| 5 | Weekly review cron (Friday, automated) | System starts identifying its own improvement candidates without you asking. |
| 6 | Generator-Evaluator loop on /build-feature | Closes the last producer-to-curator gap. Build after step 4 is well-calibrated. |