The PR Checklist We Run on Every AI-Generated Merge

Three months into building VitalRegistry with Claude Code, we shipped a PR that passed all its tests. Lint clean. Type-safe. Merged. Deployed.

Two weeks later we were untangling four different error handling patterns spread across eight files. All generated in the same sprint, all technically correct, none consistent with each other. The codebase had started to fork from itself and no test had noticed.

We use AI to write roughly 60% of our code across VitalRegistry and every client project we take on. We are not stopping. The checklist exists because we learned what AI generated code gets wrong in ways that tests cannot catch.

The Three Degradation Patterns

AI generated code degrades in ways that do not show up in tests: inconsistent error handling patterns, copied abstractions that do not fit the codebase's conventions, and dead code from half implemented approaches that compiled but were never wired up.

The code works. The logic is sound. The model generates each file with its own local coherence, reading adjacent context but not holding the full pattern of the codebase in memory. The result is a slow divergence from the conventions the codebase had before the AI touched it.

Error handling drift is the most common. A model asked to add a new Firebase listener produces try/catch blocks matching what it sees in the current file. If that file was written six months ago with an older pattern, you get the old convention cloned into the new feature. Multiply this across a sprint and you have four error handling styles in one codebase, all valid, none intentional.

Abstraction duplication is slower to find. AI reaches for the nearest tool in its context window. If formatCurrencyHUF() lives in lib/format.ts but the model was focused on a different part of the tree, you get displayHUFAmount() in components/payments/utils.ts. Same logic, different name, no shared test. Both survive review because both look reasonable in isolation.

Dead code from half completed refactors is the most invisible. The model starts a pattern, writes the scaffolding, then finishes the task a different way. The scaffolding compiles. Nothing calls it. Someone tries to trace a data flow and follows a path that leads nowhere.

The Incident That Built the Checklist

The error handling drift caught us on VitalRegistry's contract generation feature. Three contributors in one sprint: Zsombor on the frontend, me on the Firebase functions, Claude Code on the glue. Each context window was coherent. The seams were not.

We found it during a manual trace on a completely unrelated bug. The contract upload path caught and logged in one shape. The device status update rethrew in a different shape. The notification trigger swallowed errors silently, in a pattern copied from a Streamlit prototype we'd translated months earlier.

Nothing broke. The patterns just accumulated until the codebase stopped feeling like one codebase.

Qwen Coder 27B explored twenty thousand lines of code in twenty minutes on a request to make a small edit. Running locally on a 3090, the model did not trust its own tool call grammar and compensated by reading everything repeatedly. The output was unusable. The 3090 found its real job indexing our internal knowledge base instead. Different failure mode. Same root: AI generated code requires a human with codebase wide context to review it, because the model's context window is not that.

The Checklist

We run five checks on every AI touched PR before merge. Not a CI job. A five minute read.

1. Error handling consistency. Does this PR's error handling match the pattern in the rest of the codebase? Pick two unrelated files from outside the PR's scope. Compare. If the new code invented its own convention, rewrite it before it spreads.

2. Abstraction fit. Did the AI introduce a utility that already exists elsewhere under a different name? Search the repo for the concept, not just the function name. Duplication at this level is invisible to linters because both versions are valid code.

3. Dead code audit. Any functions defined but not called? Any unused imports? Run the linter with unused code rules before merge. AI scaffolds and then changes approach mid generation. The scaffold stays.

4. Convention alignment. Variable naming, file structure, export patterns. Does this PR match the shape of the module it lives in? The model optimises for the file it is writing, not the directory it sits in.

5. The adversarial loop. Run the automated review pass before merge. Our GSD ship phase tries an adversarial review and fix loop twice before pinging a human. Most builds clear in one loop. The ceiling is two loops because by the third the agent starts agreeing with itself instead of finding the bug.

Five checks. Most PRs clear in under ten minutes. The ones that do not catch the error handling drift before it compounds.

The Decision: Manual Checklist vs. Automated Review Only

Options considered: rely entirely on the automated adversarial review loop, run the manual checklist on every PR, or run the checklist only on AI-heavy PRs above a line count threshold.

Chosen: manual checklist on every AI-touched PR, run alongside the automated loop.

Rationale: the automated review catches logic bugs and test coverage gaps. It does not catch convention drift. The reviewer agent sees the PR, not the year of incremental decisions that preceded it. The manual checklist covers what the agent cannot: does this match what we were doing three months ago, before this sprint's context window started.

Tradeoff: five to ten minutes per PR, every PR. That cost is real. The alternative is a codebase that forks from itself slowly enough that no single commit is the problem.

What the Checklist Is Not

We run it because we trust Claude Code enough to use it for 60% of our output. That means trusting it to be wrong in the specific ways models are wrong: locally coherent, globally inconsistent.

The adversarial review finds bugs. The checklist finds drift. Both run. Neither replaces the other.

AI-assisted development runs at the same speed with a five-minute tax on every merge. Code that forks from your conventions costs more to maintain than code that does not.

The question worth sitting with: in your codebase right now, is the AI generating code that fits your conventions, or is it slowly writing new ones you did not choose?