Reviewing AI-written code, where the bug is never where you look

AI code shows no signs of struggle, which inverts how a reviewer hunts for bugs. The failure modes it repeats, and where to spend scarce attention.

The cleanest pull request I reviewed this spring had its bug in the safest-looking line. It was a delete-file endpoint, maybe forty lines, and a teammate had generated most of it with an assistant. The diff read like documentation: typed end to end, named well, a green test sitting next to it. I almost approved it on the first pass. The permission check was right there at the top, so the part I’d normally slow down for looked handled.

It checked user.canDelete. A real flag, and the wrong one. canDelete is an account-level capability, the thing that decides whether the trash button renders at all. It says nothing about whether this particular user may delete this particular file, which for a file shared into the account by someone else is the entire question. The endpoint would remove a file out from under its owner.

// it looks like an authorization check. it even is one. it just guards the wrong thing.
async function deleteFile(fileId: string, user: User) {
  const file = await files.get(fileId);
  if (!file) throw new NotFound();
  if (!user.canDelete) throw new Forbidden();  // account capability, not "this file"
  await files.remove(fileId);                  // file.ownerId never enters the decision
}

A missing check would have been easy to catch; an empty spot where a guard belongs is the first thing your eye snags on. This one was present and wrong, and a present check reads, at a glance, as a handled one. That’s the trap.

The tell is gone

I’ve reviewed a lot of code, and for years my instincts were tuned to a human author. People telegraph where their bugs are. The function someone fought with looks fought-with: a comment arguing with itself, a variable renamed twice, a // TODO: revisit, a commit message that just says “finally.” You learn to slow down where the author slowed down, and most of the time it works.

A model writes the hard part and the trivial part in the same even voice. There’s no hesitation to read, so the scent I’d been following isn’t there. The bug isn’t where the author struggled, because in the text nobody struggled; it’s wherever the plausible code happens to be wrong, which tends to be the part that looks most done. The old heuristic, read the messy bits closely and trust the clean ones, runs backwards now. I read the clean ones first.

The shapes the misses come in

After enough of these PRs the misses stop feeling random. Models fail in a handful of recognizable shapes, and naming them lets you go looking instead of hoping to trip over the problem.

Authorization that guards the wrong thing. A model knows, statistically, that endpoints have permission checks, so it writes one; it doesn’t know your authorization model, so it guesses a subject. The check is present and the right shape, and it answers a different question than the one that matters.

APIs and packages that don’t exist. A model writes against an average of every version of every library, not the one in your lockfile. So it calls a method renamed two majors ago, or imports a package whose name sounds right and 404s on npm. Both are real failure modes, and both are the kind a machine catches better than a reviewer does, which I’ll come back to.

// the debounce for the domain-availability check. the package name sounds plausible
// but 404s on npm — invented by analogy from real ones like `p-debounce`.
import { debounceAsync } from 'debounce-async-promise';

The happy path is perfect and nothing else exists. Loading, empty, error, the request that should have been cancelled. The model nails the demo and skips the states that only show up in production. On the frontend the one that bites most is the stale-response race, because it never throws and never appears in a quick click-through.

// flawless when you type slowly. type fast and a slow check for "ma" can resolve
// after the quick one for "mail", flipping the field back to "available" for a name
// the user already finished. nothing cancels the request that's now stale.
useEffect(() => {
  checkDomain(name).then(setAvailable);
}, [name]);

Code that fits a generic pattern but not this codebase. It reaches for fetch and a hand-rolled try/catch when there’s a typed client three imports away that already does retries, auth, and error normalization. It re-derives a selector that already exists. Each line is locally reasonable, and the codebase ends up with one more way of doing something it already did.

Over-engineering. Ask a junior for a one-off and you tend to get something too simple. Ask a model and you get a factory, a config object, a strategy interface, and an extension point nobody requested. It optimizes for looking like mature code, because that’s what the training rewarded, rather than for the smallest thing that would have worked. The junior and the model both need redirecting, just in opposite directions.

Subtle logic at the boundaries. Off-by-one, an inclusive range that should be exclusive, a >= that should be >. These are hard to spot precisely because the code around them is tidy enough that the eye keeps sliding off.

Reviewing a PR you can’t read line by line

A model produces a 900-line PR in an afternoon, and all 900 lines look finished. Read every line with equal weight and most of your attention goes to boilerplate while the one line that matters slips by, so I stopped reading these top to bottom.

I start from intent. What was this supposed to do, and where does correctness actually live? For a delete endpoint that’s the authorization decision and whatever it reads to make it. For a list view it’s the empty, loading, and error branches and the keys. I read those few places carefully and skim the rest to confirm it’s as boring as it looks. Often I’ll write the three or four invariants that have to hold into the PR description first, then go find each one in the code. If I can’t find where an invariant is enforced, that absence is usually the bug, and finding it that way beats reading everything.

Tests are a gate, with a catch most people miss: a test written by the same model that wrote the code is a tautology. It asserts the behavior the code happens to have, bug included. So I read the test harder than the implementation, and adversarially. Does it touch the empty case, the stale response, the unauthorized user? If the only assertions are the happy path, a green run tells you very little.

// the model wrote the code and this test. it's green — and it only checks the happy
// path: type a free name, expect "available". nothing here touches the race from
// earlier, so "all tests pass" means "my bug is untested", not "there's no bug".
test('shows the domain as available', async () => {
  render(<DomainField />);
  await userEvent.type(screen.getByRole('textbox'), 'mail');
  expect(await screen.findByText('available')).toBeVisible();
});

In practice I write the assertions, or at least name the cases, and let the model fill in the scaffolding. The part that says what correct means has to stay with a human.

One more thing, and it’s free: ask the model why. “Why this approach, what did you consider and reject, what happens if the input is empty.” The explanation isn’t trustworthy, and that isn’t the point. The question drags the assumption into the open, and a wrong assumption is easier to catch in a sentence than buried in code. When it can’t justify a line, that line moves to the top of my list.

Trust tracks blast radius, not how clean it looks

Not all code earns the same scrutiny, and cleanliness is the wrong axis to calibrate on. A pure formatting helper with a property test: I trust the model and move on, since the worst case is a wrong label. An authorization boundary, a money flow, a data migration, a cache-invalidation rule: I trust nothing, because the worst case is a deleted file, a double charge, or data that’s quietly wrong for a week before anyone notices.

I did enough billing and subscription work to have a permanent flinch about money paths. A model has no flinch; it writes refund logic as calmly as it writes a button color. So I calibrate to where a mistake is expensive. Generated code pulls the other way, toward whatever reads as most finished, and that’s the pull to resist.

Push the repeatable catches into CI

A reviewer can’t be the only gate. A model produces faster than anyone reads, and most of its failures are mechanical enough that a tool catches them more reliably than I do. Automating them frees human attention for the judgment calls and hands the rote checks to something that runs on every commit without getting bored.

Those first two gates also defuse the failure mode that sounds the scariest. An invented import looks like a supply-chain risk, because a package name that’s free today is a typosquat waiting to happen. But it’s the cheapest of the model’s mistakes to catch: a method that doesn’t exist fails the compile, a package that isn’t in the lockfile fails the install, and no human has to notice either. The scariest-sounding category is the one the machine already owns.

This is the team-level version of something I learned standardizing patterns across a group of engineers: you scale judgment by encoding it where it runs on its own, not by repeating it in review.

The most prolific junior you’ll ever have

Reviewing a model’s code feels familiar once you stop fighting it: it’s mentoring with the human parts removed. It’s the most prolific junior you’ll ever work with, and also one that can’t be grown. You catch the same mistakes this week that you caught last week, because nothing carried over.

That shifts the weight of the job from writing toward reviewing, and it raises the value of what mentoring trains: reading code skeptically, knowing where correctness lives, telling a real reason from a plausible one. My bet is that skill ages slower than the syntax to write fast, though it isn’t immune. A lot of my instincts are wired to a particular stack and its APIs, and those move too. The half-life is just longer.

There’s a tension under all of this that I haven’t resolved, and I don’t think the tooling has either. Generation got cheap and review didn’t. CI absorbs the mechanical half, the invented imports and the missing types. The expensive half, is this the right authorization subject, is this abstraction worth its weight, what happens when the input is empty, is exactly the half a tool can’t judge, and it doesn’t get cheaper because the code arrived faster. The easy code ships sooner, and the bottleneck slides onto the few decisions that were the actual work. On a quiet day I give a generated PR the full treatment. On a day with six of them I triage by blast radius and merge some things I’d have read harder a year ago, and I know I’m doing it. Pretending the judgment scaled with the generation is how teams merge fast now and pay for it later.

You also stop reviewing the author. With a person, review is half a conversation; you calibrate to who wrote it and what they’ll take from your comments. With a model there’s no one to calibrate to and nothing to teach, so it comes down to the code and whether it’s right.

What I check before I approve an AI PR

I still get fooled. A few weeks ago a generated PR wrapped a reasonable-looking retry around a call that wasn’t idempotent, and it sat in production for a day before the duplicate writes turned up in support tickets. In code I’d written myself I’d have remembered the call wasn’t safe to repeat; in someone else’s clean diff I didn’t think to ask. The goal was never zero misses, I missed things back when the code was all human too. It’s to keep the misses out of the expensive places, and to not let a tidy diff talk me into trusting it. The model won’t flag its own bug, so that part of the job stays mine.