26 June 2026 · Alphabench

How to debug a failing test with a coding agent

A failing test is rarely broken where it complains. The assertion blows up in one file while the actual cause sits in another, and you only see the fix once you understand both ends. A flaky test is harder again, since the trigger is some timing or shared-state quirk that a single read of the code tends to miss. An agent can chase this down for you, but you have to treat it as the investigation it is rather than firing a one-line request at it.

Give it something it can reproduce. Hand over the failing assertion or the CI output and let it actually run the test instead of theorising from the source. If the failure is intermittent, tell it to run the thing twenty or fifty times, because a flake that fails three times in twenty is simply invisible in a single run. Until it can make the test fail on demand, it is guessing.

Push it to find the cause, not just to quiet the symptom. The behaviour you want is a loop: form a hypothesis, narrow it, follow the trail across files to the root, rather than slapping a patch on whatever line threw. You can tell which one you got from the explanation. A real diagnosis says why it failed and where; a patch just makes the red disappear until it is back next week. Ask for the why.

Make it prove the fix holds. A change that greens the broken test and quietly breaks its neighbour has not fixed anything, so the agent should run the surrounding suite, and for a flake, re-run the once-failing test enough times to show it is steadily green and not just lucky once. Keep every step on screen and gated, so what lands in front of you to approve is a reasoned change, not a hopeful one.

Pier debugs in exactly this rhythm. The track down a failing test walkthrough runs the full loop: reproduce under repeated runs, catch the race, fix it, then confirm fifty clean passes. When the red is in your pipeline rather than your code, fixing a failing CI build takes the same approach.

There is a happy side effect: this kind of work is cheap to run. An investigation is mostly reading, code and logs and prior output, with only a small fix written at the end, and reading is the inexpensive part, so you can let the agent dig as long as it needs without watching the bill. The cost per task page has real figures for jobs like this.

How to debug a failing test with a coding agent

Keep reading

Getting the most out of a low-cost coding model

Terminal-native coding workflows: shipping without leaving the shell

How to run a codebase-wide migration with an AI agent

Begin here