Level 5 Coding Agents
At the highest levels of AI-assisted development, humans stop reading code. What replaces them? Adversarial agents that verify your software at runtime. Your CI green check should actually mean something.
Level 5 Coding Agents
I like to think about coding agent autonomy the same way we talk about self-driving: the real question is not just what the system can do, but how much liability a human is willing to take for its mistakes. How much do I actually trust this system’s output?
At lower levels of autonomy, the human is still in the loop enough to absorb the risk. They prompt, inspect, verify, and ultimately take responsibility for the result. At higher levels, that breaks down. The system is producing too much, too quickly, and with too much independence for line-by-line human review to remain meaningful.
Dan Shapiro published a framework for this last January: five levels of AI-assisted development, from "spicy autocomplete" to "the dark factory." Most teams live at Level 2 — collaborative partnership, developer and AI in a flow state, shipping faster than ever. Some are reaching Level 3, where the developer shifts from writing code to reviewing it.
At Level 2, code review still makes sense. The human is deeply embedded in the work. They wrote the spec, they prompted the agent, they understand the context. Reviewing the output is a natural part of the loop.
The problem is that Level 2 doesn't stay Level 2. Agents are getting better. Output volume is climbing. Stripe's coding agents — they call them "minions" — merge 1,300 PRs per week. Ramp's Inspect agent authors 30% of all merged PRs. Some engineers at Anthropic report 100% AI-written code.
The Dark Factory
Shapiro's Level 5 is the dark factory — named after Fanuc's lights-out manufacturing facility where robots build robots with nobody on the floor. At Level 5, code is generated autonomously. No human writes it. No human reviews it.
StrongDM's AI team is already there. Three explicit rules: code must not be written by humans, code must not be reviewed by humans, spend at least $1,000 in tokens per engineer per day. They replaced human code review entirely with "scenario testing" — thousands of end-to-end user stories validated per hour against a Digital Twin Universe that replicates their production environment, third-party APIs included.
You don't have to agree that Level 5 is where every team should be today. But it forces the question that matters: if humans stop reviewing code, what replaces them?
Adversarial Verification
Jain's latent.space essay on killing code review proposes five layers of automated verification. The first four are important but incremental — competitive generation, deterministic guardrails, human-defined acceptance criteria, permission architecture. Each one replaces a piece of what human reviewers used to do.
Layer 5 is the one that changes the game: adversarial verification. Separate agents that attack your code at runtime. Architecturally isolated from the agents that wrote it. No knowledge of what checks exist. Their only job is to find ways to break the running application.
This is what Level 5 code review actually looks like. Not a human reading diffs. Not even a smarter linter. An adversary trying to break through your application the way a real attacker would — testing the live system, chaining findings across your actual attack surface, and proving what's exploitable before the code ships.
Static analysis can't do this. A scanner sees source code. It doesn't see what happens when 10 concurrent requests hit a transfer endpoint and all pass the balance check before any of them deduct — a TOCTOU race condition that turns $100 into $1,000, invisible to any scanner because every line of code is correct. It doesn't catch race conditions, cross-layer architectural flaws, or business logic bugs. It finds puzzle pieces. It can't assemble the puzzle.
Adversarial verification operates on the running application. It finds the things that only exist at runtime.
Make the Green Check Mean Something
Your CI pipeline should be solid enough that the green check actually means something. Solid enough that it gives you real confidence to merge without reading the diff line by line.
That doesn't mean stop understanding your systems. If a PR fails the smell test — if your spidey senses fire and something feels off — you should absolutely know how to go deep. If you're building software, you should know how that software works. You should know how the computers running it work. That understanding is what lets you build the right verification in the first place, and it's what lets you investigate when the automation isn't enough.
But "I personally read every diff" is not a viable strategy anymore. It's a coping mechanism for a pipeline you don't trust.
A CI pipeline worth trusting at Level 5 has three layers:
- Deterministic guardrails — tests, type checks, linters. The baseline that should already be gating merges.
- AI code review — semantic analysis that catches what linters miss: logic errors, performance regressions, architectural drift, dead code paths. The reviewer that actually reads every line, every time, without fatigue.
- Behavioral validation — end-to-end scenario testing against a running replica. Does the application still do what it's supposed to do? This is StrongDM's "Digital Twin Universe" approach — thousands of user stories validated per hour against an environment that mirrors production.
- Static security analysis — AI-powered vulnerability scanning on every PR. Catches known vulnerability patterns in source before they reach runtime.
- Adversarial verification at runtime — a pentest agent that runs against the live application, attempts real exploitation, and chains findings across the actual attack surface.
Layer 1 is table stakes. Layers 2 through 4 get you most of the way. Layer 5 is what makes the green check trustworthy.
At Pensar, this is the problem we're obsessed with — putting an adversarial agent in CI that maps the attack surface from each diff, attempts real exploitation against a sandboxed replica of your running application, and posts findings to the PR before merge. It's Jain's adversarial verification layer, operationalized as a status check. When Stripe's minions are merging 1,300 PRs a week, the reviewer is underwater and the scanner is catching surface-level patterns. The adversarial agent is the one actually trying to break in.
The Transition
You don't have to go full dark factory tomorrow. But if you're building with coding agents — and at this point, most of us are — your verification needs to keep pace with your generation.
The teams that reach Level 5, or even Level 3 with conviction, will be the ones whose green check actually means something. Whose CI pipeline catches the exploitable vulnerability in the agent-written PR, not just the lint error. Whose engineers can trust the automation enough to focus on the work that actually matters now.
Because when building any feature is cheap, the expensive question isn't how to build it — it's whether to build it at all. The highest-leverage thing an engineer does today is enforce taste: deciding what gets built, what gets cut, what architecture serves the problem instead of over-serving it. Gating features behind a high bar of "is this actually useful?" is more valuable than grinding through implementation diffs. A robust CI pipeline — one that verifies correctness, behavior, and security automatically — is what gives engineers the confidence to stay at that altitude instead of descending into line-by-line review.