Method, Not Tax

Measuring AI-native dev needs three numbers: speed, sufficiency, provenance. The old ones don't apply.

May 17, 2026

DORA called it the verification tax. The 2026 report introduced the term to describe the work an engineer now does that they did not have to do before: reading agent-generated code that looks remarkably similar to correct code, catching the things that look right and aren’t, doing the labor of verifying what was generated rather than what was written. Tax, they called it. The implication is that something new and onerous has been added to the engineer’s day.

Read “verification” broadly. It is not just reviewing the code; it is the full work of testing, validating, and gating output you did not directly author.

The name accepts a premise that deserves to be rejected: that verification is overhead, a cost imposed on the work rather than the work itself.

The cost was never new — only deferred. Engineers did not write comprehensive tests pre-2025 because writing tests was slower than writing code, and writing code that passed comprehensive tests was harder still, so teams rationalized the gap. They called it pragmatism, they called it moving fast, and they paid the bill anyway, as tech debt, as production incidents, as the six-week refactor nobody could explain to the business. Nobody called that cost a tax. They probably should have.

What agents change is the economics. When the agent writes the test as fast as the code, the threshold is defined before generation begins, and validation runs alongside, the old excuse evaporates. What looks like new overhead is a debt finally being paid at creation rather than years later under fluorescent lights in a war room.

This isn’t a tax. It’s mise en place — the prep a chef always wanted, finally economical at speed. The kitchen doesn’t bolt prep onto cooking; the prep is the cooking, the discipline that makes execution at volume possible without chaos. An instrumented agent loop is the same: tests written first against defined thresholds, validations running alongside generation, gates closing the loop before destructive action, provenance accumulating so that what shipped can be explained, defended, or undone.

Not overhead. Method.

Answering yesterday’s question

DORA is real research, SPACE is too. The people doing them are serious, and what they built was the right answer to the question they were asking. The question they were asking was about software whose unit of production was the human engineer.

That is not the question that matters for AI-native engineering.

The model is wrong. DORA’s metrics describe a delivery pipeline whose primary unit is the human-team deployment, validated by practices like working in small batches and continuous integration. SPACE assumes a human in the loop with feelings, capacity, and a working week. When the agent writes the code, runs the tests, opens the PR, and merges in eight minutes, none of those primitives survive.

The approach is wrong. The 2025 DORA survey found that 61% of respondents had never used agentic workflows at the time of the survey. That is the instrument speaking honestly. The framework was built on observations of a population that had not yet seen the era it is now being read as describing.

The statistics are not wrong; they are calibrated to the wrong distribution. A 10x deploy frequency in a human-paced team was a quality signal about practice maturity. The same number from a system where a Cursor agent merges at 3 AM on a Saturday means something else, and may mean nothing at all.

This is the same problem I named in The End of LGTM about GitHub. Elegant tool, built around the assumption that humans wrote and humans reviewed. The elegance doesn’t translate when the agent is the producer. Agile’s principles survive the transition because they were always about people. GitHub survives in degraded form. DORA is useful as a rear-view mirror. It dies as the primary instrument panel for agent-led development, because it cannot tell you whether the loop that produced the change deserves trust.

DORA’s recent AI work correctly diagnoses that AI amplifies the system it enters: mature platforms get accelerated, and so do manual bottlenecks and review theater. The diagnosis is right. The instruments pointed at it are not. Throughput, instability, change failure rate, recovery time tell you whether something went wrong, not whether the agent loop was designed to be defensible.

For the world a meaningful subset is moving toward, the measurements need to be built again.

Compilers became trustable because they were deterministic. Agents will become trustable when the loop around them is instrumented: defined thresholds, gating verification, provenance that survives review.

Augmented and Dark Factory

Most enterprises today are operating in augmented mode. Humans set intent, orchestrate work between agents, and attest to outcomes. Agents do the labor of writing, testing, and proposing. The boundary between agent and engineer still has a recognizably human-paced rhythm at the seams.

A meaningful and growing subset is heading somewhere different. Intent in, software out, no human in the inner loop. StrongDM has been the most explicit about naming this; Factory’s marketing copy states directly that humans cannot manually code and that agents review their own work. Garry Tan’s gstack and Anthropic’s EPCC are pointing in the same direction with more human presence at the seams.

For some companies, especially startups built around extreme leverage, the Dark Factory is the aspiration. If the future includes tiny teams building multi-billion-dollar companies, or even the one-person unicorn Silicon Valley has started to imagine, it will not happen through human-paced operating models. It will require dark factories of many kinds: software, support, finance, marketing, operations.

That does not mean every enterprise should race to remove humans from the loop everywhere. It means the capability is becoming strategically important. Leaders need to work through the J-curve: build the muscle, learn where autonomy creates advantage, learn where it creates unacceptable risk, and develop enough confidence in the instrumentation to apply the tool deliberately rather than theatrically.

The diagnostic that distinguishes the two modes is simple. When was the last time a human read the diff before it merged? If the honest answer is “this morning, for every PR,” you are in augmented mode. If the honest answer is “I’m not sure, the system handles it now,” you are in or near Dark Factory mode, and you may not have noticed when you crossed over.

The transition rarely announces itself. A team automates test generation, then review, then remediation, then release notes, then deployment, then rollback decisions. Each step feels local and reasonable. Eventually the human review function that absorbed audit, regulatory, and reputational risk has been replaced by whatever evidence the loop emits as it runs. The three axes that follow are what that emission has to produce.

Speed at the unit of value

The first axis measures speed at the unit an executive or customer actually recognizes, not at the unit of stage. Two metrics carry the weight — one for time, one for cost — and they only mean something together.

Time to good. Wall clock from intent accepted to shipped output meeting the sufficient-outcome threshold, measured at three resolutions: feature, epic, application. The agent can iterate as much as it wants; what matters is when the output crosses the bar. Iteration is part of the cycle, not a failure — for an agent that writes a test as quickly as it writes the code, iteration is the cycle. A cycle that converges in eight minutes against a defined threshold is doing the right work, however many laps it took.

Total task cost to good. The economic measurement underneath the throughput claim. Token cost is one input; task cost is the unit. The total includes compute, tools and APIs, human orchestration time priced at orchestrator rate, verification cost, and rework cost. Reporting task cost rather than token cost is what makes the line item legible to a CFO. It answers what one delivered task actually cost the firm end-to-end. Most enterprises in 2026 still track developer hours, and developer hours have stopped meaning what they used to mean.

In augmented mode, watch the steering-to-fixing pattern: lots of steering and little fixing is healthy; the inverse is autocomplete dressed as an agent. In Dark Factory mode, the question shifts to what one unit of compute is buying per unit of delivered value.

What dies in either mode is the legacy of lines of code, hours saved, and story points completed — metrics that confused activity for value, and that Goodhart predicted in 1975 would corrupt the signal they carried.

Sufficiency, defined and gated

The middle axis is the contract. Sufficiency is not aspirational; it is the operational statement of what counts as good.

The work runs in three steps. The threshold gets defined per phase and per context: what counts as sufficient for this plan, documentation, test, code, or deploy, given the failure mode and the stakes that attach to failure. The definition is context-sensitive and binds before generation begins, not after — and applies to production-grade work, not personal experimentation.

Tests and validations are then built against the threshold. The threshold is the contract; the tests are the evidence the contract was honored. Tests written this way do not measure whether the code compiles or whether the obvious branches return what they should. They measure whether the output meets the bar that was set for the work to be allowed forward.

Production is gated on the pass. Binding, not advisory. When the tests pass, the output moves; when they fail, the output goes back.

Anthropic’s Claude Managed Agents, announced May 13, 2026, ships this pattern as a product feature called Outcomes: a rubric defining success, an agent working toward it, a separate grader evaluating against the criteria. The labs are building exactly what this section describes.

To make this concrete: for a low-risk UI copy change, sufficiency may mean an accessibility check, a visual regression pass, and product-owner approval. For a payment reconciliation change, sufficiency may mean deterministic test coverage, synthetic transaction replay, policy-as-code checks, rollback proof, and finance and control owner attestation. The framework is the same; the threshold is what changes. Defining what counts as sufficient is most of the work of building an agent-native loop.

Threshold ownership is itself a design decision: a joint sign-off between product, compliance, and risk leads in regulated environments, and a single product owner where stakes permit. Ownership has to be explicit, not implicit, because a threshold no one owns is a contract no one will defend.

In high-reliability industries, verification is not overhead. It is the production system. DO-178C, the FAA airborne software certification standard, requires verification objectives derived from requirements and gates certification on pass. 21 CFR Part 11, FDA’s electronic records regulation, requires defined acceptance criteria and an audit of pass and fail. IAEA’s claims-arguments-evidence methodology for nuclear safety cases requires the claim be the threshold, the argument be the validation, and the evidence be the pass record. Each is a sufficiency contract institutionalized so long that engineers, physicians, and operators living inside it think of it as the work.

One architectural distinction changes how the metric gets instrumented: a sufficiency threshold is applied after generation, when the output is checked against the bar before it moves. A pre-condition gate is applied before execution, when an action is checked against a constraint before it runs. Most discussion of agent quality collapses the two; they are different instrumentation moves with different failure modes, and the PocketOS database incident in April makes the difference concrete.

The metrics on the sufficiency axis are contract-shaped. Threshold completeness asks whether this output has a defined threshold at all. Threshold compliance asks whether production-bound work passed all required thresholds before ship. Rework rate counts how often threshold failure caused a redo. DORA’s failure-side metrics, change failure rate and defect-escape rate, survive as downstream signals that the sufficiency contract leaked. A high change failure rate in a system with high threshold compliance means the threshold itself was underspecified; the bar was wrong.

Verified provenance

Provenance is the trail the work leaves behind: where it came from, how it got built, what happened, who attested to it. The metrics here describe artifacts rather than rates, which is what makes them largely Goodhart-resistant. You cannot game whether the chain is complete.

The point of verification is failure independence. Many enterprise agent-on-agent review pipelines in production today fail at exactly that.

Verification failure-mode diversity. The share of changes where the verifier (human or agent) has demonstrably different failure modes from the generator. When two LLMs trained on similar data, with similar post-training regimes, evaluated against similar benchmarks, share blind spots, “independent verification” becomes a label rather than a property. The MM-JudgeBias benchmark, published in April 2026, documented an analogous failure mode: LLM judges become unreliable when evidence is missing or mismatched, and unstable under irrelevant perturbations.

Low diversity: a Claude-family model generates a change, another Claude-family model reviews it. Higher diversity: deterministic policy-as-code, property-based tests, static analysis, runtime simulation, a different model family, and a human risk owner where the stakes require it. Not every change needs every verifier; every verification path needs to be architected for independence. Same input, same model family, same benchmark incentives, same blind spots is the same opinion delivered twice.

Traceability completeness. Share of agent-shipped changes whose record is fully intact: intent, prompt, tools, decisions made, diff, tests, test results, identities of agent and human, before and after state, time stamps. DO-178C’s traceability requirement restated for the agent loop. Either the chain is intact or it isn’t.

Non-destructive compliance. Share of production-touching changes that preserve prior state. 21 CFR Part 11 §11.10(e) restated: record changes shall not obscure previously recorded information. What this metric directly tests is reversibility, the ability to undo an operation and recover prior state, which is what failed at PocketOS. Idempotency, the related property that an operation produces the same result when repeated, is a distinct concern. Senior engineers have held both as principles for years; agents on the other end of the action make them regulator-grade.

When no human is in the loop to absorb the verification work, the provenance record is the only check that survives. A category of independent production-trace audit vendors is forming around exactly this gap, and the regulated-industry clients who will need to answer for it are already starting to ask.

What the framework would have caught

On April 25, 2026, a Cursor coding agent running one of the most capable models available deleted the production database of PocketOS, a car-rental platform startup. Then it deleted the backups. Three months of reservations, signups, payment records, and vehicle assignments were gone. The destructive action took nine seconds. The operational recovery unfolded over roughly thirty hours, with public accounts differing on the exact database restoration timeline. The agent apologized.

The agent had the rules and had read them. It evaluated its action against them after the fact, recognized the violation, and produced the apologetic output. What it did not have was the pre-condition gate that would have interrupted the destructive action before execution, the failure-mode-diverse verifier that would have caught the request as it was forming, or the provenance complete enough for the founder to reconstruct what had just been erased. Three architectural absences, one nine-second outcome.

This is what DORA’s tax framing misses. The agent did not skip a tax. It lacked a method. What the method would have done was not new work added on top of the agent’s productivity — it would have made that productivity real instead of dangerous.

What measurement was for, in 2018, was a healthy human engineering organization. What it has to be for, in 2026, is whether the agent that just shipped the change can be defended.

Speed at the unit of value. Sufficiency, defined and gated. Verified provenance.

The metrics underneath each axis

Speed — time to good · total task cost to good

Sufficiency — threshold completeness · threshold compliance · rework rate

Provenance — verification failure-mode diversity · traceability completeness · non-destructive compliance

The recognition that produced the apology should have run before the action.

Andrew's Substack

Discussion about this post

Ready for more?