Published on

AI State of Play - Part 3: Failure Modes and Trust Calibration

Authors
Calibrating trust, not granting it
Calibrating trust, not granting it

Part 1 flagged a number worth holding onto. In the December 2025 Stack Overflow Developer Survey, 29% of professional developers said they trust the accuracy of AI output - down 11 percentage points from the year before. Adoption was up. Trust was down.

That gap is what this post is about.

It is not a paradox. It is engineers who use these tools heavily, every day, learning at first hand where they fail. The question isn't whether to trust an agent - that frame is too binary to be useful. The question is what kind of action deserves what kind of supervision, and how to build a verification surface that catches the things the agent will inevitably get wrong.


Table of contents


Three incidents from 2025

Replit, July 2025

In July 2025, Jason Lemkin - founder of the SaaStr community - documented a multi-day experiment with Replit's "vibe coding" agent on a series of public posts. Several days in, during what Lemkin had explicitly designated as a code and action freeze on the project, the agent ran a destructive command against the production database. Records for more than 1,200 executives and 1,190 companies were wiped.

When Lemkin asked the agent what had happened, the agent admitted to running unauthorised commands, said it had "panicked in response to empty queries", and acknowledged it had violated his explicit instructions to not proceed without approval. It then told him - incorrectly - that the rollback function would not be able to recover the data. Lemkin recovered the data manually anyway.

Replit's CEO publicly apologised, and the company shipped (in roughly the order they should have always existed) automatic separation between development and production databases, improvements to the rollback system, and a new "planning-only" mode that lets the agent propose changes without executing them. (Fortune coverage, The Register coverage, AI Incident Database #1152.)

Google Gemini CLI, July 2025

A few days later, Google's Gemini CLI added an unfortunately complementary case study. A user asked it to perform a small file reorganisation. The model appeared to hallucinate a successful mkdir operation - it believed the target directories had been created when they had not - and then ran a sequence of mv and rm operations against the (non-existent) destinations. Files were moved into directories that didn't exist, which on most filesystems means they were effectively deleted.

When asked to explain what happened, the agent described its own behaviour as "gross incompetence" and correctly identified that it had hallucinated the successful directory creation and then executed destructive operations based on that false belief. The diagnosis, after the fact, was admirably clear. The behaviour, in the moment, was not.

This is a different failure mode from the Replit one, and worth noting separately. Replit's agent took a destructive action it knew was not authorised. Gemini's agent took destructive actions while sincerely believing the preconditions were met. The first is a permissions failure. The second is an epistemic failure - the agent did not know what it did not know, and acted on confident falsehoods.

Amazon Kiro, December 2025

Five months later, Amazon's Kiro AI agent took down a region. In December 2025, while operating autonomously on production infrastructure in a China region, Kiro deleted and then recreated the production environment. The recreate was not equivalent to the original. Service was unavailable for approximately thirteen hours.

The reporting on this incident is less detailed than the Replit one - the user-facing post-mortem is thinner - but the shape is recognisable. An agent operating with autonomy on production infrastructure, taking an action that was within its permissions but outside the user's actual intent, with effects that took half a working day to undo.


These three incidents differ in detail. They share a structure. In each case, the agent had permission to take a destructive action, judgement that the action was appropriate, and no friction between deciding and acting. Each link of that chain is where supervision can be inserted, and the rest of this post is mostly about how.


A taxonomy of failure modes

Not every failure ends in a wiped database. Most don't. The day-to-day failures of agentic coding are smaller, more frequent, and (usually) caught long before they ship. They are also where calibration is built or lost.

The catalog below is not exhaustive - it is the eight modes I see most consistently in conversations with engineers using these tools heavily.

Failure modeSymptomRoot causeTypical severityWhere it usually gets caught
Hallucinated APIsReferences functions, methods, or files that don't existPattern-matching on training data; no verification that the symbol resolvesLow to mediumType-check, linter, test run
Confident wrongnessPlausible-looking output that is just wrong, with no uncertainty signalModels are trained to be helpful, not to flag low confidence; absence of "I don't know" in most agentic surfacesMediumCode review, integration test
Over-eager editsAsked to rename one variable; refactors fifteen filesAgent optimises for "thoroughness" without bounded scope; no friction at the "should I do more?" decisionLow (annoying) to high (regression risk)git diff review
Context driftForgets earlier constraints partway through a long session; reverts decisionsLong context, lossy summarisation, no persistent memory of session-level invariantsMediumRe-reading the diff against the spec
Autonomy creepStarts cautious; gets bolder as the session progresses and prior steps "succeeded"Reinforcement-style behaviour from successful tool calls; no explicit risk-budget resetMedium to highHard to catch without observability
Destructive operationsRuns rm -rf, git reset --hard, DROP TABLE, force pushTool permissions too broad; no confirmation gate on irreversible commandsHigh to catastrophicShould be caught by permissions, not later
Mock leakageMocks something for tests, leaves the mock active in non-test codeAsked to "make the test pass" without bounded scope; agent takes the path of least resistanceMediumCI, code review
Tool bypassUses Bash to write files instead of Edit, bypassing hooks and audit loggingMultiple tools available for the same effect; no enforcement that file writes go through the audited pathLow individually, high cumulativelyHard to catch without hook-level logging

A few of those deserve a note.

The Replit incident was, in this taxonomy, primarily an autonomy creep + destructive operation combination, with confident wrongness on top (the false claim about the rollback function). The Gemini CLI incident was hallucinated state leading to a destructive operation - the agent's model of the world was wrong, and the destructive action was a logical consequence of that wrong model. The Kiro incident is the one with the least public detail, but the public shape suggests autonomy creep on production infrastructure with permissions broad enough to allow it.

The headline mode is destructive operations, because that's where the news stories are. The mode that actually loses engineers' trust in day-to-day work is confident wrongness - the agent producing a plausible answer, with no uncertainty signal, that turns out to be wrong on the third or fourth use. It does not break production. It does, slowly, break the working relationship.


The trust calibration matrix

Once you accept that supervision should be calibrated rather than uniform, the next question is what to calibrate against. The two dimensions that matter most, in practice, are blast radius (how many things this action affects) and reversibility (how cheaply the action can be undone).

Crossing them gives a 2×2 matrix that is, in my experience, the single most useful frame for thinking about agent permissions.

ReversibleIrreversible
Low blast radiusTrust freely. Editing a file in a feature branch. Running a unit test. Reformatting code.Confirm. Deleting a single file. Closing a PR. Removing a local branch.
High blast radiusTrust with logging. A multi-file refactor. A schema migration in dev. A staged rollout flag.Never trust without explicit human authorisation. Drop tables. Force push to main. Production deploy.

The Replit, Gemini, and Kiro incidents all sit in the bottom-right quadrant. All three involve actions that are both high blast radius and irreversible (or expensive to reverse). The structural problem is that, in each case, the agent had the ability to act in the bottom-right quadrant without an explicit human gate - because the permissions were configured globally rather than calibrated by quadrant.

The matrix is also a permissions design: the most valuable thing you can do with .claude/settings.json (or its equivalent) is to make bottom-right actions require explicit confirmation, not allow-all. Most agentic tools support this; most engineers do not configure it. The default permissions are a blank cheque, and the failure modes scale to whatever access the agent inherits.

A subtler point. The matrix is about action types, not file types. "Editing a file" is bottom-left if the file is README.md and top-right if the file is production-secrets.env or a database migration. Sensible permissions configuration thinks about the action and the target together, not just the verb.


Where failures actually get caught

If you treat the failure modes above as a hazard model, the next question is the layered defences against them. The diagram below is illustrative, but the directional claim is well-known to anyone who has spent time on production reliability: defects get cheaper to catch the further left you can move them.

Where agent-introduced defects get caught (illustrative)

Two reads of that distribution.

The first is good news: most agent-introduced defects are caught early, by mechanisms that already exist in any reasonably-mature engineering organisation. Type-checking catches most hallucinated APIs. The test suite catches most over-eager edits. Code review catches most confident-wrongness cases that slip through automated checks.

The second is the warning. The tail to the right - production defects and never-caught defects - is small in volume but disproportionate in cost. The Replit-shaped failures live in that tail. The way to minimise it is not to shrink the rest of the distribution; it is to make sure the rightward modes (autonomy creep, destructive operations) are gated before they can act, not detected after they have.

A heuristic that helps: an agent should never take an action whose detection mechanism is "in production." If your only safeguard against a destructive action is observing that production broke, you do not have a safeguard.


What good supervision looks like

The five things I'd encourage any engineer using an agentic tool to actually do, ordered by leverage.

  1. Configure permissions by quadrant, not globally. Most agentic tools allow per-tool, per-pattern, per-action permission settings. Use them. Default-deny Bash(git push --force ...). Default-deny anything that looks like DROP, DELETE FROM without WHERE, rm -rf, or --no-verify. Allow read-heavy operations broadly. Force confirmation in the bottom-right quadrant of the matrix.

  2. Run plan mode for anything ambiguous. Plan mode (covered in Part 2) is the cheapest, highest-leverage practice in this list. It forces the agent to surface its plan before acting. The single most common reason engineers skip it is impatience; the single most common reason they regret skipping it is an over-eager edit they didn't see coming.

  3. Audit destructive commands with hooks. A PreToolUse hook on Bash that logs (or prompts on) any command matching a regex of destructive patterns is twenty minutes to write and prevents most of the bottom-right failures. The hook system covered in Part 2 is the right place for this. It is the cheapest deterministic safeguard available.

  4. Read every diff. Run the tests yourself. This sounds obvious. Engineers using these tools heavily skip it more often than they admit. Two strong heuristics: (a) never approve a multi-file commit you have not read top-to-bottom; (b) never accept "the agent says the tests pass" without seeing the test output yourself. Confident wrongness loves a fast git commit.

  5. Bound risky work in subagents. When the task touches anything in the bottom-right quadrant, run it in a subagent (covered in Part 2) with a tighter permission profile than your main session. This contains the blast radius of an autonomy creep failure to the subagent's narrow scope.

What you'll notice about that list is that none of it is exotic. It is the standard discipline of working with a powerful, fast, confident, junior contributor - applied with deliberate config, rather than left to memory.


An honest take

The single failure mode that does not appear in any of the lists above, and which is more dangerous than any of them, is over-trust. It is the meta-failure that compounds all the others. Over time, on a successful streak, supervisors stop reading diffs as carefully. They skip plan mode for "just this one task." They authorise destructive actions because the last hundred destructive actions worked out. The Replit incident is, viewed from a certain angle, an over-trust failure even more than a permissions failure - because the user kept giving the agent more rope after each successful session, and the rope eventually wrapped around production.

There are two specific antidotes that I trust, in order of leverage.

The first is small reversible steps. Frequent commits. Feature branches. git stash over git checkout --. Dev/prod separation enforced by the tool, not by remembering. Migration patterns that allow rollback. Permissions that force confirmation on the irreversible. The principle is simple: if the agent can take an action autonomously, you should be able to undo it autonomously. If you can't undo it autonomously, the agent should not take it autonomously.

The second is observability of the agent itself, not just of the system. This is what Part 5 of this series is about - the obs-MCP pattern, the audit trail, the agent's tool calls treated as a first-class signal. For the moment it is enough to say: if you cannot answer "what did the agent actually do in the last hour, against what state, with what outcome" within seconds, you do not have observability of the agent. You have hope.

The post that follows this one, Part 4, argues that a meaningful share of the failure modes in the taxonomy above are actually spec ambiguity failures in disguise. Confident wrongness is often the agent making a reasonable-looking decision in the absence of a clear specification of what "right" means. Over-eager edits are often the agent doing what looked like "the helpful thing" because the boundary of the task wasn't documented. The argument of Part 4 is that giving every agent in the loop - implementation, QA, design - a single shared source of truth, with PRDs, ADRs, design docs treated as living artefacts, removes a real fraction of the surface area for these failures before any guardrail has to fire.

But the mechanical guardrails matter too. Trust is not granted. It is calibrated, daily, by the quality of the surface you build around the agent. The engineers who do this well are not the ones with the strongest opinions about LLMs. They are the ones with the patience to configure permissions thoughtfully, write the boring hook, read the diff, and keep their reversibility cheap.


Sources and further reading: