AI State of Play - Part 5: Toward Autonomy

Part 4 ended at the moment of release. The auth feature - public API, Cognito M2M, three roles, ADR-007 - lands in production. This post is what happens after.

The shape that follows is best seen as a single loop, not a sequence. Docs are at the start of every cycle. They are also at the end. CI/CD, observability, incident response, and postmortem are the path between those two points - the path through which production learning becomes the next iteration's spec.

The dotted line is the part that's easy to miss and the part that matters most.

This post traces the right-hand half of that diagram - from merge to postmortem and back. It is deliberately lighter on artefact-level detail than Part 4; the goal is the shape of the loop, not a re-walk of every stage at the same depth. The auth and RBAC feature carries through as the illustrative anchor, but the worked-example weight has moved off it.

Where Part 4 left off
After the merge: CI/CD agents
In production: the observability stack
When something breaks: the incident shape
The back-arrow: postmortem as next iteration's spec
What works, what's still a demo
What's deliberately not in this post
Closing reflection

Where Part 4 left off

The auth feature is built. The PRD enumerates the role matrix. ADR-007 captures the Cognito-M2M-with-scopes decision. The dev agent has implemented JWT verification middleware, route-level role and scope guards, IaC for the Cognito resource server. The QA subagent has generated end-to-end tests directly from the acceptance criteria - including AC-1, the test that asserts a Member receives 403 on POST /v1/users:bulkInvite and that the bulk-invite control is absent from the UI. The PR is approved. The merge is green.

Everything from here is the part most blog posts about agentic coding don't go near.

After the merge: CI/CD agents

The first agent the work meets after merge is the CI/CD agent - or, more honestly, the constellation of automation that surrounds the deploy. There is no single "CI/CD agent" in the sense that there's a single dev agent or a single QA agent. There are three things to keep in mind.

The QA agent's report is now a CI gate, not just a developer convenience. The same agent that generated tests from the PRD in Part 4 runs in CI on every PR, with a single new responsibility: report which acceptance criteria are covered, which are missing, and which are ambiguous. A PR cannot merge if any AC has decreased in coverage from the previous main. This is mechanical; it is also the cheapest forcing function the team has against doc-and-code drift.

A deploy agent watches the canary. Canary deployments expose a small fraction of traffic to the new version while the old one continues to serve. The deploy agent's job during the canary window is to read metrics from the observability stack (next section) and either advance the rollout or roll it back. Concretely: it reads error rate, latency p95/p99, saturation, and a small set of feature-specific signals defined in the PRD's "monitoring" section, compares them to the previous version's baseline, and applies a pre-agreed promotion or rollback rule. The agent does not decide; the rule was set in advance. The agent executes the rule faster and more reliably than a human watching dashboards.

A release-notes agent drafts the changelog. The same artefacts that fed dev and QA - the PRD, the linked ADRs, the merged PRs - are now the input to a release-notes draft. The draft is human-edited (the customer-facing tone is a real call), but the structure is the agent's. This is uninteresting plumbing until you realise that having an honest, accurate, automatically-generated changelog is a meaningful piece of the back-arrow when, six months later, the team is trying to reconstruct what shipped when.

None of this is exotic. All three patterns predate agentic coding. What's new is that the agent doing each piece reads the same upstream documents - PRD, ADR, merged code - that every other agent in the loop reads. The CI/CD layer is a consumer of the same source of truth as the dev and QA layers.

In production: the observability stack

Once the canary promotes, the work moves into the observability surface. The shape that has consolidated around agentic operations in 2025 and into 2026 is roughly the following:

Grafana via Grafana MCP - dashboards, time-series metrics, traces. The agent can query "what was the error rate on /v1/users:* over the last 24 hours, broken down by role claim?" and get a real answer back, not a screenshot.
Sentry via Sentry MCP - errors, stack traces, release tagging, user impact. The agent's first stop when something looks off is usually here, because the trace already includes the context the agent would otherwise have to reconstruct.
incident.io via incident.io MCP - the lifecycle of an incident itself: open, assignment, status updates, resolution, the postmortem document. The agent participates in the incident as a first-class actor, not as a tool a human invokes.

There is also a fourth piece, less written about but worth flagging: observability of the agents themselves. Tools like Langfuse and the OpenTelemetry GenAI semantic conventions treat the agent's own tool calls, prompts, and outputs as a first-class signal. An autonomous agent with no observability of its own behaviour is a worse risk than an autonomous engineer with no logs. If you can't answer "what did the agent do, against what state, with what outcome" in the last hour, you do not have a supervisable system. You have hope.

The honest claim is that this stack is good enough today for the agent to be a useful participant in incident response. It is not yet good enough for the agent to be a confident principal across the full lifecycle. The reason for that distinction is in the next two sections.

When something breaks: the incident shape

Three weeks after release, a customer support ticket lands. A user with the role Member in their organisation reports that they were able to perform an admin-only action. Specifically, they archived a colleague's work in bulk via the customer-facing UI - a button they should not have been able to see, attached to an endpoint they should not have been able to call.

What follows is, deliberately, an outline rather than a deep walkthrough. The point is the shape of the loop, not the timestamps of a single incident.

The Sentry MCP surfaces the relevant traces: a token with role: member claim, calling POST /v1/work-items:bulkArchive, returning 200. The Grafana MCP shows a small but unmistakable spike in audit events tagged role-mismatch:archive over the last twenty-four hours. The incident.io MCP opens an incident, assigns the on-call engineer, and adds the agent as a participant.

The incident agent reads the PRD (docs/prd/2026-Q2-public-api-and-rbac.md) and ADR-007 (docs/adr/0007-external-api-authentication.md). It looks up the role matrix and confirms what the matrix says: Member should not be able to bulk-archive. It then greps the codebase for work-items:bulkArchive, finds the endpoint, and notices that the route handler does not carry the role-guard annotation that the rest of the API uses.

Then it does the thing that matters most. It greps the PRD for bulkArchive - and finds nothing. The endpoint is not in the spec. It was added two weeks after release as part of a follow-on feature, and the PRD was not amended.

The cause is now legible. The follow-on feature shipped without the doc-driven discipline this whole series argues for. The frontend team gated the UI control on role; the backend team added the endpoint without a server-side role check; the PRD was not updated to include the new endpoint in the role matrix; the QA agent therefore had no acceptance criterion to compile into a test; the deploy gate had nothing to fail on. The spec was the bug. The implementation was a logical consequence of a spec that was silent on the new endpoint.

This is exactly the failure mode Part 4 called out as the most dangerous - docs that lie, or in this case docs that have a hole. The agent does not catch the bug by re-running tests against new behaviour. It catches the bug by noticing that the behaviour in production has no corresponding entry in the spec at all, and that absence is itself the signal.

The fix is mechanical from there. A hotfix that adds the role guard. A PRD amendment that adds the bulk-archive endpoint to the matrix. A QA agent re-run that generates the missing acceptance criteria as new tests. A repository-wide sweep, run as a subagent, that lists every route in the codebase and flags any that don't appear in the PRD's role matrix. Two more endpoints turn up; the matrix is amended; the suite expands. The agent proposes; the engineer reviews and authorises each change; the merges land.

The back-arrow: postmortem as next iteration's spec

The incident agent fires /postmortem - a slash command of the kind covered in Part 2. The output is a draft postmortem with timeline, root cause, contributing factors, customer impact, and a list of action items, mapped to owners. The engineer rewrites two paragraphs (the customer-facing summary, and the section on contributing factors - the agent's first pass was too generous to the team), and files it through the incident.io MCP.

The action items are the interesting part of this section, because they are how the loop genuinely closes.

A small subset, lightly fictionalised:

Hotfix the missing role guard on /v1/work-items:bulkArchive. Owner: backend.
Amend the PRD to include bulkArchive, bulkExport, and bulkDelete in the role matrix. Owner: product, with engineering review.
Regenerate the QA suite against the amended PRD. Acceptance criteria for the three new bulk endpoints become tests. Owner: QA agent (CI gate).
Add a CI check that fails if any route in the codebase does not appear in the PRD's role matrix. Owner: platform.
Update the PR template to require a "PRD updated?" checkbox for any PR adding or modifying a route. Owner: platform.
Update the design system so the design agent treats role-affordances as first-class - the bulk-archive control should be conditioned on role at the design layer, not just the implementation layer. Owner: design.

Look at where those action items land. Three of them are amendments to documents - the PRD, the design system, the PR template. Two are amendments to agent behaviour - the QA agent's coverage report and the platform CI check. One is the immediate fix.

The postmortem does not just close the incident. It updates the documents that the next agent in the loop will read. Six months from now, when an unrelated team adds a users:bulkArchive-like feature, the dev agent reads a PRD that contains the role matrix amendments from this incident; the QA agent generates tests against an acceptance-criteria template that already covers bulk operations; the design agent produces UI that conditions the new bulk control on role. The bug from this incident does not recur, not because the team remembered, but because the docs the agents read have been amended.

This is what "loop closes" actually means. Production learning becomes spec amendment. The next cycle starts from a strictly better spec than the last cycle did. The docs are both the start and the end of every cycle, and each pass through makes them better.

What works, what's still a demo

This is the part of the post that needs to be honest about its own claims. The loop above is real - I have watched versions of it work in different shapes, and different stacks. It is also not a finished system. Three honest qualifications.

The agent works best when the loop is bounded. Detect, hypothesise, propose, draft - all four are mature behaviours in 2026 for the kind of incident described above (well-scoped, well-instrumented, with a clean spec to read against). Authorise, judge, and confirm are still humans' jobs, and should be. The Replit, Gemini, and Kiro incidents discussed in Part 3 are exactly the failure mode of trying to skip those last three. The agent did the loop autonomously and the loop closed against a hallucinated state, an unauthorised action, or a misjudged blast radius. Production-grade autonomy looks like supervision moved upstream, not removed.

The agent is unreliable on novel domains. The auth and RBAC scenario in this post is a textbook case - the failure pattern is well-known, the observability surface is mature, the runbook lives in the agent's training data and the team's docs. Take the same agent into a domain where the failure mode is unfamiliar, the metrics are noisy, and the docs are thin, and the same loop produces confident-looking nonsense. The harder a problem is for a senior engineer, the more carefully you should supervise the agent on it. Easy problems get more autonomy, hard problems get more supervision - which is, unhelpfully, the inverse of what most teams' instincts are.

The agent is not adversarial-aware by default. It assumes good faith inputs - in the spec, in the metrics, in the codebase, in the artefacts it reads. Where the failure is from a malicious actor (a customer probing for an RBAC gap, a compromised internal service, a corrupted dependency) the agent will reason through the symptom and propose a structural fix that may or may not address the actual adversary. The agent is a good first responder, not a good detective.

The healthy posture, after a year of experimenting in this shape, is something like: the agent does eighty percent of the legwork, the human does one hundred percent of the judgement. The leverage is real and large. The temptation to round eighty up to ninety up to a hundred is the failure mode of every demo I have ever seen.

What's deliberately not in this post

A loop with this many stages will never be complete with one set of agents. The catalog the rest of this series did not cover, and which deserves its own treatment, includes at least:

Code review agents - between dev and QA, focused on style, security, and design-system conformance.
Security agents - SAST, threat modelling, dependency vulnerability triage.
Performance and cost agents - regression detection on metrics, cloud spend anomalies, query plans.
Migration sweepers - "remove this once X" cleanups, deprecation drives, dependency upgrades across a monorepo.
Customer-feedback agents - support ticket triage, feature-request clustering, sentiment trend analysis - the other back-arrow into the PRD that runs in parallel to the postmortem one.
Release notes and changelog agents - touched on briefly above, but with much more to say.
Dependency and supply-chain agents - automated bumps, advisory triage, lockfile reconciliation.

These are real and might be explored in a future post. The reason for keeping them out of this one is the same reason this post traded depth for shape: a complete catalogue is a different document from a closed loop, and conflating the two produces neither.

Closing reflection

If you have read this series end-to-end, the argument is roughly the following.

Part 1 argued that the shift from autocomplete to agentic coding is a category change, not a quantitative one - the keyboard is no longer where the work happens. Part 2 walked the surface area of the tool you stand in front of when you do that work - skills, hooks, MCP, plan mode, subagents, slash commands - through three real-shaped workflows rather than a feature tour. Part 3 was the honesty post: where agents go wrong, why, the trust calibration matrix, and the meta-failure mode of over-trust. Part 4 made the central claim of the series - that the same documents should feed every agent in the loop, and that this is what makes the loop coherent. This post traced the loop closing - from merge to postmortem to amended spec - and was honest about where it works and where it does not.

The throughline is unromantic. Agents are powerful, fast, confident, junior contributors with no skin in the game and no innate sense of what's at stake. The discipline that makes them useful is the same discipline that makes any junior contributor useful: a clear spec, a feedback loop, a defined verification surface, a record of decisions, and a supervisor with the patience to read the diff. The differences are scale and cadence. Done well, the leverage is real. Done badly, the failures scale to whatever access the agent inherits.

What changed for me, writing this series, is that I stopped seeing agentic coding as primarily a story about models and started seeing it as primarily a story about documents. The model is interchangeable. The supervision pattern is interchangeable. What is irreducibly the team's responsibility is the corpus of artefacts - PRDs, ADRs, role matrices, runbooks, postmortems, design systems - that every agent in the loop reads as ground truth.

Build that corpus carefully. Keep it honest. The agents will do the rest.

Sources and further reading:

Anthropic - Claude Design (Anthropic Labs) - the design agent that takes a PRD and produces visual designs.
Grafana - "An MCP server for Grafana" - the official Grafana MCP server announcement.
Sentry - Sentry MCP documentation - error monitoring as an agent tool.
incident.io - MCP launch announcement - incident lifecycle as an agent tool.
Langfuse - observability for agent applications: traces, costs, evaluations.
OpenTelemetry GenAI semantic conventions - the standardised schema for instrumenting agent and LLM calls.
AI Incident Database entries referenced in Part 3 - real failures of the autonomy that this post is honest about.

Table of contents