AI State of Play - Part 4: Docs as the Source of Truth

One spec, multiple consumers

Part 3 closed on a claim that's worth picking up here: a meaningful share of the failure modes in agentic coding are spec ambiguity failures in disguise. Confident wrongness is often the agent making a reasonable-looking decision in the absence of a clear specification of what "right" means. Over-eager edits are often the agent doing what looked like "the helpful thing" because the boundary of the task wasn't documented.

The argument of this post is direct. The cheapest, highest-leverage thing engineering teams can do to work well with agents is to take their own product and architecture documents seriously again. PRDs, ADRs, design docs, and acceptance criteria - artefacts most teams treat as "process tax" - become the substrate that makes agentic engineering work. The same documents feed the implementation agent, the QA agent, and - increasingly - the design agent. Anthropic shipped Claude Design through Anthropic Labs recently as a concrete instance of the third one; the pattern generalises. Single source of truth, multiple consumers. The spec is compiled into code, tests, and UI.

This is the idea half of the series rests on. It is also where I have changed my mind most over the last year: I used to think PRDs and ADRs were valuable but expensive overhead. Agents change the cost-benefit. Documents that compounded only weakly when humans read them compound much more strongly when agents do.

To make any of this concrete, the rest of this post walks a single worked example end-to-end: a small B2B SaaS adding a public API and role-based access control. The same example carries through into Part 5, where the auth feature will go to production and an incident will hit it.

The thesis
The worked example
Stage 1: The PRD
Stage 2: The ADR
Stage 3: The dev agent reads, plans, builds
Stage 4: The QA agent reads the same docs
Stage 5: The loop closes
Failure modes specific to this approach
An honest take

The thesis

For decades, the "single source of truth" argument for engineering documentation has been about humans: keep the docs accurate so engineers, on-call responders, and new hires can find what they need. The argument is sound. It also did not, in most teams I have seen, generate enough payoff to justify the cost. Docs went stale because their main consumer was a future engineer, sometime, maybe.

What changes with agents is who reads the docs.

When the implementation agent picks up a task, it reads everything in its working set: the prompt, the relevant code, and any docs you have pointed it at - CLAUDE.md, skills, ADRs, design docs, runbooks. When the QA agent (a subagent or a separate session) generates tests, it reads the spec. When the design agent produces UI, it reads the same PRD - the role matrix is a UI constraint as much as it is an API one. When an on-call agent triages an incident in Part 5, it reads the ADR for the affected service to understand the design's intent before proposing a fix. Every artefact you write is now read multiple times, by multiple agents, across the lifecycle of the work.

That changes the math. A doc that was previously read three times by humans across its lifetime might now be read three hundred times across an agent-mediated SDLC. The cost of writing it stays roughly the same; the value compounds.

The mechanical version of this argument is even simpler. The implementation agent and the QA agent should not be fed two different specs. If they are, they will diverge. If the QA agent's tests are based on what the engineer thought the requirements were and the implementation is based on what the agent inferred from the codebase, you are testing one fiction against another. If both agents are reading the same PRD - the same acceptance criteria, the same role matrix, the same non-goals - the system is internally consistent by construction.

This isn't a new idea. Acceptance test-driven development has argued for it since the early 2000s. What's new is that agents make it cheap.

The worked example

The product is a customer-facing web app sold to small and mid-sized teams. Up to now, the application has had a single category of user, authenticated via AWS Cognito User Pools. The new piece of work is to expose a public API to customers, so they can build their own integrations - which introduces three concerns simultaneously:

End-user authN against the API, reusing the existing Cognito User Pool.
Service-to-service authN, for partner systems and internal tools that don't have a human user. Implementation: Cognito Machine-to-Machine (M2M) with custom resource server scopes.
Role-based access control across the existing app and the new API, with three roles: Admin (org management, billing, all data), Manager (manage their team, invite users, all team data), Member (read team data, edit their own work).

That last piece - RBAC - is the one whose details will matter most when Part 5 hits an incident, so it gets disproportionate attention here.

Stage 1: The PRD

The PRD is the document the engineering team and the product team agreed on. It captures what is being built and why, deliberately not how. In an agentic SDLC, the PRD has one new responsibility: its acceptance criteria need to be precise enough for an agent to compile into tests.

A useful PRD for this feature has roughly the structure below. The fragment is short on purpose - the post is about the role of the document, not how to write a fifty-page spec.

docs/prd/2026-Q2-public-api-and-rbac.md

# Public API and RBAC

**Status**: Approved | **Owner**: Product (Pat) | **Engineering lead**: Eng (Kim)
**Target release**: 2026-Q2 | **Linked ADRs**: ADR-007

## Goals

1. Customers can build read/write integrations against tenant data via a
   public API, authenticated as either an end user or a service principal.
2. Three application roles - Admin, Manager, Member - with consistent
   enforcement across the web app and the public API.
3. No regressions to existing end-user flows in the web app.

## Non-goals

- We are *not* exposing org-level or cross-tenant data on this release.
- We are *not* introducing custom roles or per-resource ACLs in this release.
- We are *not* changing the existing pricing or contracts.

## Roles and capabilities (the role matrix)

| Capability                          | Admin | Manager | Member |
| ----------------------------------- | :---: | :-----: | :----: |
| View team members                   |  ✓    |   ✓     |  ✓     |
| Invite a single user                |  ✓    |   ✓     |        |
| **Bulk-invite users (CSV upload)**  |  ✓    |   ✓     |        |
| Remove a user from team             |  ✓    |   ✓     |        |
| Change a user's role                |  ✓    |         |        |
| Manage billing                      |  ✓    |         |        |
| Read tenant data                    |  ✓    |   ✓     |  ✓     |
| Edit own work                       |  ✓    |   ✓     |  ✓     |

## Acceptance criteria (the seed for tests)

- AC-1: An authenticated **Member** receives `403` on `POST /v1/users:bulkInvite`
  and the bulk-invite UI on `/team/manage` is not rendered.
- AC-2: An authenticated **Manager** receives `2xx` on `POST /v1/users:bulkInvite`
  with valid input and `400` with malformed CSV.
- AC-3: A request without a valid Cognito token receives `401` on any
  `/v1/...` endpoint.
- AC-4: A service principal token without the `users:write` scope receives
  `403` on `POST /v1/users:*` regardless of role claims.
- (... AC-5 through AC-19 omitted for brevity ...)

Three things about that fragment are worth pulling out, because they are the things the agent needs.

First, the role matrix is a table. Tables have an unreasonable advantage in agent context: they are unambiguous, easy to parse, and easy to diff when something changes. The contrast - prose like "Members generally cannot perform team management actions, with some exceptions for..." - is what produces confident-wrongness failures. The agent fills in the gap with what seems reasonable. Tables remove the gap.

Second, the acceptance criteria are written as test specifications. AC-1 is a sentence; it is also a test case. "Authenticate as Member, hit POST /v1/users:bulkInvite, assert 403, render /team/manage, assert no bulk-invite control." The QA agent doesn't need to be creative. It needs to map criteria 1:1 to executable tests.

Third, the non-goals exist. Non-goals are how you stop an agent from being over-eager. "We are not introducing per-resource ACLs in this release" saves an entire category of plausible-but-out-of-scope refactor that the agent might otherwise propose because it is technically a coherent next step. Non-goals are friction, deliberately added.

Worth flagging one consequence of the matrix before moving on. It is a constraint on UI as much as it is on the API: the bulk-invite control on /team/manage should be rendered for Admin and Manager and absent for Member, and that mapping comes from the matrix, not from a designer's intuition or a frontend engineer's recollection of a meeting. The design agent reading this PRD reproduces those affordances the same way the dev agent reproduces the route guards and the QA agent reproduces the test cases. The post focuses on dev and QA from here, but the discipline is identical for design - and the loop generalises to whichever agents the team adds next quarter.

Stage 2: The ADR

The PRD captures what. The ADR captures the architectural decisions made along the way, in the canonical format Michael Nygard introduced in 2011 - status, context, decision, consequences. ADRs are particularly valuable for agents because they encode why a decision was made, which is the information most likely to be missing from the codebase itself.

For this feature, a single ADR is enough.

docs/adr/0007-external-api-authentication.md

# ADR-007: External API authentication via Cognito M2M and scopes

**Status**: Accepted on 2026-04-15 | **Supersedes**: -
**Linked PRD**: 2026-Q2-public-api-and-rbac

## Context

We need to expose a public API to customers. Two distinct caller types:
end users with browser sessions (already authenticated against our Cognito
User Pool), and service principals representing partner systems and
internal tools (no human user, no browser).

We considered three options:

1. **Cognito User Pools + Cognito M2M (resource server with custom scopes).**
   End users keep using the existing pool. Service principals use Cognito's
   M2M client-credentials flow with custom scopes (`users:read`, `users:write`,
   `data:read`, `data:write`).
2. **Self-issued JWTs**, signed by a service we operate.
3. **Auth0** (or similar) as a parallel identity provider.

## Decision

Option 1: Cognito User Pools for end users, Cognito M2M with a resource
server (`api.example.com`) and four custom scopes for service principals.

JWT verification at the edge of the API service via a single middleware:
verify signature against the JWKS endpoint, validate `aud` and `iss`,
extract role claims (end-user tokens) or scope claims (service tokens),
attach a normalised `Principal` object to the request.

Authorisation enforcement happens in **two places**: a route-level guard
(checks role/scope against an annotation on the route), and the existing
data-access layer (defence in depth - tenant scoping at the query layer
remains in force regardless of the route guard).

## Consequences

**Positive**: One IDP, one verification path, one rotation story.
End-user tokens already work; new code is the M2M flow + scope checks.
Defence in depth at the data layer means a single missed route guard
does not by itself enable cross-tenant access (but **may** enable
in-tenant privilege escalation - see [Part 5](5-toward-autonomy)).

**Negative**: Cognito's M2M flow is less ergonomic than a standalone
client-credentials provider, particularly around scope discovery for
customers. Documentation cost.

**Open questions**: We have not yet decided on a per-customer rate-limit
strategy for the M2M flow. Out of scope for this ADR; tracked in PR-2026-031.

The ADR is doing two pieces of work the PRD cannot. It captures a decision, not a requirement, and it captures what was rejected and why. When an agent six months from now is asked to "switch the API auth to JWTs we issue ourselves," the right response is not to make the change. The right response is to surface ADR-007 and ask whether this is intended to be a supersession - because the original decision was deliberate, with stated alternatives, and the implications for the rotation story and the M2M flow are non-trivial. ADRs are how the agent learns to ask before acting.

The convention I would recommend, and which the rest of this example assumes: ADRs live in docs/adr/ with sequential numbering, and the agent reads them on demand via a skill or via direct reference in CLAUDE.md. The Nygard format is canonical. There is no need to invent a different one.

Stage 3: The dev agent reads, plans, builds

With the PRD and ADR in place, the engineer's prompt to the dev agent is short. The interesting thing about it is what is not in it.

Implement the public API and RBAC feature per docs/prd/2026-Q2-public-api-and-rbac.md and docs/adr/0007-external-api-authentication.md. Use the existing middleware pattern. Plan first.

The dev agent enters plan mode (covered in Part 2) and reads. It pulls both documents into context, scans the existing codebase for the middleware pattern referenced in the ADR, locates the existing Cognito integration, and surfaces a plan. The plan, by the time it lands, is a numbered list of about a dozen steps - middleware additions, route guards, scope-check logic, IaC fragments to register the resource server and scopes in Cognito, a customer-facing key-issuance script, the four route-level annotations, and tests.

Two things matter about how the agent works at this stage.

The first is that the agent does not decide on auth architecture. The ADR did. The agent's job is implementation, not design. If during the work the agent hits a question that the ADR does not answer - "should the scope check happen before or after the role check?" - the right behaviour is to flag it, propose the answer it would default to, and either get a quick decision from the engineer or amend ADR-007 with the new sub-decision. The agent does not silently choose. This is enforced partly through plan mode and partly through a project skill that reminds the agent to consult ADRs before making architectural choices.

The second is that the agent's work is continually anchored to the PRD's role matrix. When the agent writes the route guard for POST /v1/users:bulkInvite, the annotation on the route is the matrix's authoritative version: @requires(role: ['Admin', 'Manager'], scope: 'users:write'). There is no judgement call about who can call this endpoint - the matrix decided that, the engineer approved it, the agent transcribes it. The implementation cost of this is approximately zero; the cost of not doing it is the entire post-release category of "the spec said one thing and the code does another."

A short skill at .claude/skills/spec-driven/SKILL.md makes this discipline explicit:

.claude/skills/spec-driven/SKILL.md

---
name: spec-driven
description: Use when implementing or modifying features that have a PRD or ADR. Enforces consultation of the spec, anchoring of role checks to the role matrix, and surfacing of decisions that the spec does not answer rather than guessing.
---

# Spec-driven implementation

When implementing a feature with a PRD or ADR:

1. Read the PRD and any linked ADRs *before* planning. Surface the goals,
   non-goals, and role matrix in the plan.
2. Anchor every role check, scope check, or permission decision to the
   role matrix. The matrix is authoritative; transcribe, do not interpret.
3. If the spec is silent on a question that arises during implementation,
   stop and ask. Do not guess. Propose what you would default to and why.
4. Acceptance criteria in the PRD are not just for QA. Confirm each
   criterion has a corresponding code path before claiming completeness.
5. If a change is needed to a published ADR's decision, propose a
   *new* ADR that supersedes it, rather than silently changing behaviour.

That skill lives in the repo and is auto-loaded by the agent whenever the spec-driven discipline is relevant. It is not long. It does not need to be. Most of its value is in step 3.

Stage 4: The QA agent reads the same docs

This is the central claim of the post.

The QA agent is a separate subagent, configured at .claude/agents/qa.md. Its job is not to write tests for the implementation. Its job is to write tests for the spec. When the engineer (or, later, the CI agent in Part 5) invokes it, the QA agent reads the PRD and the ADR - the same artefacts the dev agent read - and produces a test suite that maps acceptance criteria one-to-one onto executable tests.

.claude/agents/qa.md

---
name: qa
description: Generate and maintain end-to-end tests from PRDs and ADRs. The PRD's acceptance criteria are the spec of the test suite. Each AC maps to one (or more) tests; absent ACs become questions, not silent omissions.
tools: [Read, Glob, Grep, Edit, Write, Bash]
---

You are a QA engineer. Your role is to generate and maintain a test suite
that *enforces the spec*, not the implementation.

When invoked:

1. Read every PRD in `docs/prd/` and every linked ADR in `docs/adr/`.
2. For each acceptance criterion (AC-N), identify whether a corresponding
   test exists in the e2e suite. If not, generate one.
3. Tests are named after the AC (`ac01_member_cannot_bulk_invite.spec.ts`)
   and contain a comment block linking back to the PRD line.
4. If an acceptance criterion is ambiguous, *do not invent the missing
   detail*. Surface a question to the human reviewer.
5. Report at the end with: ACs covered, ACs missing, ACs ambiguous.

The output of the first run, against the PRD shown above, is a test suite where AC-1 has a corresponding test:

tests/e2e/ac01_member_cannot_bulk_invite.spec.ts

// Generated from PRD 2026-Q2-public-api-and-rbac, AC-1.
// "An authenticated Member receives 403 on POST /v1/users:bulkInvite
//  and the bulk-invite UI on /team/manage is not rendered."

import { test, expect } from '@playwright/test'
import { authenticateAs } from './helpers/auth'

test.describe('AC-1: Member cannot bulk-invite users', () => {
  test('API returns 403', async ({ request }) => {
    const token = await authenticateAs('member@acme.test')
    const res = await request.post('/v1/users:bulkInvite', {
      headers: { Authorization: `Bearer ${token}` },
      data: { csv: 'email\nfoo@example.com\n' },
    })
    expect(res.status()).toBe(403)
  })

  test('UI does not render bulk-invite control', async ({ page }) => {
    await authenticateAs('member@acme.test', { browser: page })
    await page.goto('/team/manage')
    await expect(page.getByTestId('bulk-invite-button')).toHaveCount(0)
  })
})

The crucial property: this test is generated from the spec, not from the implementation. If the implementation later regresses - say, an over-eager refactor accidentally removes the role guard from the route - this test fails, because the spec says it should. If the implementation passes this test only because the test was written to match the implementation's existing behaviour, the test catches nothing. The QA agent's discipline is to read the PRD as source-of-truth and to ignore the implementation when constructing assertions.

The same pattern holds for AC-2 through AC-19. By the end of the run, the test suite is structurally a mirror of the PRD's acceptance criteria section - the same shape, in code, executable in CI.

There is a second-order effect that takes a few cycles to feel. When you change the spec, the QA agent regenerates the relevant tests. When the spec is silent, the QA agent asks. The PRD becomes a forcing function for thinking. Every ambiguity in the spec is now a friction surface, not just a footnote that nobody reads.

Stage 5: The loop closes

Once the dev agent and the QA agent are both reading the same artefacts, the loop closes in a useful way.

Step	Dev agent	QA agent	Design agent	Ops/incident agent (Part 5)
PRD (goals, role matrix, ACs)	✓	✓	✓	✓
ADR-007 (auth decisions)	✓	✓		✓
Implementation	✓	✓		✓
Test suite	✓	✓		✓
CLAUDE.md / repo skills	✓	✓	✓	✓
Runbooks (added at deploy)				✓

That table is not aspirational. Every column is an agent that exists today. The shared row of artefacts is the surface area you are deliberately maintaining.

A natural follow-on is what happens when the spec changes mid-flight. The discipline is mechanical: the engineer (or product) edits the PRD, opens a PR. The dev agent re-reads the PRD diff and surfaces which acceptance criteria changed and which corresponding code paths need updating. The QA agent re-reads the same diff and regenerates or updates the affected tests. The PR ships when the implementation, the spec, and the tests are all internally consistent. None of this is novel as a process - it is just spec-driven development. What is new is that the cost of doing it has dropped sharply.

Failure modes specific to this approach

The thesis is not free of failure modes. Three are worth flagging up front.

Docs going stale. The classic objection. If the PRD says one thing and the code does another, the agent has to choose - and a poorly-configured agent will choose whichever it read most recently or whichever it has more confidence in. The mitigation is process, not tooling: spec changes go through PRs alongside code changes, the QA agent's report ("ACs covered / missing / ambiguous") is a CI gate, and the runbook for "the docs and the code disagree" is stop and reconcile, not ship and decide later.

Over-spec. The opposite trap. PRDs that try to dictate implementation. ADRs that mix decision with line-by-line guidance. The dev agent ends up with a spec so prescriptive that it cannot exercise any engineering judgement, and the resulting code has the texture of a spec compiled to source - because that is what it is. The mitigation is a discipline of the document author: the PRD captures what and why, the ADR captures which design choice and why not the alternatives, and everything else - line breaks, naming, parameter ordering - is the agent's job. Specs that are too tight push the failure mode from "confident wrongness" to "boring wrongness" - syntactically correct code with no judgement in it.

Docs that lie. The most dangerous mode. A PRD that says one thing because that is what the team intended, while the code does another because that is what shipped under deadline pressure. With agents reading the docs, the lie now propagates: the QA agent's tests pass against the lie; the design agent renders affordances against the lie; the incident agent in Part 5 reasons against the lie when triaging; new work gets built on the lie. The mitigation is partly the QA agent (its acceptance-criteria coverage report flags drift) and partly the discipline that any change to behaviour ships with the same-PR change to the spec. A lying spec is worse than no spec at all.

An honest take

The argument for taking PRDs and ADRs seriously is older than agents. Acceptance test-driven development, behaviour-driven development, the working-backwards memo at Amazon, the Architecture Decision Record post Michael Nygard wrote in 2011 - the engineering tradition has been arguing for this discipline for two decades. The reason it has not stuck on most teams is, unfashionably, economic. Writing a PRD that meets a high bar takes meaningful time. Writing an ADR that captures the alternatives, not just the chosen path, takes more. The compounding payoff - new engineers ramp faster, on-call responders find the right context, refactors don't violate forgotten constraints - was real but discounted, because most teams' biggest constraint was not "good docs would help us" but "we don't have the time to write them and engineers don't read them anyway."

Agents change that calculation in two ways.

The first is that they raise the read-frequency of every doc by an order of magnitude. The PRD that was read three times by humans is now read every time the dev agent picks up a related task, every time the QA agent regenerates tests, every time the incident agent triages a problem in the affected service. The compounding starts paying back inside the same quarter.

The second is that they lower the cost of writing the docs. The PRD I would have written by hand in three hours, an agent will draft from a half-page brief in fifteen minutes. The ADR I would have written by hand for an hour, an agent will draft from a back-and-forth conversation about the alternatives. The author's job becomes review and judgement, not transcription. The cost goes down, the value goes up, the math finally works.

There is a temptation, with all of this, to slip into a kind of techno-utopianism. A doc-driven SDLC is not magic. Specs can still be wrong; ADRs can still encode bad decisions; QA agents can still regenerate tests for a spec that itself was poorly thought through. The point of the discipline is not that the documents are infallible. The point is that, when the system is set up this way, the failure modes get pushed out of the implementation phase and back into the design phase, where they are catastrophically cheaper to fix. A bad spec can be revised in an afternoon. A production incident caused by a silent inconsistency between code and spec can take a week and a postmortem.

The next post in this series picks up exactly there. The auth feature ships. A few weeks after release, an incident lands - an RBAC misconfiguration on a page and an endpoint, post-release, exactly the category of failure this post argues should not survive a doc-driven SDLC. In Part 5, the incident-response agent reads the same PRD and the same ADR-007 to triage what went wrong, propose a fix, draft the postmortem, and update the documents that made the failure invisible in the first place. The loop closes the rest of the way.

Sources and further reading:

Michael Nygard, "Documenting Architecture Decisions" (2011) - the canonical ADR format (Status, Context, Decision, Consequences). Original blog post: cognitect.com/blog/2011/11/15/documenting-architecture-decisions.
adr.github.io - community catalog of ADR templates and tooling.
Colin Bryar and Bill Carr, Working Backwards: Insights, Stories, and Secrets from Inside Amazon (St. Martin's Press, 2021) - the working-backwards / PR-FAQ memo discipline at Amazon.
Kent Beck, Test-Driven Development: By Example (Addison-Wesley, 2002) - the canonical TDD reference; the acceptance-criteria-as-tests argument extends naturally from it.
AWS - Cognito User Pools and resource servers / scopes - reference for the M2M flow and custom scopes used in ADR-007.

Table of contents