- Routing · how layers compose and how docs get loaded
- Change rhythm · scratch plans, phase discipline, and ADRs as the only durable artifact
- Review and verify · this post · auto-triggered self-review on every PR, verification gates, and deployment
Summary
This post defines how a change proves it is correct, safe, and ready to ship.
The review-and-verify layer standardizes:
- three kinds of review (self, AI-assisted, human) and when each applies
- auto-triggered self-review that runs on every PR event, not on demand
- seven review dimensions that structure what the review checks
- explicit review gates that block progression when violated
- verification discipline that proves the change actually works in production-like conditions
- deployment as the final phase of the same discipline, including rollback and incident triage
The core rule is simple. Every PR opens with an automated review. Every push updates it. Hard gates (tests failing, missing docs, security findings) block merge. Verification is not complete until the change is running in production under real load and has survived a smoke test.
The appendix contains the review command skeleton, the full review-dimensions reference, a sanitized GitHub Actions workflow that wires the auto-trigger, and a deployment playbook. Part 1 defined the routing layer. Part 2 defined the change rhythm that produces the draft PR this layer reviews.
Problem
Implicit review is where changes silently regress. Four failure modes are common.
Review is treated as a human bottleneck
When review lives entirely in a human reviewer's inbox, throughput is capped by that reviewer's availability. Worse, the review is opaque: a single "LGTM" at the bottom of a PR carries no record of what was actually checked. Reading the review a week later tells you the change was approved, not that any particular dimension was verified.
AI review is treated as a novelty
Teams adopting Claude Code often run an AI review on demand ("hey, review this PR") and treat the output as an optional advisory. On-demand review misses the changes no one remembered to review. Optional advisories become noise.
Verify collapses into "tests pass"
Tests passing is necessary, not sufficient. A green test suite says the change does not break anything the tests cover. It says nothing about whether the change works in production, under real load, with real data, against real dependencies. Without a separate verify discipline, "tests pass" becomes "ship it," and the gap between test environment and production absorbs the bugs.
Deployment is treated as out of scope for review
A PR merging to main is not the end of the change. Rollout, post-deploy smoke, and rollback are part of the same change, and regressions most commonly surface there. Governance that stops at merge leaves the most expensive failures uncovered.
All four failures share one root cause. The discipline ends before it reaches the surface where changes actually break: production, running code, real load. Everything before that surface (tests, code review, CI) is a proxy. The framework needs to cover the proxies and extend to the surface itself.
Goals
- Make self-review automatic, not on-demand.
- Structure review around explicit dimensions rather than reviewer discretion.
- Block merge on hard-gate violations; allow advisories to be resolved in the PR thread.
- Separate verify from tests. Verify proves the change works, not that it does not crash.
- Treat deployment as part of the same change, with its own rollout, smoke, and rollback steps.
- Feed review findings into Decision Records when they produce durable rationale.
- Keep the command surface small: one review command, invoked automatically by CI.
Non-Goals
- Replace human review. AI review is the floor, not the ceiling.
- Define a universal deploy pipeline. Rollout mechanisms vary; the discipline is what stays consistent.
- Prescribe a specific test framework or coverage target.
- Turn every PR into a multi-day verification exercise. Trivial changes still ship quickly.
- Replace incident response or on-call process. Those are adjacent disciplines.
Proposed Design
1. Review Model
Three kinds of review exist. Each has a different cost, a different coverage profile, and a different trigger.
| Kind | Trigger, cost, coverage |
|---|---|
| Self-review | Author-initiated or auto-triggered. Cheap, consistent, and high-coverage because it runs on every PR. Catches the mechanical dimensions (tests, docs, conventions, basic security). Does not reason about subtle architectural issues. |
| AI-assisted review | Auto-triggered on PR events. Claude Code runs /agent-review-change against the diff and posts a structured comment. Higher reasoning quality than self-review, still cheap, still consistent. Catches subtle issues (missing error paths, incorrect invariants, unsafe SQL) that a checklist would not. |
| Human review | Requested explicitly. Expensive, inconsistent, but irreplaceable for design decisions, domain judgment, and accountability. Human reviewers spend their time on what the other two cannot cover. |
The three kinds are additive, not alternative. Self-review and AI-assisted review run on every PR; human review is requested when the change warrants it. Human reviewers arrive at a PR that has already been through the first two, and their attention is freed for the parts only they can do.
2. Review Triggers
Review runs at four trigger points. The first is the automated default, and the other three are supplementary:
- PR opened, reopened, synchronized, or ready for review. This is the first-class trigger. Every draft PR (see part 2) fires a self-review run. Every subsequent push updates it.
- Phase completion. When a change transitions from implement to close,
/agent-close-changecan request a final review pass before merge. - Pre-deploy. For changes that carry deployment risk (database migrations, breaking API changes, infra rewrites), an explicit pre-deploy review run checks deployment-specific dimensions.
- Post-incident. After an incident, the offending change is reviewed in reverse: the team runs
/agent-review-changeagainst the PR that caused the incident, captures the findings in a Decision Record, and feeds the lessons back into review dimensions.
The auto-trigger on PR events is the only trigger that must be in place. The other three layer on top.
3. Automating Self-Review
The mechanism matters as much as the principle. "Run self-review on every PR" only works if a CI workflow actually invokes /agent-review-change on every PR event. A sanitized GitHub Actions workflow that wires this:
name: agent-review
on:
pull_request:
types: [opened, reopened, synchronize, ready_for_review]
permissions:
contents: read
pull-requests: write
checks: write
jobs:
review:
runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Run agent review
env:
ANTHROPIC_API_KEY: ${{ secrets.AGENT_REVIEW_KEY }}
PR_NUMBER: ${{ github.event.pull_request.number }}
BASE_SHA: ${{ github.event.pull_request.base.sha }}
HEAD_SHA: ${{ github.event.pull_request.head.sha }}
run: |
agent-review-change \
--diff-range "$BASE_SHA..$HEAD_SHA" \
--pr "$PR_NUMBER" \
--dimensions correctness,safety,docs,tests,compatibility,security,performance \
--output structured
- name: Post review
uses: actions/github-script@v7
with:
script: |
const review = require('./agent-review-output.json');
// Update existing review comment in place, or create one.
// Escalate to a required check for any blocking gate.
Five properties make this workflow work as the governance mechanism, not just a convenience:
- Event coverage.
opened,reopened,synchronize, andready_for_reviewtogether mean that every state change that adds new code fires a new review. A stale review cannot hide behind a later push. - Structured output. The review produces JSON that maps one-to-one to the seven review dimensions. A human reading the PR sees a table, not a wall of prose. A later script can query the output.
- Idempotent PR comment. The workflow updates a single review comment in place, rather than appending a new one each push. The PR stays readable.
- Blocking vs. advisory. Most findings post as advisory comments. Hard gates (tests failing, missing required docs, security violations) escalate to a required check that blocks merge.
- Timeout and retries. The job has a strict timeout (10 minutes) and a predictable retry policy, so a flaky review does not block a PR indefinitely.
The workflow is sanitized to satisfy the public-repo security rules: no account IDs, no bucket names, no specific secret names beyond a placeholder. A real deployment substitutes the organization's own secret management and review runner.
4. Review Dimensions
Review is structured around seven dimensions. Each dimension answers a specific question. The output maps one finding per dimension per file, which keeps the review legible.
| Dimension | Question |
|---|---|
| Correctness | Does the change do what it says? Are invariants preserved? Are error paths handled? |
| Safety | Can this code corrupt data, destroy state, or break running systems? Are destructive operations gated appropriately? |
| Docs | Are evergreen docs updated in this PR? Is the commit message sufficient? Does any durable rationale need a Decision Record? |
| Tests | Does the change ship with tests? Do the tests test behavior, not just absence of panic? Are edge cases covered? |
| Compatibility | Does the change break API consumers, database schemas, serialized formats, or configuration contracts? |
| Security | Does the change introduce injection vectors, auth bypass, exposed secrets, unsafe deserialization, or permissioning issues? |
| Performance | Does the change regress hot paths? Are new N+1 queries, unbounded allocations, or blocking I/O introduced? |
Dimensions are orthogonal. A single finding should fall cleanly into one of them. When a finding spans two dimensions, it usually means the diagnosis is still shallow; splitting it into two findings makes the review sharper.
5. Review Gates
A review gate is a hard stop that blocks merge. The framework defines four default gates, and repos can add their own:
| Gate | Blocks merge when |
|---|---|
| Tests | Any required test suite fails on the PR branch. |
| Docs | Behavior changed but an evergreen doc referenced by the changed area was not updated. |
| Security | A security finding above a configured severity threshold is open. |
| Compatibility | A breaking change is introduced without a Decision Record explaining the migration path. |
All other findings from the seven-dimension review surface as advisory comments. Advisories are visible, addressable, and resolved in the PR thread; they do not block the merge button. This split keeps the gating discipline narrow enough that teams do not learn to ignore it.
6. Verify Discipline
Verify is distinct from review. Review reads the code; verify runs it. Four verify layers, in increasing cost and increasing realism:
| Layer | What it proves |
|---|---|
| Unit tests | Isolated logic behaves correctly for documented inputs. |
| Integration tests | Components wire together. Real database, real HTTP client, real external API stubs where feasible. |
| Staging / canary | The change runs under production-like traffic, on production-like data, without affecting all users. |
| Observability check | Metrics, logs, and traces show the change working as expected in the running system. No new errors, no latency regressions, no unexpected resource use. |
The first two layers are covered by standard CI. The third and fourth are where most teams lose discipline: a change passes CI and is declared "verified," but no one watched it run in a realistic environment. The framework's position is that "tests pass" is a gate, not a finish line. Verification is complete when the change has been observed running under real conditions without regression.
For UI changes this includes browser testing: the feature is exercised manually at 375 / 768 / 1200 px, console is checked, network requests are inspected, and the feature is used along a realistic path. Type checks and test suites verify code correctness, not feature correctness.
7. Deployment
Deployment is part of verify, not a separate stage. A change that merges and breaks in production has not been verified, regardless of how green CI was. Three deployment concerns are first-class:
Rollout strategy
The framework does not prescribe a specific rollout mechanism (canary, blue/green, progressive delivery, feature flags). It does require that the change's rollout strategy is explicit somewhere the review can see it. In most repos the right surface is the PR description or a linked runbook. For changes that carry lasting architectural implications (a new canary infrastructure, a new feature flag system), a Decision Record captures the rationale.
Post-deploy smoke
Every change that alters behavior requires a post-deploy smoke step: a scripted or manual check that the feature works in production after rollout. A post-deploy smoke that only checks "service is healthy" is not sufficient; it must exercise the specific feature that changed. For trivial changes this can be a one-line curl. For larger changes it is a small suite.
Rollback discipline
Every deployable change has a rollback path, and the rollback path is documented before merge, not discovered during incident. Rollback paths are not all symmetrical: a schema migration cannot be rolled back by reverting the deploy, for example. When the rollback is not a simple revert, the Decision Record for the change explains what the actual rollback procedure is.
Incident triage as review-in-reverse
An incident is a review that runs too late. The discipline for recovering value from one is to run the review dimensions in reverse against the change that caused the incident: which dimension failed, why did the auto-review not catch it, and what should change in the review dimensions so it catches the next one. Findings that produce durable lessons become Decision Records. The change-rhythm and review-and-verify layers reinforce each other here: the incident's lesson is captured by the same Decision Record mechanism that captures architectural choices.
8. Review Findings Feed Decision Records
Most review findings are resolved in the PR and forgotten. A small subset surfaces durable rationale: a constraint the team did not know about, a trade-off that future reviews should respect, a class of bug that should be prevented at the policy level.
When a review finding crosses that threshold, it becomes a Decision Record through the same /agent-record-decision command introduced in part 2. Three common patterns:
- a security finding that reveals an architectural weakness becomes a DR that captures the mitigation approach
- a compatibility finding that forces an API break becomes a DR that records the break and its migration path
- a post-incident review that identifies a missing review dimension becomes a DR that proposes adding the dimension and updates the framework's dimension list accordingly
This is the feedback loop that keeps the framework from calcifying. The review dimensions, the gates, and the policies all get updated by lessons surfaced in actual reviews and actual incidents, not by abstract reasoning.
Review-and-Verify Capabilities
This layer adds exactly one command to the framework's surface.
Review Change
/agent-review-change is invoked by CI on PR events (opened, reopened, synchronize, ready_for_review). It can also be called by hand for phase completion, pre-deploy, or post-incident review. It loads the seven review dimensions and the gate definitions, runs them against the diff, and produces structured output suitable for a PR comment or a GitHub check.
The command is deliberately narrow. It does not plan changes, implement changes, or merge changes. Those are owned by the commands in parts 1 and 2. /agent-review-change reviews; everything else is out of scope.
Usage
The review-and-verify command lives under ~/.claude/commands/ alongside the routing and change-rhythm commands from parts 1 and 2, using the reserved agent-* namespace.
graph LR PR["PR opened
or synchronize"]:::routing REVIEW["/agent-review-change
auto-triggered"]:::api GATES{"Hard gate
violation?"}:::decision COMMENT["Structured PR comment
7 review dimensions"]:::data BLOCK["Required check fails
merge blocked"]:::onDemand MERGE["Merge allowed"]:::data PR --> REVIEW REVIEW --> GATES GATES -- "yes" --> BLOCK GATES -- "no" --> COMMENT COMMENT --> MERGE BLOCK -.-> COMMENT classDef routing fill:#5C3E14,stroke:#E8A849,color:#fff,stroke-width:1.5px classDef api fill:#1B3D3A,stroke:#4ECDC4,color:#fff,stroke-width:1.5px classDef data fill:#2A3054,stroke:#7B93DB,color:#fff,stroke-width:1.5px classDef onDemand fill:#3A2054,stroke:#B07CD8,color:#fff,stroke-width:1.5px classDef decision fill:#4A1E3A,stroke:#E85D75,color:#fff,stroke-width:1.5px
Repos that want a custom review dimension (for example, "accessibility" on a frontend-heavy repo) add it in repo-local docs at docs/agents/reference/review-dimensions.md. Repo gates add to the framework gates rather than replace them.
Example: an auto-triggered review that caught a regression
A change adds a new endpoint that accepts a user-supplied filter and composes it into a database query. The PR passes CI. Tests are green. The change is one file, seven lines.
- PR opens. The GitHub Actions workflow fires
/agent-review-change. - The review runs across the seven dimensions and posts a structured comment.
- The security dimension flags the filter as a potential injection vector: the user-supplied value is string-concatenated into the SQL, not parameterized.
- The finding triggers the security gate (severity above the configured threshold). The required check fails. Merge is blocked.
- The author switches to a parameterized query, pushes a new commit.
- The
synchronizeevent fires. The review reruns. The security finding is gone. The required check passes. - The PR merges.
Nothing about this example required a human reviewer. The auto-trigger, the dimension, and the gate did the work. Human review is still available; it simply is not the first line of defense anymore.
Tradeoffs
Benefits
- Every PR gets reviewed, not just the ones someone remembered to tag.
- Review output is structured and diffable across runs.
- Gates keep the discipline narrow enough that authors do not learn to ignore it.
- Verify is separate from tests, so "tests pass" stops being confused with "change works."
- Deployment, rollback, and incident triage live in the same discipline, not adjacent to it.
- The feedback loop into Decision Records keeps the framework grounded in real findings.
Costs
- CI cost per PR increases: every push runs a review.
- Structured reviews can produce false positives that the author has to triage. The advisory vs. gate split mitigates this but does not eliminate it.
- Rollback discipline requires work that is easy to skip. If a team does not enforce it, the benefit evaporates.
- The seven-dimension matrix is a design opinion. A team with different priorities may need to adjust it.
These costs are acceptable because they buy consistent review coverage, explicit gating, and a verification discipline that extends to the surface where changes actually break.
Recommendation
Adopt this layer once the routing (part 1) and change rhythm (part 2) are in place:
- install
/agent-review-changeunder~/.claude/commands/ - wire the GitHub Actions workflow (or equivalent CI) to fire the command on PR events
- adopt the seven review dimensions as the default; extend them per repo as needed
- enforce the four default gates; add repo-local gates only when a repeated failure class justifies one
- treat staging, canary, and observability checks as part of verify, not optional
- document rollback before merge for any change that carries deployment risk
- feed durable findings from reviews and incidents back into Decision Records
This completes the three-layer governance framework. Routing decides what gets loaded, rhythm decides how the change moves, and review-and-verify decides when the change is ready to run in front of real users. Each layer is independently useful, and together they form a complete cycle from intent to production.
Appendix: The review command shim, the review-dimensions reference, the sanitized GHA workflow, and the deployment playbook are available in the implementation appendix.
Series: Part 1 · Routing · Part 2 · Change rhythm