Agent Governance for Claude Code, Part 3: Review and Verify

Series · Agent Governance for Claude Code

Routing · how layers compose and how docs get loaded
Change rhythm · scratch plans, phase discipline, and ADRs as the only durable artifact
Review and verify · this post · auto-triggered self-review on every PR, verification gates, and deployment

Summary

This post defines how a change proves it is correct, safe, and ready to ship.

The review-and-verify layer standardizes:

three kinds of review (self, AI-assisted, human) and when each applies
auto-triggered self-review that runs on every PR event, not on demand
seven review dimensions that structure what the review checks
explicit review gates that block progression when violated
verification discipline that proves the change actually works in production-like conditions
deployment as the final phase of the same discipline, including rollback and incident triage

The core rule is simple. Every PR opens with an automated review. Every push updates it. Hard gates (tests failing, missing docs, security findings) block merge. Verification is not complete until the change is running in production under real load and has survived a smoke test.

The appendix contains the review command skeleton, the full review-dimensions reference, a sanitized GitHub Actions workflow that wires the auto-trigger, and a deployment playbook. Part 1 defined the routing layer. Part 2 defined the change rhythm that produces the draft PR this layer reviews.

Problem

Implicit review is where changes silently regress. Four failure modes are common.

Review is treated as a human bottleneck

When review lives entirely in a human reviewer's inbox, throughput is capped by that reviewer's availability. Worse, the review is opaque: a single "LGTM" at the bottom of a PR carries no record of what was actually checked. Reading the review a week later tells you the change was approved, not that any particular dimension was verified.

AI review is treated as a novelty

Teams adopting Claude Code often run an AI review on demand ("hey, review this PR") and treat the output as an optional advisory. On-demand review misses the changes no one remembered to review. Optional advisories become noise.

Verify collapses into "tests pass"

Tests passing is necessary, not sufficient. A green test suite says the change does not break anything the tests cover. It says nothing about whether the change works in production, under real load, with real data, against real dependencies. Without a separate verify discipline, "tests pass" becomes "ship it," and the gap between test environment and production absorbs the bugs.

Deployment is treated as out of scope for review

A PR merging to main is not the end of the change. Rollout, post-deploy smoke, and rollback are part of the same change, and regressions most commonly surface there. Governance that stops at merge leaves the most expensive failures uncovered.

All four failures share one root cause. The discipline ends before it reaches the surface where changes actually break: production, running code, real load. Everything before that surface (tests, code review, CI) is a proxy. The framework needs to cover the proxies and extend to the surface itself.

Goals

Make self-review automatic, not on-demand.
Structure review around explicit dimensions rather than reviewer discretion.
Block merge on hard-gate violations; allow advisories to be resolved in the PR thread.
Separate verify from tests. Verify proves the change works, not that it does not crash.
Treat deployment as part of the same change, with its own rollout, smoke, and rollback steps.
Feed review findings into Decision Records when they produce durable rationale.
Keep the command surface small: one review command, invoked automatically by CI.

Non-Goals

Replace human review. AI review is the floor, not the ceiling.
Define a universal deploy pipeline. Rollout mechanisms vary; the discipline is what stays consistent.
Prescribe a specific test framework or coverage target.
Turn every PR into a multi-day verification exercise. Trivial changes still ship quickly.
Replace incident response or on-call process. Those are adjacent disciplines.

Proposed Design

1. Review Model

Three kinds of review exist. Each has a different cost, a different coverage profile, and a different trigger.

Kind	Trigger, cost, coverage
Self-review	Author-initiated or auto-triggered. Cheap, consistent, and high-coverage because it runs on every PR. Catches the mechanical dimensions (tests, docs, conventions, basic security). Does not reason about subtle architectural issues.
AI-assisted review	Auto-triggered on PR events. Claude Code runs `/agent-review-change` against the diff and posts a structured comment. Higher reasoning quality than self-review, still cheap, still consistent. Catches subtle issues (missing error paths, incorrect invariants, unsafe SQL) that a checklist would not.
Human review	Requested explicitly. Expensive, inconsistent, but irreplaceable for design decisions, domain judgment, and accountability. Human reviewers spend their time on what the other two cannot cover.

The three kinds are additive, not alternative. Self-review and AI-assisted review run on every PR; human review is requested when the change warrants it. Human reviewers arrive at a PR that has already been through the first two, and their attention is freed for the parts only they can do.

2. Review Triggers

Review runs at four trigger points. The first is the automated default, and the other three are supplementary:

PR opened, reopened, synchronized, or ready for review. This is the first-class trigger. Every draft PR (see part 2) fires a self-review run. Every subsequent push updates it.
Phase completion. When a change transitions from implement to close, /agent-close-change can request a final review pass before merge.
Pre-deploy. For changes that carry deployment risk (database migrations, breaking API changes, infra rewrites), an explicit pre-deploy review run checks deployment-specific dimensions.
Post-incident. After an incident, the offending change is reviewed in reverse: the team runs /agent-review-change against the PR that caused the incident, captures the findings in a Decision Record, and feeds the lessons back into review dimensions.

The auto-trigger on PR events is the only trigger that must be in place. The other three layer on top.

3. Automating Self-Review

The mechanism matters as much as the principle. "Run self-review on every PR" only works if a CI workflow actually invokes /agent-review-change on every PR event. A sanitized GitHub Actions workflow that wires this:

name: agent-review

on:
  pull_request:
    types: [opened, reopened, synchronize, ready_for_review]

permissions:
  contents: read
  pull-requests: write
  checks: write

jobs:
  review:
    runs-on: ubuntu-latest
    timeout-minutes: 10
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Run agent review
        env:
          ANTHROPIC_API_KEY: ${{ secrets.AGENT_REVIEW_KEY }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
          BASE_SHA: ${{ github.event.pull_request.base.sha }}
          HEAD_SHA: ${{ github.event.pull_request.head.sha }}
        run: |
          agent-review-change \
            --diff-range "$BASE_SHA..$HEAD_SHA" \
            --pr "$PR_NUMBER" \
            --dimensions correctness,safety,docs,tests,compatibility,security,performance \
            --output structured

      - name: Post review
        uses: actions/github-script@v7
        with:
          script: |
            const review = require('./agent-review-output.json');
            // Update existing review comment in place, or create one.
            // Escalate to a required check for any blocking gate.

Five properties make this workflow work as the governance mechanism, not just a convenience:

Event coverage. opened, reopened, synchronize, and ready_for_review together mean that every state change that adds new code fires a new review. A stale review cannot hide behind a later push.
Structured output. The review produces JSON that maps one-to-one to the seven review dimensions. A human reading the PR sees a table, not a wall of prose. A later script can query the output.
Idempotent PR comment. The workflow updates a single review comment in place, rather than appending a new one each push. The PR stays readable.
Blocking vs. advisory. Most findings post as advisory comments. Hard gates (tests failing, missing required docs, security violations) escalate to a required check that blocks merge.
Timeout and retries. The job has a strict timeout (10 minutes) and a predictable retry policy, so a flaky review does not block a PR indefinitely.

The workflow is sanitized to satisfy the public-repo security rules: no account IDs, no bucket names, no specific secret names beyond a placeholder. A real deployment substitutes the organization's own secret management and review runner.

4. Review Dimensions

Review is structured around seven dimensions. Each dimension answers a specific question. The output maps one finding per dimension per file, which keeps the review legible.

Dimension	Question
Correctness	Does the change do what it says? Are invariants preserved? Are error paths handled?
Safety	Can this code corrupt data, destroy state, or break running systems? Are destructive operations gated appropriately?
Docs	Are evergreen docs updated in this PR? Is the commit message sufficient? Does any durable rationale need a Decision Record?
Tests	Does the change ship with tests? Do the tests test behavior, not just absence of panic? Are edge cases covered?
Compatibility	Does the change break API consumers, database schemas, serialized formats, or configuration contracts?
Security	Does the change introduce injection vectors, auth bypass, exposed secrets, unsafe deserialization, or permissioning issues?
Performance	Does the change regress hot paths? Are new N+1 queries, unbounded allocations, or blocking I/O introduced?

Dimensions are orthogonal. A single finding should fall cleanly into one of them. When a finding spans two dimensions, it usually means the diagnosis is still shallow; splitting it into two findings makes the review sharper.

5. Review Gates

A review gate is a hard stop that blocks merge. The framework defines four default gates, and repos can add their own:

Gate	Blocks merge when
Tests	Any required test suite fails on the PR branch.
Docs	Behavior changed but an evergreen doc referenced by the changed area was not updated.
Security	A security finding above a configured severity threshold is open.
Compatibility	A breaking change is introduced without a Decision Record explaining the migration path.

All other findings from the seven-dimension review surface as advisory comments. Advisories are visible, addressable, and resolved in the PR thread; they do not block the merge button. This split keeps the gating discipline narrow enough that teams do not learn to ignore it.

6. Verify Discipline

Verify is distinct from review. Review reads the code; verify runs it. Four verify layers, in increasing cost and increasing realism:

Layer	What it proves
Unit tests	Isolated logic behaves correctly for documented inputs.
Integration tests	Components wire together. Real database, real HTTP client, real external API stubs where feasible.
Staging / canary	The change runs under production-like traffic, on production-like data, without affecting all users.
Observability check	Metrics, logs, and traces show the change working as expected in the running system. No new errors, no latency regressions, no unexpected resource use.

The first two layers are covered by standard CI. The third and fourth are where most teams lose discipline: a change passes CI and is declared "verified," but no one watched it run in a realistic environment. The framework's position is that "tests pass" is a gate, not a finish line. Verification is complete when the change has been observed running under real conditions without regression.

For UI changes this includes browser testing: the feature is exercised manually at 375 / 768 / 1200 px, console is checked, network requests are inspected, and the feature is used along a realistic path. Type checks and test suites verify code correctness, not feature correctness.

7. Deployment

Deployment is part of verify, not a separate stage. A change that merges and breaks in production has not been verified, regardless of how green CI was. Three deployment concerns are first-class:

Rollout strategy

The framework does not prescribe a specific rollout mechanism (canary, blue/green, progressive delivery, feature flags). It does require that the change's rollout strategy is explicit somewhere the review can see it. In most repos the right surface is the PR description or a linked runbook. For changes that carry lasting architectural implications (a new canary infrastructure, a new feature flag system), a Decision Record captures the rationale.

Post-deploy smoke

Every change that alters behavior requires a post-deploy smoke step: a scripted or manual check that the feature works in production after rollout. A post-deploy smoke that only checks "service is healthy" is not sufficient; it must exercise the specific feature that changed. For trivial changes this can be a one-line curl. For larger changes it is a small suite.

Rollback discipline

Every deployable change has a rollback path, and the rollback path is documented before merge, not discovered during incident. Rollback paths are not all symmetrical: a schema migration cannot be rolled back by reverting the deploy, for example. When the rollback is not a simple revert, the Decision Record for the change explains what the actual rollback procedure is.

Incident triage as review-in-reverse

An incident is a review that runs too late. The discipline for recovering value from one is to run the review dimensions in reverse against the change that caused the incident: which dimension failed, why did the auto-review not catch it, and what should change in the review dimensions so it catches the next one. Findings that produce durable lessons become Decision Records. The change-rhythm and review-and-verify layers reinforce each other here: the incident's lesson is captured by the same Decision Record mechanism that captures architectural choices.

8. Review Findings Feed Decision Records

Most review findings are resolved in the PR and forgotten. A small subset surfaces durable rationale: a constraint the team did not know about, a trade-off that future reviews should respect, a class of bug that should be prevented at the policy level.

When a review finding crosses that threshold, it becomes a Decision Record through the same /agent-record-decision command introduced in part 2. Three common patterns:

a security finding that reveals an architectural weakness becomes a DR that captures the mitigation approach
a compatibility finding that forces an API break becomes a DR that records the break and its migration path
a post-incident review that identifies a missing review dimension becomes a DR that proposes adding the dimension and updates the framework's dimension list accordingly

This is the feedback loop that keeps the framework from calcifying. The review dimensions, the gates, and the policies all get updated by lessons surfaced in actual reviews and actual incidents, not by abstract reasoning.

Review-and-Verify Capabilities

This layer adds exactly one command to the framework's surface.

Review Change

/agent-review-change is invoked by CI on PR events (opened, reopened, synchronize, ready_for_review). It can also be called by hand for phase completion, pre-deploy, or post-incident review. It loads the seven review dimensions and the gate definitions, runs them against the diff, and produces structured output suitable for a PR comment or a GitHub check.

The command is deliberately narrow. It does not plan changes, implement changes, or merge changes. Those are owned by the commands in parts 1 and 2. /agent-review-change reviews; everything else is out of scope.

Usage

The review-and-verify command lives under ~/.claude/commands/ alongside the routing and change-rhythm commands from parts 1 and 2, using the reserved agent-* namespace.

graph LR
  PR["PR opened
or synchronize"]:::routing
  REVIEW["/agent-review-change
auto-triggered"]:::api
  GATES{"Hard gate
violation?"}:::decision
  COMMENT["Structured PR comment
7 review dimensions"]:::data
  BLOCK["Required check fails
merge blocked"]:::onDemand
  MERGE["Merge allowed"]:::data

  PR --> REVIEW
  REVIEW --> GATES
  GATES -- "yes" --> BLOCK
  GATES -- "no" --> COMMENT
  COMMENT --> MERGE
  BLOCK -.-> COMMENT

  classDef routing fill:#5C3E14,stroke:#E8A849,color:#fff,stroke-width:1.5px
  classDef api fill:#1B3D3A,stroke:#4ECDC4,color:#fff,stroke-width:1.5px
  classDef data fill:#2A3054,stroke:#7B93DB,color:#fff,stroke-width:1.5px
  classDef onDemand fill:#3A2054,stroke:#B07CD8,color:#fff,stroke-width:1.5px
  classDef decision fill:#4A1E3A,stroke:#E85D75,color:#fff,stroke-width:1.5px

Repos that want a custom review dimension (for example, "accessibility" on a frontend-heavy repo) add it in repo-local docs at docs/agents/reference/review-dimensions.md. Repo gates add to the framework gates rather than replace them.

Example: an auto-triggered review that caught a regression

A change adds a new endpoint that accepts a user-supplied filter and composes it into a database query. The PR passes CI. Tests are green. The change is one file, seven lines.

PR opens. The GitHub Actions workflow fires /agent-review-change.
The review runs across the seven dimensions and posts a structured comment.
The security dimension flags the filter as a potential injection vector: the user-supplied value is string-concatenated into the SQL, not parameterized.
The finding triggers the security gate (severity above the configured threshold). The required check fails. Merge is blocked.
The author switches to a parameterized query, pushes a new commit.
The synchronize event fires. The review reruns. The security finding is gone. The required check passes.
The PR merges.

Nothing about this example required a human reviewer. The auto-trigger, the dimension, and the gate did the work. Human review is still available; it simply is not the first line of defense anymore.

Tradeoffs

Benefits

Every PR gets reviewed, not just the ones someone remembered to tag.
Review output is structured and diffable across runs.
Gates keep the discipline narrow enough that authors do not learn to ignore it.
Verify is separate from tests, so "tests pass" stops being confused with "change works."
Deployment, rollback, and incident triage live in the same discipline, not adjacent to it.
The feedback loop into Decision Records keeps the framework grounded in real findings.

Costs

CI cost per PR increases: every push runs a review.
Structured reviews can produce false positives that the author has to triage. The advisory vs. gate split mitigates this but does not eliminate it.
Rollback discipline requires work that is easy to skip. If a team does not enforce it, the benefit evaporates.
The seven-dimension matrix is a design opinion. A team with different priorities may need to adjust it.

These costs are acceptable because they buy consistent review coverage, explicit gating, and a verification discipline that extends to the surface where changes actually break.

Recommendation

Adopt this layer once the routing (part 1) and change rhythm (part 2) are in place:

install /agent-review-change under ~/.claude/commands/
wire the GitHub Actions workflow (or equivalent CI) to fire the command on PR events
adopt the seven review dimensions as the default; extend them per repo as needed
enforce the four default gates; add repo-local gates only when a repeated failure class justifies one
treat staging, canary, and observability checks as part of verify, not optional
document rollback before merge for any change that carries deployment risk
feed durable findings from reviews and incidents back into Decision Records

This completes the three-layer governance framework. Routing decides what gets loaded, rhythm decides how the change moves, and review-and-verify decides when the change is ready to run in front of real users. Each layer is independently useful, and together they form a complete cycle from intent to production.

Appendix: The review command shim, the review-dimensions reference, the sanitized GHA workflow, and the deployment playbook are available in the implementation appendix.

Series: Part 1 · Routing · Part 2 · Change rhythm