Review,
Verify, Deploy

Closing a change without silent regressions. Auto-triggered self-review runs on every PR open and every push. Verification gates prove the change works. Deployment is part of the same discipline.

Scroll
Series · Agent Governance for Claude Code
  1. Routing · how layers compose and how docs get loaded
  2. Change rhythm · scratch plans, phase discipline, and ADRs as the only durable artifact
  3. Review and verify · this post · auto-triggered self-review on every PR, verification gates, and deployment

Summary

This post defines how a change proves it is correct, safe, and ready to ship.

The review-and-verify layer standardizes:

The core rule is simple. Every PR opens with an automated review. Every push updates it. Hard gates (tests failing, missing docs, security findings) block merge. Verification is not complete until the change is running in production under real load and has survived a smoke test.

The appendix contains the review command skeleton, the full review-dimensions reference, a sanitized GitHub Actions workflow that wires the auto-trigger, and a deployment playbook. Part 1 defined the routing layer. Part 2 defined the change rhythm that produces the draft PR this layer reviews.

Problem

Implicit review is where changes silently regress. Four failure modes are common.

Review is treated as a human bottleneck

When review lives entirely in a human reviewer's inbox, throughput is capped by that reviewer's availability. Worse, the review is opaque: a single "LGTM" at the bottom of a PR carries no record of what was actually checked. Reading the review a week later tells you the change was approved, not that any particular dimension was verified.

AI review is treated as a novelty

Teams adopting Claude Code often run an AI review on demand ("hey, review this PR") and treat the output as an optional advisory. On-demand review misses the changes no one remembered to review. Optional advisories become noise.

Verify collapses into "tests pass"

Tests passing is necessary, not sufficient. A green test suite says the change does not break anything the tests cover. It says nothing about whether the change works in production, under real load, with real data, against real dependencies. Without a separate verify discipline, "tests pass" becomes "ship it," and the gap between test environment and production absorbs the bugs.

Deployment is treated as out of scope for review

A PR merging to main is not the end of the change. Rollout, post-deploy smoke, and rollback are part of the same change, and regressions most commonly surface there. Governance that stops at merge leaves the most expensive failures uncovered.

All four failures share one root cause. The discipline ends before it reaches the surface where changes actually break: production, running code, real load. Everything before that surface (tests, code review, CI) is a proxy. The framework needs to cover the proxies and extend to the surface itself.

Goals

Non-Goals

Proposed Design

1. Review Model

Three kinds of review exist. Each has a different cost, a different coverage profile, and a different trigger.

Kind Trigger, cost, coverage
Self-review Author-initiated or auto-triggered. Cheap, consistent, and high-coverage because it runs on every PR. Catches the mechanical dimensions (tests, docs, conventions, basic security). Does not reason about subtle architectural issues.
AI-assisted review Auto-triggered on PR events. Claude Code runs /agent-review-change against the diff and posts a structured comment. Higher reasoning quality than self-review, still cheap, still consistent. Catches subtle issues (missing error paths, incorrect invariants, unsafe SQL) that a checklist would not.
Human review Requested explicitly. Expensive, inconsistent, but irreplaceable for design decisions, domain judgment, and accountability. Human reviewers spend their time on what the other two cannot cover.

The three kinds are additive, not alternative. Self-review and AI-assisted review run on every PR; human review is requested when the change warrants it. Human reviewers arrive at a PR that has already been through the first two, and their attention is freed for the parts only they can do.

2. Review Triggers

Review runs at four trigger points. The first is the automated default, and the other three are supplementary:

The auto-trigger on PR events is the only trigger that must be in place. The other three layer on top.

3. Automating Self-Review

The mechanism matters as much as the principle. "Run self-review on every PR" only works if a CI workflow actually invokes /agent-review-change on every PR event. A sanitized GitHub Actions workflow that wires this:

name: agent-review

on:
  pull_request:
    types: [opened, reopened, synchronize, ready_for_review]

permissions:
  contents: read
  pull-requests: write
  checks: write

jobs:
  review:
    runs-on: ubuntu-latest
    timeout-minutes: 10
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Run agent review
        env:
          ANTHROPIC_API_KEY: ${{ secrets.AGENT_REVIEW_KEY }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
          BASE_SHA: ${{ github.event.pull_request.base.sha }}
          HEAD_SHA: ${{ github.event.pull_request.head.sha }}
        run: |
          agent-review-change \
            --diff-range "$BASE_SHA..$HEAD_SHA" \
            --pr "$PR_NUMBER" \
            --dimensions correctness,safety,docs,tests,compatibility,security,performance \
            --output structured

      - name: Post review
        uses: actions/github-script@v7
        with:
          script: |
            const review = require('./agent-review-output.json');
            // Update existing review comment in place, or create one.
            // Escalate to a required check for any blocking gate.

Five properties make this workflow work as the governance mechanism, not just a convenience:

The workflow is sanitized to satisfy the public-repo security rules: no account IDs, no bucket names, no specific secret names beyond a placeholder. A real deployment substitutes the organization's own secret management and review runner.

4. Review Dimensions

Review is structured around seven dimensions. Each dimension answers a specific question. The output maps one finding per dimension per file, which keeps the review legible.

Dimension Question
Correctness Does the change do what it says? Are invariants preserved? Are error paths handled?
Safety Can this code corrupt data, destroy state, or break running systems? Are destructive operations gated appropriately?
Docs Are evergreen docs updated in this PR? Is the commit message sufficient? Does any durable rationale need a Decision Record?
Tests Does the change ship with tests? Do the tests test behavior, not just absence of panic? Are edge cases covered?
Compatibility Does the change break API consumers, database schemas, serialized formats, or configuration contracts?
Security Does the change introduce injection vectors, auth bypass, exposed secrets, unsafe deserialization, or permissioning issues?
Performance Does the change regress hot paths? Are new N+1 queries, unbounded allocations, or blocking I/O introduced?

Dimensions are orthogonal. A single finding should fall cleanly into one of them. When a finding spans two dimensions, it usually means the diagnosis is still shallow; splitting it into two findings makes the review sharper.

5. Review Gates

A review gate is a hard stop that blocks merge. The framework defines four default gates, and repos can add their own:

Gate Blocks merge when
Tests Any required test suite fails on the PR branch.
Docs Behavior changed but an evergreen doc referenced by the changed area was not updated.
Security A security finding above a configured severity threshold is open.
Compatibility A breaking change is introduced without a Decision Record explaining the migration path.

All other findings from the seven-dimension review surface as advisory comments. Advisories are visible, addressable, and resolved in the PR thread; they do not block the merge button. This split keeps the gating discipline narrow enough that teams do not learn to ignore it.

6. Verify Discipline

Verify is distinct from review. Review reads the code; verify runs it. Four verify layers, in increasing cost and increasing realism:

Layer What it proves
Unit tests Isolated logic behaves correctly for documented inputs.
Integration tests Components wire together. Real database, real HTTP client, real external API stubs where feasible.
Staging / canary The change runs under production-like traffic, on production-like data, without affecting all users.
Observability check Metrics, logs, and traces show the change working as expected in the running system. No new errors, no latency regressions, no unexpected resource use.

The first two layers are covered by standard CI. The third and fourth are where most teams lose discipline: a change passes CI and is declared "verified," but no one watched it run in a realistic environment. The framework's position is that "tests pass" is a gate, not a finish line. Verification is complete when the change has been observed running under real conditions without regression.

For UI changes this includes browser testing: the feature is exercised manually at 375 / 768 / 1200 px, console is checked, network requests are inspected, and the feature is used along a realistic path. Type checks and test suites verify code correctness, not feature correctness.

7. Deployment

Deployment is part of verify, not a separate stage. A change that merges and breaks in production has not been verified, regardless of how green CI was. Three deployment concerns are first-class:

Rollout strategy

The framework does not prescribe a specific rollout mechanism (canary, blue/green, progressive delivery, feature flags). It does require that the change's rollout strategy is explicit somewhere the review can see it. In most repos the right surface is the PR description or a linked runbook. For changes that carry lasting architectural implications (a new canary infrastructure, a new feature flag system), a Decision Record captures the rationale.

Post-deploy smoke

Every change that alters behavior requires a post-deploy smoke step: a scripted or manual check that the feature works in production after rollout. A post-deploy smoke that only checks "service is healthy" is not sufficient; it must exercise the specific feature that changed. For trivial changes this can be a one-line curl. For larger changes it is a small suite.

Rollback discipline

Every deployable change has a rollback path, and the rollback path is documented before merge, not discovered during incident. Rollback paths are not all symmetrical: a schema migration cannot be rolled back by reverting the deploy, for example. When the rollback is not a simple revert, the Decision Record for the change explains what the actual rollback procedure is.

Incident triage as review-in-reverse

An incident is a review that runs too late. The discipline for recovering value from one is to run the review dimensions in reverse against the change that caused the incident: which dimension failed, why did the auto-review not catch it, and what should change in the review dimensions so it catches the next one. Findings that produce durable lessons become Decision Records. The change-rhythm and review-and-verify layers reinforce each other here: the incident's lesson is captured by the same Decision Record mechanism that captures architectural choices.

8. Review Findings Feed Decision Records

Most review findings are resolved in the PR and forgotten. A small subset surfaces durable rationale: a constraint the team did not know about, a trade-off that future reviews should respect, a class of bug that should be prevented at the policy level.

When a review finding crosses that threshold, it becomes a Decision Record through the same /agent-record-decision command introduced in part 2. Three common patterns:

This is the feedback loop that keeps the framework from calcifying. The review dimensions, the gates, and the policies all get updated by lessons surfaced in actual reviews and actual incidents, not by abstract reasoning.

Review-and-Verify Capabilities

This layer adds exactly one command to the framework's surface.

Review Change

/agent-review-change is invoked by CI on PR events (opened, reopened, synchronize, ready_for_review). It can also be called by hand for phase completion, pre-deploy, or post-incident review. It loads the seven review dimensions and the gate definitions, runs them against the diff, and produces structured output suitable for a PR comment or a GitHub check.

The command is deliberately narrow. It does not plan changes, implement changes, or merge changes. Those are owned by the commands in parts 1 and 2. /agent-review-change reviews; everything else is out of scope.

Usage

The review-and-verify command lives under ~/.claude/commands/ alongside the routing and change-rhythm commands from parts 1 and 2, using the reserved agent-* namespace.

graph LR
  PR["PR opened
or synchronize"]:::routing REVIEW["/agent-review-change
auto-triggered"]:::api GATES{"Hard gate
violation?"}:::decision COMMENT["Structured PR comment
7 review dimensions"]:::data BLOCK["Required check fails
merge blocked"]:::onDemand MERGE["Merge allowed"]:::data PR --> REVIEW REVIEW --> GATES GATES -- "yes" --> BLOCK GATES -- "no" --> COMMENT COMMENT --> MERGE BLOCK -.-> COMMENT classDef routing fill:#5C3E14,stroke:#E8A849,color:#fff,stroke-width:1.5px classDef api fill:#1B3D3A,stroke:#4ECDC4,color:#fff,stroke-width:1.5px classDef data fill:#2A3054,stroke:#7B93DB,color:#fff,stroke-width:1.5px classDef onDemand fill:#3A2054,stroke:#B07CD8,color:#fff,stroke-width:1.5px classDef decision fill:#4A1E3A,stroke:#E85D75,color:#fff,stroke-width:1.5px

Repos that want a custom review dimension (for example, "accessibility" on a frontend-heavy repo) add it in repo-local docs at docs/agents/reference/review-dimensions.md. Repo gates add to the framework gates rather than replace them.

Example: an auto-triggered review that caught a regression

A change adds a new endpoint that accepts a user-supplied filter and composes it into a database query. The PR passes CI. Tests are green. The change is one file, seven lines.

  1. PR opens. The GitHub Actions workflow fires /agent-review-change.
  2. The review runs across the seven dimensions and posts a structured comment.
  3. The security dimension flags the filter as a potential injection vector: the user-supplied value is string-concatenated into the SQL, not parameterized.
  4. The finding triggers the security gate (severity above the configured threshold). The required check fails. Merge is blocked.
  5. The author switches to a parameterized query, pushes a new commit.
  6. The synchronize event fires. The review reruns. The security finding is gone. The required check passes.
  7. The PR merges.

Nothing about this example required a human reviewer. The auto-trigger, the dimension, and the gate did the work. Human review is still available; it simply is not the first line of defense anymore.

Tradeoffs

Benefits

Costs

These costs are acceptable because they buy consistent review coverage, explicit gating, and a verification discipline that extends to the surface where changes actually break.

Recommendation

Adopt this layer once the routing (part 1) and change rhythm (part 2) are in place:

This completes the three-layer governance framework. Routing decides what gets loaded, rhythm decides how the change moves, and review-and-verify decides when the change is ready to run in front of real users. Each layer is independently useful, and together they form a complete cycle from intent to production.

Appendix: The review command shim, the review-dimensions reference, the sanitized GHA workflow, and the deployment playbook are available in the implementation appendix.

Series: Part 1 · Routing · Part 2 · Change rhythm