Building a Production
System in One Week

eat-or-yeet.com

Scroll

The Experiment

The goal was to see how fast one engineer with AI could stand up production-grade infrastructure. The result: Eat or Yeet, a recipe platform with auth, search, ratings, image processing, and a full recipe lifecycle. SSR frontend, Go API, Postgres, Redis, Typesense. Running on a single $14/month node.

graph TB
    BR["Browser"]:::routing --> TR

    subgraph K3S["k3s on EC2 t4g.small"]
        TR["Traefik Ingress"]:::routing --> FE["Frontend
TanStack Start SSR"]:::api TR --> API["API
Go + Connect RPC"]:::api TR --> IMG["imgproxy"]:::api API --> PG[("Postgres")]:::data API --> RD[("Redis")]:::data API --> TS[("Typesense")]:::data API --> IW["Image Worker
River"]:::inference end IW --> S3["S3"]:::data IMG --> S3 classDef routing fill:#5C3E14,stroke:#E8A849,color:#fff,stroke-width:1.5px classDef api fill:#1B3D3A,stroke:#4ECDC4,color:#fff,stroke-width:1.5px classDef inference fill:#4A1C27,stroke:#E85D75,color:#fff,stroke-width:1.5px classDef data fill:#2A3054,stroke:#7B93DB,color:#fff,stroke-width:1.5px

The Request Path

A user hits eat-or-yeet.com. The request lands on a single EC2 t4g.small running k3s. Traefik routes / to the frontend, /api to Go, /_img to imgproxy. Images live in S3, accessed via IMDS credentials. No static keys anywhere in the system.

Frontend & API

TanStack Start renders the first paint server-side. SEO gets real HTML, users get instant content. File-based routing, React Query for data fetching, and SSR that degrades gracefully if the API is unreachable (a lesson learned the hard way with ensureQueryData crashing SSR).

The API is Go with Connect RPC. One set of .proto files generates the Go server, TypeScript client, and dev stubs. Browsers speak HTTP/JSON. Services speak gRPC. No API drift. The schema is the contract, and breaking changes fail at compile time.

Auth & Authorization

Auth0 handles identity: Google and Apple OAuth, JWT validation, and RBAC with four roles (admin, editor, author, user). The API runs lenient middleware: invalid tokens proceed without claims rather than hard-failing. This keeps unauthenticated browsing fast and lets authorization live in the business logic where it belongs.

Search & Data

Typesense powers multi-dimensional faceted search with sub-millisecond latency on ~50MB of RAM. A pizza is dinner AND Italian AND baking AND weekend-project, not a single rigid category. The index syncs from Postgres via River, a Postgres-backed job queue that also handles image processing. Edge-triggered for speed, with a level-triggered reconciliation sweep as a consistency backstop.

Postgres stores everything: recipes, users, ratings. pg_dump to S3 every 6 hours. Redis handles session storage and query caching.

Images

Upload goes to S3. When a browser requests an image, imgproxy reads from S3 and resizes on demand via signed URLs. WebP, AVIF, smart crop, whatever the client needs. No pre-generated thumbnails, no image processing pipeline. ~30MB of RAM, instant format negotiation.

Infrastructure

All of this runs on a single ARM64 EC2 instance for $14/month. k3s provides a real Kubernetes API, so the same Helm charts deploy to a $14 node today and a multi-node EKS cluster tomorrow. That's the point: the migration path is a values.yaml change, not a rewrite. Traefik handles ingress, EBS provides persistent storage, and IMDS means zero static AWS credentials anywhere in the system.

Technology Choices

Compute: EKS → k3s

The original plan called for EKS with a multi-account AWS Organization, 3 NAT Gateways, and ArgoCD for GitOps deployments. The AWS bill hit ~$30/day with zero production traffic. EKS control plane alone is $72/month, a hard floor you can't negotiate. NAT Gateways added $97/month. ArgoCD is powerful but heavy for a single node.

k3s on a t4g.small gives you the same Kubernetes API: same Helm charts, same kubectl, same RBAC, for $14/month. Single node, no HA. The migration path to EKS is a values.yaml change, not a rewrite. ARM64 (Graviton) gives better price-performance than x86. ArgoCD replaced by SSH + helm upgrade --wait. Simpler, sufficient, and the deploy fails if pods don't become ready. Total savings: $180+/month.

Build System: Bazel for Go

Alternatives considered: go build with golangci-lint, GoReleaser, or Mage. The payoff of Bazel: nogo runs static analysis at compile time (not as a separate lint step that developers skip), builds are hermetic (same output on every machine, every time), and cross-compilation to ARM64 is a --platforms flag. CI builds Go on an x86 runner and produces ARM64 Docker images for production. No emulation, no multi-arch builds, no QEMU.

Monorepo: Two Build Systems, One Repo

Go services use Bazel. TypeScript frontend uses Nx. The conventional wisdom says pick one. Bazel can build JavaScript. Nx has plugins for everything. But each tool is dominant in its own ecosystem for a reason. Bazel's rules_go and nogo integration is unmatched for Go: hermetic builds, compile-time static analysis, and zero-config cross-compilation. Nx's understanding of TypeScript project references, its computation cache, and its integration with Vite, Storybook, and Playwright is unmatched for frontend. Forcing one tool to do both means fighting the tool instead of shipping.

The conventional knock on this approach is setup complexity. Gazelle for BUILD file generation, gopackagesdriver to bridge gopls and Bazel for IDE support, Starlark macros for custom rules, Nx workspace config, TypeScript project references. With AI, that complexity flattens to near-zero. Claude generated the initial BUILD files, configured Gazelle's resolve directives, wired up gopackagesdriver, scaffolded the Nx workspace config with project references, and built the cross-build-system CI wiring. The setup that would normally take days of reading docs and fighting config took hours.

The hard part of a monorepo with generated code (proto types, mocks, instrumented wrappers) is telling the build system where things live. Gazelle can't see Bazel-generated outputs, so it needs explicit resolve directives. A handful of regex rules in the root BUILD.bazel handles every generated import pattern:

# Root BUILD.bazel — Gazelle resolve directives for generated code

# gazelle:prefix github.com/example/project
# gazelle:go_naming_convention go_default_library

# Proto-generated Connect RPC stubs (explicit, since they have custom rules)
# gazelle:resolve go .../gen/yeet/v1/yeetv1connect //proto/yeet/v1:yeetv1connect
# gazelle:resolve go .../gen/yeet/v1/yeetv1connect/instrumented_yeetv1connect //proto/yeet/v1:instrumented_yeetv1connect

# All other proto-generated Go types (regex catches the rest)
# gazelle:resolve_regexp go \.com/.*/gen/(.+)$ //proto/$1:go_default_library

# Mock libraries generated by go_mock_library macro
# gazelle:resolve_regexp go \.com/.*/(backend/.*)/mock_(\w+) //$1:mock_$2

# Instrumented decorator libraries generated by go_instrumented_library macro
# gazelle:resolve_regexp go \.com/.*/(backend/.*)/instrumented_(\w+) //$1:instrumented_$2

# Proto BUILD files are hand-written (custom rules opaque to Gazelle)
# gazelle:exclude proto

Five directives. That's the entire cost of teaching Gazelle about three categories of generated code across the whole repo. Every new mock, every new instrumented wrapper, every new proto package is automatically resolvable. No per-package BUILD.bazel edits required.

The glue between Bazel and Nx is Protobuf. .proto files live in proto/ at the repo root. Bazel generates Go types via rules_proto. Nx triggers buf generate for TypeScript types. A CI workflow detects proto changes and runs both code generators, failing the PR if generated code is stale. The contract is the boundary: backend and frontend can evolve independently as long as the proto contract holds.

Alternatives considered: Turborepo (JavaScript-only, no Go story), Pants (strong Python/Go support but weaker TypeScript ecosystem), Bazel-for-everything (possible but rules_js is high-friction for a React + Vite + Storybook stack), and Nx-for-everything (no nogo equivalent, weaker hermeticity for Go). The two-tool approach adds the cost of two dependency graphs and two cache systems, but each tool operates in its strength zone.

API Contract: Connect RPC over REST

REST with OpenAPI gives you documentation. gRPC gives you a compiler but locks out browsers. Connect RPC gives you both: one set of .proto files generates the Go server, TypeScript client, and dev stubs. Browsers speak HTTP/JSON; services speak gRPC. A breaking schema change fails at compile time in both languages. The entire frontend was built and tested against generated dev stubs before a single line of backend existed. The trade-off is less ecosystem tooling than OpenAPI: no Swagger UI, fewer code generators. Worth it for a typed contract that can't drift.

Search: Typesense over Elasticsearch

The plan originally used Postgres full-text search. Plan v4 introduced Typesense for what Postgres FTS can't do: typo tolerance, prefix search, search-as-you-type, facet counts, highlighting, and synonyms. The features that define great recipe discovery. Elasticsearch was considered but uses ~500MB+ of RAM for a corpus of thousands of documents. Typesense uses ~50MB, returns results in under a millisecond, and the entire schema is a single JSON definition. Postgres FTS remains as a fallback.

sequenceDiagram
    participant B as Browser
    participant FE as Frontend
    participant API as yeet-api
    participant TS as Typesense
    participant PG as Postgres

    B->>FE: GET /recipes?q=sourdough&tags=baking
    FE->>API: SearchRecipes RPC
    API->>TS: Multi-search (query + facet filters)
    TS-->>API: Hits + facet counts (~0.5ms)
    API->>PG: Hydrate full recipe entities (batch)
    PG-->>API: Recipe rows
    API-->>FE: SearchRecipesResponse (proto)
    FE-->>B: SSR HTML + hydration data

    Note over API,TS: Facets: cuisine, meal type,
difficulty, time, tags Note over TS,PG: Typesense = search index
Postgres = source of truth

Job Queue: River over Redis/RabbitMQ

Image processing and search index sync need background jobs. The conventional choices like Sidekiq, Bull, and RabbitMQ all introduce new infrastructure. River uses Postgres as the queue. No new operational surface, and critically: transactional job enqueue. Insert the recipe and enqueue the index job in the same transaction. If the insert fails, the job never enqueues. The trade-off is coupling to Postgres, which is already the single point of failure anyway.

sequenceDiagram
    participant API as yeet-api
    participant PG as Postgres
    participant W as Image Worker
    participant R2 as S3
    participant TS as Typesense

    Note over API,PG: Single transaction
    API->>PG: INSERT recipe
    API->>PG: INSERT river_job (image)
    API->>PG: INSERT river_job (index)
    API->>PG: COMMIT

    Note over W,R2: Async — Image Worker picks up jobs
    W->>PG: Poll river_job
    W->>R2: Upload original image
    W->>PG: Mark job complete

    Note over W,TS: Async — Search sync
    W->>PG: Poll river_job
    W->>PG: SELECT recipe
    W->>TS: Upsert document
    W->>PG: Mark job complete

    Note over W,TS: Level-triggered fallback
    W-->>PG: Periodic full reconciliation
    W-->>TS: Bulk re-index (self-healing)
        

Images: imgproxy + S3

Alternatives considered: pre-generated thumbnails (storage explosion, processing pipeline), Cloudinary/Imgix (per-transformation pricing, vendor lock-in), or Sharp in Node (ties image processing to the frontend). imgproxy reads from S3 and resizes on demand via signed URLs. WebP, AVIF, smart crop, ~30MB of RAM. The signed URL scheme means clients can't request arbitrary transformations. Only the API can mint valid URLs. IMDS provides S3 credentials automatically. No static keys. No thumbnails to pre-generate, no storage multiplication, no external service dependency.

sequenceDiagram
    participant B as Browser
    participant FE as Frontend (SSR)
    participant API as yeet-api
    participant IP as imgproxy
    participant R2 as S3

    B->>FE: GET /recipe/sourdough
    FE->>API: GetRecipe RPC
    API-->>FE: Recipe + signed imgproxy URLs
    FE-->>B: HTML with <img src="/_img/...">

    B->>IP: GET /_img/rs:300:300/plain/s3://bucket/photo.jpg@sha256
    IP->>R2: GET photo.jpg (S3 API)
    R2-->>IP: Original image
    IP-->>B: Resized WebP (300×300, smart crop)

    Note over B,IP: Format negotiation via Accept header
    Note over API,IP: Only API can mint valid signed URLs
        

Observability: Decoration over Bespoke

The conventional approach to observability: sprinkle span.Start(), metrics.Increment(), and logger.Info() through your business logic. Every method becomes half observability code, half actual logic. Untestable, inconsistent, and a maintenance nightmare.

This project uses a decorator pattern via Bazel codegen. A custom Starlark macro (go_instrumented_library) reads Go interfaces at build time and generates Instrumented* wrapper types that delegate each method through an Instrumentor: automatic span creation, duration metrics, and structured logging with zero boilerplate in the implementation:

# BUILD.bazel — declares which interfaces get instrumented
load("//tools/bazel:instrumentorgen.bzl", "go_instrumented_library")

go_instrumented_library(
    name = "instrumented_store",
    library = ":go_default_library",
    interfaces = ["RecipeStore", "UserStore", "RatingStore"],
    importpath = ".../instrumented_store",
)

# With per-method auth policies (auth + observability in one decorator):
go_instrumented_library(
    name = "instrumented_controller",
    library = ":go_default_library",
    interfaces = ["RecipeController"],
    auth = {
        "RecipeController.Create": "authenticate",
        "RecipeController.Update": "authorize_owner:recipe.ID",
        "RecipeController.Delete": "authorize_owner:recipe.ID",
    },
)

The generated code only exists in Bazel's output tree, never on disk, never in git. The instrumentorgen tool is pure stdlib Go (go/parser, go/types, text/template). No external dependencies. Gazelle auto-resolves instrumented imports via a single regex directive.

At the wiring site (cmd/*/main.go), each concrete implementation is wrapped with its generated decorator. Business logic packages never import observability. The result: every interface call is traced, metered, and logged consistently across 21K lines of Go, without a single manual span.Start().

Observability as Code: Datadog

Vendor-neutral instrumentation: OTel traces, Prometheus metrics, zap structured logs. DD Agent consumes all three. Zero DD SDK in app code. Dashboards, monitors, and SLOs are Pulumi resources, not clickops:

// infra/pulumi/stacks/observability/dashboards.ts
import * as datadog from "@pulumi/datadog";

// Service Overview — the RED dashboard (Rate, Errors, Duration)
const serviceOverview = new datadog.Dashboard("service-overview", {
  title: "Eat or Yeet — Service Overview",
  layoutType: "ordered",
  widgets: [
    // Request rate per service
    timeseriesWidget("Requests / sec", [
      "sum:trace.connectrpc.request{service:yeet-api}.as_rate()",
      "sum:http.server.request{service:yeet-frontend}.as_rate()",
      "sum:river.job.completed{service:yeet-worker}.as_rate()",
    ]),
    // Error rate (% of total)
    queryValueWidget("API Error Rate",
      "sum:trace.connectrpc.request.errors{service:yeet-api}.as_rate() / " +
      "sum:trace.connectrpc.request{service:yeet-api}.as_rate() * 100"
    ),
    // Latency percentiles — P50, P95, P99
    timeseriesWidget("API Latency", [
      "p50:trace.connectrpc.request.duration{service:yeet-api}",
      "p95:trace.connectrpc.request.duration{service:yeet-api}",
      "p99:trace.connectrpc.request.duration{service:yeet-api}",
    ]),
  ],
});

// API Detail — per-endpoint breakdown from decorator metrics
const apiDetail = new datadog.Dashboard("api-detail", {
  title: "Eat or Yeet — API Endpoints",
  layoutType: "ordered",
  widgets: [
    // Every decorated method is a row — the naming convention makes this automatic
    toplistWidget("Slowest Endpoints (P95)",
      "p95:trace.connectrpc.request.duration{service:yeet-api} by {resource_name}"
    ),
    // Error breakdown by RPC method + status code
    timeseriesWidget("Errors by Endpoint", [
      "sum:trace.connectrpc.request.errors{service:yeet-api} by {resource_name}",
    ]),
    // Store-layer latency — catches query regressions before they hit users
    timeseriesWidget("Store Layer (decorated)", [
      "avg:store.recipe.GetByID.duration{service:yeet-api}",
      "avg:store.recipe.List.duration{service:yeet-api}",
      "avg:store.recipe.Search.duration{service:yeet-api}",
    ]),
  ],
});

// Data Layer — leading indicators
const dataLayer = new datadog.Dashboard("data-layer", {
  title: "Eat or Yeet — Data Layer",
  widgets: [
    queryValueWidget("PG Pool Utilization",
      "avg:db.pool.open_connections{service:yeet-api} / " +
      "avg:db.pool.max_open{service:yeet-api} * 100"
    ),
    timeseriesWidget("Redis Hit Rate", [
      "sum:cache.hits{service:yeet-api}.as_rate()",
      "sum:cache.misses{service:yeet-api}.as_rate()",
    ]),
    timeseriesWidget("Typesense Query Latency", [
      "p95:typesense.search.duration{service:yeet-api}",
    ]),
    queryValueWidget("River Queue Depth",
      "avg:river.queue.depth{service:yeet-worker}"
    ),
  ],
});
// infra/pulumi/stacks/observability/monitors.ts

// P1 — page immediately
const apiErrorRate = new datadog.Monitor("api-error-rate-p1", {
  type: "query alert",
  query: "sum(last_2m):sum:trace.connectrpc.request.errors{service:yeet-api}.as_rate() / " +
         "sum:trace.connectrpc.request{service:yeet-api}.as_rate() * 100 > 5",
  name: "[P1] API error rate > 5%",
  message: `{{#is_alert}}
API error rate is {{value}}% — users are impacted.
Dashboard: https://app.datadoghq.com/dashboard/api-detail
First action: check recent deploys, then trace errors by resource_name.
@pagerduty-yeet
{{/is_alert}}`,
  priority: 1,
});

const pgPoolExhaustion = new datadog.Monitor("pg-pool-p1", {
  type: "query alert",
  query: "avg(last_2m):avg:db.pool.open_connections{service:yeet-api} / " +
         "avg:db.pool.max_open{service:yeet-api} * 100 > 90",
  name: "[P1] Postgres connection pool > 90%",
  message: "Pool exhaustion imminent. Check for long-running queries or connection leaks.",
  priority: 1,
});

// P2 — investigate within the hour
const apiLatencyP99 = new datadog.Monitor("api-latency-p2", {
  type: "query alert",
  query: "avg(last_5m):p99:trace.connectrpc.request.duration{service:yeet-api} > 0.5",
  name: "[P2] API P99 latency > 500ms",
  message: "Latency regression. Check store-layer decorated metrics for the slow method.",
  priority: 2,
});

const redisHitRate = new datadog.Monitor("redis-hitrate-p2", {
  type: "query alert",
  query: "avg(last_5m):sum:cache.hits{service:yeet-api}.as_rate() / " +
         "(sum:cache.hits{service:yeet-api}.as_rate() + " +
         "sum:cache.misses{service:yeet-api}.as_rate()) * 100 < 80",
  name: "[P2] Redis hit rate < 80%",
  message: "Cache effectiveness degraded. Check eviction policy or key expiry.",
  priority: 2,
});

const riverQueueGrowing = new datadog.Monitor("river-queue-p2", {
  type: "query alert",
  query: "avg(last_10m):avg:river.queue.depth{service:yeet-worker} > 100",
  name: "[P2] River queue depth growing",
  message: "Jobs enqueuing faster than processing. Check worker pod status and job error rate.",
  priority: 2,
});
// infra/pulumi/stacks/observability/rum.ts

// RUM — what server metrics can't see
const rumDashboard = new datadog.Dashboard("rum-frontend", {
  title: "Eat or Yeet — Real User Monitoring",
  widgets: [
    // Core Web Vitals — the metrics Google ranks you on
    queryValueWidget("LCP (target < 2.5s)",
      "p75:rum.largest_contentful_paint{service:yeet-frontend}"
    ),
    queryValueWidget("CLS (target < 0.1)",
      "p75:rum.cumulative_layout_shift{service:yeet-frontend}"
    ),
    queryValueWidget("INP (target < 200ms)",
      "p75:rum.interaction_to_next_paint{service:yeet-frontend}"
    ),
    // SSR vs hydration — where does the browser spend time?
    timeseriesWidget("Page Load Breakdown", [
      "avg:rum.time_to_first_byte{service:yeet-frontend}",      // SSR
      "avg:rum.dom_content_loaded{service:yeet-frontend}",       // parse
      "avg:rum.load_event{service:yeet-frontend}",               // hydration complete
    ]),
    // Connect RPC from the browser's perspective
    timeseriesWidget("RPC Latency (client-side)", [
      "p95:rum.resource.duration{service:yeet-frontend,resource_type:xhr}",
    ]),
    // JS errors — hydration mismatches, runtime failures
    timeseriesWidget("JS Errors", [
      "sum:rum.error{service:yeet-frontend}.as_rate()",
    ]),
    // Custom business metrics
    timeseriesWidget("User Signals", [
      "sum:rum.action{service:yeet-frontend,action_name:recipe_search}.as_rate()",
      "sum:rum.action{service:yeet-frontend,action_name:recipe_view}.as_rate()",
      "sum:rum.action{service:yeet-frontend,action_name:rating_submit}.as_rate()",
    ]),
  ],
});

// RUM alert — Core Web Vitals regression
const lcpRegression = new datadog.Monitor("lcp-regression-p2", {
  type: "query alert",
  query: "avg(last_15m):p75:rum.largest_contentful_paint{service:yeet-frontend} > 2500",
  name: "[P2] LCP > 2.5s — Core Web Vitals failing",
  message: "Largest Contentful Paint regressed. Check SSR performance, image loading, and font delivery.",
  priority: 2,
});

Every dashboard, monitor, and SLO is a Pulumi resource. pulumi up provisions them. pulumi preview diffs them. No one edits a dashboard in the UI. Changes go through the same plan/review workflow as application code. When someone adds a new decorated interface method, the metrics exist automatically. When someone adds a new RUM action, it shows up in the user signals widget. The observability layer grows with the codebase.

graph LR
    subgraph App["Application Layer"]
        DEC["Decorated Interfaces
(auto-generated)"]:::api --> MET["Prometheus Metrics"]:::data DEC --> TRC["OTel Traces"]:::data ZAP["zap Structured Logs"]:::api --> LOG["JSON to stdout"]:::data end subgraph Infra["Infrastructure Telemetry"] PG3["Postgres pool stats"]:::data RD3["Redis hit/miss"]:::data RV3["River queue depth"]:::data end subgraph Client["Browser"] RUM["RUM SDK"]:::inference --> CWV["Core Web Vitals"]:::inference RUM --> XHR["XHR/Fetch timing"]:::inference end MET --> DD["Datadog Agent"]:::routing TRC --> DD LOG --> DD PG3 --> DD RD3 --> DD RV3 --> DD CWV --> DDC["DD Cloud"]:::routing XHR --> DDC DD --> DDC DDC --> DASH["Dashboards"]:::api DDC --> MON["Monitors + Alerts"]:::inference classDef routing fill:#5C3E14,stroke:#E8A849,color:#fff,stroke-width:1.5px classDef api fill:#1B3D3A,stroke:#4ECDC4,color:#fff,stroke-width:1.5px classDef inference fill:#4A1C27,stroke:#E85D75,color:#fff,stroke-width:1.5px classDef data fill:#2A3054,stroke:#7B93DB,color:#fff,stroke-width:1.5px

Auth: Policy as Build Configuration

Authentication and authorization are the parts of a codebase that seem simple until they're scattered across every handler. The typical pattern: an if !user.IsAuthenticated() guard at the top of each handler, an ownership check buried in the business logic, and a prayer that nobody forgets one. Inconsistent, untestable, and invisible to code review until it's a vulnerability.

This project pushes auth policy into the same decorator layer used for observability. The go_instrumented_library macro accepts an auth map that declares per-method policies at the build system level:

# BUILD.bazel — auth policies declared alongside interface instrumentation
load("//tools/bazel:instrumentorgen.bzl", "go_instrumented_library")

go_instrumented_library(
    name = "instrumented_controller",
    library = ":go_default_library",
    interfaces = [
        "AuthController",
        "RecipeController",
        "RatingController",
    ],
    auth = {
        # "authenticate" — requires a valid caller, rejects anonymous
        "AuthController.UpdateProfile": "authenticate",
        "RatingController.Rate": "authenticate",
        "RecipeController.Create": "authenticate",

        # "authorize_owner:expr" — caller must own the resource (or be admin)
        # expr is the Go expression for the resource ID in the method args
        "RecipeController.Update": "authorize_owner:recipe.ID",
        "RecipeController.Delete": "authorize_owner:id",
        "RecipeController.SubmitForReview": "authorize_owner:id",

        # Admin-only
        "RecipeController.ApproveRecipe": "authenticate",
    },
)

The code generator (instrumentorgen, pure stdlib Go using go/parser and text/template) reads the auth map and generates decorator methods that enforce the policy before delegating to the real implementation:

// Generated by instrumentorgen — never written by hand, never in git

func (d *InstrumentedRecipeController) Update(ctx context.Context, recipe *Recipe) error {
    // Auth policy: authorize_owner:recipe.ID
    caller, authErr := d.authz.Authenticate(ctx)
    if authErr != nil {
        return authErr
    }
    ownerID, resolveErr := d.resolver.ResolveOwner(ctx, recipe.ID)
    if resolveErr != nil {
        return resolveErr
    }
    if authzErr := d.authz.AuthorizeOwner(ctx, caller, ownerID); authzErr != nil {
        return authzErr
    }
    ctx = authz.ContextWithCaller(ctx, caller)

    // Observability (same decorator, same method)
    span, ctx := d.instrumentor.StartSpan(ctx, "controller.recipe.Update")
    defer span.End()
    start := time.Now()

    err := d.inner.Update(ctx, recipe)

    d.instrumentor.Record(span, start, err)
    return err
}

The HTTP middleware layer extracts JWT claims from the Authorization header and attaches them to context, but it never rejects requests. It's permissive by design: unauthenticated requests proceed, and per-method decorators enforce what's required. This means public endpoints (recipe listing, search) need zero auth configuration. Only methods that appear in the auth map require a caller.

Two policies cover every case in the app. "authenticate" requires a valid caller — any logged-in user. "authorize_owner:expr" resolves the resource owner via an OwnerResolver interface and checks that the caller is the owner or an admin. The expr is a Go expression evaluated against the method's arguments, so recipe.ID and id both work depending on the method signature.

The result: auth policy is visible in BUILD.bazel files, reviewable in PRs, and enforced at compile time. A new controller method with no auth entry is public by default — an explicit, auditable decision. A method with the wrong policy fails at the ownership check, not silently. And the business logic in RecipeController never imports the auth package.

The Testing Harness

Five environments, each catching a different class of bug. The feedback loop tightens as you move up the stack.

Environment 1: Unit Tests

49 Go test files, 12.6K lines. Mocks generated by Bazel (go_mock_library), never checked in. EXPECT() with DoAndReturn() turns mocks into contract assertions. The test fails if the caller passes wrong arguments, not just if it returns wrong results:

// handler/recipe_test.go — gomock verifies the handler→controller contract
func (s *RecipeHandlerSuite) TestListRecipes_with_filters() {
    s.mockController.EXPECT().
        ListRecipes(gomock.Any(), gomock.Any()).
        DoAndReturn(func(ctx context.Context, params store.ListRecipesParams) ([]*store.Recipe, error) {
            // Assert filters are passed correctly through the handler
            assert.Equal(s.T(), store.CategoryTypeDinner, params.Filters.CategoryType)
            assert.Equal(s.T(), []string{"italian", "french"}, params.Filters.Cuisines)
            return seedRecipes, nil
        })

    resp, err := s.handler.ListRecipes(ctx, connect.NewRequest(req))
    require.NoError(s.T(), err)
    assert.Len(s.T(), resp.Msg.Recipes, 2)
}

Pure logic (mappers, utilities) uses table-driven tests. Integration tests use embedded Postgres (V16) with //go:build integration tags. Real SQL, real constraints, real query plans:

// store/recipe_integration_test.go — real Postgres, real queries
func (s *RecipeStoreSuite) TestSearch_filters_by_cuisine() {
    // Embedded Postgres with migrations applied — same schema as production
    recipes, err := s.store.Search(ctx, store.SearchParams{
        Cuisines: []string{"italian"},
    })
    require.NoError(s.T(), err)
    for _, r := range recipes {
        assert.Contains(s.T(), r.Cuisines, "italian")
    }
}

Frontend unit tests via Vitest: React Query hooks, utility functions, pure component logic. Fast, no DOM overhead.

Environment 2: Storybook + Accessibility

97 stories, every one passing axe-core WCAG AA. Components rendered in isolation with all viewport sizes. The enforcement is one line:

// packages/ui/.storybook/preview.ts
const preview: Preview = {
  parameters: {
    a11y: {
      test: "error",  // axe-core violations FAIL the build. Not warnings. Errors.
    },
  },
};

CI builds Storybook, serves it statically, runs test-storybook against all 97 stories. A post-visit hook catches rendering errors that don't throw to console. Missing ARIA labels, color contrast failures, keyboard traps. All caught before code review.

Environment 3: Playwright E2E (Dev Stubs)

62 specs running against the Vite dev server with in-process Connect RPC stubs. No backend, no database. Pure frontend validation. A generic mockRpc helper intercepts Connect HTTP/JSON calls with compile-time type safety:

// e2e/helpers/mock-rpc.ts — generic over proto service descriptor
export async function mockRpc<S extends DescService>(
  page: Page,
  service: S,
  impl: Partial<ServiceImpl<S>>,
): Promise<void>

// TypeScript checks method names and return types against the proto schema
await mockRpc(page, RecipeService, {
  getRecipe: (req) => ({ recipe: seedRecipes[0] }),
  listRecipes: (req) => ({ recipes: seedRecipes, totalCount: seedRecipes.length }),
});

Dev stubs activate only in Vite dev mode (mode === 'development'). They're stateless: pure functions, deterministic responses, no mutation. Per-test state overrides use page.route() interception. This is what let the entire frontend get built and tested before a single line of backend existed.

Environment 4: Tilt + System E2E

Two modes that share the same full-stack cluster. just up boots a complete Kubernetes environment on your machine via Kind (Kubernetes in Docker): 10 services, real data, production builds:

# What `just up` (Tilt) actually runs — 10 services in a local k8s cluster
#
# Data Layer (Bitnami Helm charts):
#   Postgres (API)      — seeded from seeds/local.sql via init container
#   Postgres (River)    — isolated job queue DB (prevents write amplification on app DB)
#   Redis               — session cache, rate limiting
#   Typesense           — search index
#   MinIO               — S3-compatible object store (replaces S3 in local)
#   imgproxy            — on-demand image resize, reads from MinIO
#
# App Layer (custom Docker builds):
#   yeet-api            — Go backend, waits for all data services
#   yeet-image-worker   — River consumer for image + search jobs
#   yeet-frontend       — TanStack Start SSR (production build, not Vite dev)
#
# Networking:
#   ingress-nginx       — routes /api → yeet-api, / → yeet-frontend
#   Port 80 on localhost

Manual verification (local harness): Tilt is the primary test environment for hands-on verification. The frontend is a Docker image with SSR, talking to a real Go API, against a real Postgres seeded with realistic data. Dev stubs don't exist here. If your seed data is missing a column, Tilt catches it. Dev stubs won't. File watchers trigger rebuilds. Save a Go file, Tilt rebuilds the image and hot-deploys the pod. Same Helm charts that deploy to production, different values-local.yaml.

System E2E (remote harness): Playwright also runs against the Tilt cluster for full system tests: real backend, real database, real search indexing. These are distinct from the Environment 3 functional tests that use dev stubs. System E2E catches the class of bugs that only appear when real services interact: SSR hydration with live API data, search index sync latency, image upload flows through MinIO to imgproxy. Service dependencies are explicit in the Tiltfile. API doesn't start until Postgres, Redis, Typesense, and imgproxy are ready. No boot-order race conditions.

Environment 5: Post-Deploy Smoke

In the deploy workflow itself. curl homepage (expect 200), curl API (expect valid JSON). Deploy marked failed if smoke test fails. No separate infrastructure. The deploy pipeline is the test harness.

Git Hooks (Lefthook)

Enforcement before code leaves your machine:

# lefthook.yml — all path-filtered, pre-commit runs parallel
pre-commit:
  parallel: true
  commands:
    nogo:     { run: "bazel build //backend/...", glob: "backend/**/*.go" }
    gofmt:    { run: "gofmt -l backend/",         glob: "backend/**/*.go" }
    buf-lint: { run: "buf lint proto/",            glob: "proto/**/*.proto" }
    nx-lint:  { run: "npx nx affected -t lint",    glob: "frontend/**" }

pre-push:
  commands:
    bazel-test: { run: "bazel test //..." }
    nx-test:    { run: "npx nx affected -t test" }
    typecheck:  { run: "npx nx affected -t typecheck" }

commit-msg:
  commands:
    conventional:
      run: 'grep -qE "^(feat|fix|chore|docs|refactor|test|ci|build|perf|style)(\(.+\))?!?: .+" {1}'

The Command Interface

Every workflow has a single command. just (a modern make alternative) wraps the entire polyglot build system. New engineers run just setup && just init and have a working environment. No READMEs to parse, no tribal knowledge about which tool needs which flags:

# Setup + development
just setup               # Install all prerequisites (mise for Go/Node/pnpm, Homebrew for infra)
just init                # First-time repo setup (hooks + deps + proto + cluster)
just up                  # Start full stack in Kind (10 services, watches + auto-rebuilds)
just down                # Tear down Tilt resources

# Testing (5 environments, one interface)
just test                # Run all tests (backend + frontend)
just test-unit           # Unit tests only
just test-integration    # Integration tests (embedded Postgres)
just test-functional     # Playwright E2E (Vite dev server + stubs)
just test-system         # Playwright E2E (real backend via Tilt/Kind)
just test-storybook      # Storybook smoke + a11y (97 stories)
just test-pipeline       # Pipeline tests (Go against real infra)
just lint                # All linters (backend + proto + frontend)

# Backend / Frontend / Infra (namespaced subcommands)
just be test             # Go tests via Bazel
just be build            # Docker image (ARM64 cross-compile)
just be helm-template    # Render Helm chart
just fe dev              # Vite dev server
just fe storybook        # Storybook dev server
just fe e2e              # Playwright functional tests
just infra preview       # Pulumi preview
just infra up            # Pulumi apply

# Production operations
just pods                # View production pod status
just logs-api            # Tail API logs
just logs-frontend       # Tail frontend logs
just deploy main-abc1234 # Deploy a specific image tag
just ssh                 # SSH to the k3s node
just prod-psql           # Connect to production Postgres

The feedback loop: write code → unit tests catch logic bugs → Storybook catches visual/a11y bugs → Playwright catches integration bugs → Tilt catches production-build bugs → smoke tests catch deploy bugs. Each layer is fast for what it tests, and the union covers the full surface.

CI/CD Pipeline

9 GitHub Actions workflows, all path-filtered with concurrency groups. A proto change doesn't trigger frontend tests. A Helm chart change doesn't rebuild Docker images. OIDC federation means zero stored AWS credentials. Every CI run gets short-lived tokens.

CI runs on every PR: Backend changes trigger Bazel build + test (nogo static analysis at compile time, gofmt, all Go tests). Frontend changes trigger four parallel workflows: nx lint/typecheck/test, Playwright E2E against dev stubs, and Storybook build with a11y enforcement across all 97 stories. Proto changes run buf lint and regenerate TypeScript types. Helm and IaC changes get their own validation: helm template and pulumi preview respectively.

Deploy is a pipeline, not a script. Resolve the target SHA, then build in parallel: Bazel cross-compiles Go on an x86 runner targeting ARM64 (no emulation, no QEMU), producing three backend images pushed to ECR. Nx builds the frontend image separately. A production environment approval gate blocks the deploy. On approval: SSH to the k3s node, helm upgrade --wait for each service (deploy fails if pods don't become ready), then smoke tests. curl the homepage and API. Rollback is a separate manual workflow that ignores deploy freezes.

graph LR
    SHA["Resolve SHA"]:::routing --> BUILD
    subgraph BUILD["Parallel Builds"]
        BE["Backend:
Bazel cross-compile
x86 → ARM64
3 Docker images → ECR"]:::api FEb["Frontend:
nx build
Docker → ECR"]:::api end BUILD --> APPROVE{"Production
Approval"}:::inference APPROVE --> DEPLOY["SSH to k3s
helm upgrade ×3"]:::api DEPLOY --> SMOKE["Smoke Test:
curl homepage + API"]:::routing classDef routing fill:#5C3E14,stroke:#E8A849,color:#fff,stroke-width:1.5px classDef api fill:#1B3D3A,stroke:#4ECDC4,color:#fff,stroke-width:1.5px classDef inference fill:#4A1C27,stroke:#E85D75,color:#fff,stroke-width:1.5px classDef data fill:#2A3054,stroke:#7B93DB,color:#fff,stroke-width:1.5px

The Documentation Strategy

19K lines of docs, in sync with code. Three CLAUDE.md files encode conventions and judgment. 134 change plans capture every decision before code is written. This is the mechanism that makes AI-assisted development work at scale.

CLAUDE.md as Shared Brain

Three files, 500+ lines total. Root CLAUDE.md owns cross-cutting concerns: commit conventions, doc workflow, environment strategy, the verification runbook. backend/CLAUDE.md owns Go architecture. frontend/CLAUDE.md owns React patterns. The AI reads all three before every session. 153 commits later, the conventions held.

What makes it work isn't the rules. It's the judgment encoded in them. A few examples:

"Extract, don't duplicate." Before writing a new hook or component, check if the pattern exists. Two identical occurrences is enough to extract. This kept the frontend from drifting into copy-paste sprawl across 97 Storybook stories and 29K lines of TypeScript.

"Proactively flag improvements." The AI doesn't silently comply with a bad plan. If a test doesn't actually test, if an abstraction leaks, if an error gets swallowed, it raises it. This makes CLAUDE.md self-correcting: the system pushes back against its own entropy.

"Main is evergreen." Warnings, failing tests, peer dependency mismatches. If you encounter them, you own them. "It was already like that" is explicitly not acceptable. This prevents the slow accumulation of tech debt that kills most projects.

Architecture enforcement is structural, not advisory. The backend follows MVCS: Handler → Controller → Store ← Client → Mapper. Three-layer type system: Proto (API boundary) → Entity (domain logic) → Model (database). Each layer owns its conversion boundary. The frontend mirrors this: platform layers L1 (query/config) → L2 (auth) → L3 (layout), with features as isolated verticals that never import from each other. These aren't suggestions. They're load-bearing walls.

The Change Doc Workflow

Every commit: write plan.md before code, review.md after. Both immutable once committed. If the implementation deviates from the plan, the deviation goes in the review, not back into the plan. 134 plans, 117 reviews, 7 plan reassessments. The result is a searchable decision log: you can trace why anything was built the way it was, months later.

docs/
├── PLAN.md                        # Work queue + roadmap
├── changes/                       # 135 directories, immutable
│   ├── 0000001-project-init/
│   │   ├── plan.md                # Written before code
│   │   └── review.md              # Written after implementation
│   ├── 0000002-architecture-update/
│   │   └── ...
│   └── 0000135-system-pipeline-ci/
├── runbooks/
│   └── aws-auth.md
├── spec/
│   ├── index.md
│   ├── recipes/
│   │   ├── overview.md
│   │   └── search.md
│   └── roadmap.md
└── technical/
    ├── index.md
    ├── api.md
    ├── architecture.md
    ├── conventions.md
    ├── data-model.md
    ├── deployment.md
    ├── design-system.md
    ├── infrastructure-options.md
    ├── local-development.md
    └── observability.md

A Real Example: Change 0000025

Here's what the workflow actually looks like. Change 0000025 creates the Helm chart for the API service. The plan captures decisions before writing any YAML:

## Decisions

### No values-staging.yaml
The environment strategy explicitly states "No permanent staging environment."
All pre-prod validation happens in ephemeral environments.
A staging values file would contradict this.

### ConfigMap + Secret split
Non-sensitive config (ENVIRONMENT, PORT, REGION) goes in a ConfigMap.
Credentials (DATABASE_URL, REDIS_URL, Auth0) are referenced via secretEnv
from an external K8s Secret. Local dev uses ConfigMap only since Kind has
no real secrets.

### Health probes use Connect protocol
Liveness and readiness probes hit POST /yeet.v1.HealthService/Check.
This matches the actual health handler implementation.

Then the review captures what actually happened. Five deviations surfaced during implementation:

## Deviations from Plan

| Planned | Implemented | Rationale |
|---------|------------|-----------|
| httpGet probes on health endpoint | tcpSocket probes | Connect handlers only accept POST. K8s probes only support GET (returns 405). Scratch image has no shell for exec probes. |
| Simple /api path with Prefix pathType | Regex path /api(/|$)(.*) with rewrite-target | Backend registers handlers at /yeet.v1.HealthService/Check, not /api/... Without rewrite, requests through ingress 404. |

## Learnings
- Connect's POST-only constraint makes K8s health probes non-trivial.
  The standard fix is a dedicated /healthz GET endpoint on the backend.
- The secretEnv pattern (env var name → secret key mapping + a secretName
  reference) keeps the chart simple while supporting both local (no secrets)
  and production (external Secret) patterns cleanly.

The plan was wrong about health probes and ingress routing. That's the point. The plan forced the decisions to be explicit, and the review captured what reality looked like. Six months from now, anyone reading this change knows why the chart uses tcpSocket probes instead of httpGet, and why the ingress has a regex rewrite. That reasoning would otherwise live only in someone's head.

Plan Reassessments

The project plan was rewritten 7 times. EKS → k3s, multi-account → single account, ArgoCD → SSH + helm upgrade, 3 NAT Gateways → 0. Each reassessment captures what changed and why. The plan is a living document, not scripture. When the plan is wrong, the system is designed to catch it: the deviation goes in review.md, the roadmap gets a new version, and the next commit starts from the corrected reality.

Evergreen docs (technical/, spec/, CLAUDE.md) are updated in the same commit as the code they describe. Stale docs are bugs. After every review.md, a doc sweep checks all evergreen docs for staleness. 19K lines of documentation, all in sync.

Retrospective

CLAUDE.md was the single most impactful artifact in the project, more than any feature, any test, any infrastructure decision. Conventions established early and enforced consistently pay back 10x. The plan/review discipline forces thinking before doing. Writing plan.md before code catches bad ideas before they become bad code. Writing review.md after captures what the plan got wrong. Over 134 iterations, that creates a compounding knowledge base.

AI makes complex tooling trivial. Bazel has a steep learning curve. BUILD files, Gazelle, gopackagesdriver for IDE integration. But AI handles the boilerplate, the custom Starlark rules, and the initial configuration. Same with Connect RPC: the proto definitions, code generation pipeline, and dev stub wiring are exactly the kind of structured, convention-heavy work that AI executes flawlessly. Tools that were previously "too complex to justify for a small project" are now free.

Start cheap, migrate later. k3s → EKS is a values.yaml change, not a rewrite. Storybook a11y caught real bugs: missing ARIA labels, color contrast failures, keyboard traps that "looked fine" visually. Dev stubs are the key to frontend velocity: fast, deterministic, typed against the proto schema. And IMDS everywhere means zero static AWS keys in the entire system.

What's Next

This is a proof of concept. The goal was to see how fast one engineer with AI could stand up production-grade infrastructure, not to launch a product at scale. There's no HPA, no multi-node scheduling, no managed databases yet. But every design decision was made so that turning this into a real production deployment takes less time than the initial build. The same Helm charts, the same Pulumi patterns, the same CI/CD pipeline. The code already exists in git history. Here's what turning the dial looks like.

EKS: Multi-Environment, Multi-Region

The EKS cluster was built, deployed, and torn down during the first week. 64 resources provisioned, validated, then destroyed when the cost analysis showed $468/month for zero traffic. The Pulumi code is preserved in git. pulumi up brings it back.

The key design: one stack program, many configurations. Every resource name threads {environment}-{region} through it. Spin up production-us-east-1 and staging-eu-west-1 from the same code. The config file is the only difference:

// infra/pulumi/lib/config.ts — one interface drives the entire stack
export interface EnvironmentConfig {
  environment: string;          // "production", "staging", "leased-xyz"
  region: string;               // "us-east-1", "eu-west-1"
  vpcCidr: string;              // "10.0.0.0/16"
  azCount: number;              // 3 for prod, 2 for staging
  natHighAvailability: boolean; // NAT per AZ (prod) vs single NAT (staging)
  eksNodeInstanceType: string;  // "t4g.medium" (prod) vs "t4g.small" (staging)
  eksNodeDesiredSize: number;   // 3 (prod) vs 1 (staging)
  rdsInstanceClass: string;     // "db.t4g.medium" (prod) vs "db.t4g.micro" (staging)
  redisNodeType: string;        // "cache.t4g.small" (prod) vs "cache.t4g.micro" (staging)
  // ...
}

// infra/pulumi/lib/tags.ts — every AWS resource gets these
export function standardTags(environment: string, region: string) {
  return { Project: "yeet", Environment: environment, Region: region, ManagedBy: "pulumi" };
}
// infra/pulumi/stacks/environment/index.ts — the orchestrator
const config = loadEnvironmentConfig();
const prefix = `yeet-${config.environment}-${config.region}`;

const vpc = new Vpc(prefix, { cidr: config.vpcCidr, azCount: config.azCount, ... });
const sgs = new SecurityGroups(prefix, { vpcId: vpc.vpcId, ... });

const eks = new EksCluster(prefix, {
  vpcId: vpc.vpcId,
  privateSubnetIds: vpc.privateSubnetIds,
  nodeInstanceType: config.eksNodeInstanceType,
  nodeDesiredSize: config.eksNodeDesiredSize,
  ...
});

const apiDb  = new RdsInstance(`${prefix}-api`,   { identifier: "api",   ... });
const riverDb = new RdsInstance(`${prefix}-river`, { identifier: "river", ... });
const redis  = new ElasticacheCluster(`${prefix}-redis`, { ... });
const argocd = new ArgoCd(`${prefix}-argocd`, { kubeProvider: eks.kubeProvider, ... });

Helm values use the same layering: base chart, cluster overlay, region overlay. Three -f flags compose the full config:

# Helm values layering: base → cluster → region
helm upgrade yeet-api deploy/helm/yeet-api/ \
  -f values.yaml \                            # Base: ports, probes, image config
  -f values-production.yaml \                 # Cluster: replicas=2, resources, secrets
  -f values-production-us-east-1.yaml         # Region: ECR URL, domain, RDS endpoint
# values-production.yaml — cluster-level overrides
cluster: production
replicaCount: 2
resources:
  requests: { cpu: 100m, memory: 128Mi }
  limits:   { cpu: 500m, memory: 256Mi }
env:
  ENVIRONMENT: production

# values-production-us-east-1.yaml — region-level overrides
region: us-east-1
# ECR URL, RDS endpoint, domain — everything that differs per region

Adding a region: create Pulumi.staging-eu-west-1.yaml with the config values, create values-staging-eu-west-1.yaml with the Helm overrides, run pulumi up. The stack program, the Helm charts, the CI/CD pipeline, and the container images are all unchanged. That's the migration path from a $14/month single node to a multi-region production deployment. Config files, not rewrites.

Managed Data: RDS + ElastiCache

Postgres and Redis currently run as containers on the k3s node. The managed versions were built and tested during the EKS phase: RDS with automated password rotation via Secrets Manager, multi-AZ failover, encrypted at rest, and Point-in-Time Recovery:

// infra/pulumi/lib/rds-instance.ts
const instance = new aws.rds.Instance(`${name}-instance`, {
  engine: "postgres",
  engineVersion: args.engineVersion,
  instanceClass: args.instanceClass,       // db.t4g.micro → db.t4g.medium as traffic grows
  manageMasterUserPassword: true,          // Auto-rotation in Secrets Manager
  storageEncrypted: true,
  multiAz: isProd,                         // Flip when traffic justifies it
  backupRetentionPeriod: isProd ? 7 : 1,
  deletionProtection: isProd,
  performanceInsightsEnabled: isProd,
});

// infra/pulumi/lib/elasticache-cluster.ts
const redis = new aws.elasticache.ReplicationGroup(`${name}-redis`, {
  engine: "redis",
  engineVersion: args.engineVersion,
  nodeType: args.nodeType,                 // cache.t4g.micro → cache.t4g.small
  numCacheClusters: args.numCacheNodes,    // 1 → 2+ for automatic failover
  atRestEncryptionEnabled: true,
  transitEncryptionEnabled: true,
  automaticFailoverEnabled: isProd && args.numCacheNodes > 1,
  snapshotRetentionLimit: isProd ? 7 : 0,
});

Two separate RDS instances: one for the application, one for River (job queue). The dedicated River database isolates its write amplification from app query performance. That pattern carries over from the current containerized setup.

ArgoCD + Canary Deploys

The current deploy is SSH + helm upgrade. The GitOps version was also built and tested: ArgoCD for continuous sync, Argo Rollouts for canary deployments, and Argo Image Updater for automatic image promotion. The canary analysis templates use Datadog metrics to gate rollouts. Error rate and P95 latency must stay within thresholds or the rollout auto-reverts:

# deploy/argocd/analysis/datadog-error-rate.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: datadog-error-rate
spec:
  metrics:
    - name: error-rate
      interval: 60s
      count: 5
      successCondition: result[0] < 0.01    # Must stay below 1%
      failureLimit: 2                        # Two failures → auto-rollback
      provider:
        datadog:
          query: |
            sum:trace.http.request.errors{service:{{args.service-name}},rollout-type:canary}.as_rate()
            / sum:trace.http.request.hits{service:{{args.service-name}},rollout-type:canary}.as_rate()

# deploy/argocd/analysis/datadog-latency-p95.yaml
---
kind: AnalysisTemplate
metadata:
  name: datadog-latency-p95
spec:
  metrics:
    - name: latency-p95
      interval: 60s
      count: 5
      successCondition: result[0] < 500     # P95 must stay under 500ms
      failureLimit: 2
      provider:
        datadog:
          query: |
            p95:trace.http.request.duration{service:{{args.service-name}},rollout-type:canary}

ArgoCD watches the git repo, auto-syncs deployments, and self-heals drift. Image Updater watches ECR for new tags matching main-[a-f0-9]{7} and writes the new tag back to git, triggering ArgoCD sync automatically. The entire GitOps pipeline: push code → CI builds image → Image Updater detects it → ArgoCD canaries it → Datadog gates it. Zero manual steps after merge.

// infra/pulumi/lib/argocd.ts — the full GitOps stack
export class ArgoCd extends pulumi.ComponentResource {
  constructor(name: string, args: ArgoCdArgs, opts?) {
    super("yeet:gitops:ArgoCd", name, {}, opts);

    // Argo CD — continuous deployment from git
    new k8s.helm.v3.Release(`${name}-argocd`, {
      chart: "argo-cd", version: "7.7.16",
      repositoryOpts: { repo: "https://argoproj.github.io/argo-helm" },
      namespace: "argocd",
      values: args.argocdValues,
    });

    // Argo Rollouts — canary with automated analysis
    new k8s.helm.v3.Release(`${name}-rollouts`, {
      chart: "argo-rollouts", version: "2.38.1",
      repositoryOpts: { repo: "https://argoproj.github.io/argo-helm" },
      namespace: "argo-rollouts",
      values: args.rolloutsValues,
    });

    // Image Updater — auto-promotes new ECR images
    new k8s.helm.v3.Release(`${name}-image-updater`, {
      chart: "argocd-image-updater", version: "0.12.0",
      repositoryOpts: { repo: "https://argoproj.github.io/argo-helm" },
      namespace: "argocd",
      values: args.imageUpdaterValues,
    });
  }
}

All of this was built, tested against a real EKS cluster, and preserved in git before the pivot to k3s. The infrastructure scales from $14/month to $400+/month without rewriting application code. Only Pulumi stack configs and Helm values change. Add HPA definitions to the Helm charts, flip multiAz: true on RDS, bump the node count. The design decisions were made up front so that scaling the system is a configuration exercise, not a rewrite. That's the entire point.

The Punchline

4+ months of traditional output, one week, one person.

AI didn't replace judgment. It removed the mechanical ceiling. Architecture, trade-offs, verification, still human. Volume, consistency, documentation. That's where AI changes the equation.

This isn't the future. It's now. One engineer with AI doing what used to require a team and a quarter.