Question

How should 361’s 3–5 dev team actually write and review code on oxflow using Claude Code alongside other AI coding tools? Specifically: (a) which PR-review agent plugs into the GitHub + Render + Neon stack; (b) should we use Claude and Codex together, and if so how; (c) does BMAD give us something worth the ceremony at 3–5 devs; (d) is the Superpowers plugin in or out?

TL;DR

Run CodeRabbit on every PR as the default reviewer (free tier covers 3–5 devs on private repos) and wire anthropics/claude-code-action@v1 for @claude-mentioned deep reviews on PRs touching migrations or domain rules. Claude Code is the daily driver for writing; Codex is optional as a second-opinion reviewer, not on the critical path. Adopt BMAD partially — use the Analyst + PM + Architect trio as a one-week upfront planning ritual, skip the full SM/Dev/QA orchestration at this team size. Install obra/superpowers on every dev’s Claude Code for TDD and planning discipline.

Reject: Greptile (more noise than we can tolerate), Graphite Agent (stacking is the valuable part, not the AI), Sourcery (Python-only), Cursor BugBot (we’re not on Cursor).

Approach

Delegated a landscape survey to a research agent with web access, briefed with the locked stack (DevOps, DB) and the specific questions above. Cross-checked community sentiment against published benchmarks, official vendor docs, and Anthropic’s 2026 Agentic Coding Trends Report.

Findings

BMAD method

BMAD — “Breakthrough Method for Agile AI-Driven Development” — lives at github.com/bmad-code-org/BMAD-METHOD (renamed from bmadcode/BMAD-METHOD; old URL redirects). Authored by Brian Madison. Active since April 2025, now on v6.3.0 (April 10 2026), ~1,400+ stars. Free, open-source, framework-agnostic. Installs into VS Code / Cursor / Claude Code via npx bmad-method install.

Splits the SDLC into two phases across six personas:

  • Planning phase (web UI / long context). Analyst → project brief. PM → PRD. Architect → system design.
  • Execution phase (IDE). Scrum Master shards the PRD into hyper-detailed “story files” carrying full architectural context. Dev implements one story at a time. QA reviews.

Agents are YAML-backed markdown personas (.agent.yaml compiled to .md) with id, name, persona.role, communication_style, principles, commands, dependencies. Scale-adaptive — you can skip phases for bug-fix-scope vs enterprise-scope.

Expansion packs extend BMAD beyond web dev: cloud infra, data architecture, API integration, platform engineering, governance, plus community packs for game dev, DevOps, creative writing. v6.3 introduced a marketplace and merged four agents into one (“Amelia”).

BMAD pairs naturally with Claude Code because story files are designed to drop into an agent’s context window whole. See Benny Cheung — Applied BMAD and BMAD + Claude Code setup guide.

Claude Code vs Codex — numbers

Benchmarks as of Q1 2026 (per NxCode and MorphLLM):

MetricClaude Code (Opus 4.7 / Sonnet 4.5)Codex CLI (GPT-5.3-Codex)
SWE-bench Verified80.9%~74%
Terminal-Bench 2.065.4%77.3%
Token cost (same task)~$155 (API)~$15
First-pass correctness~95%~88%

Where each is strong:

  • Claude Code. Deep reasoning on existing codebases, multi-file refactors, architectural thinking, Agent Teams / subagents, MCP integrations (3,000+), hooks, SDK extensibility. Collaborative partner who checks in.
  • Codex CLI. Token-efficient, aggressive full-auto mode with kernel sandboxing, terminal / DevOps / scripts, tight GitHub integration, cheaper.

“Claude writes, Codex reviews” — is this real? Partially. OpenAI shipped an official Codex plugin for Claude Code for exactly this cross-provider pattern; the “remove same-model sycophancy” rationale is documented. MindStudio’s writeup and Homborsund AI’s “Agentic Coding Gets Real” both describe real teams doing it. It is not folklore — it is a published pattern — but it is not a universal best practice. Most teams in Anthropic’s 2026 report use a single agent. The split makes sense when you already pay for both and want a second opinion on high-stakes PRs.

PR review agents — the 2026 landscape

ToolTeam pricingBug-catchStrengthsWatch-outs
CodeRabbitPro $24/dev/mo; free on private repos w/ 200 files/hr cap; free forever on public~44–46%Lowest false-positive rate; GitHub / GitLab / Bitbucket / Azure DevOps; learns team styleLess deep than Greptile
Greptile~$30/dev/mo, no free tier82% (highest)Best bug catch; repo-wide context; diagrams11 false positives/run vs CR’s 2; GitHub + GitLab only
Qodo MergeFree dev tier (30 PRs/mo); Pro $19/user; Teams $30/user60% F1Generates unit tests when gaps found; open-source core (PR-Agent) is self-hostableLess polished UI
Sourcery~$12/dev/mon/aDeep Python (AST-level)Python-only — useless for NestJS/React
GitHub Copilot Code ReviewBundled in Copilot $10/userModerateNative PR integration, zero setup4,000-char custom-instruction cap; cannot block merges; no CI gating; no learning
Cursor BugBot~$40/user/mon/aTuned for AI-generated codeCursor-adjacent; overkill if not on Cursor
Graphite Agent (ex-Diamond)$40/user/mo Teamn/aUnified Chat + review + stacking + merge queue; Shopify +33% PRs/dev, Asana +21% code/weekExpensive; stacking ergonomics are the real draw
Bito~$15/user/mon/a50+ langs, 25 NL, effort estimatesLess differentiated
Anthropic claude-code-actionAPI cost ~$0.20–$2/PRHigh (full Claude Code runtime)Responds to @claude in PRs, runs in Actions runner, can open sub-PRs / fix codePer-token; spend cap needed
OpenAI codex-actionAPI cost per runHighcodex exec in CI, auto-review, autofix CI failuresPer-token

Sentiment snapshot from CodeRabbit vs Qodo and Greptile vs CodeRabbit: CodeRabbit is the community default for small teams. Greptile is loved where false-positive tolerance is low. Qodo’s OSS PR-Agent is the hacker favourite. Copilot Code Review is “free-ish” but considered weak head-to-head.

Claude Code in CI

anthropics/claude-code-action@v1 (repo, docs) is the official path. Runs the full Claude Code runtime inside a GitHub Actions runner with repo files, git history, and diff context. Install the Claude GitHub App, store ANTHROPIC_API_KEY as a secret, drop a workflow:

name: Claude PR Review
on:
  pull_request:
    types: [opened, synchronize, reopened]
  issue_comment:
    types: [created]
concurrency:
  group: claude-${{ github.ref }}
  cancel-in-progress: true
jobs:
  review:
    runs-on: ubuntu-latest
    permissions: { contents: read, pull-requests: write, issues: write }
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }
      - uses: anthropics/claude-code-action@v1
        with:
          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
          prompt: |
            Review this PR for correctness, security, performance, and NestJS/React conventions.
            Flag any Neon-specific migration concerns. Only comment on real issues.

Companion repos: anthropics/claude-code-security-review (security-specialised), anthropics/claude-code-base-action (headless primitive). OpenAI’s equivalent openai/codex-action@v1 works similarly and can auto-review every PR when “Automatic reviews” is flipped in Codex settings. The Codex autofix cookbook has a notable example of Codex auto-fixing failing CI.

Who writes, who reviews, who tests — emerging patterns

Synthesised from Anthropic’s 2026 Agentic Coding Trends Report and Martin Fowler’s “Humans and Agents”:

  • ~60% of work is AI-assisted, but only 0–20% is fully delegated. Humans stay in the “why loop” (goals, constraints, product intent); agents run the “how loop” (typing, scaffolding, refactors, test generation).
  • Code review has become the primary engineering surface. For many engineers it’s the first look at code they didn’t write. Spec-driven development is displacing line-by-line review.
  • Tiered human-in-the-loop is emerging best practice. Low-risk (confidence > 0.8) auto-merge; medium-risk (0.6–0.8) human review with a 4-hour SLA; auth / schema / dependency changes always flag.
  • Agent-reviews-agent is now mainstream. An AI reviewer reads AI-generated code before a human sees it, filtering noise. This is the pattern CodeRabbit / Graphite Agent / claude-code-action all slot into.

Emerging shorthand: “human on the loop, not in the loop” — supervising the working loop rather than approving every diff.

Real examples: Vercel publicly runs Vercel Agent on their own PRs; GitHub uses Copilot Code Review on theirs; Shopify (+33% PRs/dev after Graphite) and Asana (+21% code/week) are in Graphite’s case studies; Rakuten / CRED / TELUS / Zapier appear in Anthropic’s report.

Superpowers plugin

github.com/obra/superpowers by Jesse Vincent (@obra, formerly Anthropic’s early Claude team, now at Prime Radiant). Community plugin (not official Anthropic), accepted into the official Claude plugin marketplace in January 2026. ~121K stars (April 2026) — one of the fastest-growing repos of the year.

Install in Claude Code:

/plugin marketplace add obra/superpowers-marketplace
/plugin install superpowers@superpowers-marketplace

Ships 14 composable skills (markdown workflow instructions Claude reads at runtime) enforcing a 5-phase discipline: clarify → design → plan → code → verify. Skills include true red/green TDD, subagent-based code review, debugging protocols, brainstorming, skill authoring, YAGNI / DRY enforcement. Companion repos: superpowers-skills, superpowers-lab, superpowers-chrome, superpowers-developing-for-claude-code.

Relation to BMAD. Both are methodology-as-code over Claude Code. BMAD is heavier (six agents, full PRD → story pipeline, good for greenfield multi-week projects). Superpowers is lighter (per-task discipline, good for day-to-day). They compose — BMAD for the planning sprint, Superpowers inside each dev’s Claude Code for individual stories. See Benny Cheung — Harmonizing BMAD + Superpowers.

Recommendation for 361

1. PR review agent → CodeRabbit (free) + claude-code-action for flagged PRs.

CodeRabbit is free on private repos within the 200 files/hr cap — 3–5 devs won’t hit that. Wire anthropics/claude-code-action@v1 as a second, deeper reviewer triggered by @claude mentions, and always on PRs that touch apps/api/src/migrations/** or Neon schema files. Cheap default + surgical depth. Budget ~$50/month in Anthropic API spend for claude-code-action, cap with ANTHROPIC_MAX_SPEND_USD.

Runner-up if we outgrow CodeRabbit: Qodo Merge Pro at $19/user — the test-generation-on-missing-coverage is genuinely differentiated.

2. Claude vs Codex → Claude-primary, Codex optional second-opinion reviewer.

At 3–5 devs with Phase 1 hard-locked to June 27, paying for two agent ecosystems is friction we don’t need. Use Claude Code as the daily driver. If one dev wants to A/B the Codex-reviews-Claude pattern via the Codex plugin for Claude Code, let them — it’s a zero-cost experiment and the anti-sycophancy argument is real. Revisit post-Phase-1.

3. BMAD → partial. Planning trio only.

One-week upfront: run Analyst → PM → Architect against the current oxFlow concept map, business rules, and Workbench migration scope. Output: a PRD and architecture doc in web-UI Claude. That’s the ritual worth doing for a complex Benchmark replacement. Skip SM / Dev / QA orchestration — at 3–5 devs, story-file ceremony outweighs the benefit, and we already have GitHub Flow + staging gate for execution.

4. Superpowers → in, every dev.

TDD + planning discipline are free wins. Plugin is trivial to uninstall if it gets in the way. Addresses the single biggest failure mode of Claude Code at scale — agents skipping planning and emitting plausible-but-wrong code.

5. Don’t add: Greptile (good but expensive + overlaps CodeRabbit), Graphite Agent (the stacking is the value, not the AI; separate decision), Sourcery (Python-only), Cursor BugBot (not on Cursor).

6. Wire the setup in one week.

  • Day 1. Install Claude GitHub App; add ANTHROPIC_API_KEY secret; create .github/workflows/claude-review.yml with the workflow above.
  • Day 2. Connect CodeRabbit; disable its auto-approvals; configure .coderabbit.yaml with the oxFlow business-rule conventions.
  • Day 3. /plugin install superpowers@superpowers-marketplace in each dev’s Claude Code; commit .claude/commands/ + .claude/agents/ presets.
  • Week 1 (parallel track). BMAD Analyst → PM → Architect sprint on the current specs → produces PRD.md, architecture.md, committed to oxflow-docs/planning/.

Alternatives considered

  • Greptile as primary reviewer. Higher bug catch (82% vs CodeRabbit’s 46%), but 11 false positives/run is painful for a team small enough to read every comment. Revisit if PR noise stops being the limiting factor.
  • Full BMAD adoption. Story-file ceremony is more overhead than value at 3–5 devs. Revisit if the team doubles.
  • Codex as daily driver. Cheaper per task and genuinely good at DevOps / scripts, but the loss of Claude Code’s Agent Teams + MCP + Superpowers ecosystem outweighs the savings.
  • No review agent — human review only. Viable at this size, but leaves review quality variance fully dependent on which human picks up the PR. A consistent first-pass reviewer removes that variance for zero ongoing effort.
  • Graphite for stacked PRs + Graphite Agent as reviewer. Graphite stacking is a real productivity lever, but the AI reviewer is a separate decision; evaluate the stacking ergonomics on its own merits.

Sources

See also

Linked from

  • Daily log 2026-04-17 — originating tasks (Claude Managed Agents + Multica, Superpowers, Git research).