Question
How should 361’s 3–5 dev team actually write and review code on oxflow using Claude Code alongside other AI coding tools? Specifically: (a) which PR-review agent plugs into the GitHub + Render + Neon stack; (b) should we use Claude and Codex together, and if so how; (c) does BMAD give us something worth the ceremony at 3–5 devs; (d) is the Superpowers plugin in or out?
TL;DR
Run CodeRabbit on every PR as the default reviewer (free tier covers 3–5 devs on private repos) and wire anthropics/claude-code-action@v1 for @claude-mentioned deep reviews on PRs touching migrations or domain rules. Claude Code is the daily driver for writing; Codex is optional as a second-opinion reviewer, not on the critical path. Adopt BMAD partially — use the Analyst + PM + Architect trio as a one-week upfront planning ritual, skip the full SM/Dev/QA orchestration at this team size. Install obra/superpowers on every dev’s Claude Code for TDD and planning discipline.
Reject: Greptile (more noise than we can tolerate), Graphite Agent (stacking is the valuable part, not the AI), Sourcery (Python-only), Cursor BugBot (we’re not on Cursor).
Approach
Delegated a landscape survey to a research agent with web access, briefed with the locked stack (DevOps, DB) and the specific questions above. Cross-checked community sentiment against published benchmarks, official vendor docs, and Anthropic’s 2026 Agentic Coding Trends Report.
Findings
BMAD method
BMAD — “Breakthrough Method for Agile AI-Driven Development” — lives at github.com/bmad-code-org/BMAD-METHOD (renamed from bmadcode/BMAD-METHOD; old URL redirects). Authored by Brian Madison. Active since April 2025, now on v6.3.0 (April 10 2026), ~1,400+ stars. Free, open-source, framework-agnostic. Installs into VS Code / Cursor / Claude Code via npx bmad-method install.
Splits the SDLC into two phases across six personas:
- Planning phase (web UI / long context). Analyst → project brief. PM → PRD. Architect → system design.
- Execution phase (IDE). Scrum Master shards the PRD into hyper-detailed “story files” carrying full architectural context. Dev implements one story at a time. QA reviews.
Agents are YAML-backed markdown personas (.agent.yaml compiled to .md) with id, name, persona.role, communication_style, principles, commands, dependencies. Scale-adaptive — you can skip phases for bug-fix-scope vs enterprise-scope.
Expansion packs extend BMAD beyond web dev: cloud infra, data architecture, API integration, platform engineering, governance, plus community packs for game dev, DevOps, creative writing. v6.3 introduced a marketplace and merged four agents into one (“Amelia”).
BMAD pairs naturally with Claude Code because story files are designed to drop into an agent’s context window whole. See Benny Cheung — Applied BMAD and BMAD + Claude Code setup guide.
Claude Code vs Codex — numbers
Benchmarks as of Q1 2026 (per NxCode and MorphLLM):
| Metric | Claude Code (Opus 4.7 / Sonnet 4.5) | Codex CLI (GPT-5.3-Codex) |
|---|---|---|
| SWE-bench Verified | 80.9% | ~74% |
| Terminal-Bench 2.0 | 65.4% | 77.3% |
| Token cost (same task) | ~$155 (API) | ~$15 |
| First-pass correctness | ~95% | ~88% |
Where each is strong:
- Claude Code. Deep reasoning on existing codebases, multi-file refactors, architectural thinking, Agent Teams / subagents, MCP integrations (3,000+), hooks, SDK extensibility. Collaborative partner who checks in.
- Codex CLI. Token-efficient, aggressive full-auto mode with kernel sandboxing, terminal / DevOps / scripts, tight GitHub integration, cheaper.
“Claude writes, Codex reviews” — is this real? Partially. OpenAI shipped an official Codex plugin for Claude Code for exactly this cross-provider pattern; the “remove same-model sycophancy” rationale is documented. MindStudio’s writeup and Homborsund AI’s “Agentic Coding Gets Real” both describe real teams doing it. It is not folklore — it is a published pattern — but it is not a universal best practice. Most teams in Anthropic’s 2026 report use a single agent. The split makes sense when you already pay for both and want a second opinion on high-stakes PRs.
PR review agents — the 2026 landscape
| Tool | Team pricing | Bug-catch | Strengths | Watch-outs |
|---|---|---|---|---|
| CodeRabbit | Pro $24/dev/mo; free on private repos w/ 200 files/hr cap; free forever on public | ~44–46% | Lowest false-positive rate; GitHub / GitLab / Bitbucket / Azure DevOps; learns team style | Less deep than Greptile |
| Greptile | ~$30/dev/mo, no free tier | 82% (highest) | Best bug catch; repo-wide context; diagrams | 11 false positives/run vs CR’s 2; GitHub + GitLab only |
| Qodo Merge | Free dev tier (30 PRs/mo); Pro $19/user; Teams $30/user | 60% F1 | Generates unit tests when gaps found; open-source core (PR-Agent) is self-hostable | Less polished UI |
| Sourcery | ~$12/dev/mo | n/a | Deep Python (AST-level) | Python-only — useless for NestJS/React |
| GitHub Copilot Code Review | Bundled in Copilot $10/user | Moderate | Native PR integration, zero setup | 4,000-char custom-instruction cap; cannot block merges; no CI gating; no learning |
| Cursor BugBot | ~$40/user/mo | n/a | Tuned for AI-generated code | Cursor-adjacent; overkill if not on Cursor |
| Graphite Agent (ex-Diamond) | $40/user/mo Team | n/a | Unified Chat + review + stacking + merge queue; Shopify +33% PRs/dev, Asana +21% code/week | Expensive; stacking ergonomics are the real draw |
| Bito | ~$15/user/mo | n/a | 50+ langs, 25 NL, effort estimates | Less differentiated |
Anthropic claude-code-action | API cost ~$0.20–$2/PR | High (full Claude Code runtime) | Responds to @claude in PRs, runs in Actions runner, can open sub-PRs / fix code | Per-token; spend cap needed |
OpenAI codex-action | API cost per run | High | codex exec in CI, auto-review, autofix CI failures | Per-token |
Sentiment snapshot from CodeRabbit vs Qodo and Greptile vs CodeRabbit: CodeRabbit is the community default for small teams. Greptile is loved where false-positive tolerance is low. Qodo’s OSS PR-Agent is the hacker favourite. Copilot Code Review is “free-ish” but considered weak head-to-head.
Claude Code in CI
anthropics/claude-code-action@v1 (repo, docs) is the official path. Runs the full Claude Code runtime inside a GitHub Actions runner with repo files, git history, and diff context. Install the Claude GitHub App, store ANTHROPIC_API_KEY as a secret, drop a workflow:
name: Claude PR Review
on:
pull_request:
types: [opened, synchronize, reopened]
issue_comment:
types: [created]
concurrency:
group: claude-${{ github.ref }}
cancel-in-progress: true
jobs:
review:
runs-on: ubuntu-latest
permissions: { contents: read, pull-requests: write, issues: write }
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- uses: anthropics/claude-code-action@v1
with:
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
prompt: |
Review this PR for correctness, security, performance, and NestJS/React conventions.
Flag any Neon-specific migration concerns. Only comment on real issues.Companion repos: anthropics/claude-code-security-review (security-specialised), anthropics/claude-code-base-action (headless primitive). OpenAI’s equivalent openai/codex-action@v1 works similarly and can auto-review every PR when “Automatic reviews” is flipped in Codex settings. The Codex autofix cookbook has a notable example of Codex auto-fixing failing CI.
Who writes, who reviews, who tests — emerging patterns
Synthesised from Anthropic’s 2026 Agentic Coding Trends Report and Martin Fowler’s “Humans and Agents”:
- ~60% of work is AI-assisted, but only 0–20% is fully delegated. Humans stay in the “why loop” (goals, constraints, product intent); agents run the “how loop” (typing, scaffolding, refactors, test generation).
- Code review has become the primary engineering surface. For many engineers it’s the first look at code they didn’t write. Spec-driven development is displacing line-by-line review.
- Tiered human-in-the-loop is emerging best practice. Low-risk (confidence > 0.8) auto-merge; medium-risk (0.6–0.8) human review with a 4-hour SLA; auth / schema / dependency changes always flag.
- Agent-reviews-agent is now mainstream. An AI reviewer reads AI-generated code before a human sees it, filtering noise. This is the pattern CodeRabbit / Graphite Agent / claude-code-action all slot into.
Emerging shorthand: “human on the loop, not in the loop” — supervising the working loop rather than approving every diff.
Real examples: Vercel publicly runs Vercel Agent on their own PRs; GitHub uses Copilot Code Review on theirs; Shopify (+33% PRs/dev after Graphite) and Asana (+21% code/week) are in Graphite’s case studies; Rakuten / CRED / TELUS / Zapier appear in Anthropic’s report.
Superpowers plugin
github.com/obra/superpowers by Jesse Vincent (@obra, formerly Anthropic’s early Claude team, now at Prime Radiant). Community plugin (not official Anthropic), accepted into the official Claude plugin marketplace in January 2026. ~121K stars (April 2026) — one of the fastest-growing repos of the year.
Install in Claude Code:
/plugin marketplace add obra/superpowers-marketplace
/plugin install superpowers@superpowers-marketplace
Ships 14 composable skills (markdown workflow instructions Claude reads at runtime) enforcing a 5-phase discipline: clarify → design → plan → code → verify. Skills include true red/green TDD, subagent-based code review, debugging protocols, brainstorming, skill authoring, YAGNI / DRY enforcement. Companion repos: superpowers-skills, superpowers-lab, superpowers-chrome, superpowers-developing-for-claude-code.
Relation to BMAD. Both are methodology-as-code over Claude Code. BMAD is heavier (six agents, full PRD → story pipeline, good for greenfield multi-week projects). Superpowers is lighter (per-task discipline, good for day-to-day). They compose — BMAD for the planning sprint, Superpowers inside each dev’s Claude Code for individual stories. See Benny Cheung — Harmonizing BMAD + Superpowers.
Recommendation for 361
1. PR review agent → CodeRabbit (free) + claude-code-action for flagged PRs.
CodeRabbit is free on private repos within the 200 files/hr cap — 3–5 devs won’t hit that. Wire anthropics/claude-code-action@v1 as a second, deeper reviewer triggered by @claude mentions, and always on PRs that touch apps/api/src/migrations/** or Neon schema files. Cheap default + surgical depth. Budget ~$50/month in Anthropic API spend for claude-code-action, cap with ANTHROPIC_MAX_SPEND_USD.
Runner-up if we outgrow CodeRabbit: Qodo Merge Pro at $19/user — the test-generation-on-missing-coverage is genuinely differentiated.
2. Claude vs Codex → Claude-primary, Codex optional second-opinion reviewer.
At 3–5 devs with Phase 1 hard-locked to June 27, paying for two agent ecosystems is friction we don’t need. Use Claude Code as the daily driver. If one dev wants to A/B the Codex-reviews-Claude pattern via the Codex plugin for Claude Code, let them — it’s a zero-cost experiment and the anti-sycophancy argument is real. Revisit post-Phase-1.
3. BMAD → partial. Planning trio only.
One-week upfront: run Analyst → PM → Architect against the current oxFlow concept map, business rules, and Workbench migration scope. Output: a PRD and architecture doc in web-UI Claude. That’s the ritual worth doing for a complex Benchmark replacement. Skip SM / Dev / QA orchestration — at 3–5 devs, story-file ceremony outweighs the benefit, and we already have GitHub Flow + staging gate for execution.
4. Superpowers → in, every dev.
TDD + planning discipline are free wins. Plugin is trivial to uninstall if it gets in the way. Addresses the single biggest failure mode of Claude Code at scale — agents skipping planning and emitting plausible-but-wrong code.
5. Don’t add: Greptile (good but expensive + overlaps CodeRabbit), Graphite Agent (the stacking is the value, not the AI; separate decision), Sourcery (Python-only), Cursor BugBot (not on Cursor).
6. Wire the setup in one week.
- Day 1. Install Claude GitHub App; add
ANTHROPIC_API_KEYsecret; create.github/workflows/claude-review.ymlwith the workflow above. - Day 2. Connect CodeRabbit; disable its auto-approvals; configure
.coderabbit.yamlwith the oxFlow business-rule conventions. - Day 3.
/plugin install superpowers@superpowers-marketplacein each dev’s Claude Code; commit.claude/commands/+.claude/agents/presets. - Week 1 (parallel track). BMAD Analyst → PM → Architect sprint on the current specs → produces PRD.md, architecture.md, committed to
oxflow-docs/planning/.
Alternatives considered
- Greptile as primary reviewer. Higher bug catch (82% vs CodeRabbit’s 46%), but 11 false positives/run is painful for a team small enough to read every comment. Revisit if PR noise stops being the limiting factor.
- Full BMAD adoption. Story-file ceremony is more overhead than value at 3–5 devs. Revisit if the team doubles.
- Codex as daily driver. Cheaper per task and genuinely good at DevOps / scripts, but the loss of Claude Code’s Agent Teams + MCP + Superpowers ecosystem outweighs the savings.
- No review agent — human review only. Viable at this size, but leaves review quality variance fully dependent on which human picks up the PR. A consistent first-pass reviewer removes that variance for zero ongoing effort.
- Graphite for stacked PRs + Graphite Agent as reviewer. Graphite stacking is a real productivity lever, but the AI reviewer is a separate decision; evaluate the stacking ergonomics on its own merits.
Sources
- BMAD-METHOD GitHub repo
- BMAD Method official docs
- BMAD v6.3.0 changelog — VibeSparking
- Applied BMAD — Benny Cheung
- BMAD + Claude Code setup guide — Zichen
- Claude Code vs Codex CLI 2026 — NxCode
- Codex vs Claude Code — Builder.io
- Codex vs Claude Code — DataCamp
- Codex vs Claude Code benchmarks — MorphLLM
- OpenAI Codex plugin for Claude Code — MindStudio
- Anthropic 2026 Agentic Coding Trends Report
anthropics/claude-code-actionanthropics/claude-code-security-review- Claude Code GitHub Actions docs
openai/codex-action- Codex GitHub Action docs
- Codex SDK code-review cookbook
- Codex autofix CI failures cookbook
- CodeRabbit pricing
- CodeRabbit vs Qodo comparison
- Greptile vs CodeRabbit
- Qodo pricing
- Qodo Merge / PR-Agent open source
- Graphite Agent launch
- Graphite Agent pricing + case studies
- GitHub Copilot code review docs
obra/superpowersobra/superpowers-marketplace- Superpowers blog post — Jesse Vincent
- Simon Willison on Superpowers
- Martin Fowler — Humans and Agents
- Harmonizing BMAD + Superpowers — Benny Cheung
- HITL patterns for AI agents 2026
- Claude Code agentic workflow patterns — MindStudio
- Set up Claude Code GitHub Actions — systemprompt.io
See also
- Shared project memory — the memory layer that feeds agent context (CLAUDE.md, session capture, Karpathy wiki).
- Multica and Claude Managed Agents — sibling research on orchestrating agents on top of this review stack.
- Database analysis — Neon + S3 data tier. The “PR flagged if touching migrations” rule above references the Neon schema surface.
- DevOps and infrastructure — GitHub + Render + Neon stack this review workflow plugs into.
BRANDING.md— visual language used in the accompanying HTML.
Linked from
- Daily log 2026-04-17 — originating tasks (Claude Managed Agents + Multica, Superpowers, Git research).