Research — agentic coding and PR review for oxFlow

Question

How should 361’s 3–5 dev team actually write and review code on oxflow using Claude Code alongside other AI coding tools? Specifically: (a) which PR-review agent plugs into the GitHub + Render + Neon stack; (b) should we use Claude and Codex together, and if so how; (c) does BMAD give us something worth the ceremony at 3–5 devs; (d) is the Superpowers plugin in or out?

TL;DR

Run CodeRabbit on every PR as the default reviewer (free tier covers 3–5 devs on private repos) and wire anthropics/claude-code-action@v1 for @claude-mentioned deep reviews on PRs touching migrations or domain rules. Claude Code is the daily driver for writing; Codex is optional as a second-opinion reviewer, not on the critical path. Adopt BMAD partially — use the Analyst + PM + Architect trio as a one-week upfront planning ritual, skip the full SM/Dev/QA orchestration at this team size. Install obra/superpowers on every dev’s Claude Code for TDD and planning discipline.

Reject: Greptile (more noise than we can tolerate), Graphite Agent (stacking is the valuable part, not the AI), Sourcery (Python-only), Cursor BugBot (we’re not on Cursor).

Approach

Delegated a landscape survey to a research agent with web access, briefed with the locked stack (DevOps, DB) and the specific questions above. Cross-checked community sentiment against published benchmarks, official vendor docs, and Anthropic’s 2026 Agentic Coding Trends Report.

Findings

BMAD method

BMAD — “Breakthrough Method for Agile AI-Driven Development” — lives at github.com/bmad-code-org/BMAD-METHOD (renamed from bmadcode/BMAD-METHOD; old URL redirects). Authored by Brian Madison. Active since April 2025, now on v6.3.0 (April 10 2026), ~1,400+ stars. Free, open-source, framework-agnostic. Installs into VS Code / Cursor / Claude Code via npx bmad-method install.

Splits the SDLC into two phases across six personas:

Planning phase (web UI / long context). Analyst → project brief. PM → PRD. Architect → system design.
Execution phase (IDE). Scrum Master shards the PRD into hyper-detailed “story files” carrying full architectural context. Dev implements one story at a time. QA reviews.

Agents are YAML-backed markdown personas (.agent.yaml compiled to .md) with id, name, persona.role, communication_style, principles, commands, dependencies. Scale-adaptive — you can skip phases for bug-fix-scope vs enterprise-scope.

Expansion packs extend BMAD beyond web dev: cloud infra, data architecture, API integration, platform engineering, governance, plus community packs for game dev, DevOps, creative writing. v6.3 introduced a marketplace and merged four agents into one (“Amelia”).

BMAD pairs naturally with Claude Code because story files are designed to drop into an agent’s context window whole. See Benny Cheung — Applied BMAD and BMAD + Claude Code setup guide.

Claude Code vs Codex — numbers

Benchmarks as of Q1 2026 (per NxCode and MorphLLM):

Metric	Claude Code (Opus 4.7 / Sonnet 4.5)	Codex CLI (GPT-5.3-Codex)
SWE-bench Verified	80.9%	~74%
Terminal-Bench 2.0	65.4%	77.3%
Token cost (same task)	~$155 (API)	~$15
First-pass correctness	~95%	~88%

Where each is strong:

Claude Code. Deep reasoning on existing codebases, multi-file refactors, architectural thinking, Agent Teams / subagents, MCP integrations (3,000+), hooks, SDK extensibility. Collaborative partner who checks in.
Codex CLI. Token-efficient, aggressive full-auto mode with kernel sandboxing, terminal / DevOps / scripts, tight GitHub integration, cheaper.

“Claude writes, Codex reviews” — is this real? Partially. OpenAI shipped an official Codex plugin for Claude Code for exactly this cross-provider pattern; the “remove same-model sycophancy” rationale is documented. MindStudio’s writeup and Homborsund AI’s “Agentic Coding Gets Real” both describe real teams doing it. It is not folklore — it is a published pattern — but it is not a universal best practice. Most teams in Anthropic’s 2026 report use a single agent. The split makes sense when you already pay for both and want a second opinion on high-stakes PRs.

PR review agents — the 2026 landscape

Tool	Team pricing	Bug-catch	Strengths	Watch-outs
CodeRabbit	Pro $24/dev/mo; free on private repos w/ 200 files/hr cap; free forever on public	~44–46%	Lowest false-positive rate; GitHub / GitLab / Bitbucket / Azure DevOps; learns team style	Less deep than Greptile
Greptile	~$30/dev/mo, no free tier	82% (highest)	Best bug catch; repo-wide context; diagrams	11 false positives/run vs CR’s 2; GitHub + GitLab only
Qodo Merge	Free dev tier (30 PRs/mo); Pro $19/user; Teams $30/user	60% F1	Generates unit tests when gaps found; open-source core (PR-Agent) is self-hostable	Less polished UI
Sourcery	~$12/dev/mo	n/a	Deep Python (AST-level)	Python-only — useless for NestJS/React
GitHub Copilot Code Review	Bundled in Copilot $10/user	Moderate	Native PR integration, zero setup	4,000-char custom-instruction cap; cannot block merges; no CI gating; no learning
Cursor BugBot	~$40/user/mo	n/a	Tuned for AI-generated code	Cursor-adjacent; overkill if not on Cursor
Graphite Agent (ex-Diamond)	$40/user/mo Team	n/a	Unified Chat + review + stacking + merge queue; Shopify +33% PRs/dev, Asana +21% code/week	Expensive; stacking ergonomics are the real draw
Bito	~$15/user/mo	n/a	50+ langs, 25 NL, effort estimates	Less differentiated
Anthropic `claude-code-action`	API cost ~$0.20–$2/PR	High (full Claude Code runtime)	Responds to `@claude` in PRs, runs in Actions runner, can open sub-PRs / fix code	Per-token; spend cap needed
OpenAI `codex-action`	API cost per run	High	`codex exec` in CI, auto-review, autofix CI failures	Per-token

Sentiment snapshot from CodeRabbit vs Qodo and Greptile vs CodeRabbit: CodeRabbit is the community default for small teams. Greptile is loved where false-positive tolerance is low. Qodo’s OSS PR-Agent is the hacker favourite. Copilot Code Review is “free-ish” but considered weak head-to-head.

Claude Code in CI

anthropics/claude-code-action@v1 (repo, docs) is the official path. Runs the full Claude Code runtime inside a GitHub Actions runner with repo files, git history, and diff context. Install the Claude GitHub App, store ANTHROPIC_API_KEY as a secret, drop a workflow:

name: Claude PR Review
on:
  pull_request:
    types: [opened, synchronize, reopened]
  issue_comment:
    types: [created]
concurrency:
  group: claude-${{ github.ref }}
  cancel-in-progress: true
jobs:
  review:
    runs-on: ubuntu-latest
    permissions: { contents: read, pull-requests: write, issues: write }
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }
      - uses: anthropics/claude-code-action@v1
        with:
          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
          prompt: |
            Review this PR for correctness, security, performance, and NestJS/React conventions.
            Flag any Neon-specific migration concerns. Only comment on real issues.

Companion repos: anthropics/claude-code-security-review (security-specialised), anthropics/claude-code-base-action (headless primitive). OpenAI’s equivalent openai/codex-action@v1 works similarly and can auto-review every PR when “Automatic reviews” is flipped in Codex settings. The Codex autofix cookbook has a notable example of Codex auto-fixing failing CI.

Who writes, who reviews, who tests — emerging patterns

Synthesised from Anthropic’s 2026 Agentic Coding Trends Report and Martin Fowler’s “Humans and Agents”:

~60% of work is AI-assisted, but only 0–20% is fully delegated. Humans stay in the “why loop” (goals, constraints, product intent); agents run the “how loop” (typing, scaffolding, refactors, test generation).
Code review has become the primary engineering surface. For many engineers it’s the first look at code they didn’t write. Spec-driven development is displacing line-by-line review.
Tiered human-in-the-loop is emerging best practice. Low-risk (confidence > 0.8) auto-merge; medium-risk (0.6–0.8) human review with a 4-hour SLA; auth / schema / dependency changes always flag.
Agent-reviews-agent is now mainstream. An AI reviewer reads AI-generated code before a human sees it, filtering noise. This is the pattern CodeRabbit / Graphite Agent / claude-code-action all slot into.

Emerging shorthand: “human on the loop, not in the loop” — supervising the working loop rather than approving every diff.

Real examples: Vercel publicly runs Vercel Agent on their own PRs; GitHub uses Copilot Code Review on theirs; Shopify (+33% PRs/dev after Graphite) and Asana (+21% code/week) are in Graphite’s case studies; Rakuten / CRED / TELUS / Zapier appear in Anthropic’s report.

Superpowers plugin

github.com/obra/superpowers by Jesse Vincent (@obra, formerly Anthropic’s early Claude team, now at Prime Radiant). Community plugin (not official Anthropic), accepted into the official Claude plugin marketplace in January 2026. ~121K stars (April 2026) — one of the fastest-growing repos of the year.

Install in Claude Code:

/plugin marketplace add obra/superpowers-marketplace
/plugin install superpowers@superpowers-marketplace

Ships 14 composable skills (markdown workflow instructions Claude reads at runtime) enforcing a 5-phase discipline: clarify → design → plan → code → verify. Skills include true red/green TDD, subagent-based code review, debugging protocols, brainstorming, skill authoring, YAGNI / DRY enforcement. Companion repos: superpowers-skills, superpowers-lab, superpowers-chrome, superpowers-developing-for-claude-code.

Relation to BMAD. Both are methodology-as-code over Claude Code. BMAD is heavier (six agents, full PRD → story pipeline, good for greenfield multi-week projects). Superpowers is lighter (per-task discipline, good for day-to-day). They compose — BMAD for the planning sprint, Superpowers inside each dev’s Claude Code for individual stories. See Benny Cheung — Harmonizing BMAD + Superpowers.

Recommendation for 361

1. PR review agent → CodeRabbit (free) + claude-code-action for flagged PRs.

CodeRabbit is free on private repos within the 200 files/hr cap — 3–5 devs won’t hit that. Wire anthropics/claude-code-action@v1 as a second, deeper reviewer triggered by @claude mentions, and always on PRs that touch apps/api/src/migrations/** or Neon schema files. Cheap default + surgical depth. Budget ~$50/month in Anthropic API spend for claude-code-action, cap with ANTHROPIC_MAX_SPEND_USD.

Runner-up if we outgrow CodeRabbit: Qodo Merge Pro at $19/user — the test-generation-on-missing-coverage is genuinely differentiated.

2. Claude vs Codex → Claude-primary, Codex optional second-opinion reviewer.

At 3–5 devs with Phase 1 hard-locked to June 27, paying for two agent ecosystems is friction we don’t need. Use Claude Code as the daily driver. If one dev wants to A/B the Codex-reviews-Claude pattern via the Codex plugin for Claude Code, let them — it’s a zero-cost experiment and the anti-sycophancy argument is real. Revisit post-Phase-1.

3. BMAD → partial. Planning trio only.

One-week upfront: run Analyst → PM → Architect against the current oxFlow concept map, business rules, and Workbench migration scope. Output: a PRD and architecture doc in web-UI Claude. That’s the ritual worth doing for a complex Benchmark replacement. Skip SM / Dev / QA orchestration — at 3–5 devs, story-file ceremony outweighs the benefit, and we already have GitHub Flow + staging gate for execution.

4. Superpowers → in, every dev.

TDD + planning discipline are free wins. Plugin is trivial to uninstall if it gets in the way. Addresses the single biggest failure mode of Claude Code at scale — agents skipping planning and emitting plausible-but-wrong code.

5. Don’t add: Greptile (good but expensive + overlaps CodeRabbit), Graphite Agent (the stacking is the value, not the AI; separate decision), Sourcery (Python-only), Cursor BugBot (not on Cursor).

6. Wire the setup in one week.

Day 1. Install Claude GitHub App; add ANTHROPIC_API_KEY secret; create .github/workflows/claude-review.yml with the workflow above.
Day 2. Connect CodeRabbit; disable its auto-approvals; configure .coderabbit.yaml with the oxFlow business-rule conventions.
Day 3. /plugin install superpowers@superpowers-marketplace in each dev’s Claude Code; commit .claude/commands/ + .claude/agents/ presets.
Week 1 (parallel track). BMAD Analyst → PM → Architect sprint on the current specs → produces PRD.md, architecture.md, committed to oxflow-docs/planning/.

Alternatives considered

Greptile as primary reviewer. Higher bug catch (82% vs CodeRabbit’s 46%), but 11 false positives/run is painful for a team small enough to read every comment. Revisit if PR noise stops being the limiting factor.
Full BMAD adoption. Story-file ceremony is more overhead than value at 3–5 devs. Revisit if the team doubles.
Codex as daily driver. Cheaper per task and genuinely good at DevOps / scripts, but the loss of Claude Code’s Agent Teams + MCP + Superpowers ecosystem outweighs the savings.
No review agent — human review only. Viable at this size, but leaves review quality variance fully dependent on which human picks up the PR. A consistent first-pass reviewer removes that variance for zero ongoing effort.
Graphite for stacked PRs + Graphite Agent as reviewer. Graphite stacking is a real productivity lever, but the AI reviewer is a separate decision; evaluate the stacking ergonomics on its own merits.

Sources

Linked from

Daily log 2026-04-17 — originating tasks (Claude Managed Agents + Multica, Superpowers, Git research).

oxFlow Wiki

Explorer