Research — DevOps & Infrastructure Design for oxFlow

Note: pending team review.

Plain-English Summary

oxFlow needs a way for the dev team to write code, test it, show it to Oxcon for approval, and then put it live — without stepping on each other’s toes or exposing real client data during testing.

Here’s what we’re proposing:

Three platforms, that’s it. GitHub (where the code lives and gets checked automatically), Render (runs the app), and Neon (stores the data). No complex AWS/Azure setup. The whole team manages everything from three dashboards.
A simple code flow. A developer builds a feature on their own branch. When it’s ready, it goes to a “staging” version of the app that Oxcon can log into and review. Once they approve, it goes live. If something breaks in production, we can push an emergency fix directly.
Real data for testing, but safe. When we deploy to staging, we take an instant snapshot of the live database and automatically swap out sensitive info (supplier names, dollar values, margins) with fake data. This means Oxcon reviews against realistic data without anyone seeing confidential pricing.
Automatic quality checks. Every time a developer submits code, GitHub automatically runs tests, checks for bugs, and scans for security issues. If anything fails, the code can’t go live.
Slack notifications. The #oxflow channel gets pinged when staging is updated, when something deploys to production, or when a test fails.

Estimated cost: ~NZ$205–370/mo at early stage. Scales with usage — no big upfront spend.

Question

How should the oxFlow team collaborate on code, deploy to staging and production, handle database snapshots for staging, and what hosting/infrastructure should we use — given enterprise requirements, integration points (Xero, M365, MS Project, Workbench), and a small team (3-5 devs)?

TL;DR

Recommended: GitHub Flow with a staging gate, hosted on Render, database on Neon (for copy-on-write branching), CI/CD via GitHub Actions, notifications to #oxflow Slack channel. Three platforms total. Estimated ~NZ$205–370/mo at early stage.

Tech stack: NestJS + React SPA (Vite + TanStack Router) + TypeScript + PostgreSQL (Neon) + Drizzle ORM. Chosen because TypeScript has first-class SDKs for every integration point — Xero, MSAL (M365 SSO), Claude AI. PHP has no official MSAL library from Microsoft.

The Kitchen Analogy

Throughout this doc we use a restaurant analogy to explain the architecture. If you only read one section, read this one.

Restaurant	oxFlow	What it does
Dining room	React SPA (frontend)	What users see and interact with — screens, buttons, forms
Kitchen	NestJS API (backend)	Where the real work happens — calculates costs, enforces rules, processes data
Chefs	Services (business logic)	Follow recipes to prepare each order — one chef for estimates, one for adjudications, one for commercials
Head chef	API router	Decides which chef handles which order
Recipes	Business rules (123 of them)	The instructions chefs follow — “you can’t submit with unpriced items”, “commercial rules apply in sequence”
Pantry	Neon PostgreSQL (database)	Where all ingredients are stored — tenders, estimates, resources, price books
Prep board	Redis + BullMQ (job queue)	Long orders that can’t be served immediately — PDF exports, Excel imports, batch calculations
Reservation book	Auth.js + M365 SSO	Checks who’s allowed in and what role they have — Admin, Lead Estimator, Estimator
Test kitchen	Staging environment	An exact replica of the real restaurant where new dishes are tasted before going on the real menu

1. Tech Stack

TL;DR: TypeScript end-to-end (NestJS backend + React frontend) because every system oxFlow integrates with (Xero, M365, AI) has its best SDK in TypeScript. PHP doesn’t even have an official Microsoft login library.

Layer	Technology	Kitchen term	Why this choice
Frontend	React + TypeScript + Vite + TanStack Router + Tailwind CSS	The dining room	Interactive dashboards, tree editors, real-time collaboration. SPA gives full control over client state
Backend	NestJS + TypeScript	The kitchen	123 business rules need a structured home. NestJS provides modules (stations), services (chefs), guards (door policy), and interceptors (audit logging) out of the box
Database	PostgreSQL on Neon	The pantry	Recursive CTEs for 5-level cost hierarchies, materialized views for roll-ups, JSONB for flexible metadata. Neon adds instant database branching for staging
ORM	Drizzle ORM	Pantry labels	Full SQL control for recursive queries. Prisma cannot do recursive CTEs (open issue since 2020)
Job queue	BullMQ + Redis	The prep board	PDF generation, Excel import/export, batch cost calculations — all run in the background without blocking users
Auth	Auth.js + Microsoft Entra ID (Azure AD)	The reservation book	Oxcon uses M365. Auth.js has a built-in Microsoft provider — SSO in minimal config
Real-time	WebSocket gateway (NestJS built-in)	Kitchen bell	Per-item locking, presence indicators, live updates when another estimator edits

Why This Stack (Integration Reality Check)

oxFlow integrates with Xero, Microsoft 365, MS Project, and Workbench. SDK availability drove the stack decision:

Integration	TypeScript SDK	PHP SDK	Winner
Xero	`xero-node` (250 stars, updated daily)	`xero-php-oauth2` (105 stars)	TypeScript
M365 SSO (MSAL)	`MSAL.js` (4,044 stars, updated daily)	No official MSAL for PHP	TypeScript
MS Project (.MPP)	Via Python microservice (MPXJ)	No option	Neither — separate service
Claude AI	`anthropic-sdk-typescript` (1,862 stars)	No official SDK	TypeScript

TypeScript has first-class, officially maintained SDKs for every integration point. PHP does not have an official MSAL library from Microsoft, making M365 SSO a second-class experience.

Stacks Considered and Rejected

Stack	Why not
Next.js (App Router)	No service layer conventions for 123 business rules. Server-first paradigm is a poor fit for highly interactive dashboards (Railway, Documenso, Northflank all moved away from Next.js for this reason). Prisma (common pairing) cannot do recursive queries
Laravel + Inertia.js	No official MSAL for PHP (M365 SSO is second-class). Xero and AI SDKs are less maintained in PHP. Strong framework but wrong ecosystem for these integration points
NestJS + Next.js (separate)	Two services to deploy and maintain. Added complexity without proportional benefit at this team size
Remix	Small ecosystem, documentation gaps, React Router v7 transition uncertainty

2. Architecture

TL;DR: One app that does everything — the user-facing screens and the backend logic are packaged together and deployed as a single unit. No microservices, no separate frontend server.

One deployable application. The kitchen and dining room are in the same building.

┌─────────────────────────────────────────────┐
│                   Render                     │
│                                              │
│  ┌──────────────────────────────────────┐   │
│  │           NestJS Application          │   │
│  │                                       │   │
│  │  ┌─────────┐  ┌──────────────────┐   │   │
│  │  │  Static  │  │   API Routes     │   │   │
│  │  │  React   │  │   /api/v1/*      │   │   │
│  │  │  SPA     │  │                  │   │   │
│  │  │ (dist/)  │  │  ┌────────────┐  │   │   │
│  │  │          │  │  │  Services   │  │   │   │
│  │  │  Dining  │  │  │  (chefs)   │  │   │   │
│  │  │  room    │  │  │            │  │   │   │
│  │  │          │  │  │  Kitchen   │  │   │   │
│  │  └─────────┘  │  └────────────┘  │   │   │
│  │               │                   │   │   │
│  │               │  ┌────────────┐   │   │   │
│  │               │  │  BullMQ    │   │   │   │
│  │               │  │  Workers   │   │   │   │
│  │               │  │ (prep      │   │   │   │
│  │               │  │  board)    │   │   │   │
│  │               │  └────────────┘   │   │   │
│  └──────────────────────────────────────┘   │
│                                              │
│  ┌──────────┐                                │
│  │  Redis   │ ← job queue storage            │
│  └──────────┘                                │
└─────────────────────────────────────────────┘
          │
          │ SQL queries
          ▼
┌──────────────────┐
│  Neon PostgreSQL  │ ← the pantry
│                   │
│  main branch      │ ← production ingredients
│  staging branch   │ ← test kitchen ingredients
└──────────────────┘

┌──────────────────┐
│  Microsoft Entra  │ ← the reservation book
│  (Azure AD SSO)   │   (Microsoft's servers — we
│                   │    just redirect to them)
└──────────────────┘

How a Request Flows (Ordering a Dish)

User opens oxflow.3sixtyone.co — Render serves the React SPA (dining room opens)
User clicks “Open Estimate #42” — SPA sends GET /api/v1/estimates/42 to NestJS
NestJS auth middleware checks the Microsoft token (reservation book confirms they’re allowed)
NestJS route calls the Estimate Service (chef picks up the order)
Service queries Neon for the estimate + items + worksheet (chef goes to the pantry)
Service computes cost roll-ups following business rules (chef follows the recipe)
Response sent back to SPA — user sees the estimate (dish served to the table)

How a Background Job Flows (A Slow-Cook Order)

User clicks “Export PDF” — SPA sends POST /api/v1/estimates/42/export
NestJS adds a job to BullMQ (order pinned to the prep board)
API responds immediately: “we’re preparing your export” (waiter tells customer it’s coming)
A BullMQ worker picks up the job, generates the PDF (kitchen hand works on it)
When done, user gets notified and can download it (dish brought to the table)

3. Platforms & Hosting

TL;DR: Three accounts to manage — GitHub (code), Render (runs the app), Neon (database). We control everything, no dependency on Oxcon’s IT. Login with Oxcon’s Microsoft accounts still works from anywhere. Disaster recovery exceeds the spec targets (RPO ~zero, RTO under 1 hour). Code is organised as a Turborepo monorepo with shared TypeScript types.

Three platforms to manage. That’s it.

Platform	What it hosts	Kitchen term	Estimated cost
GitHub	Code, CI/CD, pull requests	The recipe book + the kitchen inspector	Free (private repos, 2000 CI minutes/mo)
Render	NestJS app + React SPA + Redis	The restaurant building	~NZ$85–170/mo
Neon	PostgreSQL database + branches	The pantry (with instant cloning)	~NZ$35–120/mo

Why Render?

361 Coders manages multiple client projects. Deploying on a client’s Azure tenant would mean depending on their IT team for permissions, deployments, and debugging. Render gives the 361 team full control from a single dashboard.

SSO with Oxcon’s M365 works regardless of where the app is hosted — it’s just OAuth2 redirects and tokens. The app can be anywhere.

Why Not Azure/AWS?

Concern	Answer
”But Oxcon uses M365”	SSO works from any host. Auth.js redirects to Microsoft’s login page — doesn’t matter where the app runs
”Azure is more enterprise”	Oxcon cares that data is secure and the app works, not which logo is on the server
”We might need Azure later”	NestJS runs anywhere Node.js runs. Migration is straightforward if data residency requirements emerge
Complexity	Azure/AWS requires VPCs, security groups, IAM policies, container registries. Weeks of DevOps setup vs hours on Render

Why Neon Specifically?

Neon’s killer feature is database branching — instant, copy-on-write clones of the production database. This directly solves the staging data requirement (see Section 6). No other managed PostgreSQL service offers this at the same level.

Database platform	Branching	Staging snapshot story
Neon	Yes (instant, milliseconds)	One command, instant clone, pay only for diffs
Supabase	Schema only (no data)	Must seed data manually or script pg_dump/restore
AWS RDS	No	Script pg_dump → anonymize → pg_restore (minutes to hours)
Azure PostgreSQL	No	Same manual dump/restore workflow

See also: 2026-04-17-database-analysis.md for the full database comparison.

Disaster Recovery

The NFR specifies RPO <24h and RTO <4h. The selected platforms exceed both targets:

Target	Requirement	Platform capability
RPO (data loss window)	< 24 hours	Neon provides continuous point-in-time recovery (PITR) — RPO is effectively zero. Any committed transaction can be recovered
RTO (time to restore)	< 4 hours	Render redeploys from git in under 10 minutes. Neon database branches restore in seconds. Combined RTO is well under 1 hour

Specific Plans

Platform	Plan	Why this tier
Neon	Scale ($69 USD base)	Includes autoscaling compute, 10 branches, connection pooling, PITR. Free tier has connection limits that would break production
Render	Standard ($25 USD/service)	Includes zero-downtime deploys, health checks, persistent disk, managed Redis. Starter tier lacks production reliability features

Monorepo Structure

NestJS (backend) and React (frontend) share TypeScript types and validation schemas. The repo uses Turborepo + pnpm workspaces to manage this:

oxflow/
├── apps/
│   ├── api/          ← NestJS backend
│   └── web/          ← React SPA
├── packages/
│   └── shared/       ← TypeScript types, validation schemas, constants
├── turbo.json
├── pnpm-workspace.yaml
└── package.json

Turborepo gives selective builds (only rebuild what changed), shared type checking across apps, and parallel task execution in CI.

4. Git Branching Strategy

TL;DR: Developers build features on their own branch, merge to staging for Oxcon to review, then merge to main to go live. Emergency fixes skip staging and go straight to production.

Two permanent branches: main (the real restaurant) and staging (the test kitchen).

main (production — the real restaurant, live customers)
│
├── staging (test kitchen — Oxcon reviews new dishes here)
│
├── feature/OXF-42-adjudication-workflow
├── feature/OXF-58-recipe-builder
├── bugfix/OXF-71-rate-rollup-rounding
└── hotfix/OXF-89-login-crash

The Flow

Think of it as: a chef develops a new dish → tests it in the test kitchen → client tastes it → if approved, it goes on the real menu.

Developer branches from main (learns from the real menu)
        │
        ▼
feature/OXF-42 ── PR ──▶ staging (code review + auto-deploy to test kitchen)
                            │
                      Oxcon reviews staging.oxflow.3sixtyone.co
                            │
                      client approves the new dish
                            │
                            ▼
                    staging ── merge ──▶ main (goes on the real menu)
                                          │
                                    auto-deploy to oxflow.3sixtyone.co

Rules

Rule	Kitchen analogy	Why
Always branch from `main`	Learn from the real menu, not the experimental one	Prevents inheriting half-finished work from other developers
PRs target `staging`	New dishes go to the test kitchen first	Team code review + client review before production
CI must pass before merge	Kitchen inspector checks the dish	Lint, types, tests — catches problems before they reach staging
Short-lived branches (1-5 days)	Don’t hog a station for weeks	Reduces merge conflicts, keeps work small and reviewable
Staging → main after client approval	Approved dish goes on the real menu	Client has seen and signed off on what’s being shipped
Hotfixes go directly to main	Fire in the real kitchen — fix it now	Branch from main, PR to main, then merge main → staging to sync
If a feature is rejected, revert it on staging	Remove the failed dish from the test kitchen	Keep staging clean before merging to main

Branch Naming

feature/OXF-{ticket}-{short-description}    e.g. feature/OXF-42-adjudication-workflow
bugfix/OXF-{ticket}-{short-description}     e.g. bugfix/OXF-71-rate-rollup-rounding
hotfix/OXF-{ticket}-{short-description}     e.g. hotfix/OXF-89-login-crash

Why Not GitFlow?

GitFlow uses five branch types (main, develop, release/*, hotfix/*, feature/*) and requires dual merges on every release and hotfix. It was designed for quarterly releases — not continuous deployment. For a 3-5 person team shipping a web app, the ceremony adds merge conflicts without proportional value.

5. CI/CD Pipeline

TL;DR: Every time a developer submits code, GitHub automatically checks it for bugs, security issues, and style problems. If anything fails, the code can’t go live. Deployments and test results get posted to #oxflow in Slack.

All CI/CD runs on GitHub Actions. Notifications go to the #oxflow Slack channel.

On Every PR (Targeting Staging)

The kitchen inspector checks the dish before it enters the test kitchen.

Step	What it checks	Runs in
Lint + Format	Code style (ESLint + Prettier)	~30s
Type check	TypeScript compiler (`tsc --noEmit`)	~30s
Unit tests	Business logic (Vitest)	~1-2min
Integration tests	API + database (Vitest + Neon dev branch)	~2-3min
Build check	Both API and SPA compile successfully	~1-2min
Security scan	Known vulnerabilities (CodeQL or Trivy)	~1-2min

All steps run in parallel where possible. PR cannot merge unless all pass.

On Merge to Staging

New dish enters the test kitchen.

1. Build API + SPA
2. Create fresh Neon branch from production (instant clone of the pantry)
3. Run anonymization on the new branch (swap real labels for fake ones)
4. Run pending database migrations
5. Deploy to Render staging environment
6. Post to #oxflow Slack: "Staging updated — OXF-42 adjudication workflow ready for review"

On Merge Staging → Main (Production Deploy)

Approved dish goes on the real menu.

1. Build API + SPA
2. Run migrations on Neon main branch
3. Deploy to Render production environment
4. Run smoke tests against production
5. Post to #oxflow Slack: "Production deployed — OXF-42 adjudication workflow is live"

Hotfix Flow

Fire in the real kitchen — fix it now, don’t go through the test kitchen.

1. Branch from main
2. PR targets main directly
3. Same CI checks run
4. On merge: deploy to production immediately
5. Post to #oxflow Slack: "HOTFIX deployed — OXF-89 login crash fixed"
6. Merge main → staging to keep test kitchen current

6. Database Snapshots & Staging Data

TL;DR: When we deploy to staging, we take an instant copy of the live database and automatically replace sensitive info (supplier names, dollar values, margins) with fake data. Audit logs are excluded from masking to preserve immutability. Oxcon reviews against realistic data without anyone seeing confidential pricing. Database migrations are forward-only — if we need to undo a change, we write a new migration rather than rolling back.

This is the “test kitchen pantry” — how staging gets realistic data without exposing real commercial information.

The Problem

Staging needs realistic data to test against. But production data contains sensitive information — supplier rates, tender values, commercial margins. We can’t just copy it raw.

The Solution: Neon Branch + Anonymization

Think of it as: clone the real pantry, then swap all the labels so nobody can tell which supplier provided which ingredients.

Production DB (Neon main) — real pantry
        │
        │ neonctl branches create --name staging-20260420
        │ (instant copy-on-write — milliseconds)
        │
        ▼
Raw staging branch (exact copy of prod data)
        │
        │ Anonymization script runs:
        │ ── Supplier names → "Supplier A", "Supplier B", "Supplier C"
        │ ── Tender values → randomized within ±15%
        │ ── Commercial margins → randomized
        │ ── User emails → user-{hash}@test.oxflow.3sixtyone.co
        │ ── Referential integrity preserved (same masked ID everywhere)
        │
        ▼
Anonymized staging branch
        │
        │ Pending migrations applied (if any new schema changes)
        │
        ▼
Staging database ready — API connects to this branch

What Gets Anonymized

Data type	Treatment	Why
Company names (suppliers, subcontractors)	Deterministic pseudonyms	Commercial sensitivity
Tender dollar values	Randomized within ±15%	Pricing confidentiality
Commercial rules (margins, markups)	Randomized	Competitive advantage
Resource rates (from Price Book)	Randomized within ±15%	Supplier pricing
User emails	Masked to `user-{hash}@test.oxflow.3sixtyone.co`	Privacy
Estimate items, quantities, units	Kept as-is	Needed for realistic testing
Headings, hierarchy structure	Kept as-is	Needed for realistic testing
Codes, categories, units of measure	Kept as-is	Reference data, not sensitive

Tooling

PostgreSQL Anonymizer 2.0 — a Postgres extension that runs inside the database. Uses deterministic hash-based masking with a secret salt, meaning the same input always produces the same masked output. This preserves joins and relationships across tables.

Audit Log Policy

Audit tables are excluded from anonymization on staging branches. The 7-year immutable audit log requirement means these records must never be mutated. On staging, audit data either retains original values (acceptable for internal staging since only the 361 team accesses it) or is truncated entirely for client-facing demos.

Migration Strategy

Database migrations are forward-only with compensating migrations for rollback. For a financial system where cost calculations, snapshots, and audit logs depend on schema integrity, reversible migrations risk data corruption. If a migration needs to be undone, a new forward migration is written that compensates for the change.

Lifecycle

Fresh branch created on every merge to staging
Previous staging branch is automatically deleted by CI
Branches are cheap — copy-on-write, you only pay for the data that changes after branching
Each developer can also create personal Neon branches for local development

7. Environments Summary

TL;DR: Three versions of the app — production (real users), staging (client review with fake data), and local dev (each developer’s machine).

Three environments, like three versions of the restaurant.

Environment	URL	Kitchen analogy	Deploys from	Database	Who uses it
Production	`oxflow.3sixtyone.co`	The real restaurant	`main` branch	Neon main	Oxcon estimators (real work)
Staging	`staging.oxflow.3sixtyone.co`	The test kitchen	`staging` branch	Neon branch (snapshot + anonymized)	Oxcon reviewers + 361 team
Local dev	`localhost:3000`	A chef’s home kitchen	Feature branch	Neon dev branch or local Docker PG	Individual developers

8. Notifications (Slack oxflow)

TL;DR: A dedicated #oxflow Slack channel gets automatic updates whenever code is deployed, tests fail, or staging is ready for client review.

All automated notifications go to a dedicated #oxflow Slack channel.

Event	Message
PR opened	”PR #42 opened: Adjudication workflow — ready for code review”
CI failed	”CI failed on PR #42: unit tests — 3 failures in adjudication.service.spec.ts”
Staging deployed	”Staging updated with PR #42 — review at staging.oxflow.3sixtyone.co”
Production deployed	”Production deployed — OXF-42 adjudication workflow is live”
Hotfix deployed	”HOTFIX deployed — OXF-89 login crash fixed”
Staging review requested	”OXF-42 ready for client review @ryan @greg @matt”

9. Open Questions

TL;DR: Five things need team/client input before we can finalise — domain, data residency, ticket tracking tool, staging review timeline, and budget approval.

Items that need team or client input before finalising.

Question	Who decides	Impact
Subdomain confirmation (oxflow.3sixtyone.co)	361 team	URLs for production and staging environments
Oxcon data residency requirements	Oxcon	May require hosting migration to Azure if NZ data residency is mandated
Ticket tracking tool (Asana / Linear / other)	361 team	Branch naming convention (OXF-{ticket}) depends on this
Staging review SLA	Oxcon + 361	How long does client have to review before staging is refreshed
Budget approval for Render + Neon	361 / Oxcon	Estimated ~NZ$205–370/mo for hosting + database

10. Cost Estimate

TL;DR: ~NZ$205–370/mo to start on production-grade plans (Neon Scale + Render Standard). Scales with usage, no big upfront spend. All pay-as-you-grow.

Estimated monthly costs at early stage (small team, moderate data).

Service	Plan	Estimated monthly	Notes
GitHub	Team (free for orgs)	$0	2000 CI minutes/mo included
Render	Standard (web service + managed Redis)	~NZ$85–170	Zero-downtime deploys, health checks, persistent disk
Neon	Scale ($69 USD base)	~NZ$120–200	Autoscaling compute, 10 branches, PITR, connection pooling
Domain	—	~NZ$25/yr	If purchasing a new subdomain
Total		~NZ$205–370/mo	Grows with usage, not upfront

All prices converted from USD at ~1.70 NZD/USD. Actual exchange rate will vary.

These are conservative (high-side) estimates. Actual costs will likely come in lower, especially in early stages with low traffic and small data volumes. All platforms are pay-as-you-grow — no large upfront commitments.

oxFlow Wiki

Explorer