Case Studies · Last updated June 2026

Case Study: Architecting an E-commerce Platform Before the First Commit — Cost Engineering, Risk Analysis, and an AUD $50/Month Launch Stack

Results at a Glance

  • A 330-feature platform scoped bottom-up to ~230 engineer-days — roughly 4 months for a 3-person team, with an 18% contingency built in rather than bolted on
  • Launch infrastructure costed at AUD 45–55/month including managed PostgreSQL with automated backups, point-in-time recovery, and failover
  • Every deferred component — Redis, search engine, microservices — has a written, measurable upgrade trigger instead of a vague "later"
  • 90 risks and 164 requirement gaps identified and documented before the first production commit, including two that would have corrupted money: webhook replays and concurrent oversell
  • A ~3,300-line working prototype validated the discount engine design before committing 14 engineer-days to the production version

The Project

Pick N Collect is an Australian click-and-collect and delivery platform with an unusual basket: weekly groceries, a case of wine, and a restaurant meal in a single checkout. Woolworths does groceries, Dan Murphy's does alcohol, Uber Eats does restaurants — nobody does all three in one order. The plan: three Next.js apps (storefront, admin, restaurant portal), a Go API, PostgreSQL, Firebase Auth, Stripe, and Cloudflare R2 for images.

Most case studies on this site are about systems already in production. This one is about the work that happens before — and it's work I think gets skipped far too often. Before the first production commit, this project had a full architecture decision record, a costed hosting plan with scale triggers, a bottom-up effort estimate, a compliance register, a risk analysis, and a working prototype. That up-front engineering is the case study.

Sizing the Build Honestly

The feature tracker lists 330 features across the storefront, admin dashboard, restaurant portal, and Go API. Instead of a finger-in-the-air "about six months", the estimate was built bottom-up: each module sized from its open feature count, assuming mid/senior engineers and reuse from the prototype.

Go API (13 modules) 68 engineer-days

3 Next.js apps + shared pkgs 97 engineer-days

Infra, QA, E2E, launch 29 engineer-days

Contingency (~18%) 35 engineer-days

Total ~230 engineer-days (range 200–270)

That translates to about 4 months for a 3-person squad, 5.5–6 months for two engineers. The two largest single line items told us where the risk lived before any code did: the promotions engine (31 features, 14 days) and Stripe webhook orchestration (14 features, 7 days). Both got designed first, behind tests, rather than discovered late.

The AUD $50/Month Launch Stack

A pre-revenue platform shouldn't carry a post-revenue bill. The launch stack was costed line by line in AUD:

DigitalOcean Droplet (SYD1, Go API + Caddy) ~$18/mo

DO Managed PostgreSQL 15 (SYD1) ~$23/mo

Cloudflare R2 (images, no egress fees) ~$2/mo

Domain ~$3.50/mo

Vercel, Cloudflare Pages, Firebase Auth,

GitHub Actions, Sentry free tiers

Total ~AUD 45–55/mo

The obvious challenge: Oracle Cloud's Always Free tier offers 4 ARM cores and 24 GB of RAM for $0 — why pay anything? Three reasons, all documented in the decision record. Oracle frequently can't provision that ARM capacity in Sydney and reclaims idle free-tier instances. The database would be self-managed on a single VM with a daily pg_dump — up to 24 hours of data loss for a payments business. And there's no failover: VM crash means downtime. Free is the right price for dev and staging, and that's where it got used.

The $23/month managed PostgreSQL is the single highest-ROI line in the budget: automated backups, point-in-time recovery, and failover in under a minute — capabilities that cost real engineering time to self-host badly. The rest of the region decision was simple process of elimination: Hetzner, Render, and Railway have no Australian region, and Fly.io's "Postgres" is operate-it-yourself. Australian data residency plus single-digit-millisecond latency from the Sydney edge settled it.

The plan also costs the next two phases so growth doesn't arrive as a surprise: roughly AUD 200–350/month once GMV passes $100K (bigger droplet, Postgres standby and read replica, Redis), and AUD 2,000–8,000/month at the $1M–$10M GMV stage, where the platform migrates to AWS ap-southeast-2 — Aurora multi-AZ, ECS Fargate, CloudFront. Each tier is a decision made in advance with its trigger written down, not a 3 a.m. scramble.

Every Deferral Gets a Written Trigger

The architecture says no to a lot of fashionable components — but every "no" is really a "not until", with a measurable threshold:

  • No Redis at launch. Catalog endpoints ship Cache-Control: public, s-maxage=60, stale-while-revalidate=600 plus Next.js ISR. Redis arrives when PostgreSQL connection waits exceed 10ms at p95, or one endpoint sustains 100 req/s.
  • No Algolia or Typesense. PostgreSQL full-text search with a tsvector column and GIN index handles ~50K products comfortably; the launch catalog is a fraction of that. The trigger is 50K+ SKUs or typo-tolerance becoming a real competitive issue. One non-obvious detail: the FTS uses the simple dictionary rather than english, because stemming mangles Australian brand names.
  • No WebSockets. Order notifications only need server-to-browser push, so Server-Sent Events do the job with plain HTTP. WebSockets get reconsidered when live driver GPS tracking — genuinely bidirectional, high-frequency — lands on the roadmap.
  • No microservices. A Go monolith on a 2 GB droplet handles thousands of requests per second for this workload. The split order is even pre-decided: notification service first (it's I/O-heavy and stateless), then dispatch, then catalog — triggered at 10K+ orders/day or when separate team ownership emerges.

Writing the trigger down matters more than the deferral itself. "We'll add Redis when we need it" becomes an argument six months later; "when p95 connection wait exceeds 10ms" becomes a dashboard alert.

Multi-Tenancy with Postgres Row-Level Security

Restaurant partners log into their own portal and must only ever see their own orders. Rather than trusting every developer to remember a WHERE restaurant_id = ? clause forever, tenant isolation is enforced in the database itself with row-level security:

-- Migration: enable RLS on restaurant-scoped tables
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON orders
  USING (restaurant_id = current_setting('app.tenant_id')::uuid);

-- Go side: every restaurant-scoped query runs inside a
-- transaction that pins the tenant from the verified JWT
tx.Exec(ctx, "SET LOCAL app.tenant_id = $1", claims.TenantID)

A forgotten WHERE clause now returns zero rows instead of another restaurant's orders. The acceptance test is written into the plan verbatim: a restaurant A token must not be able to read restaurant B's orders — asserted in CI, not assumed.

Designing for the Failures That Corrupt Money

The risk analysis surfaced 90 items, but two stood out because they silently corrupt financial state rather than throwing errors.

Stripe delivers webhooks at-least-once. Without an idempotency guard, every replay of checkout.session.completed double-decrements stock and double-credits loyalty points. The guard is a single atomic state transition:

-- Atomic claim: only one webhook delivery wins
UPDATE orders SET status = 'paid'
WHERE id = $1 AND status = 'pending_payment';

-- RowsAffected == 0 → already processed, skip all side effects.
-- A processed_stripe_events table adds an audit trail.

-- Integration test: replay the same event 3 times,
-- assert stock decrements exactly once.

Concurrent oversell. Between order creation (stock check) and the payment webhook (stock deduction), two buyers can both pass the check on a one-unit product. The design reserves stock at order time with SELECT FOR UPDATE on the product row, and if deduction still fails at webhook time, the order is automatically refunded through Stripe — not parked for a staff member to notice. The load-test exit criterion is specific: 20 concurrent checkouts on a one-unit product produce exactly one success and nineteen HTTP 409s.

Smaller but the same spirit: Cloudflare R2 doesn't enforce a max size on presigned PUT uploads by default, so the upload design pins ContentLength in the presign call and backs it with a bucket-level transform rule — otherwise anyone with an upload URL can store arbitrarily large files on your bill.

Compliance as an Architecture Input

Selling alcohol online in Australia without a state liquor licence is a criminal offence — NSW penalties exceed $11,000 per incident. That's not a legal footnote; it shapes the schema and the API. Products carry an alcohol_pct column, and order creation rejects any cart containing alcohol unless age verification is recorded — enforced server-side as a blocking check, because a front-end modal is trivially bypassed and legally insufficient on its own.

GST works the same way: every price is stored as GST-inclusive integer cents with a computed gst_cents per line item, and fresh groceries are flagged GST-free while alcohol and restaurant meals aren't. Even the loyalty program got a compliance review — uncapped redeemable points can legally become a stored-value facility requiring a financial services licence, so the design caps redemption at 20% of order value with 12-month expiry. Licensing lead times (4–8 weeks) went on the critical path next to the engineering milestones, because a finished platform that can't legally sell its highest-margin category isn't finished.

Prototype First, Estimate Second

Before the production plan was finalised, I built a ~3,300-line throwaway prototype: a Go API with SQLite, a Next.js storefront, a seeded catalog of 26 Australian products, and — the actual point — a working discount engine. It evaluates priority-sorted rules across three types (buy 5+ wines for 5% off, buy 6+ beers for 20% off snacks, spend $100+ for 10% off the cart) against the live cart with a real-time preview endpoint.

The prototype paid for itself twice. It made the effort estimate credible — 14 engineer-days for the production promotions engine is a much safer number once you've built one. And it surfaced the most dangerous open decision in the project: the prototype's flat discount_rules table conflicts with the normalised seven-table promotions schema the production spec needs. Caught now, that's a design discussion. Caught after migrations ship, it's a data migration with money attached.

Where the Project Stands

To be clear about status — because a case study that oversells is worthless: the architecture, cost plan, compliance register, risk analysis, and prototype are done; the production build is phased (foundation → transactable MVP → full operations → growth) and in its early stages. The monorepo, Go service skeleton with RBAC middleware, and CI/CD layout exist; the bulk of the 312 open features are ahead, which is exactly what the 230-day estimate says. I'll update this page as the phases land.

What I'd Do Differently

  • Settle the promotions schema before writing the prototype, not after. The flat-table shortcut was right for a demo, but it leaked into early spec documents and became a conflict that had to be formally resolved — prototypes are disposable, the assumptions they plant are not.
  • Engage the liquor-licensing solicitor in week one. The 4–8 week licence lead time is longer than the entire infrastructure setup, and it gates revenue on the highest-margin category.
  • Treat the compliance register as a living budget item. Privacy policy, GST registration, and terms of service each carry real effort that early estimates tend to file under "misc".

Scoping a platform build?

I do this kind of pre-build engineering for teams: architecture decision records, infrastructure cost plans with scale triggers, effort estimates you can defend, and risk analysis before the risks ship.

See Services