Production Deployment — Cloudflare + Railway

What this doc answers

After reading this you can take Zapvol to production from scratch, and you’ll understand why each choice is the way it is. Not a selection comparison — see Architecture Overview for that.

Key parameters

Item	Value
Frontend hosting	Cloudflare Pages
Backend hosting	Railway
Database	Railway Postgres plugin
Queue / Cache	Railway Redis plugin
File storage	Cloudflare R2
Log backend	Grafana Cloud Loki
Server image	Single Dockerfile, shared by server / worker
Starting monthly cost	$5–15

Why Cloudflare + Railway

After elimination:

Vercel doesn’t fit the server: @hono/node-ws needs persistent WebSocket connections, BullMQ workers need long-lived processes. Vercel Functions is a serverless model — neither runs there.
Cloudflare Workers doesn’t fit the server: @hono/node-server / ioredis / postgres aren’t compatible with the Workers runtime. Rewriting all of them isn’t free.
Fly.io recommends Upstash Redis — Upstash has connection-count and per-command pricing constraints that fight BullMQ’s blocking-read + persistent-TCP workload. Not a great match.
Railway gives real Postgres + real Redis containers, with server / worker / DB / cache all sharing a private network *.railway.internal. Zero egress charges, low latency, and monorepo multi-service support.

Why not Railway for the frontend too — Cloudflare Pages has a far denser CDN edge and serves static assets free.

Topology

Four Railway services + two CF Pages projects:

Service	Platform	Role	Public ingress
`marketing`	Cloudflare Pages	Marketing site (Astro)	`zapvol.com`
`web`	Cloudflare Pages	Web app (Vite SPA)	`app.zapvol.com`
`server`	Railway	API + WebSocket (Hono)	`api.zapvol.com`
`worker`	Railway	BullMQ queue consumer	none
`postgres`	Railway plugin	Primary database	private only
`redis`	Railway plugin	BullMQ + cache	private only

External dependencies: Anthropic / OpenAI (agent inference), Cloudflare R2 (file storage), Grafana Cloud Loki (logs).

All deployment artifacts live under docker/:

docker/
  ├─ Dockerfile                     multi-stage build
  ├─ docker-compose.yaml            self-hosted deploy + local prod-mirror (pulls ghcr image)
  ├─ docker-compose.build.yaml      override — build the image locally instead of pulling
  ├─ railway.server.json            Railway server service config
  └─ railway.worker.json            Railway worker service config

docker/Dockerfile produces one image; two services reuse it with different startCommand:

docker/Dockerfile  (build context = repo root)
  ├─ Stage 1-3: install + build (pnpm workspaces, tsup)
  ├─ Stage 4: prod-only deps
  └─ Stage 5: production image
       └─ CMD ["node", "apps/server/dist/index.mjs"]   ← default: server

Railway service "server"  →  CMD: node apps/server/dist/index.mjs
Railway service "worker"  →  CMD: node apps/server/dist/worker.mjs

Why not run both processes in one CMD: a worker crash takes the server down with it; scaling granularity is also coarser. Two services with independent restart policies and scaling tiers is the better shape.

Two Railway configs:

docker/railway.server.json    ← server service
docker/railway.worker.json    ← worker service

Server config — key fields:

{
  "build": { "builder": "DOCKERFILE", "dockerfilePath": "docker/Dockerfile" },
  "deploy": {
    "startCommand": "node apps/server/dist/index.mjs",
    "preDeployCommand": "node apps/server/dist/db-migrate.mjs",
    "healthcheckPath": "/health",
    "healthcheckTimeout": 30,
    "restartPolicyType": "ON_FAILURE",
    "restartPolicyMaxRetries": 5
  }
}

preDeployCommand runs once before the new release starts, dedicated to Drizzle migrations. The worker config does not have one — running DDL from two processes simultaneously creates lock contention; migrations belong on the server side only.

Local prod-mirror (run this before pushing to Railway)

docker/docker-compose.yaml replicates Railway’s service topology (postgres + redis + migrate + server + worker) and is the same compose used for self-hosted deploys — not a separate “test-only” config.

By default it pulls ghcr.io/zapvol/zapvol-server:${VERSION:-latest}. To test code that isn’t published yet, overlay docker-compose.build.yaml so Compose builds the image from docker/Dockerfile on the spot:

# From the repo root — local-build (the usual dev flow)
docker compose -f docker/docker-compose.yaml -f docker/docker-compose.build.yaml up -d --build

# Or pull from ghcr (verifying a published version)
docker compose -f docker/docker-compose.yaml up -d

# Healthcheck
curl http://localhost:8001/health   # → {"ok":true}

Startup ordering is enforced by depends_on: postgres → migrate (one-shot, runs Drizzle migrations and exits) → server / worker. Same semantic as Railway’s preDeployCommand → server start.

Credentials (AI / R2 / OAuth) live in the repo-root .env (gitignored); Compose loads them automatically. At minimum set BETTER_AUTH_SECRET and one AI provider key — otherwise the server boots fine but agent calls return 401. The full variable list is in .env.example.

Deploy timeline

A single git push triggers the full pipeline. CF Pages and Railway are fired in parallel (the same push webhook reaches both); the diagram serializes them for readability.

sequenceDiagram actor Dev as Developer participant GH as GitHub participant CF as Cloudflare Pages participant RW as Railway Build participant Mig as preDeploy (migrate) participant Srv as server service participant Wrk as worker service Dev->>GH: git push origin main GH-->>CF: webhook (marketing / web) GH-->>RW: webhook (server / worker) rect rgba(96, 165, 250, 0.18) Note over CF,CF: ① CF Pages build CF->>CF: pnpm build (Vite / Astro) CF-->>Dev: edge deployed end rect rgba(251, 191, 36, 0.18) Note over RW,Wrk: ② Railway pipeline RW->>RW: docker build (one image) RW->>Mig: preDeployCommand (server only) Mig->>Mig: drizzle migrate Mig-->>RW: exit 0 RW->>Srv: start (rolling) Srv-->>RW: /health 200 RW->>Srv: cut traffic to new version RW->>Wrk: start (replace) end Note over Dev,Wrk: any step failure (T) Railway auto-rolls back to previous image

Robustness comes from three things: (1) migration runs in exactly one place and any failure rolls back the whole release; (2) server only takes traffic after /health passes; (3) worker only restarts after server has succeeded.

Step-by-step deploy

First-time onboarding

1. Create the Railway project

# In Railway dashboard:
# 1. New Project → Deploy from GitHub repo → pick zapvol
# 2. Inside project → Add Service → Database → PostgreSQL (plugin)
# 3. Add Service → Database → Redis (plugin)

2. Configure the server service

The repo is already linked, so Railway auto-detects the Dockerfile. Manual tweaks:

Settings → Source → Config-as-code Path: docker/railway.server.json
Settings → Source → Watch Paths: apps/server/**, packages/backend/**, packages/common/**, Dockerfile, pnpm-lock.yaml
Settings → Networking → Public Networking: enable, bind api.zapvol.com

3. Duplicate as the worker service

Railway lets you Duplicate an existing service inside the same project so you don’t redo the GitHub link:

Add Service → From existing service → server
Change Config-as-code Path to docker/railway.worker.json
Leave Networking off — the worker has no public ingress

4. Environment variables

Each service has its own Variables tab. Railway uses ${{ServiceName.VAR}} for cross-service references:

server — required:

NODE_ENV=production
PORT=8001

# Private connection strings — use the PRIVATE variant to avoid egress charges
DATABASE_URL=${{Postgres.DATABASE_PRIVATE_URL}}
REDIS_URL=${{Redis.REDIS_PRIVATE_URL}}

# better-auth
BETTER_AUTH_URL=https://api.zapvol.com
BETTER_AUTH_SECRET=<openssl rand -hex 32>
BETTER_AUTH_COOKIE_DOMAIN=.zapvol.com   # so the web subdomain shares sessions

# CORS — every public origin that hits api.zapvol.com
CORS_ORIGINS=https://app.zapvol.com,https://zapvol.com

# Model providers
ANTHROPIC_API_KEY=<...>
OPENAI_API_KEY=<...>

# File storage — R2
R2_ACCESS_KEY_ID=<...>
R2_SECRET_ACCESS_KEY=<...>
R2_BUCKET=zapvol-prod
R2_ENDPOINT=https://<account>.r2.cloudflarestorage.com

# Logs — Grafana Cloud
LOKI_URL=https://logs-prod-xxx.grafana.net
LOKI_USERNAME=<...>
LOKI_PASSWORD=<...>

worker is the same minus PORT / BETTER_AUTH_* / CORS_ORIGINS (no HTTP surface). Everything else — DB, Redis, AI keys, R2, Loki — is required.

5. CF Pages

Two projects, both linked to the same GitHub repo:

CF Pages project	Repo path	Build command	Output dir	Custom domain
`zapvol-marketing`	`apps/marketing`	`pnpm install && pnpm --filter=marketing build`	`apps/marketing/dist`	`zapvol.com`
`zapvol-web`	`apps/web`	`pnpm install && pnpm --filter=web build`	`apps/web/dist`	`app.zapvol.com`

The web SPA needs an apps/web/public/_redirects for client-side routing fallback:

/*    /index.html   200

Without this, refreshing any non-root path 404s.

Environment variables (web only):

VITE_API_BASE_URL=https://api.zapvol.com
VITE_WS_URL=wss://api.zapvol.com/ws

6. DNS

In Cloudflare DNS:

zapvol.com         → CNAME zapvol-marketing.pages.dev   (proxied)
app.zapvol.com     → CNAME zapvol-web.pages.dev         (proxied)
api.zapvol.com     → CNAME <railway-server>.up.railway.app   (DNS only — important)

api.zapvol.com must be set to “DNS only”, not proxied — Cloudflare’s free-tier proxy terminates WebSockets, and the server depends heavily on WS. Going direct to Railway’s edge gives more stable long-lived connections.

Subsequent pushes

git push origin main triggers the full pipeline. Both CF Pages and Railway auto-deploy — no manual steps.

Rollback: Railway → service → Deployments → previous version → Redeploy. One click. Same on CF Pages.

What this setup does NOT do

No hand-rolled CI/CD: no GitHub Actions, ArgoCD, Jenkins. CF + Railway handle it.
No container orchestration: no Kubernetes, no docker-compose for production. Until traffic hits hundreds of thousands of QPS, Railway’s service model is enough.
No multi-region: the server is single-region. The frontend already gets global reach via CF’s edge; the complexity of cross-region replication (consistency, write routing) outweighs the win at this stage.
No serverless: see “Why not Vercel / CF Workers” above.
No self-hosted Postgres / Redis in the same container as the app. Use the Railway plugins. Backups, version upgrades, disk growth, and HA are someone else’s problem.

Pitfall checklist

Each item below corresponds to a real failure mode:

CF proxying WebSocket — covered above. api.zapvol.com must be DNS only. If WS connections drop ~30s in or wss:// handshakes 200 then close, check CF proxy first.
Postgres public vs private URL — DATABASE_URL defaults to the public hostname, routes through the public internet, and counts against egress. Use DATABASE_PRIVATE_URL for *.railway.internal.
preDeployCommand on the wrong service — workers must not run migrations. Don’t cross-wire the two railway.*.json files.
/health goes through the full middleware stack — currently it traverses pino-logger + cors + requestContext, so every healthcheck writes a log line. Loki cardinality is fine but it’s noisy. If that bothers you, register /health before those middlewares.
Drizzle migrations folder missing from the image — Stage 5 of Dockerfile must explicitly COPY --from=build /app/apps/server/drizzle/ apps/server/drizzle/. Without it, db-migrate fails to find migration files, the pre-deploy hook errors, and the entire release rolls back.
better-auth cookie domain — if BETTER_AUTH_COOKIE_DOMAIN isn’t .zapvol.com, the web app (app.zapvol.com) can’t read session cookies set by the server (api.zapvol.com).
CORS_ORIGINS missing marketing — if marketing has a “Try now” button calling api.zapvol.com directly, https://zapvol.com must also be in the allowlist.
Watch Paths too broad — Railway watches the whole repo by default. A marketing tweak then triggers a server rebuild and burns build minutes. Narrow Watch Paths to the four entries listed above.
Healthcheck timeout — default is 30 s. If cold starts load large skill packages via initToolRegistry() and exceed that, bump healthcheckTimeout to 60 s.
db-migrate is idempotent — it’s a stateless process that scans drizzle/ and compares against __drizzle_migrations every run, skipping applied ones. Safe to re-trigger.

Cost estimate

For current scale (< 100 DAU, single server / worker instance, single Postgres / Redis):

Item	Monthly
Railway Hobby plan (includes $5 usage credit)	$5
Railway actual usage (server + worker + DB + Redis at 256MB–1GB each)	$0–10
Cloudflare Pages (marketing + web)	$0
Cloudflare R2 (< 10 GB storage, < 1M ops)	$0
Grafana Cloud (free tier)	$0
Total	$5–15

Scaling up:

Each extra server replica ≈ +$3–5/month
Postgres at 4 GB RAM ≈ +$15/month
Once traffic really takes off, plan for ECS / Kubernetes — by which time the monthly bill starts in the hundreds.