Architecture

Two pieces, talking over HMAC:

   ┌───────────────────────────┐         ┌──────────────────────────────┐
   │   Control plane (Vercel)  │         │   Per-user VM (Fly.io)       │
   │                           │         │                              │
   │   apps/web (Next.js 16)   │ ──HMAC─►│   shell-mux  (:8080 public)  │
   │   @askrobin/db (Neon)     │         │   integrations  (loopback)   │
   │   @askrobin/billing       │ ◄─HMAC──│   dispatcher  (per-pane mux) │
   │   @askrobin/catalog       │         │   notifier  (outbox watcher) │
   │   @askrobin/core          │         │   sshd  (:443)               │
   └───────────────────────────┘         └──────────────────────────────┘
            ▲                                          ▲
            │                                          │
            └────── you (browser, SSH, chat) ──────────┘

Control plane

apps/web is a Next.js 16 App Router app on Vercel. Surfaces:

Marketing (/, /pricing, /privacy, /terms)
Signup + onboarding wizard
Cloud-shell SPA at /(app) — xterm.js connects to your VM's shell-mux via wss
Admin (/admin) gated by ADMIN_EMAILS
API routes for signup, secrets pass-through, ssh-keys, billing portal, OAuth broker, OAuth refresh, Stripe + Postmark webhooks
OAuth broker at /auth/start/[provider] and /auth/cb/[provider]

Auth.js v5 with Google sign-in and a JWT session. Drizzle on Neon Postgres for users, machines, subscriptions, oauth_sessions, audit_log, etc. The control plane never persists OAuth refresh tokens — just relays and forgets.

Per-user VM

Built from infra/vm-image/. Multi-stage Dockerfile, runs Ubuntu 24.04 + Node 22 + Tailscale + tmux + ttyd + sshd + Claude Code + robin-assistant. Image is ~1.4 GB and lives on Fly.

Inside, a supervisor entrypoint (scripts/entrypoint.sh) runs in place of systemd-as-PID-1:

| Service | Port | Profile | |---|---|---| | shell-mux | :8080 (public) | always | | integrations | 127.0.0.1:8081 | always | | sshd | :443 | always | | dispatcher | — | claimed only | | notifier | — | claimed only | | three scheduler loops | — | claimed only |

Two profiles, branched on INBOUND_KEY:

warm-pool — pre-spawned Fly machines waiting to be claimed. Skip the inner services to save RAM.
claimed — full stack. Triggered when the control plane pushes INBOUND_KEY and restarts the machine at signup.

The scheduler loops replace the classic systemd timers (Fly doesn't run systemd as PID 1):

robin run --due every 5 min
refresh-tokens.js every 30 min
update.sh daily, randomized 0–2h offset

Anthropic auth modes

You pick one at signup; both work, the difference is who Anthropic bills.

Paste (default). Bring your own Anthropic API key. We never see it. Stored at user-data/secrets/anthropic.json on your VM. Claude Code reads it on launch. We have zero say in your usage.
Broker. Anthropic billing pass-through. We hold the key on the control plane and proxy your traffic. Spec §16.1 ships this only after the ToS spike resolves.

Token-refresh dataflow

The most interesting cross-cutting piece. See Integrations → Token refresh for the user-facing version.

VM cron (every 30 min)
  └─ refresh-tokens.js scans user-data/secrets/*.json
     └─ for each near-expiring token:
        └─ POST CONTROL_PLANE_URL/api/oauth/refresh
           Headers: x-machine-id, x-robin-signature
           Body: { provider, refreshToken }
              ↓
           Control plane:
             1. look up machines.inbound_key by fly_machine_id
             2. verify HMAC over body
             3. catalog.getProvider(id) → broker client_id/secret env
             4. POST to provider's tokenEndpoint, grant_type=refresh_token
             5. return { tokens } — never persisted
              ↓
        VM writes tokens back to user-data/secrets/<provider>.json

Tested in two halves: the control-plane endpoint has 11 vitest cases covering HMAC gates, provider lookup, missing creds, success, refresh-token reuse, and provider failures. The VM-side script is exercised by the docker-build smoke test.

Why per-user VMs

A VM per user is the simplest model that gives you (a) real isolation, (b) a writable filesystem for user-data/, (c) a long-running tmux session that survives between turns, and (d) a place for incoming webhooks to land. We considered a single multi-tenant VM with workspace-per-user; the isolation story was bad enough we backed out.

Cost is fine because Fly auto-suspends idle machines. Warm-pool absorbs the cold-start hit so first-time signup is < 5 s.

Source

Code: github.com/kevinkiklee/askrobin.io

Spec: docs/spec.md, audit: docs/AUDIT.md, runbook: docs/RUNBOOK.md.