docs: add multi-machine support architecture design

Full WebSocket architecture spec for remote agent/terminal management: bterminal-relay binary, RemoteManager, NDJSON protocol, pre-shared token + TLS auth, 4-phase implementation plan (A-D).
2026-03-06 18:45:56 +01:00 · 2026-03-06 18:45:56 +01:00 · 04a7a4bb94
commit 04a7a4bb94
parent 86fbe3e762
3 changed files with 304 additions and 2 deletions
--- a/TODO.md
+++ b/TODO.md
@ -4,7 +4,7 @@

 - [ ] **Deno sidecar real-world testing** -- Integrated into sidecar.rs (Deno-first + Node.js fallback). Needs testing with real claude CLI and startup time benchmark vs Node.js.
 - [ ] **E2E testing (Playwright/WebDriver)** -- Scaffold at v2/tests/e2e/README.md. Needs display server to run. Test: open terminal, run command, open agent, verify output.
- [ ] **Multi-machine support** -- Remote agents via WebSocket (Phase 7+ feature).
+- [ ] **Multi-machine support** -- Architecture designed in [docs/multi-machine.md](docs/multi-machine.md). Next: Phase A (extract bterminal-core crate), then Phase B (bterminal-relay binary).
 - [ ] **Agent Teams real-world testing** -- Frontend routing implemented (Phase 7). Needs testing with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 and real subagent spawning.

 ## Completed
--- a/docs/multi-machine.md
+++ b/docs/multi-machine.md
@ -0,0 +1,302 @@
+# Multi-Machine Support — Architecture Design
+
+## Overview
+
+Extend BTerminal to manage Claude agent sessions and terminal panes running on **remote machines** over WebSocket, while keeping the local sidecar path unchanged.
+
+## Problem
+
+Current architecture is local-only:
+
+```
+WebView ←→ Rust (Tauri IPC) ←→ Local Sidecar (stdio NDJSON)
+                              ←→ Local PTY (portable-pty)
+```
+
+Target state: BTerminal acts as a **mission control** that observes agents and terminals running on multiple machines (dev servers, cloud VMs, CI runners).
+
+## Design Constraints
+
+1. **Zero changes to local path** — local sidecar/PTY must work identically
+2. **Same NDJSON protocol** — remote and local agents speak the same message format
+3. **No new runtime dependencies** — use Rust's `tokio-tungstenite` (already available via Tauri)
+4. **Graceful degradation** — remote machine goes offline → pane shows disconnected state, reconnects automatically
+5. **Security** — all remote connections authenticated and encrypted (TLS + token)
+
+## Architecture
+
+### Three-Layer Model
+
+```
+┌──────────────────────────────────────────────────────────────────┐
+│  BTerminal (Controller)                                          │
+│                                                                  │
+│  ┌──────────┐    Tauri IPC    ┌──────────────────────────────┐  │
+│  │ WebView  │ ←────────────→  │ Rust Backend                 │  │
+│  │ (Svelte) │                 │                              │  │
+│  └──────────┘                 │  ├── PtyManager (local)      │  │
+│                               │  ├── SidecarManager (local)  │  │
+│                               │  └── RemoteManager ──────────┼──┤
+│                               └──────────────────────────────┘  │
+│                                                                  │
+└──────────────────────────────────────────────────────────────────┘
+        │                                      │
+        │ (local stdio)                        │ (WebSocket wss://)
+        ▼                                      ▼
+  ┌───────────┐                    ┌──────────────────────┐
+  │ Local     │                    │ Remote Machine       │
+  │ Sidecar   │                    │                      │
+  │ (Deno/    │                    │  ┌────────────────┐  │
+  │  Node.js) │                    │  │ bterminal-relay│  │
+  │           │                    │  │ (Rust binary)  │  │
+  └───────────┘                    │  │                │  │
+                                   │  │ ├── PTY mgr   │  │
+                                   │  │ ├── Sidecar mgr│  │
+                                   │  │ └── WS server  │  │
+                                   │  └────────────────┘  │
+                                   └──────────────────────┘
+```
+
+### Components
+
+#### 1. `bterminal-relay` — Remote Agent (Rust binary)
+
+A standalone Rust binary that runs on each remote machine. It:
+
+- Listens on a WebSocket port (default: 9750)
+- Manages local PTYs and claude sidecar processes
+- Forwards NDJSON events to the controller over WebSocket
+- Receives commands (query, stop, resize, write) from the controller
+
+**Why a Rust binary?** Reuses existing `PtyManager` and `SidecarManager` code from `src-tauri/src/`. Extracted into a shared crate.
+
+```
+bterminal-relay/
+├── Cargo.toml        # depends on bterminal-core
+├── src/
+│   └── main.rs       # WebSocket server + auth
+│
+bterminal-core/       # shared crate (extracted from src-tauri)
+├── Cargo.toml
+├── src/
+│   ├── pty.rs        # PtyManager (from v2/src-tauri/src/pty.rs)
+│   ├── sidecar.rs    # SidecarManager (from v2/src-tauri/src/sidecar.rs)
+│   └── lib.rs
+```
+
+#### 2. `RemoteManager` — Controller-Side (in Rust backend)
+
+New module in `v2/src-tauri/src/remote.rs`. Manages WebSocket connections to multiple relays.
+
+```rust
+pub struct RemoteMachine {
+    pub id: String,
+    pub label: String,
+    pub url: String,          // wss://host:9750
+    pub token: String,        // auth token
+    pub status: RemoteStatus, // connected | connecting | disconnected | error
+}
+
+pub enum RemoteStatus {
+    Connected,
+    Connecting,
+    Disconnected,
+    Error(String),
+}
+
+pub struct RemoteManager {
+    machines: Arc<Mutex<Vec<RemoteMachine>>>,
+    connections: Arc<Mutex<HashMap<String, WsConnection>>>,
+}
+```
+
+#### 3. Frontend Adapters — Unified Interface
+
+The frontend doesn't care whether a pane is local or remote. The bridge layer abstracts this:
+
+```typescript
+// adapters/agent-bridge.ts — extended
+export async function queryAgent(options: AgentQueryOptions): Promise<void> {
+  if (options.remote_machine_id) {
+    return invoke('remote_agent_query', { machineId: options.remote_machine_id, options });
+  }
+  return invoke('agent_query', { options });
+}
+```
+
+Same pattern for `pty-bridge.ts` — add optional `remote_machine_id` to all operations.
+
+## Protocol
+
+### WebSocket Wire Format
+
+Same NDJSON as local sidecar, wrapped in an envelope for multiplexing:
+
+```typescript
+// Controller → Relay (commands)
+interface RelayCommand {
+  id: string;                      // request correlation ID
+  type: 'pty_create' | 'pty_write' | 'pty_resize' | 'pty_close'
+      | 'agent_query' | 'agent_stop' | 'sidecar_restart'
+      | 'ping';
+  payload: Record<string, unknown>;
+}
+
+// Relay → Controller (events)
+interface RelayEvent {
+  type: 'pty_data' | 'pty_exit'
+      | 'sidecar_message' | 'sidecar_exited'
+      | 'error' | 'pong' | 'ready';
+  sessionId?: string;
+  payload: unknown;
+}
+```
+
+### Authentication
+
+1. **Pre-shared token** — relay starts with `--token <secret>`. Controller sends token in WebSocket upgrade headers (`Authorization: Bearer <token>`).
+2. **TLS required** — relay rejects non-TLS connections in production mode. Dev mode allows `ws://` with `--insecure` flag.
+3. **Token rotation** — future: relay exposes endpoint to rotate token. Controller stores tokens in SQLite settings table.
+
+### Connection Lifecycle
+
+```
+Controller                          Relay
+    │                                 │
+    │── WSS connect ─────────────────→│
+    │── Authorization: Bearer token ──→│
+    │                                 │
+    │←── { type: "ready", ...} ───────│
+    │                                 │
+    │── { type: "ping" } ────────────→│
+    │←── { type: "pong" } ────────────│  (every 15s)
+    │                                 │
+    │── { type: "agent_query", ... }──→│
+    │←── { type: "sidecar_message" }──│  (streaming)
+    │←── { type: "sidecar_message" }──│
+    │                                 │
+    │     (disconnect)                │
+    │── reconnect (exp backoff) ─────→│  (1s, 2s, 4s, 8s, max 30s)
+```
+
+### Reconnection
+
+- Controller reconnects with exponential backoff (1s → 30s cap)
+- On reconnect, relay sends current state snapshot (active sessions, PTY list)
+- Controller reconciles: updates pane states, re-subscribes to streams
+- Active agent sessions continue on relay regardless of controller connection
+
+## Session Persistence Across Reconnects
+
+Key insight: **remote agents keep running even when the controller disconnects**. The relay is autonomous — it doesn't need the controller to operate.
+
+On reconnect:
+1. Relay sends `{ type: "state_sync", activeSessions: [...], activePtys: [...] }`
+2. Controller matches against known panes, updates status
+3. Missed messages are NOT replayed (too complex, marginal value). Agent panes show "reconnected — some messages may be missing" notice
+
+## Frontend Integration
+
+### Pane Model Changes
+
+```typescript
+// stores/layout.svelte.ts
+export interface Pane {
+  id: string;
+  type: 'terminal' | 'agent';
+  title: string;
+  group?: string;
+  remoteMachineId?: string;  // NEW: undefined = local
+}
+```
+
+### Sidebar — Machine Groups
+
+Remote panes auto-group by machine label in the sidebar:
+
+```
+▾ Local
+  ├── Terminal 1
+  └── Agent: fix bug
+
+▾ devbox (192.168.1.50)      ← remote machine
+  ├── SSH session
+  └── Agent: deploy
+
+▾ ci-runner (10.0.0.5)       ← remote machine (disconnected)
+  └── Agent: test suite ⚠️
+```
+
+### Settings Panel
+
+New "Machines" section in settings:
+
+| Field | Type | Notes |
+|-------|------|-------|
+| Label | string | Human-readable name |
+| URL | string | `wss://host:9750` |
+| Token | password | Pre-shared auth token |
+| Auto-connect | boolean | Connect on app launch |
+
+Stored in SQLite `settings` table as JSON: `remote_machines` key.
+
+## Implementation Plan
+
+### Phase A: Extract `bterminal-core` crate
+
+- Extract `PtyManager` and `SidecarManager` into a shared crate
+- `src-tauri` depends on `bterminal-core` instead of owning the code
+- Zero behavior change — purely structural refactor
+- **Estimate:** ~2h of mechanical refactoring
+
+### Phase B: Build `bterminal-relay` binary
+
+- WebSocket server using `tokio-tungstenite`
+- Token auth on upgrade
+- Routes commands to `bterminal-core` managers
+- Forwards events back over WebSocket
+- Includes `--port`, `--token`, `--insecure` CLI flags
+- **Ships as:** single static Rust binary (~5MB), `cargo install bterminal-relay`
+
+### Phase C: Add `RemoteManager` to controller
+
+- New `remote.rs` module in `src-tauri`
+- Manages WebSocket client connections
+- Tauri commands: `remote_add`, `remote_remove`, `remote_connect`, `remote_disconnect`
+- Forwards remote events as Tauri events (same `sidecar-message` / `pty-data` events, tagged with machine ID)
+
+### Phase D: Frontend integration
+
+- Extend bridge adapters with `remoteMachineId` routing
+- Add machine management UI in settings
+- Add machine status indicators in sidebar
+- Add reconnection banner in pane chrome
+- Test with 2 machines (local + 1 remote)
+
+## Security Considerations
+
+| Threat | Mitigation |
+|--------|-----------|
+| Token interception | TLS required (reject `ws://` without `--insecure`) |
+| Token brute-force | Rate limit auth attempts (5/min), lockout after 10 failures |
+| Relay impersonation | Pin relay certificate fingerprint (future: mTLS) |
+| Command injection | Relay validates all command payloads against schema |
+| Lateral movement | Relay runs as unprivileged user, no shell access beyond PTY/sidecar |
+| Data exfiltration | Agent output streams to controller only, no relay-to-relay traffic |
+
+## Performance Considerations
+
+| Concern | Mitigation |
+|---------|-----------|
+| WebSocket latency | Typical LAN: <1ms. WAN: 20-100ms. Acceptable for agent output (text, not video) |
+| Bandwidth | Agent NDJSON: ~50KB/s peak. Terminal: ~200KB/s peak. Trivial even on slow links |
+| Connection count | Max 10 machines initially (UI constraint, not technical) |
+| Message ordering | Single WebSocket per machine = ordered delivery guaranteed |
+
+## What This Does NOT Cover (Future)
+
+- **Multi-controller** — multiple BTerminal instances observing the same relay (needs pub/sub)
+- **Relay discovery** — automatic detection of relays on LAN (mDNS/Bonjour)
+- **Agent migration** — moving a running agent from one machine to another
+- **Relay-to-relay** — direct communication between remote machines
+- **mTLS** — mutual TLS for enterprise environments (Phase B+ enhancement)
--- a/docs/task_plan.md
+++ b/docs/task_plan.md
@ -145,7 +145,7 @@ See [phases.md](phases.md) for the full phased implementation plan (Phases 1-6).
 ## Open Questions

 1. **Node.js or Deno for sidecar?** Resolved: Deno-first with Node.js fallback. SidecarCommand struct in sidecar.rs abstracts the choice. Deno preferred (runs TS directly, compiles to single binary). Falls back to Node.js if Deno not in PATH.
-2. **Multi-machine support?** Remote agents via WebSocket. Phase 7+ feature.
+2. **Multi-machine support?** Designed. See [multi-machine.md](multi-machine.md) for full architecture (bterminal-relay binary, WebSocket NDJSON, RemoteManager). Implementation in 4 phases (A-D).
 3. **Agent Teams integration?** Phase 7 — frontend routing implemented (subagent pane spawning, parent/child navigation). Needs real-world testing with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1.
 4. **Electron escape hatch threshold?** If Canvas xterm.js proves >50ms latency on target system with 4 panes, switch to Electron. Benchmark in Phase 2.