From 04a7a4bb94e41082e8852f823c9039cbd857ca21 Mon Sep 17 00:00:00 2001 From: Hibryda Date: Fri, 6 Mar 2026 18:45:56 +0100 Subject: [PATCH] docs: add multi-machine support architecture design Full WebSocket architecture spec for remote agent/terminal management: bterminal-relay binary, RemoteManager, NDJSON protocol, pre-shared token + TLS auth, 4-phase implementation plan (A-D). --- TODO.md | 2 +- docs/multi-machine.md | 302 ++++++++++++++++++++++++++++++++++++++++++ docs/task_plan.md | 2 +- 3 files changed, 304 insertions(+), 2 deletions(-) create mode 100644 docs/multi-machine.md diff --git a/TODO.md b/TODO.md index f329e58..ff7e23d 100644 --- a/TODO.md +++ b/TODO.md @@ -4,7 +4,7 @@ - [ ] **Deno sidecar real-world testing** -- Integrated into sidecar.rs (Deno-first + Node.js fallback). Needs testing with real claude CLI and startup time benchmark vs Node.js. - [ ] **E2E testing (Playwright/WebDriver)** -- Scaffold at v2/tests/e2e/README.md. Needs display server to run. Test: open terminal, run command, open agent, verify output. -- [ ] **Multi-machine support** -- Remote agents via WebSocket (Phase 7+ feature). +- [ ] **Multi-machine support** -- Architecture designed in [docs/multi-machine.md](docs/multi-machine.md). Next: Phase A (extract bterminal-core crate), then Phase B (bterminal-relay binary). - [ ] **Agent Teams real-world testing** -- Frontend routing implemented (Phase 7). Needs testing with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 and real subagent spawning. ## Completed diff --git a/docs/multi-machine.md b/docs/multi-machine.md new file mode 100644 index 0000000..e44b38c --- /dev/null +++ b/docs/multi-machine.md @@ -0,0 +1,302 @@ +# Multi-Machine Support — Architecture Design + +## Overview + +Extend BTerminal to manage Claude agent sessions and terminal panes running on **remote machines** over WebSocket, while keeping the local sidecar path unchanged. + +## Problem + +Current architecture is local-only: + +``` +WebView ←→ Rust (Tauri IPC) ←→ Local Sidecar (stdio NDJSON) + ←→ Local PTY (portable-pty) +``` + +Target state: BTerminal acts as a **mission control** that observes agents and terminals running on multiple machines (dev servers, cloud VMs, CI runners). + +## Design Constraints + +1. **Zero changes to local path** — local sidecar/PTY must work identically +2. **Same NDJSON protocol** — remote and local agents speak the same message format +3. **No new runtime dependencies** — use Rust's `tokio-tungstenite` (already available via Tauri) +4. **Graceful degradation** — remote machine goes offline → pane shows disconnected state, reconnects automatically +5. **Security** — all remote connections authenticated and encrypted (TLS + token) + +## Architecture + +### Three-Layer Model + +``` +┌──────────────────────────────────────────────────────────────────┐ +│ BTerminal (Controller) │ +│ │ +│ ┌──────────┐ Tauri IPC ┌──────────────────────────────┐ │ +│ │ WebView │ ←────────────→ │ Rust Backend │ │ +│ │ (Svelte) │ │ │ │ +│ └──────────┘ │ ├── PtyManager (local) │ │ +│ │ ├── SidecarManager (local) │ │ +│ │ └── RemoteManager ──────────┼──┤ +│ └──────────────────────────────┘ │ +│ │ +└──────────────────────────────────────────────────────────────────┘ + │ │ + │ (local stdio) │ (WebSocket wss://) + ▼ ▼ + ┌───────────┐ ┌──────────────────────┐ + │ Local │ │ Remote Machine │ + │ Sidecar │ │ │ + │ (Deno/ │ │ ┌────────────────┐ │ + │ Node.js) │ │ │ bterminal-relay│ │ + │ │ │ │ (Rust binary) │ │ + └───────────┘ │ │ │ │ + │ │ ├── PTY mgr │ │ + │ │ ├── Sidecar mgr│ │ + │ │ └── WS server │ │ + │ └────────────────┘ │ + └──────────────────────┘ +``` + +### Components + +#### 1. `bterminal-relay` — Remote Agent (Rust binary) + +A standalone Rust binary that runs on each remote machine. It: + +- Listens on a WebSocket port (default: 9750) +- Manages local PTYs and claude sidecar processes +- Forwards NDJSON events to the controller over WebSocket +- Receives commands (query, stop, resize, write) from the controller + +**Why a Rust binary?** Reuses existing `PtyManager` and `SidecarManager` code from `src-tauri/src/`. Extracted into a shared crate. + +``` +bterminal-relay/ +├── Cargo.toml # depends on bterminal-core +├── src/ +│ └── main.rs # WebSocket server + auth +│ +bterminal-core/ # shared crate (extracted from src-tauri) +├── Cargo.toml +├── src/ +│ ├── pty.rs # PtyManager (from v2/src-tauri/src/pty.rs) +│ ├── sidecar.rs # SidecarManager (from v2/src-tauri/src/sidecar.rs) +│ └── lib.rs +``` + +#### 2. `RemoteManager` — Controller-Side (in Rust backend) + +New module in `v2/src-tauri/src/remote.rs`. Manages WebSocket connections to multiple relays. + +```rust +pub struct RemoteMachine { + pub id: String, + pub label: String, + pub url: String, // wss://host:9750 + pub token: String, // auth token + pub status: RemoteStatus, // connected | connecting | disconnected | error +} + +pub enum RemoteStatus { + Connected, + Connecting, + Disconnected, + Error(String), +} + +pub struct RemoteManager { + machines: Arc>>, + connections: Arc>>, +} +``` + +#### 3. Frontend Adapters — Unified Interface + +The frontend doesn't care whether a pane is local or remote. The bridge layer abstracts this: + +```typescript +// adapters/agent-bridge.ts — extended +export async function queryAgent(options: AgentQueryOptions): Promise { + if (options.remote_machine_id) { + return invoke('remote_agent_query', { machineId: options.remote_machine_id, options }); + } + return invoke('agent_query', { options }); +} +``` + +Same pattern for `pty-bridge.ts` — add optional `remote_machine_id` to all operations. + +## Protocol + +### WebSocket Wire Format + +Same NDJSON as local sidecar, wrapped in an envelope for multiplexing: + +```typescript +// Controller → Relay (commands) +interface RelayCommand { + id: string; // request correlation ID + type: 'pty_create' | 'pty_write' | 'pty_resize' | 'pty_close' + | 'agent_query' | 'agent_stop' | 'sidecar_restart' + | 'ping'; + payload: Record; +} + +// Relay → Controller (events) +interface RelayEvent { + type: 'pty_data' | 'pty_exit' + | 'sidecar_message' | 'sidecar_exited' + | 'error' | 'pong' | 'ready'; + sessionId?: string; + payload: unknown; +} +``` + +### Authentication + +1. **Pre-shared token** — relay starts with `--token `. Controller sends token in WebSocket upgrade headers (`Authorization: Bearer `). +2. **TLS required** — relay rejects non-TLS connections in production mode. Dev mode allows `ws://` with `--insecure` flag. +3. **Token rotation** — future: relay exposes endpoint to rotate token. Controller stores tokens in SQLite settings table. + +### Connection Lifecycle + +``` +Controller Relay + │ │ + │── WSS connect ─────────────────→│ + │── Authorization: Bearer token ──→│ + │ │ + │←── { type: "ready", ...} ───────│ + │ │ + │── { type: "ping" } ────────────→│ + │←── { type: "pong" } ────────────│ (every 15s) + │ │ + │── { type: "agent_query", ... }──→│ + │←── { type: "sidecar_message" }──│ (streaming) + │←── { type: "sidecar_message" }──│ + │ │ + │ (disconnect) │ + │── reconnect (exp backoff) ─────→│ (1s, 2s, 4s, 8s, max 30s) +``` + +### Reconnection + +- Controller reconnects with exponential backoff (1s → 30s cap) +- On reconnect, relay sends current state snapshot (active sessions, PTY list) +- Controller reconciles: updates pane states, re-subscribes to streams +- Active agent sessions continue on relay regardless of controller connection + +## Session Persistence Across Reconnects + +Key insight: **remote agents keep running even when the controller disconnects**. The relay is autonomous — it doesn't need the controller to operate. + +On reconnect: +1. Relay sends `{ type: "state_sync", activeSessions: [...], activePtys: [...] }` +2. Controller matches against known panes, updates status +3. Missed messages are NOT replayed (too complex, marginal value). Agent panes show "reconnected — some messages may be missing" notice + +## Frontend Integration + +### Pane Model Changes + +```typescript +// stores/layout.svelte.ts +export interface Pane { + id: string; + type: 'terminal' | 'agent'; + title: string; + group?: string; + remoteMachineId?: string; // NEW: undefined = local +} +``` + +### Sidebar — Machine Groups + +Remote panes auto-group by machine label in the sidebar: + +``` +▾ Local + ├── Terminal 1 + └── Agent: fix bug + +▾ devbox (192.168.1.50) ← remote machine + ├── SSH session + └── Agent: deploy + +▾ ci-runner (10.0.0.5) ← remote machine (disconnected) + └── Agent: test suite ⚠️ +``` + +### Settings Panel + +New "Machines" section in settings: + +| Field | Type | Notes | +|-------|------|-------| +| Label | string | Human-readable name | +| URL | string | `wss://host:9750` | +| Token | password | Pre-shared auth token | +| Auto-connect | boolean | Connect on app launch | + +Stored in SQLite `settings` table as JSON: `remote_machines` key. + +## Implementation Plan + +### Phase A: Extract `bterminal-core` crate + +- Extract `PtyManager` and `SidecarManager` into a shared crate +- `src-tauri` depends on `bterminal-core` instead of owning the code +- Zero behavior change — purely structural refactor +- **Estimate:** ~2h of mechanical refactoring + +### Phase B: Build `bterminal-relay` binary + +- WebSocket server using `tokio-tungstenite` +- Token auth on upgrade +- Routes commands to `bterminal-core` managers +- Forwards events back over WebSocket +- Includes `--port`, `--token`, `--insecure` CLI flags +- **Ships as:** single static Rust binary (~5MB), `cargo install bterminal-relay` + +### Phase C: Add `RemoteManager` to controller + +- New `remote.rs` module in `src-tauri` +- Manages WebSocket client connections +- Tauri commands: `remote_add`, `remote_remove`, `remote_connect`, `remote_disconnect` +- Forwards remote events as Tauri events (same `sidecar-message` / `pty-data` events, tagged with machine ID) + +### Phase D: Frontend integration + +- Extend bridge adapters with `remoteMachineId` routing +- Add machine management UI in settings +- Add machine status indicators in sidebar +- Add reconnection banner in pane chrome +- Test with 2 machines (local + 1 remote) + +## Security Considerations + +| Threat | Mitigation | +|--------|-----------| +| Token interception | TLS required (reject `ws://` without `--insecure`) | +| Token brute-force | Rate limit auth attempts (5/min), lockout after 10 failures | +| Relay impersonation | Pin relay certificate fingerprint (future: mTLS) | +| Command injection | Relay validates all command payloads against schema | +| Lateral movement | Relay runs as unprivileged user, no shell access beyond PTY/sidecar | +| Data exfiltration | Agent output streams to controller only, no relay-to-relay traffic | + +## Performance Considerations + +| Concern | Mitigation | +|---------|-----------| +| WebSocket latency | Typical LAN: <1ms. WAN: 20-100ms. Acceptable for agent output (text, not video) | +| Bandwidth | Agent NDJSON: ~50KB/s peak. Terminal: ~200KB/s peak. Trivial even on slow links | +| Connection count | Max 10 machines initially (UI constraint, not technical) | +| Message ordering | Single WebSocket per machine = ordered delivery guaranteed | + +## What This Does NOT Cover (Future) + +- **Multi-controller** — multiple BTerminal instances observing the same relay (needs pub/sub) +- **Relay discovery** — automatic detection of relays on LAN (mDNS/Bonjour) +- **Agent migration** — moving a running agent from one machine to another +- **Relay-to-relay** — direct communication between remote machines +- **mTLS** — mutual TLS for enterprise environments (Phase B+ enhancement) diff --git a/docs/task_plan.md b/docs/task_plan.md index b2068d3..aed9621 100644 --- a/docs/task_plan.md +++ b/docs/task_plan.md @@ -145,7 +145,7 @@ See [phases.md](phases.md) for the full phased implementation plan (Phases 1-6). ## Open Questions 1. **Node.js or Deno for sidecar?** Resolved: Deno-first with Node.js fallback. SidecarCommand struct in sidecar.rs abstracts the choice. Deno preferred (runs TS directly, compiles to single binary). Falls back to Node.js if Deno not in PATH. -2. **Multi-machine support?** Remote agents via WebSocket. Phase 7+ feature. +2. **Multi-machine support?** Designed. See [multi-machine.md](multi-machine.md) for full architecture (bterminal-relay binary, WebSocket NDJSON, RemoteManager). Implementation in 4 phases (A-D). 3. **Agent Teams integration?** Phase 7 — frontend routing implemented (subagent pane spawning, parent/child navigation). Needs real-world testing with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1. 4. **Electron escape hatch threshold?** If Canvas xterm.js proves >50ms latency on target system with 4 panes, switch to Electron. Benchmark in Phase 2.