docs: add multi-machine support architecture design
Full WebSocket architecture spec for remote agent/terminal management: bterminal-relay binary, RemoteManager, NDJSON protocol, pre-shared token + TLS auth, 4-phase implementation plan (A-D).
This commit is contained in:
parent
86fbe3e762
commit
04a7a4bb94
3 changed files with 304 additions and 2 deletions
2
TODO.md
2
TODO.md
|
|
@ -4,7 +4,7 @@
|
||||||
|
|
||||||
- [ ] **Deno sidecar real-world testing** -- Integrated into sidecar.rs (Deno-first + Node.js fallback). Needs testing with real claude CLI and startup time benchmark vs Node.js.
|
- [ ] **Deno sidecar real-world testing** -- Integrated into sidecar.rs (Deno-first + Node.js fallback). Needs testing with real claude CLI and startup time benchmark vs Node.js.
|
||||||
- [ ] **E2E testing (Playwright/WebDriver)** -- Scaffold at v2/tests/e2e/README.md. Needs display server to run. Test: open terminal, run command, open agent, verify output.
|
- [ ] **E2E testing (Playwright/WebDriver)** -- Scaffold at v2/tests/e2e/README.md. Needs display server to run. Test: open terminal, run command, open agent, verify output.
|
||||||
- [ ] **Multi-machine support** -- Remote agents via WebSocket (Phase 7+ feature).
|
- [ ] **Multi-machine support** -- Architecture designed in [docs/multi-machine.md](docs/multi-machine.md). Next: Phase A (extract bterminal-core crate), then Phase B (bterminal-relay binary).
|
||||||
- [ ] **Agent Teams real-world testing** -- Frontend routing implemented (Phase 7). Needs testing with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 and real subagent spawning.
|
- [ ] **Agent Teams real-world testing** -- Frontend routing implemented (Phase 7). Needs testing with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 and real subagent spawning.
|
||||||
|
|
||||||
## Completed
|
## Completed
|
||||||
|
|
|
||||||
302
docs/multi-machine.md
Normal file
302
docs/multi-machine.md
Normal file
|
|
@ -0,0 +1,302 @@
|
||||||
|
# Multi-Machine Support — Architecture Design
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Extend BTerminal to manage Claude agent sessions and terminal panes running on **remote machines** over WebSocket, while keeping the local sidecar path unchanged.
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
Current architecture is local-only:
|
||||||
|
|
||||||
|
```
|
||||||
|
WebView ←→ Rust (Tauri IPC) ←→ Local Sidecar (stdio NDJSON)
|
||||||
|
←→ Local PTY (portable-pty)
|
||||||
|
```
|
||||||
|
|
||||||
|
Target state: BTerminal acts as a **mission control** that observes agents and terminals running on multiple machines (dev servers, cloud VMs, CI runners).
|
||||||
|
|
||||||
|
## Design Constraints
|
||||||
|
|
||||||
|
1. **Zero changes to local path** — local sidecar/PTY must work identically
|
||||||
|
2. **Same NDJSON protocol** — remote and local agents speak the same message format
|
||||||
|
3. **No new runtime dependencies** — use Rust's `tokio-tungstenite` (already available via Tauri)
|
||||||
|
4. **Graceful degradation** — remote machine goes offline → pane shows disconnected state, reconnects automatically
|
||||||
|
5. **Security** — all remote connections authenticated and encrypted (TLS + token)
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
### Three-Layer Model
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────────────────────────────────────────────────────┐
|
||||||
|
│ BTerminal (Controller) │
|
||||||
|
│ │
|
||||||
|
│ ┌──────────┐ Tauri IPC ┌──────────────────────────────┐ │
|
||||||
|
│ │ WebView │ ←────────────→ │ Rust Backend │ │
|
||||||
|
│ │ (Svelte) │ │ │ │
|
||||||
|
│ └──────────┘ │ ├── PtyManager (local) │ │
|
||||||
|
│ │ ├── SidecarManager (local) │ │
|
||||||
|
│ │ └── RemoteManager ──────────┼──┤
|
||||||
|
│ └──────────────────────────────┘ │
|
||||||
|
│ │
|
||||||
|
└──────────────────────────────────────────────────────────────────┘
|
||||||
|
│ │
|
||||||
|
│ (local stdio) │ (WebSocket wss://)
|
||||||
|
▼ ▼
|
||||||
|
┌───────────┐ ┌──────────────────────┐
|
||||||
|
│ Local │ │ Remote Machine │
|
||||||
|
│ Sidecar │ │ │
|
||||||
|
│ (Deno/ │ │ ┌────────────────┐ │
|
||||||
|
│ Node.js) │ │ │ bterminal-relay│ │
|
||||||
|
│ │ │ │ (Rust binary) │ │
|
||||||
|
└───────────┘ │ │ │ │
|
||||||
|
│ │ ├── PTY mgr │ │
|
||||||
|
│ │ ├── Sidecar mgr│ │
|
||||||
|
│ │ └── WS server │ │
|
||||||
|
│ └────────────────┘ │
|
||||||
|
└──────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Components
|
||||||
|
|
||||||
|
#### 1. `bterminal-relay` — Remote Agent (Rust binary)
|
||||||
|
|
||||||
|
A standalone Rust binary that runs on each remote machine. It:
|
||||||
|
|
||||||
|
- Listens on a WebSocket port (default: 9750)
|
||||||
|
- Manages local PTYs and claude sidecar processes
|
||||||
|
- Forwards NDJSON events to the controller over WebSocket
|
||||||
|
- Receives commands (query, stop, resize, write) from the controller
|
||||||
|
|
||||||
|
**Why a Rust binary?** Reuses existing `PtyManager` and `SidecarManager` code from `src-tauri/src/`. Extracted into a shared crate.
|
||||||
|
|
||||||
|
```
|
||||||
|
bterminal-relay/
|
||||||
|
├── Cargo.toml # depends on bterminal-core
|
||||||
|
├── src/
|
||||||
|
│ └── main.rs # WebSocket server + auth
|
||||||
|
│
|
||||||
|
bterminal-core/ # shared crate (extracted from src-tauri)
|
||||||
|
├── Cargo.toml
|
||||||
|
├── src/
|
||||||
|
│ ├── pty.rs # PtyManager (from v2/src-tauri/src/pty.rs)
|
||||||
|
│ ├── sidecar.rs # SidecarManager (from v2/src-tauri/src/sidecar.rs)
|
||||||
|
│ └── lib.rs
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2. `RemoteManager` — Controller-Side (in Rust backend)
|
||||||
|
|
||||||
|
New module in `v2/src-tauri/src/remote.rs`. Manages WebSocket connections to multiple relays.
|
||||||
|
|
||||||
|
```rust
|
||||||
|
pub struct RemoteMachine {
|
||||||
|
pub id: String,
|
||||||
|
pub label: String,
|
||||||
|
pub url: String, // wss://host:9750
|
||||||
|
pub token: String, // auth token
|
||||||
|
pub status: RemoteStatus, // connected | connecting | disconnected | error
|
||||||
|
}
|
||||||
|
|
||||||
|
pub enum RemoteStatus {
|
||||||
|
Connected,
|
||||||
|
Connecting,
|
||||||
|
Disconnected,
|
||||||
|
Error(String),
|
||||||
|
}
|
||||||
|
|
||||||
|
pub struct RemoteManager {
|
||||||
|
machines: Arc<Mutex<Vec<RemoteMachine>>>,
|
||||||
|
connections: Arc<Mutex<HashMap<String, WsConnection>>>,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 3. Frontend Adapters — Unified Interface
|
||||||
|
|
||||||
|
The frontend doesn't care whether a pane is local or remote. The bridge layer abstracts this:
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
// adapters/agent-bridge.ts — extended
|
||||||
|
export async function queryAgent(options: AgentQueryOptions): Promise<void> {
|
||||||
|
if (options.remote_machine_id) {
|
||||||
|
return invoke('remote_agent_query', { machineId: options.remote_machine_id, options });
|
||||||
|
}
|
||||||
|
return invoke('agent_query', { options });
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Same pattern for `pty-bridge.ts` — add optional `remote_machine_id` to all operations.
|
||||||
|
|
||||||
|
## Protocol
|
||||||
|
|
||||||
|
### WebSocket Wire Format
|
||||||
|
|
||||||
|
Same NDJSON as local sidecar, wrapped in an envelope for multiplexing:
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
// Controller → Relay (commands)
|
||||||
|
interface RelayCommand {
|
||||||
|
id: string; // request correlation ID
|
||||||
|
type: 'pty_create' | 'pty_write' | 'pty_resize' | 'pty_close'
|
||||||
|
| 'agent_query' | 'agent_stop' | 'sidecar_restart'
|
||||||
|
| 'ping';
|
||||||
|
payload: Record<string, unknown>;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Relay → Controller (events)
|
||||||
|
interface RelayEvent {
|
||||||
|
type: 'pty_data' | 'pty_exit'
|
||||||
|
| 'sidecar_message' | 'sidecar_exited'
|
||||||
|
| 'error' | 'pong' | 'ready';
|
||||||
|
sessionId?: string;
|
||||||
|
payload: unknown;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Authentication
|
||||||
|
|
||||||
|
1. **Pre-shared token** — relay starts with `--token <secret>`. Controller sends token in WebSocket upgrade headers (`Authorization: Bearer <token>`).
|
||||||
|
2. **TLS required** — relay rejects non-TLS connections in production mode. Dev mode allows `ws://` with `--insecure` flag.
|
||||||
|
3. **Token rotation** — future: relay exposes endpoint to rotate token. Controller stores tokens in SQLite settings table.
|
||||||
|
|
||||||
|
### Connection Lifecycle
|
||||||
|
|
||||||
|
```
|
||||||
|
Controller Relay
|
||||||
|
│ │
|
||||||
|
│── WSS connect ─────────────────→│
|
||||||
|
│── Authorization: Bearer token ──→│
|
||||||
|
│ │
|
||||||
|
│←── { type: "ready", ...} ───────│
|
||||||
|
│ │
|
||||||
|
│── { type: "ping" } ────────────→│
|
||||||
|
│←── { type: "pong" } ────────────│ (every 15s)
|
||||||
|
│ │
|
||||||
|
│── { type: "agent_query", ... }──→│
|
||||||
|
│←── { type: "sidecar_message" }──│ (streaming)
|
||||||
|
│←── { type: "sidecar_message" }──│
|
||||||
|
│ │
|
||||||
|
│ (disconnect) │
|
||||||
|
│── reconnect (exp backoff) ─────→│ (1s, 2s, 4s, 8s, max 30s)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Reconnection
|
||||||
|
|
||||||
|
- Controller reconnects with exponential backoff (1s → 30s cap)
|
||||||
|
- On reconnect, relay sends current state snapshot (active sessions, PTY list)
|
||||||
|
- Controller reconciles: updates pane states, re-subscribes to streams
|
||||||
|
- Active agent sessions continue on relay regardless of controller connection
|
||||||
|
|
||||||
|
## Session Persistence Across Reconnects
|
||||||
|
|
||||||
|
Key insight: **remote agents keep running even when the controller disconnects**. The relay is autonomous — it doesn't need the controller to operate.
|
||||||
|
|
||||||
|
On reconnect:
|
||||||
|
1. Relay sends `{ type: "state_sync", activeSessions: [...], activePtys: [...] }`
|
||||||
|
2. Controller matches against known panes, updates status
|
||||||
|
3. Missed messages are NOT replayed (too complex, marginal value). Agent panes show "reconnected — some messages may be missing" notice
|
||||||
|
|
||||||
|
## Frontend Integration
|
||||||
|
|
||||||
|
### Pane Model Changes
|
||||||
|
|
||||||
|
```typescript
|
||||||
|
// stores/layout.svelte.ts
|
||||||
|
export interface Pane {
|
||||||
|
id: string;
|
||||||
|
type: 'terminal' | 'agent';
|
||||||
|
title: string;
|
||||||
|
group?: string;
|
||||||
|
remoteMachineId?: string; // NEW: undefined = local
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Sidebar — Machine Groups
|
||||||
|
|
||||||
|
Remote panes auto-group by machine label in the sidebar:
|
||||||
|
|
||||||
|
```
|
||||||
|
▾ Local
|
||||||
|
├── Terminal 1
|
||||||
|
└── Agent: fix bug
|
||||||
|
|
||||||
|
▾ devbox (192.168.1.50) ← remote machine
|
||||||
|
├── SSH session
|
||||||
|
└── Agent: deploy
|
||||||
|
|
||||||
|
▾ ci-runner (10.0.0.5) ← remote machine (disconnected)
|
||||||
|
└── Agent: test suite ⚠️
|
||||||
|
```
|
||||||
|
|
||||||
|
### Settings Panel
|
||||||
|
|
||||||
|
New "Machines" section in settings:
|
||||||
|
|
||||||
|
| Field | Type | Notes |
|
||||||
|
|-------|------|-------|
|
||||||
|
| Label | string | Human-readable name |
|
||||||
|
| URL | string | `wss://host:9750` |
|
||||||
|
| Token | password | Pre-shared auth token |
|
||||||
|
| Auto-connect | boolean | Connect on app launch |
|
||||||
|
|
||||||
|
Stored in SQLite `settings` table as JSON: `remote_machines` key.
|
||||||
|
|
||||||
|
## Implementation Plan
|
||||||
|
|
||||||
|
### Phase A: Extract `bterminal-core` crate
|
||||||
|
|
||||||
|
- Extract `PtyManager` and `SidecarManager` into a shared crate
|
||||||
|
- `src-tauri` depends on `bterminal-core` instead of owning the code
|
||||||
|
- Zero behavior change — purely structural refactor
|
||||||
|
- **Estimate:** ~2h of mechanical refactoring
|
||||||
|
|
||||||
|
### Phase B: Build `bterminal-relay` binary
|
||||||
|
|
||||||
|
- WebSocket server using `tokio-tungstenite`
|
||||||
|
- Token auth on upgrade
|
||||||
|
- Routes commands to `bterminal-core` managers
|
||||||
|
- Forwards events back over WebSocket
|
||||||
|
- Includes `--port`, `--token`, `--insecure` CLI flags
|
||||||
|
- **Ships as:** single static Rust binary (~5MB), `cargo install bterminal-relay`
|
||||||
|
|
||||||
|
### Phase C: Add `RemoteManager` to controller
|
||||||
|
|
||||||
|
- New `remote.rs` module in `src-tauri`
|
||||||
|
- Manages WebSocket client connections
|
||||||
|
- Tauri commands: `remote_add`, `remote_remove`, `remote_connect`, `remote_disconnect`
|
||||||
|
- Forwards remote events as Tauri events (same `sidecar-message` / `pty-data` events, tagged with machine ID)
|
||||||
|
|
||||||
|
### Phase D: Frontend integration
|
||||||
|
|
||||||
|
- Extend bridge adapters with `remoteMachineId` routing
|
||||||
|
- Add machine management UI in settings
|
||||||
|
- Add machine status indicators in sidebar
|
||||||
|
- Add reconnection banner in pane chrome
|
||||||
|
- Test with 2 machines (local + 1 remote)
|
||||||
|
|
||||||
|
## Security Considerations
|
||||||
|
|
||||||
|
| Threat | Mitigation |
|
||||||
|
|--------|-----------|
|
||||||
|
| Token interception | TLS required (reject `ws://` without `--insecure`) |
|
||||||
|
| Token brute-force | Rate limit auth attempts (5/min), lockout after 10 failures |
|
||||||
|
| Relay impersonation | Pin relay certificate fingerprint (future: mTLS) |
|
||||||
|
| Command injection | Relay validates all command payloads against schema |
|
||||||
|
| Lateral movement | Relay runs as unprivileged user, no shell access beyond PTY/sidecar |
|
||||||
|
| Data exfiltration | Agent output streams to controller only, no relay-to-relay traffic |
|
||||||
|
|
||||||
|
## Performance Considerations
|
||||||
|
|
||||||
|
| Concern | Mitigation |
|
||||||
|
|---------|-----------|
|
||||||
|
| WebSocket latency | Typical LAN: <1ms. WAN: 20-100ms. Acceptable for agent output (text, not video) |
|
||||||
|
| Bandwidth | Agent NDJSON: ~50KB/s peak. Terminal: ~200KB/s peak. Trivial even on slow links |
|
||||||
|
| Connection count | Max 10 machines initially (UI constraint, not technical) |
|
||||||
|
| Message ordering | Single WebSocket per machine = ordered delivery guaranteed |
|
||||||
|
|
||||||
|
## What This Does NOT Cover (Future)
|
||||||
|
|
||||||
|
- **Multi-controller** — multiple BTerminal instances observing the same relay (needs pub/sub)
|
||||||
|
- **Relay discovery** — automatic detection of relays on LAN (mDNS/Bonjour)
|
||||||
|
- **Agent migration** — moving a running agent from one machine to another
|
||||||
|
- **Relay-to-relay** — direct communication between remote machines
|
||||||
|
- **mTLS** — mutual TLS for enterprise environments (Phase B+ enhancement)
|
||||||
|
|
@ -145,7 +145,7 @@ See [phases.md](phases.md) for the full phased implementation plan (Phases 1-6).
|
||||||
## Open Questions
|
## Open Questions
|
||||||
|
|
||||||
1. **Node.js or Deno for sidecar?** Resolved: Deno-first with Node.js fallback. SidecarCommand struct in sidecar.rs abstracts the choice. Deno preferred (runs TS directly, compiles to single binary). Falls back to Node.js if Deno not in PATH.
|
1. **Node.js or Deno for sidecar?** Resolved: Deno-first with Node.js fallback. SidecarCommand struct in sidecar.rs abstracts the choice. Deno preferred (runs TS directly, compiles to single binary). Falls back to Node.js if Deno not in PATH.
|
||||||
2. **Multi-machine support?** Remote agents via WebSocket. Phase 7+ feature.
|
2. **Multi-machine support?** Designed. See [multi-machine.md](multi-machine.md) for full architecture (bterminal-relay binary, WebSocket NDJSON, RemoteManager). Implementation in 4 phases (A-D).
|
||||||
3. **Agent Teams integration?** Phase 7 — frontend routing implemented (subagent pane spawning, parent/child navigation). Needs real-world testing with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1.
|
3. **Agent Teams integration?** Phase 7 — frontend routing implemented (subagent pane spawning, parent/child navigation). Needs real-world testing with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1.
|
||||||
4. **Electron escape hatch threshold?** If Canvas xterm.js proves >50ms latency on target system with 4 panes, switch to Electron. Benchmark in Phase 2.
|
4. **Electron escape hatch threshold?** If Canvas xterm.js proves >50ms latency on target system with 4 panes, switch to Electron. Benchmark in Phase 2.
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue