agent-orchestrator/docs/multi-machine/relay.md

155 lines
5.7 KiB
Markdown

# Multi-Machine Support
**Status: Implemented (Phases A-D complete, 2026-03-06)**
## Overview
Extends agor to manage Claude agent sessions and terminal panes running on **remote machines** over WebSocket, while keeping the local sidecar path unchanged.
## Architecture
### Three-Layer Model
```
+----------------------------------------------------------------+
| Agent Orchestrator (Controller) |
| |
| +----------+ Tauri IPC +------------------------------+ |
| | WebView | <------------> | Rust Backend | |
| | (Svelte) | | | |
| +----------+ | +-- PtyManager (local) | |
| | +-- SidecarManager (local) | |
| | +-- RemoteManager ----------+-+
| +------------------------------+ |
+----------------------------------------------------------------+
| |
| (local stdio) | (WebSocket wss://)
v v
+-----------+ +----------------------+
| Local | | Remote Machine |
| Sidecar | | +--------------+ |
| (Deno/ | | | agor-relay | |
| Node.js) | | | (Rust binary) | |
+-----------+ | | | |
| | +-- PTY mgr | |
| | +-- Sidecar | |
| | +-- WS server| |
| +--------------+ |
+----------------------+
```
### Components
#### 1. `agor-relay` — Remote Agent (Rust binary)
A standalone Rust binary that runs on each remote machine:
- Listens on a WebSocket port (default: 9750)
- Manages local PTYs and sidecar processes
- Forwards NDJSON events to the controller over WebSocket
- Receives commands (query, stop, resize, write) from the controller
Reuses `PtyManager` and `SidecarManager` from `agor-core`.
#### 2. `RemoteManager` — Controller-Side
Module in `src-tauri/src/remote.rs`. Manages WebSocket connections to multiple relays. 12 Tauri commands for remote operations.
#### 3. Frontend Adapters — Unified Interface
The frontend doesn't care whether a pane is local or remote. Bridge adapters check `remoteMachineId` and route accordingly.
## Protocol
### WebSocket Wire Format
Same NDJSON as local sidecar, wrapped in an envelope for multiplexing:
```typescript
// Controller -> Relay (commands)
interface RelayCommand {
id: string;
type: 'pty_create' | 'pty_write' | 'pty_resize' | 'pty_close'
| 'agent_query' | 'agent_stop' | 'sidecar_restart' | 'ping';
payload: Record<string, unknown>;
}
// Relay -> Controller (events)
interface RelayEvent {
type: 'pty_data' | 'pty_exit' | 'pty_created'
| 'sidecar_message' | 'sidecar_exited'
| 'error' | 'pong' | 'ready';
sessionId?: string;
payload: unknown;
}
```
### Authentication
1. **Pre-shared token** — relay starts with `--token <secret>`. Controller sends token in WebSocket upgrade headers.
2. **TLS required** — relay rejects non-TLS connections in production mode. Dev mode allows `ws://` with `--insecure` flag.
3. **Rate limiting** — 10 failed auth attempts triggers 5-minute lockout.
### Reconnection
- Exponential backoff: 1s, 2s, 4s, 8s, 16s, 30s cap
- Uses `attempt_tcp_probe()`: TCP-only, 5s timeout (avoids allocating resources on relay during probes)
- Emits `remote-machine-reconnecting` and `remote-machine-reconnect-ready` events
- Active agent sessions continue on relay regardless of controller connection
### Session Persistence Across Reconnects
Remote agents keep running even when the controller disconnects. On reconnect:
1. Relay sends state sync with active sessions and PTYs
2. Controller reconciles and updates pane states
3. Missed messages are NOT replayed (agent panes show "reconnected" notice)
## Implementation Summary
### Phase A: Extract `agor-core` crate
Cargo workspace with PtyManager, SidecarManager, EventSink trait extracted to shared crate.
### Phase B: Build `agor-relay` binary
WebSocket server with token auth, per-connection isolated managers, structured command responses with commandId correlation.
### Phase C: Add `RemoteManager` to controller
12 Tauri commands, heartbeat ping every 15s, exponential backoff reconnection.
### Phase D: Frontend integration
`remote-bridge.ts` adapter, `machines.svelte.ts` store, `Pane.remoteMachineId` routing field.
### Remaining Work
- [ ] Real-world relay testing (2 machines)
- [ ] TLS/certificate pinning
## Security
| Threat | Mitigation |
|--------|-----------|
| Token interception | TLS required |
| Token brute-force | Rate limit + lockout |
| Relay impersonation | Certificate pinning (future: mTLS) |
| Command injection | Payload schema validation |
| Lateral movement | Unprivileged user, no shell beyond PTY/sidecar |
| Data exfiltration | Agent output streams to controller only |
## Performance
| Concern | Mitigation |
|---------|-----------|
| WebSocket latency | LAN: <1ms, WAN: 20-100ms (acceptable for text) |
| Bandwidth | Agent NDJSON: ~50KB/s peak, Terminal: ~200KB/s peak |
| Connection count | Max 10 machines (UI constraint) |
| Message ordering | Single WebSocket per machine = ordered delivery |
## Future (Not Covered)
- Multi-controller (multiple agor instances observing same relay)
- Relay discovery (mDNS/Bonjour)
- Agent migration between machines
- Relay-to-relay communication
- mTLS for enterprise environments