agent-orchestrator/docs/multi-machine.md

14 KiB

Multi-Machine Support — Architecture & Implementation

Status: Implemented (Phases A-D complete, 2026-03-06)

Overview

Extend Agents Orchestrator to manage Claude agent sessions and terminal panes running on remote machines over WebSocket, while keeping the local sidecar path unchanged.

Problem

Current architecture is local-only:

WebView ←→ Rust (Tauri IPC) ←→ Local Sidecar (stdio NDJSON)
                              ←→ Local PTY (portable-pty)

Target state: Agents Orchestrator acts as a mission control that observes agents and terminals running on multiple machines (dev servers, cloud VMs, CI runners).

Design Constraints

  1. Zero changes to local path — local sidecar/PTY must work identically
  2. Same NDJSON protocol — remote and local agents speak the same message format
  3. No new runtime dependencies — use Rust's tokio-tungstenite (already available via Tauri)
  4. Graceful degradation — remote machine goes offline → pane shows disconnected state, reconnects automatically
  5. Security — all remote connections authenticated and encrypted (TLS + token)

Architecture

Three-Layer Model

┌──────────────────────────────────────────────────────────────────┐
│  Agents Orchestrator (Controller)                                          │
│                                                                  │
│  ┌──────────┐    Tauri IPC    ┌──────────────────────────────┐  │
│  │ WebView  │ ←────────────→  │ Rust Backend                 │  │
│  │ (Svelte) │                 │                              │  │
│  └──────────┘                 │  ├── PtyManager (local)      │  │
│                               │  ├── SidecarManager (local)  │  │
│                               │  └── RemoteManager ──────────┼──┤
│                               └──────────────────────────────┘  │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘
        │                                      │
        │ (local stdio)                        │ (WebSocket wss://)
        ▼                                      ▼
  ┌───────────┐                    ┌──────────────────────┐
  │ Local     │                    │ Remote Machine       │
  │ Sidecar   │                    │                      │
  │ (Deno/    │                    │  ┌────────────────┐  │
  │  Node.js) │                    │  │ agor-relay│  │
  │           │                    │  │ (Rust binary)  │  │
  └───────────┘                    │  │                │  │
                                   │  │ ├── PTY mgr   │  │
                                   │  │ ├── Sidecar mgr│  │
                                   │  │ └── WS server  │  │
                                   │  └────────────────┘  │
                                   └──────────────────────┘

Components

1. agor-relay — Remote Agent (Rust binary)

A standalone Rust binary that runs on each remote machine. It:

  • Listens on a WebSocket port (default: 9750)
  • Manages local PTYs and claude sidecar processes
  • Forwards NDJSON events to the controller over WebSocket
  • Receives commands (query, stop, resize, write) from the controller

Why a Rust binary? Reuses existing PtyManager and SidecarManager code from src-tauri/src/. Extracted into a shared crate.

agor-relay/
├── Cargo.toml        # depends on agor-core
├── src/
│   └── main.rs       # WebSocket server + auth
│
agor-core/       # shared crate (extracted from src-tauri)
├── Cargo.toml
├── src/
│   ├── pty.rs        # PtyManager (from src-tauri/src/pty.rs)
│   ├── sidecar.rs    # SidecarManager (from src-tauri/src/sidecar.rs)
│   └── lib.rs

2. RemoteManager — Controller-Side (in Rust backend)

New module in src-tauri/src/remote.rs. Manages WebSocket connections to multiple relays.

pub struct RemoteMachine {
    pub id: String,
    pub label: String,
    pub url: String,          // wss://host:9750
    pub token: String,        // auth token
    pub status: RemoteStatus, // connected | connecting | disconnected | error
}

pub enum RemoteStatus {
    Connected,
    Connecting,
    Disconnected,
    Error(String),
}

pub struct RemoteManager {
    machines: Arc<Mutex<Vec<RemoteMachine>>>,
    connections: Arc<Mutex<HashMap<String, WsConnection>>>,
}

3. Frontend Adapters — Unified Interface

The frontend doesn't care whether a pane is local or remote. The bridge layer abstracts this:

// adapters/agent-bridge.ts — extended
export async function queryAgent(options: AgentQueryOptions): Promise<void> {
  if (options.remote_machine_id) {
    return invoke('remote_agent_query', { machineId: options.remote_machine_id, options });
  }
  return invoke('agent_query', { options });
}

Same pattern for pty-bridge.ts — add optional remote_machine_id to all operations.

Protocol

WebSocket Wire Format

Same NDJSON as local sidecar, wrapped in an envelope for multiplexing:

// Controller → Relay (commands)
interface RelayCommand {
  id: string;                      // request correlation ID
  type: 'pty_create' | 'pty_write' | 'pty_resize' | 'pty_close'
      | 'agent_query' | 'agent_stop' | 'sidecar_restart'
      | 'ping';
  payload: Record<string, unknown>;
}

// Relay → Controller (events)
interface RelayEvent {
  type: 'pty_data' | 'pty_exit' | 'pty_created'
      | 'sidecar_message' | 'sidecar_exited'
      | 'error' | 'pong' | 'ready';
  sessionId?: string;
  payload: unknown;
}

Authentication

  1. Pre-shared token — relay starts with --token <secret>. Controller sends token in WebSocket upgrade headers (Authorization: Bearer <token>).
  2. TLS required — relay rejects non-TLS connections in production mode. Dev mode allows ws:// with --insecure flag.
  3. Token rotation — future: relay exposes endpoint to rotate token. Controller stores tokens in SQLite settings table.

Connection Lifecycle

Controller                          Relay
    │                                 │
    │── WSS connect ─────────────────→│
    │── Authorization: Bearer token ──→│
    │                                 │
    │←── { type: "ready", ...} ───────│
    │                                 │
    │── { type: "ping" } ────────────→│
    │←── { type: "pong" } ────────────│  (every 15s)
    │                                 │
    │── { type: "agent_query", ... }──→│
    │←── { type: "sidecar_message" }──│  (streaming)
    │←── { type: "sidecar_message" }──│
    │                                 │
    │     (disconnect)                │
    │── reconnect (exp backoff) ─────→│  (1s, 2s, 4s, 8s, max 30s)

Reconnection (Implemented)

  • Controller reconnects with exponential backoff (1s, 2s, 4s, 8s, 16s, 30s cap)
  • Reconnection runs as an async tokio task spawned on disconnect
  • Uses attempt_tcp_probe(): TCP connect only (no WS upgrade), 5s timeout, default port 9750. Avoids allocating per-connection resources (PtyManager, SidecarManager) on the relay during probes.
  • Emits remote-machine-reconnecting event (with backoff duration) and remote-machine-reconnect-ready when probe succeeds
  • Frontend listens via onRemoteMachineReconnecting and onRemoteMachineReconnectReady in remote-bridge.ts; machines store sets status to 'reconnecting' and auto-calls connectMachine() on ready
  • Cancels if machine is removed or manually reconnected (checks status == "disconnected" && connection == None)
  • On reconnect, relay sends current state snapshot (active sessions, PTY list)
  • Controller reconciles: updates pane states, re-subscribes to streams
  • Active agent sessions continue on relay regardless of controller connection

Session Persistence Across Reconnects

Key insight: remote agents keep running even when the controller disconnects. The relay is autonomous — it doesn't need the controller to operate.

On reconnect:

  1. Relay sends { type: "state_sync", activeSessions: [...], activePtys: [...] }
  2. Controller matches against known panes, updates status
  3. Missed messages are NOT replayed (too complex, marginal value). Agent panes show "reconnected — some messages may be missing" notice

Frontend Integration

Pane Model Changes

// stores/layout.svelte.ts
export interface Pane {
  id: string;
  type: 'terminal' | 'agent';
  title: string;
  group?: string;
  remoteMachineId?: string;  // NEW: undefined = local
}

Sidebar — Machine Groups

Remote panes auto-group by machine label in the sidebar:

▾ Local
  ├── Terminal 1
  └── Agent: fix bug

▾ devbox (192.168.1.50)      ← remote machine
  ├── SSH session
  └── Agent: deploy

▾ ci-runner (10.0.0.5)       ← remote machine (disconnected)
  └── Agent: test suite ⚠️

Settings Panel

New "Machines" section in settings:

Field Type Notes
Label string Human-readable name
URL string wss://host:9750
Token password Pre-shared auth token
Auto-connect boolean Connect on app launch

Stored in SQLite settings table as JSON: remote_machines key.

Implementation (All Phases Complete)

Phase A: Extract agor-core crate [DONE]

  • Cargo workspace at level (Cargo.toml with members: src-tauri, agor-core, agor-relay)
  • PtyManager and SidecarManager extracted to agor-core/
  • EventSink trait (agor-core/src/event.rs) abstracts event emission
  • TauriEventSink (src-tauri/src/event_sink.rs) implements EventSink for AppHandle
  • src-tauri pty.rs and sidecar.rs are thin re-export wrappers

Phase B: Build agor-relay binary [DONE]

  • agor-relay/src/main.rs — WebSocket server (tokio-tungstenite)
  • Token auth on WebSocket upgrade (Authorization: Bearer header)
  • CLI: --port (default 9750), --token (required), --insecure (allow ws://)
  • Routes RelayCommand to agor-core managers, forwards RelayEvent over WebSocket
  • Rate limiting: 10 failed auth attempts triggers 5-minute lockout
  • Per-connection isolated PtyManager + SidecarManager instances
  • Command response propagation: structured responses (pty_created, pong, error) sent back via shared event channel
  • send_error() helper: all command failures emit RelayEvent with commandId + error message
  • PTY creation confirmation: pty_create command returns pty_created event with session ID and commandId for correlation

Phase C: Add RemoteManager to controller [DONE]

  • src-tauri/src/remote.rs — RemoteManager struct with WebSocket client connections
  • 12 Tauri commands: remote_add_machine, remote_remove_machine, remote_connect, remote_disconnect, remote_list_machines, remote_pty_spawn/write/resize/kill, remote_agent_query/stop, remote_sidecar_restart
  • Heartbeat ping every 15s
  • PTY creation event: emits remote-pty-created Tauri event with machineId, ptyId, commandId
  • Exponential backoff reconnection on disconnect (1s/2s/4s/8s/16s/30s cap) via attempt_tcp_probe() (TCP-only, no WS upgrade)
  • Reconnection events: remote-machine-reconnecting, remote-machine-reconnect-ready

Phase D: Frontend integration [DONE]

  • src/lib/adapters/remote-bridge.ts — machine management IPC adapter
  • src/lib/stores/machines.svelte.ts — remote machine state store
  • Pane.remoteMachineId field in layout store
  • agent-bridge.ts and pty-bridge.ts route to remote commands when remoteMachineId is set
  • SettingsDialog "Remote Machines" section
  • Sidebar auto-groups remote panes by machine label

Remaining Work

  • Reconnection logic with exponential backoff (1s-30s cap) — implemented in remote.rs
  • Relay command response propagation (pty_created, pong, error events) — implemented in main.rs
  • Real-world relay testing (2 machines)
  • TLS/certificate pinning

Security Considerations

Threat Mitigation
Token interception TLS required (reject ws:// without --insecure)
Token brute-force Rate limit auth attempts (5/min), lockout after 10 failures
Relay impersonation Pin relay certificate fingerprint (future: mTLS)
Command injection Relay validates all command payloads against schema
Lateral movement Relay runs as unprivileged user, no shell access beyond PTY/sidecar
Data exfiltration Agent output streams to controller only, no relay-to-relay traffic

Performance Considerations

Concern Mitigation
WebSocket latency Typical LAN: <1ms. WAN: 20-100ms. Acceptable for agent output (text, not video)
Bandwidth Agent NDJSON: ~50KB/s peak. Terminal: ~200KB/s peak. Trivial even on slow links
Connection count Max 10 machines initially (UI constraint, not technical)
Message ordering Single WebSocket per machine = ordered delivery guaranteed

What This Does NOT Cover (Future)

  • Multi-controller — multiple Agents Orchestrator instances observing the same relay (needs pub/sub)
  • Relay discovery — automatic detection of relays on LAN (mDNS/Bonjour)
  • Agent migration — moving a running agent from one machine to another
  • Relay-to-relay — direct communication between remote machines
  • mTLS — mutual TLS for enterprise environments (Phase B+ enhancement)