agent-orchestrator/docs/multi-machine/relay.md

5.7 KiB

Multi-Machine Support

Status: Implemented (Phases A-D complete, 2026-03-06)

Overview

Extends agor to manage Claude agent sessions and terminal panes running on remote machines over WebSocket, while keeping the local sidecar path unchanged.

Architecture

Three-Layer Model

+----------------------------------------------------------------+
|  Agent Orchestrator (Controller)                                |
|                                                                |
|  +----------+    Tauri IPC    +------------------------------+ |
|  | WebView  | <------------> | Rust Backend                 | |
|  | (Svelte) |                |                              | |
|  +----------+                |  +-- PtyManager (local)      | |
|                              |  +-- SidecarManager (local)  | |
|                              |  +-- RemoteManager ----------+-+
|                              +------------------------------+ |
+----------------------------------------------------------------+
        |                                      |
        | (local stdio)                        | (WebSocket wss://)
        v                                      v
  +-----------+                    +----------------------+
  | Local     |                    | Remote Machine       |
  | Sidecar   |                    |  +--------------+    |
  | (Deno/    |                    |  | agor-relay   |    |
  |  Node.js) |                    |  | (Rust binary) |    |
  +-----------+                    |  |              |    |
                                   |  | +-- PTY mgr  |    |
                                   |  | +-- Sidecar  |    |
                                   |  | +-- WS server|    |
                                   |  +--------------+    |
                                   +----------------------+

Components

1. agor-relay — Remote Agent (Rust binary)

A standalone Rust binary that runs on each remote machine:

  • Listens on a WebSocket port (default: 9750)
  • Manages local PTYs and sidecar processes
  • Forwards NDJSON events to the controller over WebSocket
  • Receives commands (query, stop, resize, write) from the controller

Reuses PtyManager and SidecarManager from agor-core.

2. RemoteManager — Controller-Side

Module in src-tauri/src/remote.rs. Manages WebSocket connections to multiple relays. 12 Tauri commands for remote operations.

3. Frontend Adapters — Unified Interface

The frontend doesn't care whether a pane is local or remote. Bridge adapters check remoteMachineId and route accordingly.

Protocol

WebSocket Wire Format

Same NDJSON as local sidecar, wrapped in an envelope for multiplexing:

// Controller -> Relay (commands)
interface RelayCommand {
  id: string;
  type: 'pty_create' | 'pty_write' | 'pty_resize' | 'pty_close'
      | 'agent_query' | 'agent_stop' | 'sidecar_restart' | 'ping';
  payload: Record<string, unknown>;
}

// Relay -> Controller (events)
interface RelayEvent {
  type: 'pty_data' | 'pty_exit' | 'pty_created'
      | 'sidecar_message' | 'sidecar_exited'
      | 'error' | 'pong' | 'ready';
  sessionId?: string;
  payload: unknown;
}

Authentication

  1. Pre-shared token — relay starts with --token <secret>. Controller sends token in WebSocket upgrade headers.
  2. TLS required — relay rejects non-TLS connections in production mode. Dev mode allows ws:// with --insecure flag.
  3. Rate limiting — 10 failed auth attempts triggers 5-minute lockout.

Reconnection

  • Exponential backoff: 1s, 2s, 4s, 8s, 16s, 30s cap
  • Uses attempt_tcp_probe(): TCP-only, 5s timeout (avoids allocating resources on relay during probes)
  • Emits remote-machine-reconnecting and remote-machine-reconnect-ready events
  • Active agent sessions continue on relay regardless of controller connection

Session Persistence Across Reconnects

Remote agents keep running even when the controller disconnects. On reconnect:

  1. Relay sends state sync with active sessions and PTYs
  2. Controller reconciles and updates pane states
  3. Missed messages are NOT replayed (agent panes show "reconnected" notice)

Implementation Summary

Phase A: Extract agor-core crate

Cargo workspace with PtyManager, SidecarManager, EventSink trait extracted to shared crate.

Phase B: Build agor-relay binary

WebSocket server with token auth, per-connection isolated managers, structured command responses with commandId correlation.

Phase C: Add RemoteManager to controller

12 Tauri commands, heartbeat ping every 15s, exponential backoff reconnection.

Phase D: Frontend integration

remote-bridge.ts adapter, machines.svelte.ts store, Pane.remoteMachineId routing field.

Remaining Work

  • Real-world relay testing (2 machines)
  • TLS/certificate pinning

Security

Threat Mitigation
Token interception TLS required
Token brute-force Rate limit + lockout
Relay impersonation Certificate pinning (future: mTLS)
Command injection Payload schema validation
Lateral movement Unprivileged user, no shell beyond PTY/sidecar
Data exfiltration Agent output streams to controller only

Performance

Concern Mitigation
WebSocket latency LAN: <1ms, WAN: 20-100ms (acceptable for text)
Bandwidth Agent NDJSON: ~50KB/s peak, Terminal: ~200KB/s peak
Connection count Max 10 machines (UI constraint)
Message ordering Single WebSocket per machine = ordered delivery

Future (Not Covered)

  • Multi-controller (multiple agor instances observing same relay)
  • Relay discovery (mDNS/Bonjour)
  • Agent migration between machines
  • Relay-to-relay communication
  • mTLS for enterprise environments