diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 18201c2..ac1f4df 100644 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -49,7 +49,7 @@ - remote-bridge.ts adapter wraps remote machine management IPC. machines.svelte.ts store tracks remote machine state. - Pane.remoteMachineId?: string routes operations through RemoteManager instead of local managers. Bridge adapters (pty-bridge, agent-bridge) check this field. - bterminal-relay binary (v2/bterminal-relay/) is a standalone WebSocket server with token auth, rate limiting, and per-connection isolated managers. Commands return structured responses (pty_created, pong, error) with commandId for correlation via send_error() helper. -- RemoteManager reconnection: exponential backoff (1s-30s cap) on disconnect, attempt_ws_connect() probe, emits remote-machine-reconnecting and remote-machine-reconnect-ready events. +- RemoteManager reconnection: exponential backoff (1s-30s cap) on disconnect, attempt_tcp_probe() (TCP-only, no WS upgrade), emits remote-machine-reconnecting and remote-machine-reconnect-ready events. Frontend listeners in remote-bridge.ts; machines store auto-reconnects on ready. ## Memora Tags diff --git a/CHANGELOG.md b/CHANGELOG.md index 469b88d..21afc51 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,7 +8,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added -- Exponential backoff reconnection in RemoteManager: on disconnect, spawns async task with 1s/2s/4s/8s/16s/30s-cap backoff, uses attempt_ws_connect() probe (5s timeout), emits remote-machine-reconnecting and remote-machine-reconnect-ready events +- Exponential backoff reconnection in RemoteManager: on disconnect, spawns async task with 1s/2s/4s/8s/16s/30s-cap backoff, uses attempt_tcp_probe() (TCP-only, no WS upgrade, 5s timeout, default port 9750), emits remote-machine-reconnecting and remote-machine-reconnect-ready events +- Frontend reconnection listeners: onRemoteMachineReconnecting and onRemoteMachineReconnectReady in remote-bridge.ts; machines store sets status to 'reconnecting' and auto-calls connectMachine() on ready - Relay command response propagation: bterminal-relay now sends structured responses (pty_created, pong, error) back to client via shared event channel with commandId correlation - send_error() helper in bterminal-relay for consistent error reporting across all command handlers - PTY creation confirmation flow: pty_create command returns pty_created event with session ID and commandId; RemoteManager emits remote-pty-created Tauri event @@ -48,6 +49,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - tempfile dev dependency for Rust test isolation ### Changed +- RemoteManager reconnection probe refactored from attempt_ws_connect() (full WS handshake + auth) to attempt_tcp_probe() (TCP-only connect, no resource allocation on relay) - bterminal-relay command handlers refactored: all error paths now use send_error() helper instead of log::error!() only; pong response sent via event channel instead of no-op - RemoteManager disconnect handler: scoped mutex release before event emission to prevent deadlocks; spawns reconnection task - PtyManager and SidecarManager extracted from src-tauri to bterminal-core shared crate (src-tauri now has thin re-export wrappers) diff --git a/TODO.md b/TODO.md index 421d443..02ef8f1 100644 --- a/TODO.md +++ b/TODO.md @@ -10,7 +10,7 @@ ## Completed -- [x] **Multi-machine reconnection** -- Exponential backoff reconnection (1s-30s cap) in RemoteManager, attempt_ws_connect() probe, reconnection events. | Done: 2026-03-06 +- [x] **Multi-machine reconnection** -- Exponential backoff reconnection (1s-30s cap) in RemoteManager, attempt_tcp_probe() (TCP-only), frontend reconnection listeners + auto-reconnect. | Done: 2026-03-06 - [x] **Relay command response propagation** -- Structured responses (pty_created, pong, error) with commandId correlation, send_error() helper. | Done: 2026-03-06 - [x] **Multi-machine support (Phases A-D)** -- bterminal-core crate extraction, bterminal-relay WebSocket binary, RemoteManager, frontend integration. | Done: 2026-03-06 - [x] **Agent Teams frontend support** -- Subagent pane spawning, parent/child navigation, message routing by parentId, SUBAGENT_TOOL_NAMES detection in dispatcher. | Done: 2026-03-06 diff --git a/docs/multi-machine.md b/docs/multi-machine.md index 6f1f77e..b1fb648 100644 --- a/docs/multi-machine.md +++ b/docs/multi-machine.md @@ -185,8 +185,9 @@ Controller Relay - Controller reconnects with exponential backoff (1s, 2s, 4s, 8s, 16s, 30s cap) - Reconnection runs as an async tokio task spawned on disconnect -- Uses `attempt_ws_connect()` probe: connects with auth header, immediately closes (5s timeout) +- Uses `attempt_tcp_probe()`: TCP connect only (no WS upgrade), 5s timeout, default port 9750. Avoids allocating per-connection resources (PtyManager, SidecarManager) on the relay during probes. - Emits `remote-machine-reconnecting` event (with backoff duration) and `remote-machine-reconnect-ready` when probe succeeds +- Frontend listens via `onRemoteMachineReconnecting` and `onRemoteMachineReconnectReady` in remote-bridge.ts; machines store sets status to 'reconnecting' and auto-calls `connectMachine()` on ready - Cancels if machine is removed or manually reconnected (checks status == "disconnected" && connection == None) - On reconnect, relay sends current state snapshot (active sessions, PTY list) - Controller reconciles: updates pane states, re-subscribes to streams @@ -274,7 +275,7 @@ Stored in SQLite `settings` table as JSON: `remote_machines` key. - 12 Tauri commands: remote_add_machine, remote_remove_machine, remote_connect, remote_disconnect, remote_list_machines, remote_pty_spawn/write/resize/kill, remote_agent_query/stop, remote_sidecar_restart - Heartbeat ping every 15s - PTY creation event: emits `remote-pty-created` Tauri event with machineId, ptyId, commandId -- Exponential backoff reconnection on disconnect (1s/2s/4s/8s/16s/30s cap) via `attempt_ws_connect()` probe +- Exponential backoff reconnection on disconnect (1s/2s/4s/8s/16s/30s cap) via `attempt_tcp_probe()` (TCP-only, no WS upgrade) - Reconnection events: `remote-machine-reconnecting`, `remote-machine-reconnect-ready` ### Phase D: Frontend integration [DONE] diff --git a/docs/phases.md b/docs/phases.md index a7054fc..6e610c0 100644 --- a/docs/phases.md +++ b/docs/phases.md @@ -282,7 +282,7 @@ Architecture designed in [multi-machine.md](multi-machine.md). Implementation ex - [x] Heartbeat ping every 15s - [x] PTY creation event: emits remote-pty-created Tauri event with machineId, ptyId, commandId - [x] Exponential backoff reconnection on disconnect (1s/2s/4s/8s/16s/30s cap) -- [x] attempt_ws_connect() probe function (5s timeout, auth header, immediate close) +- [x] attempt_tcp_probe() function: TCP-only probe (5s timeout, default port 9750) — avoids allocating per-connection resources on relay during probes - [x] Reconnection events: remote-machine-reconnecting, remote-machine-reconnect-ready ### Phase D: Frontend integration [status: complete] diff --git a/docs/progress.md b/docs/progress.md index 2a30bda..183daa6 100644 --- a/docs/progress.md +++ b/docs/progress.md @@ -323,7 +323,7 @@ Design: No separate sidecar process per subagent. Parent's sidecar handles all; #### RemoteManager Reconnection - [x] Exponential backoff reconnection in remote.rs: spawns async tokio task on disconnect - [x] Backoff schedule: 1s, 2s, 4s, 8s, 16s, 30s (capped) -- [x] attempt_ws_connect() probe function: connects with proper WebSocket upgrade + auth header, 5s timeout, immediate close +- [x] attempt_tcp_probe() function: TCP-only connect probe (5s timeout, default port 9750) — avoids allocating per-connection resources on relay - [x] Emits remote-machine-reconnecting (with backoffSecs) and remote-machine-reconnect-ready Tauri events - [x] Cancellation: stops if machine removed (not in HashMap) or manually reconnected (status != disconnected) - [x] Fixed scoping: disconnection cleanup uses inner block to release mutex before emitting event @@ -331,6 +331,20 @@ Design: No separate sidecar process per subagent. Parent's sidecar handles all; #### RemoteManager PTY Creation Confirmation - [x] Handles pty_created event type from relay: emits remote-pty-created Tauri event with machineId, ptyId, commandId +### Session: 2026-03-06 (continued) — Reconnection Hardening + +#### TCP Probe Refactor +- [x] Replaced attempt_ws_connect() with attempt_tcp_probe() in remote.rs: TCP-only connect (no WS upgrade), 5s timeout, default port 9750 +- [x] Avoids allocating per-connection resources (PtyManager, SidecarManager) on the relay during reconnection probes +- [x] Probe no longer needs auth token — only checks TCP reachability + +#### Frontend Reconnection Listeners +- [x] Added onRemoteMachineReconnecting() listener in remote-bridge.ts: receives machineId + backoffSecs +- [x] Added onRemoteMachineReconnectReady() listener in remote-bridge.ts: receives machineId when probe succeeds +- [x] machines.svelte.ts: reconnecting handler sets machine status to 'reconnecting', shows toast with backoff duration +- [x] machines.svelte.ts: reconnect-ready handler auto-calls connectMachine() to re-establish full WebSocket connection +- [x] Updated docs/multi-machine.md to reflect TCP probe and frontend listener changes + ### Next Steps - [ ] Real-world relay testing (2 machines) - [ ] TLS/certificate pinning for relay connections