docs: update docs for relay hardening, reconnection, and session wrap

Update multi-machine docs with reconnection implementation details,
command response propagation, and pty_created confirmation flow.
Mark reconnection as complete in phases.md, progress.md, TODO.md.
Update CLAUDE.md files with reconnection and relay response info.
Add CHANGELOG entries for new features.
This commit is contained in:
Hibryda 2026-03-06 19:49:28 +01:00
parent b0cce7ae4f
commit 218570ac35
10 changed files with 84 additions and 35 deletions

View file

@ -17,7 +17,7 @@ Project documentation lives here.
| Document | Description |
|----------|-------------|
| [task_plan.md](task_plan.md) | v2 architecture decisions, error handling, testing strategy |
| [phases.md](phases.md) | v2 implementation phases (1-7) with checklists |
| [phases.md](phases.md) | v2 implementation phases (1-7 + multi-machine A-D) with checklists |
| [findings.md](findings.md) | Research findings (Agent SDK, Tauri, xterm.js, performance) |
| [progress.md](progress.md) | Session-by-session progress log |
| [multi-machine.md](multi-machine.md) | Multi-machine support architecture (WebSocket, relay binary) |
| [multi-machine.md](multi-machine.md) | Multi-machine support architecture (implemented, WebSocket relay, reconnection) |

View file

@ -146,7 +146,7 @@ interface RelayCommand {
// Relay → Controller (events)
interface RelayEvent {
type: 'pty_data' | 'pty_exit'
type: 'pty_data' | 'pty_exit' | 'pty_created'
| 'sidecar_message' | 'sidecar_exited'
| 'error' | 'pong' | 'ready';
sessionId?: string;
@ -181,9 +181,13 @@ Controller Relay
│── reconnect (exp backoff) ─────→│ (1s, 2s, 4s, 8s, max 30s)
```
### Reconnection
### Reconnection (Implemented)
- Controller reconnects with exponential backoff (1s → 30s cap)
- Controller reconnects with exponential backoff (1s, 2s, 4s, 8s, 16s, 30s cap)
- Reconnection runs as an async tokio task spawned on disconnect
- Uses `attempt_ws_connect()` probe: connects with auth header, immediately closes (5s timeout)
- Emits `remote-machine-reconnecting` event (with backoff duration) and `remote-machine-reconnect-ready` when probe succeeds
- Cancels if machine is removed or manually reconnected (checks status == "disconnected" && connection == None)
- On reconnect, relay sends current state snapshot (active sessions, PTY list)
- Controller reconciles: updates pane states, re-subscribes to streams
- Active agent sessions continue on relay regardless of controller connection
@ -260,12 +264,18 @@ Stored in SQLite `settings` table as JSON: `remote_machines` key.
- Routes RelayCommand to bterminal-core managers, forwards RelayEvent over WebSocket
- Rate limiting: 10 failed auth attempts triggers 5-minute lockout
- Per-connection isolated PtyManager + SidecarManager instances
- Command response propagation: structured responses (pty_created, pong, error) sent back via shared event channel
- send_error() helper: all command failures emit RelayEvent with commandId + error message
- PTY creation confirmation: pty_create command returns pty_created event with session ID and commandId for correlation
### Phase C: Add `RemoteManager` to controller [DONE]
- v2/src-tauri/src/remote.rs — RemoteManager struct with WebSocket client connections
- 12 Tauri commands: remote_add_machine, remote_remove_machine, remote_connect, remote_disconnect, remote_list_machines, remote_pty_spawn/write/resize/kill, remote_agent_query/stop, remote_sidecar_restart
- Heartbeat ping every 15s
- PTY creation event: emits `remote-pty-created` Tauri event with machineId, ptyId, commandId
- Exponential backoff reconnection on disconnect (1s/2s/4s/8s/16s/30s cap) via `attempt_ws_connect()` probe
- Reconnection events: `remote-machine-reconnecting`, `remote-machine-reconnect-ready`
### Phase D: Frontend integration [DONE]
@ -278,7 +288,8 @@ Stored in SQLite `settings` table as JSON: `remote_machines` key.
### Remaining Work
- [ ] Reconnection logic with exponential backoff (1s-30s cap)
- [x] Reconnection logic with exponential backoff (1s-30s cap) — implemented in remote.rs
- [x] Relay command response propagation (pty_created, pong, error events) — implemented in main.rs
- [ ] Real-world relay testing (2 machines)
- [ ] TLS/certificate pinning

View file

@ -271,12 +271,19 @@ Architecture designed in [multi-machine.md](multi-machine.md). Implementation ex
- [x] Routes RelayCommand to PtyManager/SidecarManager, forwards RelayEvent over WebSocket
- [x] Rate limiting on auth failures (10 attempts, 5min lockout)
- [x] Per-connection isolated PTY + sidecar managers
- [x] Command response propagation: structured responses (pty_created, pong, error) via shared event channel
- [x] send_error() helper for consistent error reporting with commandId correlation
- [x] PTY creation confirmation: pty_created event with session ID and commandId
### Phase C: Add `RemoteManager` to controller [status: complete]
- [x] New remote.rs module in src-tauri — WebSocket client connections to relay instances
- [x] Machine lifecycle: add/remove/connect/disconnect
- [x] 12 new Tauri commands for remote operations
- [x] Heartbeat ping every 15s
- [x] PTY creation event: emits remote-pty-created Tauri event with machineId, ptyId, commandId
- [x] Exponential backoff reconnection on disconnect (1s/2s/4s/8s/16s/30s cap)
- [x] attempt_ws_connect() probe function (5s timeout, auth header, immediate close)
- [x] Reconnection events: remote-machine-reconnecting, remote-machine-reconnect-ready
### Phase D: Frontend integration [status: complete]
- [x] remote-bridge.ts adapter for machine management + remote events
@ -287,6 +294,7 @@ Architecture designed in [multi-machine.md](multi-machine.md). Implementation ex
- [x] Sidebar auto-groups remote panes by machine label
### Remaining Work
- [ ] Reconnection logic with exponential backoff
- [x] Reconnection logic with exponential backoff — implemented in remote.rs
- [x] Relay command response propagation — implemented in bterminal-relay main.rs
- [ ] Real-world relay testing (2 machines)
- [ ] TLS/certificate pinning

View file

@ -311,8 +311,27 @@ Design: No separate sidecar process per subagent. Parent's sidecar handles all;
- bterminal-relay: tokio, tokio-tungstenite, clap, env_logger, futures-util
- src-tauri: tokio-tungstenite, tokio, futures-util, uuid (added for RemoteManager)
### Session: 2026-03-06 (continued) — Relay Hardening & Reconnection
#### Relay Command Response Propagation
- [x] Shared event channel between EventSink and command response sender (sink_tx clone in bterminal-relay)
- [x] send_error() helper function: all command failures now emit RelayEvent with commandId + error message instead of just logging
- [x] ping command: now sends pong response via event channel (was a no-op)
- [x] pty_create: returns pty_created event with session ID and commandId for correlation
- [x] All error paths (pty_write, pty_resize, pty_close, agent_query, agent_stop, sidecar_restart) use send_error()
#### RemoteManager Reconnection
- [x] Exponential backoff reconnection in remote.rs: spawns async tokio task on disconnect
- [x] Backoff schedule: 1s, 2s, 4s, 8s, 16s, 30s (capped)
- [x] attempt_ws_connect() probe function: connects with proper WebSocket upgrade + auth header, 5s timeout, immediate close
- [x] Emits remote-machine-reconnecting (with backoffSecs) and remote-machine-reconnect-ready Tauri events
- [x] Cancellation: stops if machine removed (not in HashMap) or manually reconnected (status != disconnected)
- [x] Fixed scoping: disconnection cleanup uses inner block to release mutex before emitting event
#### RemoteManager PTY Creation Confirmation
- [x] Handles pty_created event type from relay: emits remote-pty-created Tauri event with machineId, ptyId, commandId
### Next Steps
- [ ] Reconnection logic with exponential backoff
- [ ] Real-world relay testing (2 machines)
- [ ] TLS/certificate pinning for relay connections
- [ ] Deno sidecar: test with real claude CLI, benchmark startup time vs Node.js

View file

@ -151,7 +151,7 @@ See [phases.md](phases.md) for the full phased implementation plan.
## Open Questions
1. **Node.js or Deno for sidecar?** Resolved: Deno-first with Node.js fallback. SidecarCommand struct in sidecar.rs abstracts the choice. Deno preferred (runs TS directly, compiles to single binary). Falls back to Node.js if Deno not in PATH.
2. **Multi-machine support?** Resolved: Implemented (Phases A-D complete). See [multi-machine.md](multi-machine.md) for architecture. bterminal-core crate extracted, bterminal-relay binary built, RemoteManager + frontend integration done. Remaining: reconnection logic, real-world testing, TLS.
2. **Multi-machine support?** Resolved: Implemented (Phases A-D complete). See [multi-machine.md](multi-machine.md) for architecture. bterminal-core crate extracted, bterminal-relay binary built, RemoteManager + frontend integration done. Reconnection with exponential backoff implemented. Remaining: real-world testing, TLS.
3. **Agent Teams integration?** Phase 7 — frontend routing implemented (subagent pane spawning, parent/child navigation). Needs real-world testing with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1.
4. **Electron escape hatch threshold?** If Canvas xterm.js proves >50ms latency on target system with 4 panes, switch to Electron. Benchmark in Phase 2.