# Production Hardening Reliability, security, and observability features that ensure agor runs safely in daily use. --- ## Sidecar Supervisor (Crash Recovery) The `SidecarSupervisor` in `agor-core/src/supervisor.rs` automatically restarts crashed sidecar processes. ### Behavior When the sidecar child process exits unexpectedly: 1. The supervisor detects the exit via process monitoring 2. Waits with exponential backoff before restarting: - Attempt 1: wait 1 second - Attempt 2: wait 2 seconds - Attempt 3: wait 4 seconds - Attempt 4: wait 8 seconds - Attempt 5: wait 16 seconds (capped at 30s) 3. After 5 failed attempts, the supervisor gives up and reports `SidecarHealth::Failed` ### Health States ```rust pub enum SidecarHealth { Healthy, Restarting { attempt: u32, next_retry: Duration }, Failed { attempts: u32, last_error: String }, } ``` The frontend can query health state and offer a manual restart button when auto-recovery fails. 17 unit tests cover all recovery scenarios including edge cases like rapid successive crashes. --- ## Landlock Sandbox Landlock is a Linux kernel (6.2+) security module that restricts filesystem access for processes. Agor uses it to sandbox sidecar processes, limiting what files they can read and write. ### Configuration ```rust pub struct SandboxConfig { pub read_write_paths: Vec, // Full access (project dir, temp) pub read_only_paths: Vec, // Read-only (system libs, SDK) } ``` The sandbox is applied via `pre_exec()` on the child process command, before the sidecar starts executing. ### Path Rules | Path | Access | Reason | |------|--------|--------| | Project CWD | Read/Write | Agent needs to read and modify project files | | `/tmp` | Read/Write | Temporary files during operation | | `~/.local/share/agor/` | Read/Write | SQLite databases (btmsg, sessions) | | System library paths | Read-only | Node.js/Deno runtime dependencies | | `~/.claude/` or config dir | Read-only | Claude configuration and credentials | ### Graceful Fallback If the kernel doesn't support Landlock (< 6.2) or the kernel module isn't loaded, the sandbox silently degrades — the sidecar runs without filesystem restrictions. This is logged as a warning but doesn't prevent operation. --- ## WAL Checkpoint Both SQLite databases (`sessions.db` and `btmsg.db`) use WAL (Write-Ahead Logging) mode for concurrent read/write access. Without periodic checkpoints, the WAL file grows unboundedly. A background tokio task runs `PRAGMA wal_checkpoint(TRUNCATE)` every 5 minutes on both databases. This moves WAL data into the main database file and resets the WAL. --- ## TLS Relay Support The `agor-relay` binary supports TLS for encrypted WebSocket connections: ```bash agor-relay \ --port 9750 \ --token \ --tls-cert /path/to/cert.pem \ --tls-key /path/to/key.pem ``` Without `--tls-cert`/`--tls-key`, the relay only accepts connections with the `--insecure` flag (plain WebSocket). In production, TLS is mandatory — the relay rejects `ws://` connections unless `--insecure` is explicitly set. Certificate pinning (comparing relay certificate fingerprints) is planned for v3.1. --- ## OpenTelemetry Observability The Rust backend supports optional OTLP trace export via the `AGOR_OTLP_ENDPOINT` environment variable. ### Backend (`telemetry.rs`) - `TelemetryGuard` initializes tracing + OTLP export pipeline - Uses `tracing` + `tracing-subscriber` + `opentelemetry` 0.28 + `tracing-opentelemetry` 0.29 - OTLP/HTTP export to configured endpoint - `Drop`-based shutdown ensures spans are flushed ### Frontend (`telemetry-bridge.ts`) The frontend cannot use the browser OTEL SDK (WebKit2GTK incompatible). Instead, it routes events through a `frontend_log` Tauri command that pipes into Rust's tracing system: ```typescript tel.info('agent-started', { sessionId, provider }); tel.warn('context-pressure', { projectId, usage: 0.85 }); tel.error('sidecar-crash', { error: msg }); ``` ### Docker Stack A pre-configured Tempo + Grafana stack lives in `docker/tempo/`: ```bash cd docker/tempo && docker compose up -d # Grafana at http://localhost:9715 # Set AGOR_OTLP_ENDPOINT=http://localhost:4318 to enable export ``` --- ## Agent Health Monitoring ### Heartbeats Tier 1 agents send periodic heartbeats via `btmsg heartbeat` CLI command. The heartbeats table tracks last heartbeat timestamp and status per agent. ### Stale Detection The health store detects stalled agents via the `stallThresholdMin` setting (default 15 minutes). If an agent hasn't produced output within the threshold, its activity state transitions to `Stalled` and the attention score jumps to 100 (highest priority). ### Dead Letter Queue Messages sent to agents that are offline or have crashed are moved to the dead letter queue in `btmsg.db`. This prevents silent message loss and allows debugging delivery failures.