New reference docs: - agents/ref-btmsg.md: inter-agent messaging schema and CLI - agents/ref-bttask.md: kanban task board operations - providers/ref-providers.md: Claude/Codex/Ollama/Aider comparison - config/ref-settings.md: (already committed) New guides: - contributing/dual-repo-workflow.md: community vs commercial repos - plugins/guide-developing.md: Web Worker sandbox API and publishing New pro docs: - pro/features/knowledge-base.md: persistent memory + symbol graph - pro/features/git-integration.md: context injection + branch policy - pro/marketplace/README.md: 13 plugins catalog Split files: - architecture/data-model.md: from architecture.md (schemas, layout) - production/hardening.md: from production.md (supervisor, sandbox, WAL) - production/features.md: from production.md (FTS5, plugins, secrets, audit)
4.8 KiB
Production Hardening
Reliability, security, and observability features that ensure agor runs safely in daily use.
Sidecar Supervisor (Crash Recovery)
The SidecarSupervisor in agor-core/src/supervisor.rs automatically restarts crashed sidecar processes.
Behavior
When the sidecar child process exits unexpectedly:
- The supervisor detects the exit via process monitoring
- Waits with exponential backoff before restarting:
- Attempt 1: wait 1 second
- Attempt 2: wait 2 seconds
- Attempt 3: wait 4 seconds
- Attempt 4: wait 8 seconds
- Attempt 5: wait 16 seconds (capped at 30s)
- After 5 failed attempts, the supervisor gives up and reports
SidecarHealth::Failed
Health States
pub enum SidecarHealth {
Healthy,
Restarting { attempt: u32, next_retry: Duration },
Failed { attempts: u32, last_error: String },
}
The frontend can query health state and offer a manual restart button when auto-recovery fails. 17 unit tests cover all recovery scenarios including edge cases like rapid successive crashes.
Landlock Sandbox
Landlock is a Linux kernel (6.2+) security module that restricts filesystem access for processes. Agor uses it to sandbox sidecar processes, limiting what files they can read and write.
Configuration
pub struct SandboxConfig {
pub read_write_paths: Vec<PathBuf>, // Full access (project dir, temp)
pub read_only_paths: Vec<PathBuf>, // Read-only (system libs, SDK)
}
The sandbox is applied via pre_exec() on the child process command, before the sidecar starts executing.
Path Rules
| Path | Access | Reason |
|---|---|---|
| Project CWD | Read/Write | Agent needs to read and modify project files |
/tmp |
Read/Write | Temporary files during operation |
~/.local/share/agor/ |
Read/Write | SQLite databases (btmsg, sessions) |
| System library paths | Read-only | Node.js/Deno runtime dependencies |
~/.claude/ or config dir |
Read-only | Claude configuration and credentials |
Graceful Fallback
If the kernel doesn't support Landlock (< 6.2) or the kernel module isn't loaded, the sandbox silently degrades — the sidecar runs without filesystem restrictions. This is logged as a warning but doesn't prevent operation.
WAL Checkpoint
Both SQLite databases (sessions.db and btmsg.db) use WAL (Write-Ahead Logging) mode for concurrent read/write access. Without periodic checkpoints, the WAL file grows unboundedly.
A background tokio task runs PRAGMA wal_checkpoint(TRUNCATE) every 5 minutes on both databases. This moves WAL data into the main database file and resets the WAL.
TLS Relay Support
The agor-relay binary supports TLS for encrypted WebSocket connections:
agor-relay \
--port 9750 \
--token <secret> \
--tls-cert /path/to/cert.pem \
--tls-key /path/to/key.pem
Without --tls-cert/--tls-key, the relay only accepts connections with the --insecure flag (plain WebSocket). In production, TLS is mandatory — the relay rejects ws:// connections unless --insecure is explicitly set.
Certificate pinning (comparing relay certificate fingerprints) is planned for v3.1.
OpenTelemetry Observability
The Rust backend supports optional OTLP trace export via the AGOR_OTLP_ENDPOINT environment variable.
Backend (telemetry.rs)
TelemetryGuardinitializes tracing + OTLP export pipeline- Uses
tracing+tracing-subscriber+opentelemetry0.28 +tracing-opentelemetry0.29 - OTLP/HTTP export to configured endpoint
Drop-based shutdown ensures spans are flushed
Frontend (telemetry-bridge.ts)
The frontend cannot use the browser OTEL SDK (WebKit2GTK incompatible). Instead, it routes events through a frontend_log Tauri command that pipes into Rust's tracing system:
tel.info('agent-started', { sessionId, provider });
tel.warn('context-pressure', { projectId, usage: 0.85 });
tel.error('sidecar-crash', { error: msg });
Docker Stack
A pre-configured Tempo + Grafana stack lives in docker/tempo/:
cd docker/tempo && docker compose up -d
# Grafana at http://localhost:9715
# Set AGOR_OTLP_ENDPOINT=http://localhost:4318 to enable export
Agent Health Monitoring
Heartbeats
Tier 1 agents send periodic heartbeats via btmsg heartbeat CLI command. The heartbeats table tracks last heartbeat timestamp and status per agent.
Stale Detection
The health store detects stalled agents via the stallThresholdMin setting (default 15 minutes). If an agent hasn't produced output within the threshold, its activity state transitions to Stalled and the attention score jumps to 100 (highest priority).
Dead Letter Queue
Messages sent to agents that are offline or have crashed are moved to the dead letter queue in btmsg.db. This prevents silent message loss and allows debugging delivery failures.