agent-orchestrator/docs/production/hardening.md
Hibryda b6c1d4b6af docs: add 11 new documentation files across all categories
New reference docs:
- agents/ref-btmsg.md: inter-agent messaging schema and CLI
- agents/ref-bttask.md: kanban task board operations
- providers/ref-providers.md: Claude/Codex/Ollama/Aider comparison
- config/ref-settings.md: (already committed)

New guides:
- contributing/dual-repo-workflow.md: community vs commercial repos
- plugins/guide-developing.md: Web Worker sandbox API and publishing

New pro docs:
- pro/features/knowledge-base.md: persistent memory + symbol graph
- pro/features/git-integration.md: context injection + branch policy
- pro/marketplace/README.md: 13 plugins catalog

Split files:
- architecture/data-model.md: from architecture.md (schemas, layout)
- production/hardening.md: from production.md (supervisor, sandbox, WAL)
- production/features.md: from production.md (FTS5, plugins, secrets, audit)
2026-03-17 04:18:05 +01:00

4.8 KiB

Production Hardening

Reliability, security, and observability features that ensure agor runs safely in daily use.


Sidecar Supervisor (Crash Recovery)

The SidecarSupervisor in agor-core/src/supervisor.rs automatically restarts crashed sidecar processes.

Behavior

When the sidecar child process exits unexpectedly:

  1. The supervisor detects the exit via process monitoring
  2. Waits with exponential backoff before restarting:
    • Attempt 1: wait 1 second
    • Attempt 2: wait 2 seconds
    • Attempt 3: wait 4 seconds
    • Attempt 4: wait 8 seconds
    • Attempt 5: wait 16 seconds (capped at 30s)
  3. After 5 failed attempts, the supervisor gives up and reports SidecarHealth::Failed

Health States

pub enum SidecarHealth {
    Healthy,
    Restarting { attempt: u32, next_retry: Duration },
    Failed { attempts: u32, last_error: String },
}

The frontend can query health state and offer a manual restart button when auto-recovery fails. 17 unit tests cover all recovery scenarios including edge cases like rapid successive crashes.


Landlock Sandbox

Landlock is a Linux kernel (6.2+) security module that restricts filesystem access for processes. Agor uses it to sandbox sidecar processes, limiting what files they can read and write.

Configuration

pub struct SandboxConfig {
    pub read_write_paths: Vec<PathBuf>,  // Full access (project dir, temp)
    pub read_only_paths: Vec<PathBuf>,   // Read-only (system libs, SDK)
}

The sandbox is applied via pre_exec() on the child process command, before the sidecar starts executing.

Path Rules

Path Access Reason
Project CWD Read/Write Agent needs to read and modify project files
/tmp Read/Write Temporary files during operation
~/.local/share/agor/ Read/Write SQLite databases (btmsg, sessions)
System library paths Read-only Node.js/Deno runtime dependencies
~/.claude/ or config dir Read-only Claude configuration and credentials

Graceful Fallback

If the kernel doesn't support Landlock (< 6.2) or the kernel module isn't loaded, the sandbox silently degrades — the sidecar runs without filesystem restrictions. This is logged as a warning but doesn't prevent operation.


WAL Checkpoint

Both SQLite databases (sessions.db and btmsg.db) use WAL (Write-Ahead Logging) mode for concurrent read/write access. Without periodic checkpoints, the WAL file grows unboundedly.

A background tokio task runs PRAGMA wal_checkpoint(TRUNCATE) every 5 minutes on both databases. This moves WAL data into the main database file and resets the WAL.


TLS Relay Support

The agor-relay binary supports TLS for encrypted WebSocket connections:

agor-relay \
  --port 9750 \
  --token <secret> \
  --tls-cert /path/to/cert.pem \
  --tls-key /path/to/key.pem

Without --tls-cert/--tls-key, the relay only accepts connections with the --insecure flag (plain WebSocket). In production, TLS is mandatory — the relay rejects ws:// connections unless --insecure is explicitly set.

Certificate pinning (comparing relay certificate fingerprints) is planned for v3.1.


OpenTelemetry Observability

The Rust backend supports optional OTLP trace export via the AGOR_OTLP_ENDPOINT environment variable.

Backend (telemetry.rs)

  • TelemetryGuard initializes tracing + OTLP export pipeline
  • Uses tracing + tracing-subscriber + opentelemetry 0.28 + tracing-opentelemetry 0.29
  • OTLP/HTTP export to configured endpoint
  • Drop-based shutdown ensures spans are flushed

Frontend (telemetry-bridge.ts)

The frontend cannot use the browser OTEL SDK (WebKit2GTK incompatible). Instead, it routes events through a frontend_log Tauri command that pipes into Rust's tracing system:

tel.info('agent-started', { sessionId, provider });
tel.warn('context-pressure', { projectId, usage: 0.85 });
tel.error('sidecar-crash', { error: msg });

Docker Stack

A pre-configured Tempo + Grafana stack lives in docker/tempo/:

cd docker/tempo && docker compose up -d
# Grafana at http://localhost:9715
# Set AGOR_OTLP_ENDPOINT=http://localhost:4318 to enable export

Agent Health Monitoring

Heartbeats

Tier 1 agents send periodic heartbeats via btmsg heartbeat CLI command. The heartbeats table tracks last heartbeat timestamp and status per agent.

Stale Detection

The health store detects stalled agents via the stallThresholdMin setting (default 15 minutes). If an agent hasn't produced output within the threshold, its activity state transitions to Stalled and the attention score jumps to 100 (highest priority).

Dead Letter Queue

Messages sent to agents that are offline or have crashed are moved to the dead letter queue in btmsg.db. This prevents silent message loss and allows debugging delivery failures.