docs: add 11 new documentation files across all categories

New reference docs:
- agents/ref-btmsg.md: inter-agent messaging schema and CLI
- agents/ref-bttask.md: kanban task board operations
- providers/ref-providers.md: Claude/Codex/Ollama/Aider comparison
- config/ref-settings.md: (already committed)

New guides:
- contributing/dual-repo-workflow.md: community vs commercial repos
- plugins/guide-developing.md: Web Worker sandbox API and publishing

New pro docs:
- pro/features/knowledge-base.md: persistent memory + symbol graph
- pro/features/git-integration.md: context injection + branch policy
- pro/marketplace/README.md: 13 plugins catalog

Split files:
- architecture/data-model.md: from architecture.md (schemas, layout)
- production/hardening.md: from production.md (supervisor, sandbox, WAL)
- production/features.md: from production.md (FTS5, plugins, secrets, audit)
This commit is contained in:
Hibryda 2026-03-17 04:18:05 +01:00
parent 8251321dac
commit b6c1d4b6af
11 changed files with 2198 additions and 0 deletions

View file

@ -0,0 +1,140 @@
# Production Hardening
Reliability, security, and observability features that ensure agor runs safely in daily use.
---
## Sidecar Supervisor (Crash Recovery)
The `SidecarSupervisor` in `agor-core/src/supervisor.rs` automatically restarts crashed sidecar processes.
### Behavior
When the sidecar child process exits unexpectedly:
1. The supervisor detects the exit via process monitoring
2. Waits with exponential backoff before restarting:
- Attempt 1: wait 1 second
- Attempt 2: wait 2 seconds
- Attempt 3: wait 4 seconds
- Attempt 4: wait 8 seconds
- Attempt 5: wait 16 seconds (capped at 30s)
3. After 5 failed attempts, the supervisor gives up and reports `SidecarHealth::Failed`
### Health States
```rust
pub enum SidecarHealth {
Healthy,
Restarting { attempt: u32, next_retry: Duration },
Failed { attempts: u32, last_error: String },
}
```
The frontend can query health state and offer a manual restart button when auto-recovery fails. 17 unit tests cover all recovery scenarios including edge cases like rapid successive crashes.
---
## Landlock Sandbox
Landlock is a Linux kernel (6.2+) security module that restricts filesystem access for processes. Agor uses it to sandbox sidecar processes, limiting what files they can read and write.
### Configuration
```rust
pub struct SandboxConfig {
pub read_write_paths: Vec<PathBuf>, // Full access (project dir, temp)
pub read_only_paths: Vec<PathBuf>, // Read-only (system libs, SDK)
}
```
The sandbox is applied via `pre_exec()` on the child process command, before the sidecar starts executing.
### Path Rules
| Path | Access | Reason |
|------|--------|--------|
| Project CWD | Read/Write | Agent needs to read and modify project files |
| `/tmp` | Read/Write | Temporary files during operation |
| `~/.local/share/agor/` | Read/Write | SQLite databases (btmsg, sessions) |
| System library paths | Read-only | Node.js/Deno runtime dependencies |
| `~/.claude/` or config dir | Read-only | Claude configuration and credentials |
### Graceful Fallback
If the kernel doesn't support Landlock (< 6.2) or the kernel module isn't loaded, the sandbox silently degrades the sidecar runs without filesystem restrictions. This is logged as a warning but doesn't prevent operation.
---
## WAL Checkpoint
Both SQLite databases (`sessions.db` and `btmsg.db`) use WAL (Write-Ahead Logging) mode for concurrent read/write access. Without periodic checkpoints, the WAL file grows unboundedly.
A background tokio task runs `PRAGMA wal_checkpoint(TRUNCATE)` every 5 minutes on both databases. This moves WAL data into the main database file and resets the WAL.
---
## TLS Relay Support
The `agor-relay` binary supports TLS for encrypted WebSocket connections:
```bash
agor-relay \
--port 9750 \
--token <secret> \
--tls-cert /path/to/cert.pem \
--tls-key /path/to/key.pem
```
Without `--tls-cert`/`--tls-key`, the relay only accepts connections with the `--insecure` flag (plain WebSocket). In production, TLS is mandatory — the relay rejects `ws://` connections unless `--insecure` is explicitly set.
Certificate pinning (comparing relay certificate fingerprints) is planned for v3.1.
---
## OpenTelemetry Observability
The Rust backend supports optional OTLP trace export via the `AGOR_OTLP_ENDPOINT` environment variable.
### Backend (`telemetry.rs`)
- `TelemetryGuard` initializes tracing + OTLP export pipeline
- Uses `tracing` + `tracing-subscriber` + `opentelemetry` 0.28 + `tracing-opentelemetry` 0.29
- OTLP/HTTP export to configured endpoint
- `Drop`-based shutdown ensures spans are flushed
### Frontend (`telemetry-bridge.ts`)
The frontend cannot use the browser OTEL SDK (WebKit2GTK incompatible). Instead, it routes events through a `frontend_log` Tauri command that pipes into Rust's tracing system:
```typescript
tel.info('agent-started', { sessionId, provider });
tel.warn('context-pressure', { projectId, usage: 0.85 });
tel.error('sidecar-crash', { error: msg });
```
### Docker Stack
A pre-configured Tempo + Grafana stack lives in `docker/tempo/`:
```bash
cd docker/tempo && docker compose up -d
# Grafana at http://localhost:9715
# Set AGOR_OTLP_ENDPOINT=http://localhost:4318 to enable export
```
---
## Agent Health Monitoring
### Heartbeats
Tier 1 agents send periodic heartbeats via `btmsg heartbeat` CLI command. The heartbeats table tracks last heartbeat timestamp and status per agent.
### Stale Detection
The health store detects stalled agents via the `stallThresholdMin` setting (default 15 minutes). If an agent hasn't produced output within the threshold, its activity state transitions to `Stalled` and the attention score jumps to 100 (highest priority).
### Dead Letter Queue
Messages sent to agents that are offline or have crashed are moved to the dead letter queue in `btmsg.db`. This prevents silent message loss and allows debugging delivery failures.