docs: add 11 new documentation files across all categories
New reference docs: - agents/ref-btmsg.md: inter-agent messaging schema and CLI - agents/ref-bttask.md: kanban task board operations - providers/ref-providers.md: Claude/Codex/Ollama/Aider comparison - config/ref-settings.md: (already committed) New guides: - contributing/dual-repo-workflow.md: community vs commercial repos - plugins/guide-developing.md: Web Worker sandbox API and publishing New pro docs: - pro/features/knowledge-base.md: persistent memory + symbol graph - pro/features/git-integration.md: context injection + branch policy - pro/marketplace/README.md: 13 plugins catalog Split files: - architecture/data-model.md: from architecture.md (schemas, layout) - production/hardening.md: from production.md (supervisor, sandbox, WAL) - production/features.md: from production.md (FTS5, plugins, secrets, audit)
This commit is contained in:
parent
8251321dac
commit
b6c1d4b6af
11 changed files with 2198 additions and 0 deletions
140
docs/production/hardening.md
Normal file
140
docs/production/hardening.md
Normal file
|
|
@ -0,0 +1,140 @@
|
|||
# Production Hardening
|
||||
|
||||
Reliability, security, and observability features that ensure agor runs safely in daily use.
|
||||
|
||||
---
|
||||
|
||||
## Sidecar Supervisor (Crash Recovery)
|
||||
|
||||
The `SidecarSupervisor` in `agor-core/src/supervisor.rs` automatically restarts crashed sidecar processes.
|
||||
|
||||
### Behavior
|
||||
|
||||
When the sidecar child process exits unexpectedly:
|
||||
|
||||
1. The supervisor detects the exit via process monitoring
|
||||
2. Waits with exponential backoff before restarting:
|
||||
- Attempt 1: wait 1 second
|
||||
- Attempt 2: wait 2 seconds
|
||||
- Attempt 3: wait 4 seconds
|
||||
- Attempt 4: wait 8 seconds
|
||||
- Attempt 5: wait 16 seconds (capped at 30s)
|
||||
3. After 5 failed attempts, the supervisor gives up and reports `SidecarHealth::Failed`
|
||||
|
||||
### Health States
|
||||
|
||||
```rust
|
||||
pub enum SidecarHealth {
|
||||
Healthy,
|
||||
Restarting { attempt: u32, next_retry: Duration },
|
||||
Failed { attempts: u32, last_error: String },
|
||||
}
|
||||
```
|
||||
|
||||
The frontend can query health state and offer a manual restart button when auto-recovery fails. 17 unit tests cover all recovery scenarios including edge cases like rapid successive crashes.
|
||||
|
||||
---
|
||||
|
||||
## Landlock Sandbox
|
||||
|
||||
Landlock is a Linux kernel (6.2+) security module that restricts filesystem access for processes. Agor uses it to sandbox sidecar processes, limiting what files they can read and write.
|
||||
|
||||
### Configuration
|
||||
|
||||
```rust
|
||||
pub struct SandboxConfig {
|
||||
pub read_write_paths: Vec<PathBuf>, // Full access (project dir, temp)
|
||||
pub read_only_paths: Vec<PathBuf>, // Read-only (system libs, SDK)
|
||||
}
|
||||
```
|
||||
|
||||
The sandbox is applied via `pre_exec()` on the child process command, before the sidecar starts executing.
|
||||
|
||||
### Path Rules
|
||||
|
||||
| Path | Access | Reason |
|
||||
|------|--------|--------|
|
||||
| Project CWD | Read/Write | Agent needs to read and modify project files |
|
||||
| `/tmp` | Read/Write | Temporary files during operation |
|
||||
| `~/.local/share/agor/` | Read/Write | SQLite databases (btmsg, sessions) |
|
||||
| System library paths | Read-only | Node.js/Deno runtime dependencies |
|
||||
| `~/.claude/` or config dir | Read-only | Claude configuration and credentials |
|
||||
|
||||
### Graceful Fallback
|
||||
|
||||
If the kernel doesn't support Landlock (< 6.2) or the kernel module isn't loaded, the sandbox silently degrades — the sidecar runs without filesystem restrictions. This is logged as a warning but doesn't prevent operation.
|
||||
|
||||
---
|
||||
|
||||
## WAL Checkpoint
|
||||
|
||||
Both SQLite databases (`sessions.db` and `btmsg.db`) use WAL (Write-Ahead Logging) mode for concurrent read/write access. Without periodic checkpoints, the WAL file grows unboundedly.
|
||||
|
||||
A background tokio task runs `PRAGMA wal_checkpoint(TRUNCATE)` every 5 minutes on both databases. This moves WAL data into the main database file and resets the WAL.
|
||||
|
||||
---
|
||||
|
||||
## TLS Relay Support
|
||||
|
||||
The `agor-relay` binary supports TLS for encrypted WebSocket connections:
|
||||
|
||||
```bash
|
||||
agor-relay \
|
||||
--port 9750 \
|
||||
--token <secret> \
|
||||
--tls-cert /path/to/cert.pem \
|
||||
--tls-key /path/to/key.pem
|
||||
```
|
||||
|
||||
Without `--tls-cert`/`--tls-key`, the relay only accepts connections with the `--insecure` flag (plain WebSocket). In production, TLS is mandatory — the relay rejects `ws://` connections unless `--insecure` is explicitly set.
|
||||
|
||||
Certificate pinning (comparing relay certificate fingerprints) is planned for v3.1.
|
||||
|
||||
---
|
||||
|
||||
## OpenTelemetry Observability
|
||||
|
||||
The Rust backend supports optional OTLP trace export via the `AGOR_OTLP_ENDPOINT` environment variable.
|
||||
|
||||
### Backend (`telemetry.rs`)
|
||||
|
||||
- `TelemetryGuard` initializes tracing + OTLP export pipeline
|
||||
- Uses `tracing` + `tracing-subscriber` + `opentelemetry` 0.28 + `tracing-opentelemetry` 0.29
|
||||
- OTLP/HTTP export to configured endpoint
|
||||
- `Drop`-based shutdown ensures spans are flushed
|
||||
|
||||
### Frontend (`telemetry-bridge.ts`)
|
||||
|
||||
The frontend cannot use the browser OTEL SDK (WebKit2GTK incompatible). Instead, it routes events through a `frontend_log` Tauri command that pipes into Rust's tracing system:
|
||||
|
||||
```typescript
|
||||
tel.info('agent-started', { sessionId, provider });
|
||||
tel.warn('context-pressure', { projectId, usage: 0.85 });
|
||||
tel.error('sidecar-crash', { error: msg });
|
||||
```
|
||||
|
||||
### Docker Stack
|
||||
|
||||
A pre-configured Tempo + Grafana stack lives in `docker/tempo/`:
|
||||
|
||||
```bash
|
||||
cd docker/tempo && docker compose up -d
|
||||
# Grafana at http://localhost:9715
|
||||
# Set AGOR_OTLP_ENDPOINT=http://localhost:4318 to enable export
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Agent Health Monitoring
|
||||
|
||||
### Heartbeats
|
||||
|
||||
Tier 1 agents send periodic heartbeats via `btmsg heartbeat` CLI command. The heartbeats table tracks last heartbeat timestamp and status per agent.
|
||||
|
||||
### Stale Detection
|
||||
|
||||
The health store detects stalled agents via the `stallThresholdMin` setting (default 15 minutes). If an agent hasn't produced output within the threshold, its activity state transitions to `Stalled` and the attention score jumps to 100 (highest priority).
|
||||
|
||||
### Dead Letter Queue
|
||||
|
||||
Messages sent to agents that are offline or have crashed are moved to the dead letter queue in `btmsg.db`. This prevents silent message loss and allows debugging delivery failures.
|
||||
Loading…
Add table
Add a link
Reference in a new issue