docs: add architecture, sidecar, orchestration, and production guides

New documentation covering end-to-end system architecture, multi-provider sidecar lifecycle, btmsg/bttask multi-agent orchestration, and production hardening features (supervisor, sandbox, search, plugins, secrets, audit).
2026-03-14 02:33:59 +01:00 · 2026-03-14 02:33:59 +01:00 · de8dd04f4b
commit de8dd04f4b
parent f6b3a3e080
4 changed files with 1294 additions and 0 deletions
--- a/docs/production.md
+++ b/docs/production.md
@ -0,0 +1,364 @@
+# Production Hardening
+
+Agent Orchestrator includes several production-readiness features that ensure reliability, security, and observability. This document covers each subsystem in detail.
+
+---
+
+## Sidecar Supervisor (Crash Recovery)
+
+The `SidecarSupervisor` in `bterminal-core/src/supervisor.rs` automatically restarts crashed sidecar processes.
+
+### Behavior
+
+When the sidecar child process exits unexpectedly:
+
+1. The supervisor detects the exit via process monitoring
+2. Waits with exponential backoff before restarting:
+   - Attempt 1: wait 1 second
+   - Attempt 2: wait 2 seconds
+   - Attempt 3: wait 4 seconds
+   - Attempt 4: wait 8 seconds
+   - Attempt 5: wait 16 seconds (capped at 30s)
+3. After 5 failed attempts, the supervisor gives up and reports `SidecarHealth::Failed`
+
+### Health States
+
+```rust
+pub enum SidecarHealth {
+    Healthy,
+    Restarting { attempt: u32, next_retry: Duration },
+    Failed { attempts: u32, last_error: String },
+}
+```
+
+The frontend can query health state and offer a manual restart button when auto-recovery fails. 17 unit tests cover all recovery scenarios including edge cases like rapid successive crashes.
+
+---
+
+## Landlock Sandbox
+
+Landlock is a Linux kernel (6.2+) security module that restricts filesystem access for processes. Agent Orchestrator uses it to sandbox sidecar processes, limiting what files they can read and write.
+
+### Configuration
+
+```rust
+pub struct SandboxConfig {
+    pub read_write_paths: Vec<PathBuf>,  // Full access (project dir, temp)
+    pub read_only_paths: Vec<PathBuf>,   // Read-only (system libs, SDK)
+}
+```
+
+The sandbox is applied via `pre_exec()` on the child process command, before the sidecar starts executing.
+
+### Path Rules
+
+| Path | Access | Reason |
+|------|--------|--------|
+| Project CWD | Read/Write | Agent needs to read and modify project files |
+| `/tmp` | Read/Write | Temporary files during operation |
+| `~/.local/share/bterminal/` | Read/Write | SQLite databases (btmsg, sessions) |
+| System library paths | Read-only | Node.js/Deno runtime dependencies |
+| `~/.claude/` or config dir | Read-only | Claude configuration and credentials |
+
+### Graceful Fallback
+
+If the kernel doesn't support Landlock (< 6.2) or the kernel module isn't loaded, the sandbox silently degrades — the sidecar runs without filesystem restrictions. This is logged as a warning but doesn't prevent operation.
+
+---
+
+## FTS5 Full-Text Search
+
+The search system uses SQLite's FTS5 extension for full-text search across three data types. Accessed via a Spotlight-style overlay (Ctrl+Shift+F).
+
+### Architecture
+
+```
+SearchOverlay.svelte (Ctrl+Shift+F)
+    │
+    └── search-bridge.ts → Tauri commands
+         │
+         └── search.rs → SearchDb (separate FTS5 tables)
+              │
+              ├── search_messages  — agent session messages
+              ├── search_tasks     — bttask task content
+              └── search_btmsg     — btmsg inter-agent messages
+```
+
+### Virtual Tables
+
+The `SearchDb` struct in `search.rs` manages three FTS5 virtual tables:
+
+| Table | Source | Indexed Columns |
+|-------|--------|----------------|
+| `search_messages` | Agent session messages | content, session_id, project_id |
+| `search_tasks` | bttask tasks | title, description, assignee, status |
+| `search_btmsg` | btmsg messages | content, sender, recipient, channel |
+
+### Operations
+
+| Tauri Command | Purpose |
+|---------------|---------|
+| `search_init` | Creates FTS5 virtual tables if not exist |
+| `search_all` | Queries all 3 tables, returns ranked results |
+| `search_rebuild` | Drops and rebuilds all indices (maintenance) |
+| `search_index_message` | Indexes a single new message (real-time) |
+
+### Frontend (SearchOverlay.svelte)
+
+- Triggered by Ctrl+Shift+F
+- Spotlight-style floating overlay centered on screen
+- 300ms debounce on input to avoid excessive queries
+- Results grouped by type (Messages, Tasks, Communications)
+- Click result to navigate to source (focus project, switch tab)
+
+---
+
+## Plugin System
+
+The plugin system allows extending Agent Orchestrator with custom commands and event handlers. Plugins are sandboxed JavaScript executing in a restricted environment.
+
+### Plugin Discovery
+
+Plugins live in `~/.config/bterminal/plugins/`. Each plugin is a directory containing a `plugin.json` manifest:
+
+```json
+{
+  "name": "my-plugin",
+  "version": "1.0.0",
+  "description": "A custom plugin",
+  "main": "index.js",
+  "permissions": ["notifications", "settings"]
+}
+```
+
+The Rust `plugins.rs` module scans for `plugin.json` files with path-traversal protection (rejects `..` in paths).
+
+### Sandboxed Runtime (plugin-host.ts)
+
+Plugins execute via `new Function()` in a restricted scope:
+
+**Shadowed globals (13):**
+`fetch`, `XMLHttpRequest`, `WebSocket`, `Worker`, `eval`, `Function`, `importScripts`, `require`, `process`, `globalThis`, `window`, `document`, `localStorage`
+
+**Provided API (permission-gated):**
+
+| API | Permission | Purpose |
+|-----|-----------|---------|
+| `bt.notify(msg)` | `notifications` | Show toast notification |
+| `bt.getSetting(key)` | `settings` | Read app setting |
+| `bt.setSetting(key, val)` | `settings` | Write app setting |
+| `bt.registerCommand(name, fn)` | — (always allowed) | Add command to palette |
+| `bt.on(event, fn)` | — (always allowed) | Subscribe to app events |
+
+The API object is frozen (`Object.freeze`) to prevent tampering. Strict mode is enforced.
+
+### Plugin Store (`plugins.svelte.ts`)
+
+The store manages plugin lifecycle:
+- `loadAllPlugins()` — discover, validate permissions, execute in sandbox
+- `unloadAllPlugins()` — cleanup event listeners, remove commands
+- Command registry integrates with CommandPalette
+- Event bus distributes app events to subscribed plugins
+
+### Security Notes
+
+The `new Function()` sandbox is best-effort — it is not a security boundary. A determined attacker could escape it. Landlock provides the actual filesystem restriction. The plugin sandbox primarily prevents accidental damage from buggy plugins.
+
+35 tests cover the plugin system including permission validation, sandbox escape attempts, and lifecycle management.
+
+---
+
+## Secrets Management
+
+Secrets (API keys, tokens) are stored in the system keyring rather than in plaintext files or SQLite.
+
+### Backend (`secrets.rs`)
+
+Uses the `keyring` crate with the `linux-native` feature (libsecret/DBUS):
+
+```rust
+pub struct SecretsManager;
+
+impl SecretsManager {
+    pub fn store(key: &str, value: &str) -> Result<()>;
+    pub fn get(key: &str) -> Result<Option<String>>;
+    pub fn delete(key: &str) -> Result<()>;
+    pub fn list() -> Result<Vec<SecretMetadata>>;
+    pub fn has_keyring() -> bool;
+}
+```
+
+Metadata (key names, last modified timestamps) is stored in SQLite settings. The actual secret values never touch disk — they live only in the system keyring (gnome-keyring, KWallet, or equivalent).
+
+### Frontend (`secrets-bridge.ts`)
+
+| Function | Purpose |
+|----------|---------|
+| `storeSecret(key, value)` | Store a secret in keyring |
+| `getSecret(key)` | Retrieve a secret |
+| `deleteSecret(key)` | Remove a secret |
+| `listSecrets()` | List all secret metadata |
+| `hasKeyring()` | Check if system keyring is available |
+
+### No Fallback
+
+If no keyring daemon is available (no DBUS session, no gnome-keyring), secret operations fail with a clear error message. There is no plaintext fallback — this is intentional to prevent accidental credential leakage.
+
+---
+
+## Notifications
+
+Agent Orchestrator has two notification systems: in-app toasts and OS-level desktop notifications.
+
+### In-App Toasts (`notifications.svelte.ts`)
+
+- 6 notification types: `success`, `error`, `warning`, `info`, `agent_complete`, `agent_error`
+- Maximum 5 visible toasts, 4-second auto-dismiss
+- Toast history (up to 100 entries) with unread badge in NotificationCenter
+- Agent dispatcher emits toasts on: agent completion, agent error, sidecar crash
+
+### Desktop Notifications (`notifications.rs`)
+
+Uses `notify-rust` crate for native Linux notifications. Graceful fallback if notification daemon is unavailable (e.g., no D-Bus session).
+
+Frontend triggers via `sendDesktopNotification()` in `notifications-bridge.ts`. Used for events that should be visible even when the app is not focused.
+
+### Notification Center (`NotificationCenter.svelte`)
+
+Bell icon in the top-right with unread badge. Dropdown panel shows notification history with timestamps, type icons, and clear/mark-read actions.
+
+---
+
+## Agent Health Monitoring
+
+### Heartbeats
+
+Tier 1 agents send periodic heartbeats via `btmsg heartbeat` CLI command. The heartbeats table tracks last heartbeat timestamp and status per agent.
+
+### Stale Detection
+
+The health store detects stalled agents via the `stallThresholdMin` setting (default 15 minutes). If an agent hasn't produced output within the threshold, its activity state transitions to `Stalled` and the attention score jumps to 100 (highest priority).
+
+### Dead Letter Queue
+
+Messages sent to agents that are offline or have crashed are moved to the dead letter queue in `btmsg.db`. This prevents silent message loss and allows debugging delivery failures.
+
+### Audit Logging
+
+All significant events are logged to the `audit_log` table:
+
+| Event Type | Logged When |
+|-----------|-------------|
+| `message_sent` | Agent sends a btmsg message |
+| `message_read` | Agent reads messages |
+| `channel_created` | New btmsg channel created |
+| `agent_registered` | Agent registers with btmsg |
+| `heartbeat` | Agent sends heartbeat |
+| `task_created` | New bttask task |
+| `task_status_changed` | Task status update |
+| `wake_event` | Wake scheduler triggers |
+| `prompt_injection_detected` | Suspicious content in agent messages |
+
+The AuditLogTab component in the workspace UI displays audit entries with filtering by event type and agent, with 5-second auto-refresh and max 200 entries.
+
+---
+
+## Error Classification
+
+The error classifier (`utils/error-classifier.ts`) categorizes API errors into 6 types with appropriate retry behavior:
+
+| Type | Examples | Retry? | User Message |
+|------|----------|--------|--------------|
+| `rate_limit` | HTTP 429, "rate limit exceeded" | Yes (with backoff) | "Rate limited — retrying in Xs" |
+| `auth` | HTTP 401/403, "invalid API key" | No | "Authentication failed — check API key" |
+| `quota` | "quota exceeded", "billing" | No | "Usage quota exceeded" |
+| `overloaded` | HTTP 529, "overloaded" | Yes (longer backoff) | "Service overloaded — retrying" |
+| `network` | ECONNREFUSED, timeout, DNS failure | Yes | "Network error — check connection" |
+| `unknown` | Anything else | No | "Unexpected error" |
+
+20 unit tests cover classification accuracy across various error message formats.
+
+---
+
+## WAL Checkpoint
+
+Both SQLite databases (`sessions.db` and `btmsg.db`) use WAL (Write-Ahead Logging) mode for concurrent read/write access. Without periodic checkpoints, the WAL file grows unboundedly.
+
+A background tokio task runs `PRAGMA wal_checkpoint(TRUNCATE)` every 5 minutes on both databases. This moves WAL data into the main database file and resets the WAL.
+
+---
+
+## TLS Relay Support
+
+The `bterminal-relay` binary supports TLS for encrypted WebSocket connections:
+
+```bash
+bterminal-relay \
+  --port 9750 \
+  --token <secret> \
+  --tls-cert /path/to/cert.pem \
+  --tls-key /path/to/key.pem
+```
+
+Without `--tls-cert`/`--tls-key`, the relay only accepts connections with the `--insecure` flag (plain WebSocket). In production, TLS is mandatory — the relay rejects `ws://` connections unless `--insecure` is explicitly set.
+
+Certificate pinning (comparing relay certificate fingerprints) is planned for v3.1.
+
+---
+
+## OpenTelemetry Observability
+
+The Rust backend supports optional OTLP trace export via the `BTERMINAL_OTLP_ENDPOINT` environment variable.
+
+### Backend (`telemetry.rs`)
+
+- `TelemetryGuard` initializes tracing + OTLP export pipeline
+- Uses `tracing` + `tracing-subscriber` + `opentelemetry` 0.28 + `tracing-opentelemetry` 0.29
+- OTLP/HTTP export to configured endpoint
+- `Drop`-based shutdown ensures spans are flushed
+
+### Frontend (`telemetry-bridge.ts`)
+
+The frontend cannot use the browser OTEL SDK (WebKit2GTK incompatible). Instead, it routes events through a `frontend_log` Tauri command that pipes into Rust's tracing system:
+
+```typescript
+tel.info('agent-started', { sessionId, provider });
+tel.warn('context-pressure', { projectId, usage: 0.85 });
+tel.error('sidecar-crash', { error: msg });
+```
+
+### Docker Stack
+
+A pre-configured Tempo + Grafana stack lives in `docker/tempo/`:
+
+```bash
+cd docker/tempo && docker compose up -d
+# Grafana at http://localhost:9715
+# Set BTERMINAL_OTLP_ENDPOINT=http://localhost:4318 to enable export
+```
+
+---
+
+## Session Metrics
+
+Per-project historical session data is stored in the `session_metrics` table:
+
+| Column | Type | Purpose |
+|--------|------|---------|
+| `project_id` | TEXT | Which project |
+| `session_id` | TEXT | Agent session ID |
+| `start_time` | INTEGER | Session start timestamp |
+| `end_time` | INTEGER | Session end timestamp |
+| `peak_tokens` | INTEGER | Maximum context tokens used |
+| `turn_count` | INTEGER | Total conversation turns |
+| `tool_call_count` | INTEGER | Total tool calls made |
+| `cost_usd` | REAL | Total cost in USD |
+| `model` | TEXT | Model used |
+| `status` | TEXT | Final status (success/error/stopped) |
+| `error_message` | TEXT | Error details if failed |
+
+100-row retention per project (oldest pruned on insert). Metrics are persisted on agent completion via the agent dispatcher.
+
+The MetricsPanel component displays this data as:
+- **Live view** — fleet aggregates, project health grid, task board summary, attention queue
+- **History view** — SVG sparklines for cost/tokens/turns/tools/duration, stats row, session table