Hibryda de8dd04f4b docs: add architecture, sidecar, orchestration, and production guides

New documentation covering end-to-end system architecture, multi-provider
sidecar lifecycle, btmsg/bttask multi-agent orchestration, and production
hardening features (supervisor, sandbox, search, plugins, secrets, audit).

2026-03-14 02:33:59 +01:00

14 KiB

Raw Permalink Blame History

Production Hardening

Agent Orchestrator includes several production-readiness features that ensure reliability, security, and observability. This document covers each subsystem in detail.

Sidecar Supervisor (Crash Recovery)

The SidecarSupervisor in bterminal-core/src/supervisor.rs automatically restarts crashed sidecar processes.

Behavior

When the sidecar child process exits unexpectedly:

The supervisor detects the exit via process monitoring
Waits with exponential backoff before restarting:
- Attempt 1: wait 1 second
- Attempt 2: wait 2 seconds
- Attempt 3: wait 4 seconds
- Attempt 4: wait 8 seconds
- Attempt 5: wait 16 seconds (capped at 30s)
After 5 failed attempts, the supervisor gives up and reports SidecarHealth::Failed

Health States

pub enum SidecarHealth {
    Healthy,
    Restarting { attempt: u32, next_retry: Duration },
    Failed { attempts: u32, last_error: String },
}

The frontend can query health state and offer a manual restart button when auto-recovery fails. 17 unit tests cover all recovery scenarios including edge cases like rapid successive crashes.

Landlock Sandbox

Landlock is a Linux kernel (6.2+) security module that restricts filesystem access for processes. Agent Orchestrator uses it to sandbox sidecar processes, limiting what files they can read and write.

Configuration

pub struct SandboxConfig {
    pub read_write_paths: Vec<PathBuf>,  // Full access (project dir, temp)
    pub read_only_paths: Vec<PathBuf>,   // Read-only (system libs, SDK)
}

The sandbox is applied via pre_exec() on the child process command, before the sidecar starts executing.

Path Rules

Path	Access	Reason
Project CWD	Read/Write	Agent needs to read and modify project files
`/tmp`	Read/Write	Temporary files during operation
`~/.local/share/bterminal/`	Read/Write	SQLite databases (btmsg, sessions)
System library paths	Read-only	Node.js/Deno runtime dependencies
`~/.claude/` or config dir	Read-only	Claude configuration and credentials

Graceful Fallback

If the kernel doesn't support Landlock (< 6.2) or the kernel module isn't loaded, the sandbox silently degrades — the sidecar runs without filesystem restrictions. This is logged as a warning but doesn't prevent operation.

FTS5 Full-Text Search

The search system uses SQLite's FTS5 extension for full-text search across three data types. Accessed via a Spotlight-style overlay (Ctrl+Shift+F).

Architecture

SearchOverlay.svelte (Ctrl+Shift+F)
    │
    └── search-bridge.ts → Tauri commands
         │
         └── search.rs → SearchDb (separate FTS5 tables)
              │
              ├── search_messages  — agent session messages
              ├── search_tasks     — bttask task content
              └── search_btmsg     — btmsg inter-agent messages

Virtual Tables

The SearchDb struct in search.rs manages three FTS5 virtual tables:

Table	Source	Indexed Columns
`search_messages`	Agent session messages	content, session_id, project_id
`search_tasks`	bttask tasks	title, description, assignee, status
`search_btmsg`	btmsg messages	content, sender, recipient, channel

Operations

Tauri Command	Purpose
`search_init`	Creates FTS5 virtual tables if not exist
`search_all`	Queries all 3 tables, returns ranked results
`search_rebuild`	Drops and rebuilds all indices (maintenance)
`search_index_message`	Indexes a single new message (real-time)

Frontend (SearchOverlay.svelte)

Triggered by Ctrl+Shift+F
Spotlight-style floating overlay centered on screen
300ms debounce on input to avoid excessive queries
Results grouped by type (Messages, Tasks, Communications)
Click result to navigate to source (focus project, switch tab)

Plugin System

The plugin system allows extending Agent Orchestrator with custom commands and event handlers. Plugins are sandboxed JavaScript executing in a restricted environment.

Plugin Discovery

Plugins live in ~/.config/bterminal/plugins/. Each plugin is a directory containing a plugin.json manifest:

{
  "name": "my-plugin",
  "version": "1.0.0",
  "description": "A custom plugin",
  "main": "index.js",
  "permissions": ["notifications", "settings"]
}

The Rust plugins.rs module scans for plugin.json files with path-traversal protection (rejects .. in paths).

Sandboxed Runtime (plugin-host.ts)

Plugins execute via new Function() in a restricted scope:

Shadowed globals (13): fetch, XMLHttpRequest, WebSocket, Worker, eval, Function, importScripts, require, process, globalThis, window, document, localStorage

Provided API (permission-gated):

API	Permission	Purpose
`bt.notify(msg)`	`notifications`	Show toast notification
`bt.getSetting(key)`	`settings`	Read app setting
`bt.setSetting(key, val)`	`settings`	Write app setting
`bt.registerCommand(name, fn)`	— (always allowed)	Add command to palette
`bt.on(event, fn)`	— (always allowed)	Subscribe to app events

The API object is frozen (Object.freeze) to prevent tampering. Strict mode is enforced.

Plugin Store (`plugins.svelte.ts`)

The store manages plugin lifecycle:

loadAllPlugins() — discover, validate permissions, execute in sandbox
unloadAllPlugins() — cleanup event listeners, remove commands
Command registry integrates with CommandPalette
Event bus distributes app events to subscribed plugins

Security Notes

The new Function() sandbox is best-effort — it is not a security boundary. A determined attacker could escape it. Landlock provides the actual filesystem restriction. The plugin sandbox primarily prevents accidental damage from buggy plugins.

35 tests cover the plugin system including permission validation, sandbox escape attempts, and lifecycle management.

Secrets Management

Secrets (API keys, tokens) are stored in the system keyring rather than in plaintext files or SQLite.

Backend (`secrets.rs`)

Uses the keyring crate with the linux-native feature (libsecret/DBUS):

pub struct SecretsManager;

impl SecretsManager {
    pub fn store(key: &str, value: &str) -> Result<()>;
    pub fn get(key: &str) -> Result<Option<String>>;
    pub fn delete(key: &str) -> Result<()>;
    pub fn list() -> Result<Vec<SecretMetadata>>;
    pub fn has_keyring() -> bool;
}

Metadata (key names, last modified timestamps) is stored in SQLite settings. The actual secret values never touch disk — they live only in the system keyring (gnome-keyring, KWallet, or equivalent).

Frontend (`secrets-bridge.ts`)

Function	Purpose
`storeSecret(key, value)`	Store a secret in keyring
`getSecret(key)`	Retrieve a secret
`deleteSecret(key)`	Remove a secret
`listSecrets()`	List all secret metadata
`hasKeyring()`	Check if system keyring is available

No Fallback

If no keyring daemon is available (no DBUS session, no gnome-keyring), secret operations fail with a clear error message. There is no plaintext fallback — this is intentional to prevent accidental credential leakage.

Notifications

Agent Orchestrator has two notification systems: in-app toasts and OS-level desktop notifications.

In-App Toasts (`notifications.svelte.ts`)

6 notification types: success, error, warning, info, agent_complete, agent_error
Maximum 5 visible toasts, 4-second auto-dismiss
Toast history (up to 100 entries) with unread badge in NotificationCenter
Agent dispatcher emits toasts on: agent completion, agent error, sidecar crash

Desktop Notifications (`notifications.rs`)

Uses notify-rust crate for native Linux notifications. Graceful fallback if notification daemon is unavailable (e.g., no D-Bus session).

Frontend triggers via sendDesktopNotification() in notifications-bridge.ts. Used for events that should be visible even when the app is not focused.

Notification Center (`NotificationCenter.svelte`)

Bell icon in the top-right with unread badge. Dropdown panel shows notification history with timestamps, type icons, and clear/mark-read actions.

Agent Health Monitoring

Heartbeats

Tier 1 agents send periodic heartbeats via btmsg heartbeat CLI command. The heartbeats table tracks last heartbeat timestamp and status per agent.

Stale Detection

The health store detects stalled agents via the stallThresholdMin setting (default 15 minutes). If an agent hasn't produced output within the threshold, its activity state transitions to Stalled and the attention score jumps to 100 (highest priority).

Dead Letter Queue

Messages sent to agents that are offline or have crashed are moved to the dead letter queue in btmsg.db. This prevents silent message loss and allows debugging delivery failures.

Audit Logging

All significant events are logged to the audit_log table:

Event Type	Logged When
`message_sent`	Agent sends a btmsg message
`message_read`	Agent reads messages
`channel_created`	New btmsg channel created
`agent_registered`	Agent registers with btmsg
`heartbeat`	Agent sends heartbeat
`task_created`	New bttask task
`task_status_changed`	Task status update
`wake_event`	Wake scheduler triggers
`prompt_injection_detected`	Suspicious content in agent messages

The AuditLogTab component in the workspace UI displays audit entries with filtering by event type and agent, with 5-second auto-refresh and max 200 entries.

Error Classification

The error classifier (utils/error-classifier.ts) categorizes API errors into 6 types with appropriate retry behavior:

Type	Examples	Retry?	User Message
`rate_limit`	HTTP 429, "rate limit exceeded"	Yes (with backoff)	"Rate limited — retrying in Xs"
`auth`	HTTP 401/403, "invalid API key"	No	"Authentication failed — check API key"
`quota`	"quota exceeded", "billing"	No	"Usage quota exceeded"
`overloaded`	HTTP 529, "overloaded"	Yes (longer backoff)	"Service overloaded — retrying"
`network`	ECONNREFUSED, timeout, DNS failure	Yes	"Network error — check connection"
`unknown`	Anything else	No	"Unexpected error"

20 unit tests cover classification accuracy across various error message formats.

WAL Checkpoint

Both SQLite databases (sessions.db and btmsg.db) use WAL (Write-Ahead Logging) mode for concurrent read/write access. Without periodic checkpoints, the WAL file grows unboundedly.

A background tokio task runs PRAGMA wal_checkpoint(TRUNCATE) every 5 minutes on both databases. This moves WAL data into the main database file and resets the WAL.

TLS Relay Support

The bterminal-relay binary supports TLS for encrypted WebSocket connections:

bterminal-relay \
  --port 9750 \
  --token <secret> \
  --tls-cert /path/to/cert.pem \
  --tls-key /path/to/key.pem

Without --tls-cert/--tls-key, the relay only accepts connections with the --insecure flag (plain WebSocket). In production, TLS is mandatory — the relay rejects ws:// connections unless --insecure is explicitly set.

Certificate pinning (comparing relay certificate fingerprints) is planned for v3.1.

OpenTelemetry Observability

The Rust backend supports optional OTLP trace export via the BTERMINAL_OTLP_ENDPOINT environment variable.

Backend (`telemetry.rs`)

TelemetryGuard initializes tracing + OTLP export pipeline
Uses tracing + tracing-subscriber + opentelemetry 0.28 + tracing-opentelemetry 0.29
OTLP/HTTP export to configured endpoint
Drop-based shutdown ensures spans are flushed

Frontend (`telemetry-bridge.ts`)

The frontend cannot use the browser OTEL SDK (WebKit2GTK incompatible). Instead, it routes events through a frontend_log Tauri command that pipes into Rust's tracing system:

tel.info('agent-started', { sessionId, provider });
tel.warn('context-pressure', { projectId, usage: 0.85 });
tel.error('sidecar-crash', { error: msg });

Docker Stack

A pre-configured Tempo + Grafana stack lives in docker/tempo/:

cd docker/tempo && docker compose up -d
# Grafana at http://localhost:9715
# Set BTERMINAL_OTLP_ENDPOINT=http://localhost:4318 to enable export

Session Metrics

Per-project historical session data is stored in the session_metrics table:

Column	Type	Purpose
`project_id`	TEXT	Which project
`session_id`	TEXT	Agent session ID
`start_time`	INTEGER	Session start timestamp
`end_time`	INTEGER	Session end timestamp
`peak_tokens`	INTEGER	Maximum context tokens used
`turn_count`	INTEGER	Total conversation turns
`tool_call_count`	INTEGER	Total tool calls made
`cost_usd`	REAL	Total cost in USD
`model`	TEXT	Model used
`status`	TEXT	Final status (success/error/stopped)
`error_message`	TEXT	Error details if failed

100-row retention per project (oldest pruned on insert). Metrics are persisted on agent completion via the agent dispatcher.

The MetricsPanel component displays this data as:

Live view — fleet aggregates, project health grid, task board summary, attention queue
History view — SVG sparklines for cost/tokens/turns/tools/duration, stats row, session table

14 KiB Raw Permalink Blame History