docs: complete reorganization — move remaining docs into subdirectories

2026-03-17 04:22:38 +01:00 · 2026-03-17 04:22:38 +01:00 · 493b436eef
commit 493b436eef
parent 8641f260f7
5 changed files with 669 additions and 0 deletions
--- a/docs/architecture/decisions.md
+++ b/docs/architecture/decisions.md
@ -0,0 +1,51 @@
+# Architecture Decisions Log
+
+This document records significant architecture decisions made during the development of Agent Orchestrator (agor). Each entry captures the decision, its rationale, and the date it was made. Decisions are listed chronologically within each category.
+
+---
+
+## Data & Configuration
+
+| Decision | Rationale | Date |
+|----------|-----------|------|
+| JSON for groups config, SQLite for session state | JSON is human-editable, shareable, version-controllable. SQLite for ephemeral runtime state. Load at startup only — no hot-reload, no split-brain risk. | 2026-03-07 |
+| btmsg/bttask shared SQLite DB | Both CLI tools share `~/.local/share/agor/btmsg.db`. Single DB simplifies deployment — agents already have the path. Read-only for non-Manager roles via CLI permissions. | 2026-03-11 |
+
+## Layout & UI
+
+| Decision | Rationale | Date |
+|----------|-----------|------|
+| Adaptive project count from viewport width | `Math.min(projects.length, Math.max(1, Math.floor(containerWidth / 520)))` — 5 at 5120px, 3 at 1920px, scroll-snap for overflow. min-width 480px. Better than forcing 5 at all sizes. | 2026-03-07 |
+| Flexbox + scroll-snap over CSS Grid | Allows horizontal scroll on narrow screens. Scroll-snap gives clean project-to-project scrolling. | 2026-03-07 |
+| Team panel: inline >2560px, overlay <2560px | Adapts to available space. Collapsed when no subagents running. Saves ~240px on smaller screens. | 2026-03-07 |
+| VSCode-style left sidebar (replaces top tab bar) | Vertical icon rail (2.75rem) + expandable drawer (max 50%) + always-visible workspace. Settings is a regular tab, not a special drawer. ProjectGrid always visible. Ctrl+B toggles. | 2026-03-08 |
+| CSS relative units (rule 18) | rem/em for all layout CSS. Pixels only for icon sizes, borders, box shadows. Exception: `--ui-font-size`/`--term-font-size` store px for xterm.js API. | 2026-03-08 |
+| Project accent colors from Catppuccin palette | Visual distinction: blue/green/mauve/peach/pink per slot 1-5. Applied to border + header tint via `var(--accent)`. | 2026-03-07 |
+
+## Agent Architecture
+
+| Decision | Rationale | Date |
+|----------|-----------|------|
+| Single shared sidecar (v3.0) | Existing multiplexed protocol handles concurrent sessions. Per-project pool deferred to v3.1 if crash isolation needed. Saves ~200MB RAM. | 2026-03-07 |
+| xterm budget: 4 active, unlimited suspended | WebKit2GTK OOM at ~5 instances. Serialize scrollback to text buffer, destroy xterm, recreate on focus. PTY stays alive. Suspend/resume < 50ms. | 2026-03-07 |
+| AgentPane splits into AgentSession + TeamAgentsPanel | Team agents shown inline in right panel, not as separate panes. Saves xterm/pane slots. | 2026-03-07 |
+| Tier 1 agents as ProjectBoxes via `agentToProject()` | Agents render as full ProjectBoxes (not separate UI). `getAllWorkItems()` merges agents + projects. Unified rendering = less code, same capabilities. | 2026-03-11 |
+| `extra_env` 5-layer passthrough for BTMSG_AGENT_ID | TS -> Rust AgentQueryOptions -> NDJSON -> JS runner -> SDK env. Minimal surface — only agent projects get env injection. | 2026-03-11 |
+| Periodic system prompt re-injection (1 hour) | LLM context degrades over long sessions. 1-hour timer re-sends role/tools reminder when agent is idle. `autoPrompt`/`onautopromptconsumed` callback pattern. | 2026-03-11 |
+| Role-specific tabs via conditional rendering | Manager=Tasks, Architect=Arch, Tester=Selenium+Tests, Reviewer=Tasks. PERSISTED-LAZY pattern (mount on first activation). Conditional on `isAgent && agentRole`. | 2026-03-11 |
+| PlantUML via plantuml.com server (~h hex encoding) | Avoids Java dependency. Hex encoding simpler than deflate+base64. Works with free tier. Trade-off: requires internet. | 2026-03-11 |
+
+## Themes & Typography
+
+| Decision | Rationale | Date |
+|----------|-----------|------|
+| All 17 themes map to `--ctp-*` CSS vars | 4 Catppuccin + 7 Editor + 6 Deep Dark themes. All map to same 26 CSS custom properties — zero component changes when adding themes. Pure data operation. | 2026-03-07 |
+| Typography via CSS custom properties | `--ui-font-family`/`--ui-font-size` + `--term-font-family`/`--term-font-size` in `:root`. Restored by `initTheme()` on startup. Persisted as SQLite settings. | 2026-03-07 |
+
+## System Design
+
+| Decision | Rationale | Date |
+|----------|-----------|------|
+| Keyboard shortcut layers: App > Workspace > Terminal | Prevents conflicts. Terminal captures raw keys only when focused. App layer uses Ctrl+K/G/B. | 2026-03-07 |
+| Unmount/remount on group switch | Serialize xterm scrollbacks, destroy, remount new group. <100ms perceived. Frees ~80MB per switch. | 2026-03-07 |
+| Remote machines deferred to v3.1 | Elevate to project level (`project.remote_machine_id`) but don't implement in MVP. Focus on local orchestration first. | 2026-03-07 |
--- a/docs/architecture/findings.md
+++ b/docs/architecture/findings.md
@ -0,0 +1,160 @@
+# Research Findings
+
+Research conducted during development — technology evaluations, architecture reviews, performance measurements, and design analysis. Each finding informed implementation decisions recorded in [decisions.md](decisions.md).
+
+---
+
+## 1. Claude Agent SDK
+
+**Source:** https://platform.claude.com/docs/en/agent-sdk/overview
+
+The Claude Agent SDK provides structured streaming, subagent detection, hooks, and telemetry — everything needed for a rich agent UI without terminal emulation.
+
+**Key Insight:** The SDK gives structured data — we render it as rich UI (markdown, diff views, file cards, agent trees) instead of raw terminal text. Terminal emulation (xterm.js) is only needed for SSH, local shell, and legacy CLI sessions.
+
+---
+
+## 2. Tauri + xterm.js Integration
+
+Integration pattern: `Frontend (xterm.js) <-> Tauri IPC <-> Rust PTY (portable-pty) <-> Shell/SSH/Claude`
+
+Existing projects (tauri-terminal, Terminon, tauri-plugin-pty) validated the approach.
+
+---
+
+## 3. Terminal Performance Benchmarks
+
+| Terminal | Latency | Notes |
+|----------|---------|-------|
+| xterm (native) | ~10ms | Gold standard |
+| Alacritty | ~12ms | GPU-rendered Rust |
+| VTE (GNOME Terminal) | ~50ms | GTK3/4 |
+| Hyper (Electron+xterm.js) | ~40ms | Web-based worst case |
+
+xterm.js in Tauri: ~20-30ms latency, ~20MB per instance. For AI output, perfectly fine. VTE in v1 GTK3 was actually slower at ~50ms.
+
+---
+
+## 4. Frontend Framework Choice
+
+**Why Svelte 5:** Fine-grained reactivity (`$state`/`$derived` runes), no VDOM (critical for 4-8 panes streaming simultaneously), ~5KB runtime vs React's ~40KB. Larger ecosystem than Solid.js.
+
+---
+
+## 5. Adversarial Architecture Review (v3)
+
+Three specialized agents reviewed the v3 Mission Control architecture before implementation. Caught 12 issues (4 critical) that would have required expensive rework if discovered later.
+
+### Critical Issues Found
+
+| # | Issue | Resolution |
+|---|-------|------------|
+| 1 | xterm.js 4-instance ceiling (WebKit2GTK OOM) | Budget system with suspend/resume |
+| 2 | Single sidecar = SPOF | Supervisor with crash recovery, per-project pool deferred |
+| 3 | Layout store has no workspace concept | Full rewrite to workspace.svelte.ts |
+| 4 | 384px per project on 1920px (too narrow) | Adaptive count from viewport width |
+
+8 more issues (Major/Minor) resolved before implementation.
+
+---
+
+## 6. Provider Adapter Coupling Analysis (v3)
+
+Before implementing multi-provider support, mapped every Claude-specific dependency. 13+ files classified into 4 severity levels.
+
+### Key Insights
+
+1. **Sidecar is the natural abstraction boundary.** Each provider needs its own runner.
+2. **Message format is the main divergence point.** Per-provider adapters normalize to `AgentMessage`.
+3. **Capability flags eliminate provider switches.** UI checks `capabilities.hasProfiles` instead of `provider === 'claude'`.
+4. **Env var stripping is provider-specific.**
+
+---
+
+## 7. Codebase Reuse Analysis: v2 to v3
+
+### Survived (with modifications)
+
+| Component | Modifications |
+|-----------|---------------|
+| TerminalPane.svelte | Added suspend/resume lifecycle |
+| MarkdownPane.svelte | Unchanged |
+| AgentTree.svelte | Reused inside AgentSession |
+| agents.svelte.ts | Added projectId field |
+| theme.svelte.ts | Unchanged |
+| notifications.svelte.ts | Unchanged |
+| All adapters | Minor updates for provider routing |
+| All Rust backend | Added new modules (btmsg, bttask, search, secrets, plugins) |
+
+### Replaced
+
+| v2 Component | v3 Replacement | Reason |
+|-------------|---------------|--------|
+| layout.svelte.ts | workspace.svelte.ts | Pane-based -> project-group model |
+| TilingGrid.svelte | ProjectGrid.svelte | Free-form grid -> fixed project boxes |
+| PaneContainer.svelte | ProjectBox.svelte | Generic pane -> 11-tab container |
+| SettingsDialog.svelte | SettingsTab.svelte | Modal -> sidebar drawer |
+| AgentPane.svelte | AgentSession + TeamAgentsPanel | Monolithic -> split for teams |
+| App.svelte | Full rewrite | VSCode-style sidebar layout |
+
+---
+
+## 8. Session Anchor Design (v3)
+
+### Problem
+
+When Claude's context window fills (~80% of model limit), the SDK automatically compacts older turns. Important early decisions and debugging breakthroughs can be permanently lost.
+
+### Design Decisions
+
+1. **Auto-anchor on first compaction** — Captures first 3 turns automatically.
+2. **Observation masking** — Tool outputs compacted, reasoning preserved in full.
+3. **Budget system** — Fixed scales (2K/6K/12K/20K tokens) instead of percentage-based.
+4. **Re-injection via system prompt** — Simplest SDK integration.
+
+---
+
+## 9. Multi-Agent Orchestration Design (v3)
+
+| Approach | Decision |
+|----------|----------|
+| Claude Agent Teams (native) | Supported but not primary (experimental, resume broken) |
+| Message bus (Redis/NATS) | Rejected (runtime dependency) |
+| Shared SQLite + CLI tools | **Selected** (zero deps, agents use shell) |
+| MCP server for agent comm | Rejected (overhead, complexity) |
+
+**Why SQLite + CLI:** Agents have full shell access. Python CLI tools reading/writing SQLite is lowest friction. Zero configuration, no runtime services, WAL handles concurrency.
+
+---
+
+## 10. Theme System Evolution
+
+All 17 themes (4 Catppuccin + 7 Editor + 6 Deep Dark) map to the same 26 `--ctp-*` CSS custom properties. No component ever needs to know which theme is active. Adding new themes is a pure data operation.
+
+---
+
+## 11. Performance Measurements (v3)
+
+### xterm.js Canvas Performance (WebKit2GTK, no WebGL)
+
+- Latency: ~20-30ms per keystroke
+- Memory: ~20MB per active instance
+- OOM threshold: ~5 simultaneous instances
+- Mitigation: 4-instance budget with suspend/resume
+
+### Tauri IPC Latency
+
+- Linux: ~5ms for typical payloads
+- Terminal keystroke echo: 10-15ms total
+- Agent message forwarding: negligible
+
+### SQLite WAL Concurrent Access
+
+WAL mode with 5s busy_timeout handles concurrent access reliably. 5-minute checkpoint prevents WAL growth.
+
+### Workspace Switch Latency
+
+- Serialize 4 xterm scrollbacks: ~30ms
+- Destroy + unmount: ~15ms
+- Mount new group + create xterm: ~55ms
+- **Total perceived: ~100ms**
--- a/docs/architecture/phases.md
+++ b/docs/architecture/phases.md
@ -0,0 +1,125 @@
+# Implementation Phases
+
+See [overview.md](overview.md) for system architecture and [decisions.md](decisions.md) for design decisions.
+
+---
+
+## Phase 1: Project Scaffolding [complete]
+
+- Tauri 2.x + Svelte 5 frontend initialized
+- Catppuccin Mocha CSS variables, dev scripts
+- portable-pty (used by WezTerm) over tauri-plugin-pty for reliability
+
+---
+
+## Phase 2: Terminal Pane + Layout [complete]
+
+- CSS Grid layout with responsive breakpoints (ultrawide / standard / narrow)
+- Pane resize via drag handles, layout presets (1-col, 2-col, 3-col, 2x2, master+stack)
+- xterm.js with Canvas addon (no WebGL on WebKit2GTK), Catppuccin theme
+- PTY spawn from Rust (portable-pty), stream via Tauri events
+- Copy/paste (Ctrl+Shift+C/V), SSH via PTY shell args
+
+---
+
+## Phase 3: Agent SDK Integration [complete]
+
+- Node.js/Deno sidecar using `@anthropic-ai/claude-agent-sdk` query() function
+- Sidecar communication: Rust spawns process, stdio NDJSON
+- SDK message adapter: 9 typed AgentMessage types
+- Agent store with session state, message history, cost tracking
+- AgentPane: text, tool calls/results, thinking, init, cost, errors, subagent spawn
+- Session resume (resume_session_id to SDK)
+
+---
+
+## Phase 4: Session Management + Markdown Viewer [complete]
+
+- SQLite persistence (rusqlite), session groups with collapsible headers
+- Auto-restore layout on startup
+- Markdown viewer with Shiki highlighting and live reload via file watcher
+
+---
+
+## Phase 5: Agent Tree + Polish [complete]
+
+- SVG agent tree visualization with click-to-scroll and subtree cost
+- Terminal theme hot-swap, pane drag-resize handles
+- StatusBar with counts, notifications (toast system)
+- Settings dialog, ctx integration, SSH session management
+- 4 Catppuccin themes, detached pane mode, Shiki syntax highlighting
+
+---
+
+## Phase 6: Packaging + Distribution [complete]
+
+- install-v2.sh build-from-source installer (Node.js 20+, Rust 1.77+, system libs)
+- Tauri bundle: .deb (4.3 MB) + AppImage (103 MB)
+- GitHub Actions release workflow on `v*` tags
+- Auto-updater with signing key
+
+---
+
+## Phase 7: Agent Teams / Subagent Support [complete]
+
+- Agent store parent/child hierarchy
+- Dispatcher subagent detection and message routing
+- AgentPane parent navigation + children bar
+- Subagent cost aggregation
+- 28 dispatcher tests including 10 for subagent routing
+
+---
+
+## Multi-Machine Support (Phases A-D) [complete]
+
+Architecture in [../multi-machine/relay.md](../multi-machine/relay.md).
+
+### Phase A: Extract `agor-core` crate
+
+Cargo workspace with PtyManager, SidecarManager, EventSink trait in shared crate.
+
+### Phase B: Build `agor-relay` binary
+
+WebSocket server with token auth, rate limiting, per-connection isolation, structured command responses.
+
+### Phase C: Add `RemoteManager` to controller
+
+12 Tauri commands, heartbeat ping, exponential backoff reconnection with TCP probing.
+
+### Phase D: Frontend integration
+
+remote-bridge.ts adapter, machines.svelte.ts store, routing via Pane.remoteMachineId.
+
+### Remaining
+
+- [ ] Real-world relay testing (2 machines)
+- [ ] TLS/certificate pinning
+
+---
+
+## Extras: Claude Profiles & Skill Discovery [complete]
+
+### Claude Profile / Account Switching
+
+- Reads ~/.config/switcher/profiles/ with profile.toml metadata
+- Profile selector dropdown, config_dir passed as CLAUDE_CONFIG_DIR env override
+
+### Skill Discovery & Autocomplete
+
+- Reads ~/.claude/skills/ (dirs with SKILL.md or .md files)
+- `/` prefix triggers autocomplete menu in AgentPane
+- expandSkillPrompt() injects skill content as prompt
+
+### Extended AgentQueryOptions
+
+- setting_sources, system_prompt, model, claude_config_dir, additional_directories
+- CLAUDE_CONFIG_DIR env injection for multi-account support
+
+---
+
+## System Requirements
+
+- Node.js 20+ (for Agent SDK sidecar)
+- Rust 1.77+ (for building from source)
+- WebKit2GTK 4.1+ (Tauri runtime)
+- Linux x86_64 (primary target)