From 47f932294894922a1460b74258bdc12ac1fa889c Mon Sep 17 00:00:00 2001
From: Hibryda <hibryda@protonmail.com>
Date: Thu, 12 Mar 2026 03:07:38 +0100
Subject: [PATCH] docs: update meta files for E2E testing engine Phase B+

---
 .claude/CLAUDE.md   |  4 ++--
 CHANGELOG.md        |  3 +++
 CLAUDE.md           |  3 +++
 TODO.md             |  2 +-
 docs/v3-progress.md | 35 +++++++++++++++++++++++++++++++++++
 5 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md
index 5aa261b..5f45b9e 100644
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@@ -5,7 +5,7 @@
 - v1 is a single-file Python app (`bterminal.py`). Changes are localized.
 - v2 docs are in `docs/`. Architecture decisions are in `docs/task_plan.md`.
 - v2 Phases 1-7 + multi-machine (A-D) + profiles/skills complete. Extras: SSH, ctx, themes, detached mode, auto-updater, shiki, copy/paste, session resume, drag-resize, session groups, Deno sidecar, Claude profiles, skill discovery.
-- v3 Mission Control (All Phases 1-10 Complete + S-1 Phase 1/1.5/2/3 + S-2 Session Anchors + Provider Adapter Pattern + Provider Runners + Memora Adapter + SOLID Phase 3 + Multi-Agent Orchestration): project groups, workspace store, 15 Workspace components, session continuity, workspace teardown, file overlap conflict detection, inotify-based external write detection, multi-provider adapter pattern (3 phases + Codex/Ollama runners), worktree isolation, session anchors, Memora adapter (read-only SQLite), SOLID refactoring (agent-dispatcher split → 4 utils, session.rs split → 7 sub-modules, branded types), multi-agent orchestration (btmsg inter-agent messaging, bttask kanban task board, agent prompt generator, BTMSG_AGENT_ID env passthrough, periodic re-injection, role-specific tabs: Manager=Tasks, Architect=Arch, Tester=Selenium+Tests, Reviewer=Tasks), dead v2 component cleanup, dashboard metrics panel (MetricsPanel.svelte — live health + task counts + SVG sparkline history), auto-wake Manager scheduler (3 strategies: persistent/on-demand/smart, 6 signal types, configurable threshold), reviewer agent role (workflow prompt, #review-queue/#review-log auto-channels, reviewQueueDepth attention scoring 10pts/task cap 50, Tasks tab). 345 vitest + 68 cargo tests + 22 E2E scenarios (Phase A).
+- v3 Mission Control (All Phases 1-10 Complete + S-1 Phase 1/1.5/2/3 + S-2 Session Anchors + Provider Adapter Pattern + Provider Runners + Memora Adapter + SOLID Phase 3 + Multi-Agent Orchestration): project groups, workspace store, 15 Workspace components, session continuity, workspace teardown, file overlap conflict detection, inotify-based external write detection, multi-provider adapter pattern (3 phases + Codex/Ollama runners), worktree isolation, session anchors, Memora adapter (read-only SQLite), SOLID refactoring (agent-dispatcher split → 4 utils, session.rs split → 7 sub-modules, branded types), multi-agent orchestration (btmsg inter-agent messaging, bttask kanban task board, agent prompt generator, BTMSG_AGENT_ID env passthrough, periodic re-injection, role-specific tabs: Manager=Tasks, Architect=Arch, Tester=Selenium+Tests, Reviewer=Tasks), dead v2 component cleanup, dashboard metrics panel (MetricsPanel.svelte — live health + task counts + SVG sparkline history), auto-wake Manager scheduler (3 strategies: persistent/on-demand/smart, 6 signal types, configurable threshold), reviewer agent role (workflow prompt, #review-queue/#review-log auto-channels, reviewQueueDepth attention scoring 10pts/task cap 50, Tasks tab). 388 vitest + 68 cargo tests + 22 Phase A E2E + 6 Phase B E2E (multi-project + LLM-judged).
 - v3 docs: `docs/v3-task_plan.md`, `docs/v3-findings.md`, `docs/v3-progress.md`.
 - Consult Memora (tag: `bterminal`) before making architectural changes.
 
@@ -82,7 +82,7 @@
 - v3 workspace store (`workspace.svelte.ts`) replaces layout store for v3. Groups loaded from `~/.config/bterminal/groups.json` via `groups-bridge.ts`. State: groups, activeGroupId, activeTab, focusedProjectId. Derived: activeGroup, activeProjects.
 - v3 groups backend (`groups.rs`): load_groups(), save_groups(), default_groups(). Tauri commands: groups_load, groups_save.
 - Telemetry (`telemetry.rs`): tracing + optional OTLP export to Tempo. `BTERMINAL_OTLP_ENDPOINT` env var controls (absent = console-only). TelemetryGuard in AppState with Drop-based shutdown. Frontend events route through `frontend_log` Tauri command → Rust tracing (no browser OTEL SDK — WebKit2GTK incompatible). `telemetry-bridge.ts` provides `tel.info/warn/error()` convenience API. Docker stack at `docker/tempo/` (Grafana port 9715).
-- E2E test mode (`BTERMINAL_TEST=1`): watcher.rs and fs_watcher.rs skip file watchers, wake-scheduler disabled via `disableWakeScheduler()`, `is_test_mode` Tauri command bridges to frontend. Data/config dirs overridable via `BTERMINAL_TEST_DATA_DIR`/`BTERMINAL_TEST_CONFIG_DIR`. E2E uses WebDriverIO + tauri-driver, single session, TCP readiness probe. 7 data-testid-based scenarios in `agent-scenarios.test.ts`. Test fixtures in `fixtures.ts` create isolated temp environments. Results tracked via JSON store in `results-db.ts`.
+- E2E test mode (`BTERMINAL_TEST=1`): watcher.rs and fs_watcher.rs skip file watchers, wake-scheduler disabled via `disableWakeScheduler()`, `is_test_mode` Tauri command bridges to frontend. Data/config dirs overridable via `BTERMINAL_TEST_DATA_DIR`/`BTERMINAL_TEST_CONFIG_DIR`. E2E uses WebDriverIO + tauri-driver, single session, TCP readiness probe. Phase A: 7 data-testid-based scenarios in `agent-scenarios.test.ts` (deterministic assertions). Phase B: 6 scenarios in `phase-b.test.ts` (multi-project grid, independent tab switching, status bar fleet state, LLM-judged agent responses/code generation, context tab verification). LLM judge (`llm-judge.ts`): raw fetch to Anthropic API using claude-haiku-4-5, structured verdict (pass/fail + reasoning + confidence), `assertWithJudge()` with configurable threshold, skips when `ANTHROPIC_API_KEY` absent. CI workflow (`.github/workflows/e2e.yml`): unit + cargo + e2e jobs, xvfb-run, path-filtered triggers, LLM tests gated on secret. Test fixtures in `fixtures.ts` create isolated temp environments. Results tracked via JSON store in `results-db.ts`.
 - v3 SQLite additions: agent_messages table (per-project message persistence), project_agent_state table (sdkSessionId, cost, status per project), sessions.project_id column.
 - v3 App.svelte: VSCode-style sidebar layout. Horizontal: left icon rail (GlobalTabBar, 2.75rem, single Settings gear icon) + expandable drawer panel (Settings only, content-driven width, max 50%) + main workspace (ProjectGrid always visible) + StatusBar. Sidebar has Settings only — Sessions/Docs/Context are project-specific (in ProjectBox tabs). Keyboard: Ctrl+B (toggle sidebar), Ctrl+, (settings), Escape (close).
 - v3 component tree: App -> GlobalTabBar (settings icon) + sidebar-panel? (SettingsTab) + workspace (ProjectGrid) + StatusBar. See `docs/v3-task_plan.md` for full tree.
diff --git a/CHANGELOG.md b/CHANGELOG.md
index af9d4e2..4605a60 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -13,6 +13,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - **E2E Phase A scenarios** — 7 human-authored test scenarios (22 tests) in `agent-scenarios.test.ts`: app structural integrity, settings panel, agent pane initial state, terminal tab management, command palette, project focus/tab switching, agent prompt submission (graceful Claude CLI skip)
 - **E2E test fixtures** — `tests/e2e/fixtures.ts`: creates isolated temp environments with data/config dirs, git repos, and groups.json. `createTestFixture()`, `createMultiProjectFixture()`, `destroyTestFixture()`
 - **E2E results store** — `tests/e2e/results-db.ts`: JSON-based test run/step tracking (pivoted from better-sqlite3 due to Node 25 native compile failure)
+- **E2E Phase B scenarios** — 6 multi-project + LLM-judged test scenarios in `phase-b.test.ts`: multi-project grid rendering, independent tab switching, status bar fleet state, LLM-judged agent response quality, LLM-judged code generation, context tab verification
+- **LLM judge helper** — `tests/e2e/llm-judge.ts`: raw fetch to Anthropic API (claude-haiku-4-5), structured verdicts (pass/fail + reasoning + confidence), `assertWithJudge()` with configurable min confidence threshold, graceful skip when `ANTHROPIC_API_KEY` absent
+- **E2E CI workflow** — `.github/workflows/e2e.yml`: 3 jobs (vitest, cargo, e2e), xvfb-run for headless WebKit2GTK, path-filtered triggers on v2 source changes, LLM-judged tests gated on `ANTHROPIC_API_KEY` secret availability
 
 ### Changed
 - **WebDriverIO config** — TCP readiness probe replaces blind 2s sleep for tauri-driver startup (200ms interval, 10s deadline). Added BTERMINAL_TEST=1 passthrough in capabilities
diff --git a/CLAUDE.md b/CLAUDE.md
index 6a9d8b9..bb0171b 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -92,6 +92,9 @@ Terminal emulator with SSH and Claude Code session management. v1 (GTK3+VTE Pyth
 | `v2/tests/e2e/results-db.ts` | JSON test results store (run/step tracking, no native deps) |
 | `v2/tests/e2e/specs/bterminal.test.ts` | E2E smoke tests (CSS class selectors, 50+ tests) |
 | `v2/tests/e2e/specs/agent-scenarios.test.ts` | Phase A E2E scenarios (data-testid selectors, 7 scenarios, 22 tests) |
+| `v2/tests/e2e/specs/phase-b.test.ts` | Phase B E2E scenarios (multi-project, LLM-judged assertions, 6 scenarios) |
+| `v2/tests/e2e/llm-judge.ts` | LLM judge helper (Claude API assertions, confidence thresholds) |
+| `.github/workflows/e2e.yml` | CI: unit + cargo + E2E tests (xvfb-run, path-filtered, LLM tests gated on secret) |
 | `v2/src/lib/stores/machines.svelte.ts` | Remote machine state store (Svelte 5 runes) |
 | `v2/src/lib/utils/attention-scorer.ts` | Pure attention scoring function (extracted from health store, 14 tests) |
 | `v2/src/lib/utils/wake-scorer.ts` | Pure wake signal evaluation (6 signals, 24 tests) |
diff --git a/TODO.md b/TODO.md
index f3e303c..4a324b1 100644
--- a/TODO.md
+++ b/TODO.md
@@ -3,7 +3,7 @@
 ## Active
 
 ### v2/v3 Remaining
-- [ ] **E2E testing — Phase B+** -- Phase A complete: 72 tests across 2 spec files (smoke + 7 agent scenarios). Next: LLM-judged assertions, multi-project scenarios, CI integration (xvfb-run).
+- [x] **E2E testing — Phase B+** -- Phase B complete: LLM judge helper (llm-judge.ts, raw Anthropic API fetch, claude-haiku-4-5), 6 multi-project scenarios (phase-b.test.ts: grid rendering, independent tabs, status bar, LLM-judged agent responses + code generation, context tab), CI workflow (e2e.yml: 3 jobs, xvfb-run, path-filtered, LLM tests gated on secret). 388 vitest + 68 cargo + 22 Phase A + 6 Phase B E2E. | Done: 2026-03-12
 - [ ] **Multi-machine real-world testing** -- Test bterminal-relay with 2 machines.
 - [ ] **Multi-machine TLS/certificate pinning** -- TLS support for bterminal-relay + certificate pinning in RemoteManager.
 - [ ] **Agent Teams real-world testing** -- Env var whitelist fix done. 3 test sessions ran ($1.10, $0.69, $1.70) but model didn't spawn subagents — needs complex multi-part prompts to trigger delegation. Test with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1.
diff --git a/docs/v3-progress.md b/docs/v3-progress.md
index dff5491..c68a39b 100644
--- a/docs/v3-progress.md
+++ b/docs/v3-progress.md
@@ -971,3 +971,38 @@ Reviewed and integrated Dexter's multi-agent orchestration branch (dexter_change
 - Vitest: 345 passed (was 327, +18 — new wake-scorer + metrics tests from prior session)
 - Cargo src-tauri: 68 passed (was 64, +4)
 - E2E scenarios: 22 new test cases across 7 scenarios
+
+### Session: 2026-03-12 — E2E Testing Engine Phase B+
+
+#### LLM Judge Helper
+- [x] Created `v2/tests/e2e/llm-judge.ts` — Claude API-based test assertion judge
+  - Raw fetch to Anthropic API (zero new deps), uses claude-haiku-4-5 for speed/cost
+  - `judge()` evaluates actual output against criteria, returns structured verdict (pass/fail, reasoning, confidence)
+  - `assertWithJudge()` convenience with minimum confidence threshold (default 0.7)
+  - `isJudgeAvailable()` check — tests skip gracefully when ANTHROPIC_API_KEY absent
+
+#### Phase B Scenarios (6 scenarios, ~15 tests)
+- [x] Created `v2/tests/e2e/specs/phase-b.test.ts`
+  - **B1: Multi-Project Grid** — renders multiple project boxes, unique IDs, independent agent panes, CWD paths, focus/active styling
+  - **B2: Independent Tab Switching** — different tabs active in different project boxes simultaneously
+  - **B3: Status Bar Fleet State** — agent count display, burn rate $0.00 when all idle
+  - **B4: LLM-Judged Agent Response** — sends file listing prompt, evaluates response quality + tool usage via LLM judge (requires ANTHROPIC_API_KEY)
+  - **B5: LLM-Judged Code Generation** — sends code explanation prompt, evaluates correctness via LLM judge
+  - **B6: Context Tab After Activity** — verifies context tab shows token usage data after agent activity
+- [x] Per-project helper functions: focusProject(), getAgentStatus(), sendPromptInProject(), waitForProjectAgentStatus(), getAgentMessages(), switchProjectTab()
+- [x] All LLM-judged tests skip gracefully when ANTHROPIC_API_KEY not set
+- [x] Added phase-b.test.ts to wdio.conf.js specs array
+
+#### CI Workflow
+- [x] Created `.github/workflows/e2e.yml`
+  - Triggers: push to v2-mission-control, PRs to master/v2-mission-control, manual dispatch
+  - Path filters: v2/src/**, v2/src-tauri/**, v2/tests/e2e/**
+  - 3 jobs: unit-tests (vitest), cargo-tests, e2e-tests (needs both)
+  - E2E job: installs xvfb + tauri-driver, builds debug binary, runs Phase A + Phase B specs
+  - LLM-judged tests gated on ANTHROPIC_API_KEY secret availability
+  - Uploads test-results/ artifact on all outcomes
+
+#### Verification
+- [x] Vitest: 388 passed, 0 failed (was 345, +43 from prior sessions)
+- [x] Cargo: 68 passed, 0 failed
+- [x] No regressions