From 47f932294894922a1460b74258bdc12ac1fa889c Mon Sep 17 00:00:00 2001 From: Hibryda Date: Thu, 12 Mar 2026 03:07:38 +0100 Subject: [PATCH] docs: update meta files for E2E testing engine Phase B+ --- .claude/CLAUDE.md | 4 ++-- CHANGELOG.md | 3 +++ CLAUDE.md | 3 +++ TODO.md | 2 +- docs/v3-progress.md | 35 +++++++++++++++++++++++++++++++++++ 5 files changed, 44 insertions(+), 3 deletions(-) diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 5aa261b..5f45b9e 100644 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -5,7 +5,7 @@ - v1 is a single-file Python app (`bterminal.py`). Changes are localized. - v2 docs are in `docs/`. Architecture decisions are in `docs/task_plan.md`. - v2 Phases 1-7 + multi-machine (A-D) + profiles/skills complete. Extras: SSH, ctx, themes, detached mode, auto-updater, shiki, copy/paste, session resume, drag-resize, session groups, Deno sidecar, Claude profiles, skill discovery. -- v3 Mission Control (All Phases 1-10 Complete + S-1 Phase 1/1.5/2/3 + S-2 Session Anchors + Provider Adapter Pattern + Provider Runners + Memora Adapter + SOLID Phase 3 + Multi-Agent Orchestration): project groups, workspace store, 15 Workspace components, session continuity, workspace teardown, file overlap conflict detection, inotify-based external write detection, multi-provider adapter pattern (3 phases + Codex/Ollama runners), worktree isolation, session anchors, Memora adapter (read-only SQLite), SOLID refactoring (agent-dispatcher split → 4 utils, session.rs split → 7 sub-modules, branded types), multi-agent orchestration (btmsg inter-agent messaging, bttask kanban task board, agent prompt generator, BTMSG_AGENT_ID env passthrough, periodic re-injection, role-specific tabs: Manager=Tasks, Architect=Arch, Tester=Selenium+Tests, Reviewer=Tasks), dead v2 component cleanup, dashboard metrics panel (MetricsPanel.svelte — live health + task counts + SVG sparkline history), auto-wake Manager scheduler (3 strategies: persistent/on-demand/smart, 6 signal types, configurable threshold), reviewer agent role (workflow prompt, #review-queue/#review-log auto-channels, reviewQueueDepth attention scoring 10pts/task cap 50, Tasks tab). 345 vitest + 68 cargo tests + 22 E2E scenarios (Phase A). +- v3 Mission Control (All Phases 1-10 Complete + S-1 Phase 1/1.5/2/3 + S-2 Session Anchors + Provider Adapter Pattern + Provider Runners + Memora Adapter + SOLID Phase 3 + Multi-Agent Orchestration): project groups, workspace store, 15 Workspace components, session continuity, workspace teardown, file overlap conflict detection, inotify-based external write detection, multi-provider adapter pattern (3 phases + Codex/Ollama runners), worktree isolation, session anchors, Memora adapter (read-only SQLite), SOLID refactoring (agent-dispatcher split → 4 utils, session.rs split → 7 sub-modules, branded types), multi-agent orchestration (btmsg inter-agent messaging, bttask kanban task board, agent prompt generator, BTMSG_AGENT_ID env passthrough, periodic re-injection, role-specific tabs: Manager=Tasks, Architect=Arch, Tester=Selenium+Tests, Reviewer=Tasks), dead v2 component cleanup, dashboard metrics panel (MetricsPanel.svelte — live health + task counts + SVG sparkline history), auto-wake Manager scheduler (3 strategies: persistent/on-demand/smart, 6 signal types, configurable threshold), reviewer agent role (workflow prompt, #review-queue/#review-log auto-channels, reviewQueueDepth attention scoring 10pts/task cap 50, Tasks tab). 388 vitest + 68 cargo tests + 22 Phase A E2E + 6 Phase B E2E (multi-project + LLM-judged). - v3 docs: `docs/v3-task_plan.md`, `docs/v3-findings.md`, `docs/v3-progress.md`. - Consult Memora (tag: `bterminal`) before making architectural changes. @@ -82,7 +82,7 @@ - v3 workspace store (`workspace.svelte.ts`) replaces layout store for v3. Groups loaded from `~/.config/bterminal/groups.json` via `groups-bridge.ts`. State: groups, activeGroupId, activeTab, focusedProjectId. Derived: activeGroup, activeProjects. - v3 groups backend (`groups.rs`): load_groups(), save_groups(), default_groups(). Tauri commands: groups_load, groups_save. - Telemetry (`telemetry.rs`): tracing + optional OTLP export to Tempo. `BTERMINAL_OTLP_ENDPOINT` env var controls (absent = console-only). TelemetryGuard in AppState with Drop-based shutdown. Frontend events route through `frontend_log` Tauri command → Rust tracing (no browser OTEL SDK — WebKit2GTK incompatible). `telemetry-bridge.ts` provides `tel.info/warn/error()` convenience API. Docker stack at `docker/tempo/` (Grafana port 9715). -- E2E test mode (`BTERMINAL_TEST=1`): watcher.rs and fs_watcher.rs skip file watchers, wake-scheduler disabled via `disableWakeScheduler()`, `is_test_mode` Tauri command bridges to frontend. Data/config dirs overridable via `BTERMINAL_TEST_DATA_DIR`/`BTERMINAL_TEST_CONFIG_DIR`. E2E uses WebDriverIO + tauri-driver, single session, TCP readiness probe. 7 data-testid-based scenarios in `agent-scenarios.test.ts`. Test fixtures in `fixtures.ts` create isolated temp environments. Results tracked via JSON store in `results-db.ts`. +- E2E test mode (`BTERMINAL_TEST=1`): watcher.rs and fs_watcher.rs skip file watchers, wake-scheduler disabled via `disableWakeScheduler()`, `is_test_mode` Tauri command bridges to frontend. Data/config dirs overridable via `BTERMINAL_TEST_DATA_DIR`/`BTERMINAL_TEST_CONFIG_DIR`. E2E uses WebDriverIO + tauri-driver, single session, TCP readiness probe. Phase A: 7 data-testid-based scenarios in `agent-scenarios.test.ts` (deterministic assertions). Phase B: 6 scenarios in `phase-b.test.ts` (multi-project grid, independent tab switching, status bar fleet state, LLM-judged agent responses/code generation, context tab verification). LLM judge (`llm-judge.ts`): raw fetch to Anthropic API using claude-haiku-4-5, structured verdict (pass/fail + reasoning + confidence), `assertWithJudge()` with configurable threshold, skips when `ANTHROPIC_API_KEY` absent. CI workflow (`.github/workflows/e2e.yml`): unit + cargo + e2e jobs, xvfb-run, path-filtered triggers, LLM tests gated on secret. Test fixtures in `fixtures.ts` create isolated temp environments. Results tracked via JSON store in `results-db.ts`. - v3 SQLite additions: agent_messages table (per-project message persistence), project_agent_state table (sdkSessionId, cost, status per project), sessions.project_id column. - v3 App.svelte: VSCode-style sidebar layout. Horizontal: left icon rail (GlobalTabBar, 2.75rem, single Settings gear icon) + expandable drawer panel (Settings only, content-driven width, max 50%) + main workspace (ProjectGrid always visible) + StatusBar. Sidebar has Settings only — Sessions/Docs/Context are project-specific (in ProjectBox tabs). Keyboard: Ctrl+B (toggle sidebar), Ctrl+, (settings), Escape (close). - v3 component tree: App -> GlobalTabBar (settings icon) + sidebar-panel? (SettingsTab) + workspace (ProjectGrid) + StatusBar. See `docs/v3-task_plan.md` for full tree. diff --git a/CHANGELOG.md b/CHANGELOG.md index af9d4e2..4605a60 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -13,6 +13,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **E2E Phase A scenarios** — 7 human-authored test scenarios (22 tests) in `agent-scenarios.test.ts`: app structural integrity, settings panel, agent pane initial state, terminal tab management, command palette, project focus/tab switching, agent prompt submission (graceful Claude CLI skip) - **E2E test fixtures** — `tests/e2e/fixtures.ts`: creates isolated temp environments with data/config dirs, git repos, and groups.json. `createTestFixture()`, `createMultiProjectFixture()`, `destroyTestFixture()` - **E2E results store** — `tests/e2e/results-db.ts`: JSON-based test run/step tracking (pivoted from better-sqlite3 due to Node 25 native compile failure) +- **E2E Phase B scenarios** — 6 multi-project + LLM-judged test scenarios in `phase-b.test.ts`: multi-project grid rendering, independent tab switching, status bar fleet state, LLM-judged agent response quality, LLM-judged code generation, context tab verification +- **LLM judge helper** — `tests/e2e/llm-judge.ts`: raw fetch to Anthropic API (claude-haiku-4-5), structured verdicts (pass/fail + reasoning + confidence), `assertWithJudge()` with configurable min confidence threshold, graceful skip when `ANTHROPIC_API_KEY` absent +- **E2E CI workflow** — `.github/workflows/e2e.yml`: 3 jobs (vitest, cargo, e2e), xvfb-run for headless WebKit2GTK, path-filtered triggers on v2 source changes, LLM-judged tests gated on `ANTHROPIC_API_KEY` secret availability ### Changed - **WebDriverIO config** — TCP readiness probe replaces blind 2s sleep for tauri-driver startup (200ms interval, 10s deadline). Added BTERMINAL_TEST=1 passthrough in capabilities diff --git a/CLAUDE.md b/CLAUDE.md index 6a9d8b9..bb0171b 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -92,6 +92,9 @@ Terminal emulator with SSH and Claude Code session management. v1 (GTK3+VTE Pyth | `v2/tests/e2e/results-db.ts` | JSON test results store (run/step tracking, no native deps) | | `v2/tests/e2e/specs/bterminal.test.ts` | E2E smoke tests (CSS class selectors, 50+ tests) | | `v2/tests/e2e/specs/agent-scenarios.test.ts` | Phase A E2E scenarios (data-testid selectors, 7 scenarios, 22 tests) | +| `v2/tests/e2e/specs/phase-b.test.ts` | Phase B E2E scenarios (multi-project, LLM-judged assertions, 6 scenarios) | +| `v2/tests/e2e/llm-judge.ts` | LLM judge helper (Claude API assertions, confidence thresholds) | +| `.github/workflows/e2e.yml` | CI: unit + cargo + E2E tests (xvfb-run, path-filtered, LLM tests gated on secret) | | `v2/src/lib/stores/machines.svelte.ts` | Remote machine state store (Svelte 5 runes) | | `v2/src/lib/utils/attention-scorer.ts` | Pure attention scoring function (extracted from health store, 14 tests) | | `v2/src/lib/utils/wake-scorer.ts` | Pure wake signal evaluation (6 signals, 24 tests) | diff --git a/TODO.md b/TODO.md index f3e303c..4a324b1 100644 --- a/TODO.md +++ b/TODO.md @@ -3,7 +3,7 @@ ## Active ### v2/v3 Remaining -- [ ] **E2E testing — Phase B+** -- Phase A complete: 72 tests across 2 spec files (smoke + 7 agent scenarios). Next: LLM-judged assertions, multi-project scenarios, CI integration (xvfb-run). +- [x] **E2E testing — Phase B+** -- Phase B complete: LLM judge helper (llm-judge.ts, raw Anthropic API fetch, claude-haiku-4-5), 6 multi-project scenarios (phase-b.test.ts: grid rendering, independent tabs, status bar, LLM-judged agent responses + code generation, context tab), CI workflow (e2e.yml: 3 jobs, xvfb-run, path-filtered, LLM tests gated on secret). 388 vitest + 68 cargo + 22 Phase A + 6 Phase B E2E. | Done: 2026-03-12 - [ ] **Multi-machine real-world testing** -- Test bterminal-relay with 2 machines. - [ ] **Multi-machine TLS/certificate pinning** -- TLS support for bterminal-relay + certificate pinning in RemoteManager. - [ ] **Agent Teams real-world testing** -- Env var whitelist fix done. 3 test sessions ran ($1.10, $0.69, $1.70) but model didn't spawn subagents — needs complex multi-part prompts to trigger delegation. Test with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1. diff --git a/docs/v3-progress.md b/docs/v3-progress.md index dff5491..c68a39b 100644 --- a/docs/v3-progress.md +++ b/docs/v3-progress.md @@ -971,3 +971,38 @@ Reviewed and integrated Dexter's multi-agent orchestration branch (dexter_change - Vitest: 345 passed (was 327, +18 — new wake-scorer + metrics tests from prior session) - Cargo src-tauri: 68 passed (was 64, +4) - E2E scenarios: 22 new test cases across 7 scenarios + +### Session: 2026-03-12 — E2E Testing Engine Phase B+ + +#### LLM Judge Helper +- [x] Created `v2/tests/e2e/llm-judge.ts` — Claude API-based test assertion judge + - Raw fetch to Anthropic API (zero new deps), uses claude-haiku-4-5 for speed/cost + - `judge()` evaluates actual output against criteria, returns structured verdict (pass/fail, reasoning, confidence) + - `assertWithJudge()` convenience with minimum confidence threshold (default 0.7) + - `isJudgeAvailable()` check — tests skip gracefully when ANTHROPIC_API_KEY absent + +#### Phase B Scenarios (6 scenarios, ~15 tests) +- [x] Created `v2/tests/e2e/specs/phase-b.test.ts` + - **B1: Multi-Project Grid** — renders multiple project boxes, unique IDs, independent agent panes, CWD paths, focus/active styling + - **B2: Independent Tab Switching** — different tabs active in different project boxes simultaneously + - **B3: Status Bar Fleet State** — agent count display, burn rate $0.00 when all idle + - **B4: LLM-Judged Agent Response** — sends file listing prompt, evaluates response quality + tool usage via LLM judge (requires ANTHROPIC_API_KEY) + - **B5: LLM-Judged Code Generation** — sends code explanation prompt, evaluates correctness via LLM judge + - **B6: Context Tab After Activity** — verifies context tab shows token usage data after agent activity +- [x] Per-project helper functions: focusProject(), getAgentStatus(), sendPromptInProject(), waitForProjectAgentStatus(), getAgentMessages(), switchProjectTab() +- [x] All LLM-judged tests skip gracefully when ANTHROPIC_API_KEY not set +- [x] Added phase-b.test.ts to wdio.conf.js specs array + +#### CI Workflow +- [x] Created `.github/workflows/e2e.yml` + - Triggers: push to v2-mission-control, PRs to master/v2-mission-control, manual dispatch + - Path filters: v2/src/**, v2/src-tauri/**, v2/tests/e2e/** + - 3 jobs: unit-tests (vitest), cargo-tests, e2e-tests (needs both) + - E2E job: installs xvfb + tauri-driver, builds debug binary, runs Phase A + Phase B specs + - LLM-judged tests gated on ANTHROPIC_API_KEY secret availability + - Uploads test-results/ artifact on all outcomes + +#### Verification +- [x] Vitest: 388 passed, 0 failed (was 345, +43 from prior sessions) +- [x] Cargo: 68 passed, 0 failed +- [x] No regressions