Hibryda 91a3b56dba test(e2e): split + expand phase-b into grid + LLM specs

- phase-b-grid.test.ts (227 lines): multi-project grid, tab switching,
  status bar, accent colors, project icons, scroll, tab bar completeness
- phase-b-llm.test.ts (211 lines): LLM-judged agent response, code gen,
  context tab, tool calls, cost display, session persistence
- Original phase-b.test.ts (377 lines) deleted
- New exhaustive tests added for grid layout and agent interaction

2026-03-18 03:47:16 +01:00

7.2 KiB

Raw Blame History

E2E Testing Facility

Agor's end-to-end testing uses WebDriverIO + tauri-driver to drive the real Tauri application through WebKit2GTK's inspector protocol. The facility has three pillars:

Test Fixtures — isolated fake environments with dummy projects
Test Mode — app-level env vars that disable watchers and redirect data/config paths
LLM Judge — Claude-powered semantic assertions for evaluating agent behavior

Quick Start

# Run all tests (vitest + cargo + E2E)
npm run test:all:e2e

# Run E2E only (requires pre-built debug binary)
SKIP_BUILD=1 npm run test:e2e

# Build debug binary separately (faster iteration)
cargo tauri build --debug --no-bundle

# Run with LLM judge via CLI (default, auto-detected)
npm run test:e2e

# Force LLM judge to use API instead of CLI
LLM_JUDGE_BACKEND=api ANTHROPIC_API_KEY=sk-... npm run test:e2e

Prerequisites

Dependency	Purpose	Install
Rust + Cargo	Build Tauri backend	rustup.rs
Node.js 20+	Frontend + test runner	`mise install node`
tauri-driver	WebDriver bridge to WebKit2GTK	`cargo install tauri-driver`
X11 display	WebKit2GTK needs a display	Real X, or `xvfb-run` in CI
Claude CLI	LLM judge (optional)	claude.ai/download

Architecture

+-----------------------------------------------------+
| WebDriverIO (mocha runner)                          |
|   specs/*.test.ts                                   |
|     +- browser.execute() -> DOM queries + assertions |
|     +- assertWithJudge() -> LLM semantic evaluation |
+-----------------------------------------------------+
| tauri-driver (port 4444)                            |
|   WebDriver protocol <-> WebKit2GTK inspector        |
+-----------------------------------------------------+
| Agor debug binary                                   |
|   AGOR_TEST=1 (disables watchers, wake scheduler)   |
|   AGOR_TEST_DATA_DIR -> isolated SQLite DBs          |
|   AGOR_TEST_CONFIG_DIR -> test groups.json           |
+-----------------------------------------------------+

Pillar 1: Test Fixtures (`fixtures.ts`)

The fixture generator creates isolated temporary environments so tests never touch real user data. Each fixture includes:

Temp root dir under /tmp/agor-e2e-{timestamp}/
Data dir — empty, SQLite databases created at runtime
Config dir — contains a generated groups.json with test projects
Project dir — a real git repo with README.md and hello.py (for agent testing)

Single-Project Fixture

import { createTestFixture, destroyTestFixture } from '../fixtures';

const fixture = createTestFixture('my-test');
// fixture.rootDir    -> /tmp/my-test-1710234567890/
// fixture.dataDir    -> /tmp/my-test-1710234567890/data/
// fixture.configDir  -> /tmp/my-test-1710234567890/config/
// fixture.projectDir -> /tmp/my-test-1710234567890/test-project/
// fixture.env        -> { AGOR_TEST: '1', AGOR_TEST_DATA_DIR: '...', ... }

destroyTestFixture(fixture);

Multi-Project Fixture

import { createMultiProjectFixture } from '../fixtures';
const fixture = createMultiProjectFixture(3); // 3 separate git repos

Fixture Environment Variables

Variable	Effect
`AGOR_TEST=1`	Disables file watchers, wake scheduler, enables `is_test_mode`
`AGOR_TEST_DATA_DIR`	Redirects `sessions.db` and `btmsg.db` storage
`AGOR_TEST_CONFIG_DIR`	Redirects `groups.json` config loading

Pillar 2: Test Mode

When AGOR_TEST=1 is set:

Rust backend: watcher.rs and fs_watcher.rs skip file watchers
Frontend: is_test_mode Tauri command returns true, wake scheduler disabled via disableWakeScheduler()
Data isolation: AGOR_TEST_DATA_DIR / AGOR_TEST_CONFIG_DIR override default paths

The WebDriverIO config (wdio.conf.js) passes these env vars via tauri:options.env in capabilities.

Pillar 3: LLM Judge (`llm-judge.ts`)

The LLM judge enables semantic assertions — evaluating whether agent output "looks right" rather than exact string matching.

Dual Backend

Backend	How it works	Requires
`cli` (default)	Spawns `claude` CLI with `--output-format text`	Claude CLI installed
`api`	Raw `fetch` to `https://api.anthropic.com/v1/messages`	`ANTHROPIC_API_KEY` env var

Auto-detection order: CLI first -> API fallback -> skip test.

API

import { isJudgeAvailable, judge, assertWithJudge } from '../llm-judge';

if (!isJudgeAvailable()) { this.skip(); return; }

const verdict = await judge(
  'The output should contain a file listing with at least one filename',
  actualOutput,
  'Agent was asked to list files in a directory containing README.md',
);
// verdict: { pass: boolean, reasoning: string, confidence: number }

Test Spec Files

File	Phase	Tests	Focus
`agor.test.ts`	Smoke	~50	Basic UI rendering, CSS class selectors
`phase-a-structure.test.ts`	A	12	Structural integrity + settings (Scenarios 1-2)
`phase-a-agent.test.ts`	A	15	Agent pane + prompt submission (Scenarios 3+7)
`phase-a-navigation.test.ts`	A	15	Terminal tabs + palette + focus (Scenarios 4-6)
`phase-b.test.ts`	B	~15	Multi-project grid, LLM-judged agent responses
`phase-c.test.ts`	C	27	Hardening features (palette, search, notifications, keyboard, settings, health, metrics, context, files)

Test Results Tracking (`results-db.ts`)

A lightweight JSON store for tracking test runs and individual step results. Writes to test-results/results.json.

CI Integration (`.github/workflows/e2e.yml`)

Unit tests — npm run test (vitest)
Cargo tests — cargo test (with env -u AGOR_TEST to prevent env leakage)
E2E tests — xvfb-run npm run test:e2e (virtual framebuffer for headless WebKit2GTK)

LLM-judged tests are gated on the ANTHROPIC_API_KEY secret — they skip gracefully in forks.

Writing New Tests

Pick the appropriate spec file (or create a new phase file)
Use data-testid selectors where possible
For DOM queries, use browser.execute() to run JS in the app context
For semantic assertions, use assertWithJudge() with clear criteria

WebDriverIO Config (`wdio.conf.js`)

Single session: maxInstances: 1 — tauri-driver can't handle parallel sessions
Lifecycle: onPrepare builds debug binary, beforeSession spawns tauri-driver with TCP readiness probe
Timeouts: 60s per test, 10s waitfor, 30s connection retry
Skip build: Set SKIP_BUILD=1 to reuse existing binary

Troubleshooting

Problem	Solution
"Callback was not called before unload"	Stale binary — rebuild with `cargo tauri build --debug --no-bundle`
Tests hang on startup	Kill stale `tauri-driver` processes: `pkill -f tauri-driver`
All tests skip LLM judge	Install Claude CLI or set `ANTHROPIC_API_KEY`
SIGUSR2 / exit code 144	Stale tauri-driver on port 4444 — kill and retry
`AGOR_TEST` leaking to cargo	Run cargo tests with `env -u AGOR_TEST cargo test`
No display available	Use `xvfb-run` or ensure X11/Wayland display is set

7.2 KiB Raw Blame History