- phase-b-grid.test.ts (227 lines): multi-project grid, tab switching, status bar, accent colors, project icons, scroll, tab bar completeness - phase-b-llm.test.ts (211 lines): LLM-judged agent response, code gen, context tab, tool calls, cost display, session persistence - Original phase-b.test.ts (377 lines) deleted - New exhaustive tests added for grid layout and agent interaction
7.2 KiB
E2E Testing Facility
Agor's end-to-end testing uses WebDriverIO + tauri-driver to drive the real Tauri application through WebKit2GTK's inspector protocol. The facility has three pillars:
- Test Fixtures — isolated fake environments with dummy projects
- Test Mode — app-level env vars that disable watchers and redirect data/config paths
- LLM Judge — Claude-powered semantic assertions for evaluating agent behavior
Quick Start
# Run all tests (vitest + cargo + E2E)
npm run test:all:e2e
# Run E2E only (requires pre-built debug binary)
SKIP_BUILD=1 npm run test:e2e
# Build debug binary separately (faster iteration)
cargo tauri build --debug --no-bundle
# Run with LLM judge via CLI (default, auto-detected)
npm run test:e2e
# Force LLM judge to use API instead of CLI
LLM_JUDGE_BACKEND=api ANTHROPIC_API_KEY=sk-... npm run test:e2e
Prerequisites
| Dependency | Purpose | Install |
|---|---|---|
| Rust + Cargo | Build Tauri backend | rustup.rs |
| Node.js 20+ | Frontend + test runner | mise install node |
| tauri-driver | WebDriver bridge to WebKit2GTK | cargo install tauri-driver |
| X11 display | WebKit2GTK needs a display | Real X, or xvfb-run in CI |
| Claude CLI | LLM judge (optional) | claude.ai/download |
Architecture
+-----------------------------------------------------+
| WebDriverIO (mocha runner) |
| specs/*.test.ts |
| +- browser.execute() -> DOM queries + assertions |
| +- assertWithJudge() -> LLM semantic evaluation |
+-----------------------------------------------------+
| tauri-driver (port 4444) |
| WebDriver protocol <-> WebKit2GTK inspector |
+-----------------------------------------------------+
| Agor debug binary |
| AGOR_TEST=1 (disables watchers, wake scheduler) |
| AGOR_TEST_DATA_DIR -> isolated SQLite DBs |
| AGOR_TEST_CONFIG_DIR -> test groups.json |
+-----------------------------------------------------+
Pillar 1: Test Fixtures (fixtures.ts)
The fixture generator creates isolated temporary environments so tests never touch real user data. Each fixture includes:
- Temp root dir under
/tmp/agor-e2e-{timestamp}/ - Data dir — empty, SQLite databases created at runtime
- Config dir — contains a generated
groups.jsonwith test projects - Project dir — a real git repo with
README.mdandhello.py(for agent testing)
Single-Project Fixture
import { createTestFixture, destroyTestFixture } from '../fixtures';
const fixture = createTestFixture('my-test');
// fixture.rootDir -> /tmp/my-test-1710234567890/
// fixture.dataDir -> /tmp/my-test-1710234567890/data/
// fixture.configDir -> /tmp/my-test-1710234567890/config/
// fixture.projectDir -> /tmp/my-test-1710234567890/test-project/
// fixture.env -> { AGOR_TEST: '1', AGOR_TEST_DATA_DIR: '...', ... }
destroyTestFixture(fixture);
Multi-Project Fixture
import { createMultiProjectFixture } from '../fixtures';
const fixture = createMultiProjectFixture(3); // 3 separate git repos
Fixture Environment Variables
| Variable | Effect |
|---|---|
AGOR_TEST=1 |
Disables file watchers, wake scheduler, enables is_test_mode |
AGOR_TEST_DATA_DIR |
Redirects sessions.db and btmsg.db storage |
AGOR_TEST_CONFIG_DIR |
Redirects groups.json config loading |
Pillar 2: Test Mode
When AGOR_TEST=1 is set:
- Rust backend:
watcher.rsandfs_watcher.rsskip file watchers - Frontend:
is_test_modeTauri command returns true, wake scheduler disabled viadisableWakeScheduler() - Data isolation:
AGOR_TEST_DATA_DIR/AGOR_TEST_CONFIG_DIRoverride default paths
The WebDriverIO config (wdio.conf.js) passes these env vars via tauri:options.env in capabilities.
Pillar 3: LLM Judge (llm-judge.ts)
The LLM judge enables semantic assertions — evaluating whether agent output "looks right" rather than exact string matching.
Dual Backend
| Backend | How it works | Requires |
|---|---|---|
cli (default) |
Spawns claude CLI with --output-format text |
Claude CLI installed |
api |
Raw fetch to https://api.anthropic.com/v1/messages |
ANTHROPIC_API_KEY env var |
Auto-detection order: CLI first -> API fallback -> skip test.
API
import { isJudgeAvailable, judge, assertWithJudge } from '../llm-judge';
if (!isJudgeAvailable()) { this.skip(); return; }
const verdict = await judge(
'The output should contain a file listing with at least one filename',
actualOutput,
'Agent was asked to list files in a directory containing README.md',
);
// verdict: { pass: boolean, reasoning: string, confidence: number }
Test Spec Files
| File | Phase | Tests | Focus |
|---|---|---|---|
agor.test.ts |
Smoke | ~50 | Basic UI rendering, CSS class selectors |
phase-a-structure.test.ts |
A | 12 | Structural integrity + settings (Scenarios 1-2) |
phase-a-agent.test.ts |
A | 15 | Agent pane + prompt submission (Scenarios 3+7) |
phase-a-navigation.test.ts |
A | 15 | Terminal tabs + palette + focus (Scenarios 4-6) |
phase-b.test.ts |
B | ~15 | Multi-project grid, LLM-judged agent responses |
phase-c.test.ts |
C | 27 | Hardening features (palette, search, notifications, keyboard, settings, health, metrics, context, files) |
Test Results Tracking (results-db.ts)
A lightweight JSON store for tracking test runs and individual step results. Writes to test-results/results.json.
CI Integration (.github/workflows/e2e.yml)
- Unit tests —
npm run test(vitest) - Cargo tests —
cargo test(withenv -u AGOR_TESTto prevent env leakage) - E2E tests —
xvfb-run npm run test:e2e(virtual framebuffer for headless WebKit2GTK)
LLM-judged tests are gated on the ANTHROPIC_API_KEY secret — they skip gracefully in forks.
Writing New Tests
- Pick the appropriate spec file (or create a new phase file)
- Use
data-testidselectors where possible - For DOM queries, use
browser.execute()to run JS in the app context - For semantic assertions, use
assertWithJudge()with clear criteria
WebDriverIO Config (wdio.conf.js)
- Single session:
maxInstances: 1— tauri-driver can't handle parallel sessions - Lifecycle:
onPreparebuilds debug binary,beforeSessionspawns tauri-driver with TCP readiness probe - Timeouts: 60s per test, 10s waitfor, 30s connection retry
- Skip build: Set
SKIP_BUILD=1to reuse existing binary
Troubleshooting
| Problem | Solution |
|---|---|
| "Callback was not called before unload" | Stale binary — rebuild with cargo tauri build --debug --no-bundle |
| Tests hang on startup | Kill stale tauri-driver processes: pkill -f tauri-driver |
| All tests skip LLM judge | Install Claude CLI or set ANTHROPIC_API_KEY |
| SIGUSR2 / exit code 144 | Stale tauri-driver on port 4444 — kill and retry |
AGOR_TEST leaking to cargo |
Run cargo tests with env -u AGOR_TEST cargo test |
| No display available | Use xvfb-run or ensure X11/Wayland display is set |