agent-orchestrator/docs/contributing/testing.md
Hibryda 91a3b56dba test(e2e): split + expand phase-b into grid + LLM specs
- phase-b-grid.test.ts (227 lines): multi-project grid, tab switching,
  status bar, accent colors, project icons, scroll, tab bar completeness
- phase-b-llm.test.ts (211 lines): LLM-judged agent response, code gen,
  context tab, tool calls, cost display, session persistence
- Original phase-b.test.ts (377 lines) deleted
- New exhaustive tests added for grid layout and agent interaction
2026-03-18 03:47:16 +01:00

7.2 KiB

E2E Testing Facility

Agor's end-to-end testing uses WebDriverIO + tauri-driver to drive the real Tauri application through WebKit2GTK's inspector protocol. The facility has three pillars:

  1. Test Fixtures — isolated fake environments with dummy projects
  2. Test Mode — app-level env vars that disable watchers and redirect data/config paths
  3. LLM Judge — Claude-powered semantic assertions for evaluating agent behavior

Quick Start

# Run all tests (vitest + cargo + E2E)
npm run test:all:e2e

# Run E2E only (requires pre-built debug binary)
SKIP_BUILD=1 npm run test:e2e

# Build debug binary separately (faster iteration)
cargo tauri build --debug --no-bundle

# Run with LLM judge via CLI (default, auto-detected)
npm run test:e2e

# Force LLM judge to use API instead of CLI
LLM_JUDGE_BACKEND=api ANTHROPIC_API_KEY=sk-... npm run test:e2e

Prerequisites

Dependency Purpose Install
Rust + Cargo Build Tauri backend rustup.rs
Node.js 20+ Frontend + test runner mise install node
tauri-driver WebDriver bridge to WebKit2GTK cargo install tauri-driver
X11 display WebKit2GTK needs a display Real X, or xvfb-run in CI
Claude CLI LLM judge (optional) claude.ai/download

Architecture

+-----------------------------------------------------+
| WebDriverIO (mocha runner)                          |
|   specs/*.test.ts                                   |
|     +- browser.execute() -> DOM queries + assertions |
|     +- assertWithJudge() -> LLM semantic evaluation |
+-----------------------------------------------------+
| tauri-driver (port 4444)                            |
|   WebDriver protocol <-> WebKit2GTK inspector        |
+-----------------------------------------------------+
| Agor debug binary                                   |
|   AGOR_TEST=1 (disables watchers, wake scheduler)   |
|   AGOR_TEST_DATA_DIR -> isolated SQLite DBs          |
|   AGOR_TEST_CONFIG_DIR -> test groups.json           |
+-----------------------------------------------------+

Pillar 1: Test Fixtures (fixtures.ts)

The fixture generator creates isolated temporary environments so tests never touch real user data. Each fixture includes:

  • Temp root dir under /tmp/agor-e2e-{timestamp}/
  • Data dir — empty, SQLite databases created at runtime
  • Config dir — contains a generated groups.json with test projects
  • Project dir — a real git repo with README.md and hello.py (for agent testing)

Single-Project Fixture

import { createTestFixture, destroyTestFixture } from '../fixtures';

const fixture = createTestFixture('my-test');
// fixture.rootDir    -> /tmp/my-test-1710234567890/
// fixture.dataDir    -> /tmp/my-test-1710234567890/data/
// fixture.configDir  -> /tmp/my-test-1710234567890/config/
// fixture.projectDir -> /tmp/my-test-1710234567890/test-project/
// fixture.env        -> { AGOR_TEST: '1', AGOR_TEST_DATA_DIR: '...', ... }

destroyTestFixture(fixture);

Multi-Project Fixture

import { createMultiProjectFixture } from '../fixtures';
const fixture = createMultiProjectFixture(3); // 3 separate git repos

Fixture Environment Variables

Variable Effect
AGOR_TEST=1 Disables file watchers, wake scheduler, enables is_test_mode
AGOR_TEST_DATA_DIR Redirects sessions.db and btmsg.db storage
AGOR_TEST_CONFIG_DIR Redirects groups.json config loading

Pillar 2: Test Mode

When AGOR_TEST=1 is set:

  • Rust backend: watcher.rs and fs_watcher.rs skip file watchers
  • Frontend: is_test_mode Tauri command returns true, wake scheduler disabled via disableWakeScheduler()
  • Data isolation: AGOR_TEST_DATA_DIR / AGOR_TEST_CONFIG_DIR override default paths

The WebDriverIO config (wdio.conf.js) passes these env vars via tauri:options.env in capabilities.

Pillar 3: LLM Judge (llm-judge.ts)

The LLM judge enables semantic assertions — evaluating whether agent output "looks right" rather than exact string matching.

Dual Backend

Backend How it works Requires
cli (default) Spawns claude CLI with --output-format text Claude CLI installed
api Raw fetch to https://api.anthropic.com/v1/messages ANTHROPIC_API_KEY env var

Auto-detection order: CLI first -> API fallback -> skip test.

API

import { isJudgeAvailable, judge, assertWithJudge } from '../llm-judge';

if (!isJudgeAvailable()) { this.skip(); return; }

const verdict = await judge(
  'The output should contain a file listing with at least one filename',
  actualOutput,
  'Agent was asked to list files in a directory containing README.md',
);
// verdict: { pass: boolean, reasoning: string, confidence: number }

Test Spec Files

File Phase Tests Focus
agor.test.ts Smoke ~50 Basic UI rendering, CSS class selectors
phase-a-structure.test.ts A 12 Structural integrity + settings (Scenarios 1-2)
phase-a-agent.test.ts A 15 Agent pane + prompt submission (Scenarios 3+7)
phase-a-navigation.test.ts A 15 Terminal tabs + palette + focus (Scenarios 4-6)
phase-b.test.ts B ~15 Multi-project grid, LLM-judged agent responses
phase-c.test.ts C 27 Hardening features (palette, search, notifications, keyboard, settings, health, metrics, context, files)

Test Results Tracking (results-db.ts)

A lightweight JSON store for tracking test runs and individual step results. Writes to test-results/results.json.

CI Integration (.github/workflows/e2e.yml)

  1. Unit testsnpm run test (vitest)
  2. Cargo testscargo test (with env -u AGOR_TEST to prevent env leakage)
  3. E2E testsxvfb-run npm run test:e2e (virtual framebuffer for headless WebKit2GTK)

LLM-judged tests are gated on the ANTHROPIC_API_KEY secret — they skip gracefully in forks.

Writing New Tests

  1. Pick the appropriate spec file (or create a new phase file)
  2. Use data-testid selectors where possible
  3. For DOM queries, use browser.execute() to run JS in the app context
  4. For semantic assertions, use assertWithJudge() with clear criteria

WebDriverIO Config (wdio.conf.js)

  • Single session: maxInstances: 1 — tauri-driver can't handle parallel sessions
  • Lifecycle: onPrepare builds debug binary, beforeSession spawns tauri-driver with TCP readiness probe
  • Timeouts: 60s per test, 10s waitfor, 30s connection retry
  • Skip build: Set SKIP_BUILD=1 to reuse existing binary

Troubleshooting

Problem Solution
"Callback was not called before unload" Stale binary — rebuild with cargo tauri build --debug --no-bundle
Tests hang on startup Kill stale tauri-driver processes: pkill -f tauri-driver
All tests skip LLM judge Install Claude CLI or set ANTHROPIC_API_KEY
SIGUSR2 / exit code 144 Stale tauri-driver on port 4444 — kill and retry
AGOR_TEST leaking to cargo Run cargo tests with env -u AGOR_TEST cargo test
No display available Use xvfb-run or ensure X11/Wayland display is set