agent-orchestrator/docs/contributing/testing.md
Hibryda 91a3b56dba test(e2e): split + expand phase-b into grid + LLM specs
- phase-b-grid.test.ts (227 lines): multi-project grid, tab switching,
  status bar, accent colors, project icons, scroll, tab bar completeness
- phase-b-llm.test.ts (211 lines): LLM-judged agent response, code gen,
  context tab, tool calls, cost display, session persistence
- Original phase-b.test.ts (377 lines) deleted
- New exhaustive tests added for grid layout and agent interaction
2026-03-18 03:47:16 +01:00

180 lines
7.2 KiB
Markdown

# E2E Testing Facility
Agor's end-to-end testing uses **WebDriverIO + tauri-driver** to drive the real Tauri application through WebKit2GTK's inspector protocol. The facility has three pillars:
1. **Test Fixtures** — isolated fake environments with dummy projects
2. **Test Mode** — app-level env vars that disable watchers and redirect data/config paths
3. **LLM Judge** — Claude-powered semantic assertions for evaluating agent behavior
## Quick Start
```bash
# Run all tests (vitest + cargo + E2E)
npm run test:all:e2e
# Run E2E only (requires pre-built debug binary)
SKIP_BUILD=1 npm run test:e2e
# Build debug binary separately (faster iteration)
cargo tauri build --debug --no-bundle
# Run with LLM judge via CLI (default, auto-detected)
npm run test:e2e
# Force LLM judge to use API instead of CLI
LLM_JUDGE_BACKEND=api ANTHROPIC_API_KEY=sk-... npm run test:e2e
```
## Prerequisites
| Dependency | Purpose | Install |
|-----------|---------|---------|
| Rust + Cargo | Build Tauri backend | [rustup.rs](https://rustup.rs) |
| Node.js 20+ | Frontend + test runner | `mise install node` |
| tauri-driver | WebDriver bridge to WebKit2GTK | `cargo install tauri-driver` |
| X11 display | WebKit2GTK needs a display | Real X, or `xvfb-run` in CI |
| Claude CLI | LLM judge (optional) | [claude.ai/download](https://claude.ai/download) |
## Architecture
```
+-----------------------------------------------------+
| WebDriverIO (mocha runner) |
| specs/*.test.ts |
| +- browser.execute() -> DOM queries + assertions |
| +- assertWithJudge() -> LLM semantic evaluation |
+-----------------------------------------------------+
| tauri-driver (port 4444) |
| WebDriver protocol <-> WebKit2GTK inspector |
+-----------------------------------------------------+
| Agor debug binary |
| AGOR_TEST=1 (disables watchers, wake scheduler) |
| AGOR_TEST_DATA_DIR -> isolated SQLite DBs |
| AGOR_TEST_CONFIG_DIR -> test groups.json |
+-----------------------------------------------------+
```
## Pillar 1: Test Fixtures (`fixtures.ts`)
The fixture generator creates isolated temporary environments so tests never touch real user data. Each fixture includes:
- **Temp root dir** under `/tmp/agor-e2e-{timestamp}/`
- **Data dir** — empty, SQLite databases created at runtime
- **Config dir** — contains a generated `groups.json` with test projects
- **Project dir** — a real git repo with `README.md` and `hello.py` (for agent testing)
### Single-Project Fixture
```typescript
import { createTestFixture, destroyTestFixture } from '../fixtures';
const fixture = createTestFixture('my-test');
// fixture.rootDir -> /tmp/my-test-1710234567890/
// fixture.dataDir -> /tmp/my-test-1710234567890/data/
// fixture.configDir -> /tmp/my-test-1710234567890/config/
// fixture.projectDir -> /tmp/my-test-1710234567890/test-project/
// fixture.env -> { AGOR_TEST: '1', AGOR_TEST_DATA_DIR: '...', ... }
destroyTestFixture(fixture);
```
### Multi-Project Fixture
```typescript
import { createMultiProjectFixture } from '../fixtures';
const fixture = createMultiProjectFixture(3); // 3 separate git repos
```
### Fixture Environment Variables
| Variable | Effect |
|----------|--------|
| `AGOR_TEST=1` | Disables file watchers, wake scheduler, enables `is_test_mode` |
| `AGOR_TEST_DATA_DIR` | Redirects `sessions.db` and `btmsg.db` storage |
| `AGOR_TEST_CONFIG_DIR` | Redirects `groups.json` config loading |
## Pillar 2: Test Mode
When `AGOR_TEST=1` is set:
- **Rust backend**: `watcher.rs` and `fs_watcher.rs` skip file watchers
- **Frontend**: `is_test_mode` Tauri command returns true, wake scheduler disabled via `disableWakeScheduler()`
- **Data isolation**: `AGOR_TEST_DATA_DIR` / `AGOR_TEST_CONFIG_DIR` override default paths
The WebDriverIO config (`wdio.conf.js`) passes these env vars via `tauri:options.env` in capabilities.
## Pillar 3: LLM Judge (`llm-judge.ts`)
The LLM judge enables semantic assertions — evaluating whether agent output "looks right" rather than exact string matching.
### Dual Backend
| Backend | How it works | Requires |
|---------|-------------|----------|
| `cli` (default) | Spawns `claude` CLI with `--output-format text` | Claude CLI installed |
| `api` | Raw `fetch` to `https://api.anthropic.com/v1/messages` | `ANTHROPIC_API_KEY` env var |
**Auto-detection order**: CLI first -> API fallback -> skip test.
### API
```typescript
import { isJudgeAvailable, judge, assertWithJudge } from '../llm-judge';
if (!isJudgeAvailable()) { this.skip(); return; }
const verdict = await judge(
'The output should contain a file listing with at least one filename',
actualOutput,
'Agent was asked to list files in a directory containing README.md',
);
// verdict: { pass: boolean, reasoning: string, confidence: number }
```
## Test Spec Files
| File | Phase | Tests | Focus |
|------|-------|-------|-------|
| `agor.test.ts` | Smoke | ~50 | Basic UI rendering, CSS class selectors |
| `phase-a-structure.test.ts` | A | 12 | Structural integrity + settings (Scenarios 1-2) |
| `phase-a-agent.test.ts` | A | 15 | Agent pane + prompt submission (Scenarios 3+7) |
| `phase-a-navigation.test.ts` | A | 15 | Terminal tabs + palette + focus (Scenarios 4-6) |
| `phase-b.test.ts` | B | ~15 | Multi-project grid, LLM-judged agent responses |
| `phase-c.test.ts` | C | 27 | Hardening features (palette, search, notifications, keyboard, settings, health, metrics, context, files) |
## Test Results Tracking (`results-db.ts`)
A lightweight JSON store for tracking test runs and individual step results. Writes to `test-results/results.json`.
## CI Integration (`.github/workflows/e2e.yml`)
1. **Unit tests**`npm run test` (vitest)
2. **Cargo tests**`cargo test` (with `env -u AGOR_TEST` to prevent env leakage)
3. **E2E tests**`xvfb-run npm run test:e2e` (virtual framebuffer for headless WebKit2GTK)
LLM-judged tests are gated on the `ANTHROPIC_API_KEY` secret — they skip gracefully in forks.
## Writing New Tests
1. Pick the appropriate spec file (or create a new phase file)
2. Use `data-testid` selectors where possible
3. For DOM queries, use `browser.execute()` to run JS in the app context
4. For semantic assertions, use `assertWithJudge()` with clear criteria
### WebDriverIO Config (`wdio.conf.js`)
- **Single session**: `maxInstances: 1` — tauri-driver can't handle parallel sessions
- **Lifecycle**: `onPrepare` builds debug binary, `beforeSession` spawns tauri-driver with TCP readiness probe
- **Timeouts**: 60s per test, 10s waitfor, 30s connection retry
- **Skip build**: Set `SKIP_BUILD=1` to reuse existing binary
## Troubleshooting
| Problem | Solution |
|---------|----------|
| "Callback was not called before unload" | Stale binary — rebuild with `cargo tauri build --debug --no-bundle` |
| Tests hang on startup | Kill stale `tauri-driver` processes: `pkill -f tauri-driver` |
| All tests skip LLM judge | Install Claude CLI or set `ANTHROPIC_API_KEY` |
| SIGUSR2 / exit code 144 | Stale tauri-driver on port 4444 — kill and retry |
| `AGOR_TEST` leaking to cargo | Run cargo tests with `env -u AGOR_TEST cargo test` |
| No display available | Use `xvfb-run` or ensure X11/Wayland display is set |