agent-orchestrator/docs/contributing/testing.md

# E2E Testing Facility

Agor's end-to-end testing uses **WebDriverIO + tauri-driver** to drive the real Tauri application through WebKit2GTK's inspector protocol. The facility has three pillars:

1. **Test Fixtures** — isolated fake environments with dummy projects
2. **Test Mode** — app-level env vars that disable watchers and redirect data/config paths
3. **LLM Judge** — Claude-powered semantic assertions for evaluating agent behavior

## Quick Start

```bash
# Run all tests (vitest + cargo + E2E)
npm run test:all:e2e

# Run E2E only (requires pre-built debug binary)
SKIP_BUILD=1 npm run test:e2e

# Build debug binary separately (faster iteration)
cargo tauri build --debug --no-bundle

# Run with LLM judge via CLI (default, auto-detected)
npm run test:e2e

# Force LLM judge to use API instead of CLI
LLM_JUDGE_BACKEND=api ANTHROPIC_API_KEY=sk-... npm run test:e2e
```

## Prerequisites

| Dependency | Purpose | Install |
|-----------|---------|---------|
| Rust + Cargo | Build Tauri backend | [rustup.rs](https://rustup.rs) |
| Node.js 20+ | Frontend + test runner | `mise install node` |
| tauri-driver | WebDriver bridge to WebKit2GTK | `cargo install tauri-driver` |
| X11 display | WebKit2GTK needs a display | Real X, or `xvfb-run` in CI |
| Claude CLI | LLM judge (optional) | [claude.ai/download](https://claude.ai/download) |

## Architecture

```
+-----------------------------------------------------+
| WebDriverIO (mocha runner)                          |
|   specs/*.test.ts                                   |
|     +- browser.execute() -> DOM queries + assertions |
|     +- assertWithJudge() -> LLM semantic evaluation |
+-----------------------------------------------------+
| tauri-driver (port 4444)                            |
|   WebDriver protocol <-> WebKit2GTK inspector        |
+-----------------------------------------------------+
| Agor debug binary                                   |
|   AGOR_TEST=1 (disables watchers, wake scheduler)   |
|   AGOR_TEST_DATA_DIR -> isolated SQLite DBs          |
|   AGOR_TEST_CONFIG_DIR -> test groups.json           |
+-----------------------------------------------------+
```

## Pillar 1: Test Fixtures (`fixtures.ts`)

The fixture generator creates isolated temporary environments so tests never touch real user data. Each fixture includes:

- **Temp root dir** under `/tmp/agor-e2e-{timestamp}/`
- **Data dir** — empty, SQLite databases created at runtime
- **Config dir** — contains a generated `groups.json` with test projects
- **Project dir** — a real git repo with `README.md` and `hello.py` (for agent testing)

### Single-Project Fixture

```typescript
import { createTestFixture, destroyTestFixture } from '../fixtures';

const fixture = createTestFixture('my-test');
// fixture.rootDir    -> /tmp/my-test-1710234567890/
// fixture.dataDir    -> /tmp/my-test-1710234567890/data/
// fixture.configDir  -> /tmp/my-test-1710234567890/config/
// fixture.projectDir -> /tmp/my-test-1710234567890/test-project/
// fixture.env        -> { AGOR_TEST: '1', AGOR_TEST_DATA_DIR: '...', ... }

destroyTestFixture(fixture);
```

### Multi-Project Fixture

```typescript
import { createMultiProjectFixture } from '../fixtures';
const fixture = createMultiProjectFixture(3); // 3 separate git repos
```

### Fixture Environment Variables

| Variable | Effect |
|----------|--------|
| `AGOR_TEST=1` | Disables file watchers, wake scheduler, enables `is_test_mode` |
| `AGOR_TEST_DATA_DIR` | Redirects `sessions.db` and `btmsg.db` storage |
| `AGOR_TEST_CONFIG_DIR` | Redirects `groups.json` config loading |

## Pillar 2: Test Mode

When `AGOR_TEST=1` is set:

- **Rust backend**: `watcher.rs` and `fs_watcher.rs` skip file watchers
- **Frontend**: `is_test_mode` Tauri command returns true, wake scheduler disabled via `disableWakeScheduler()`
- **Data isolation**: `AGOR_TEST_DATA_DIR` / `AGOR_TEST_CONFIG_DIR` override default paths

The WebDriverIO config (`wdio.conf.js`) passes these env vars via `tauri:options.env` in capabilities.

## Pillar 3: LLM Judge (`llm-judge.ts`)

The LLM judge enables semantic assertions — evaluating whether agent output "looks right" rather than exact string matching.

### Dual Backend

| Backend | How it works | Requires |
|---------|-------------|----------|
| `cli` (default) | Spawns `claude` CLI with `--output-format text` | Claude CLI installed |
| `api` | Raw `fetch` to `https://api.anthropic.com/v1/messages` | `ANTHROPIC_API_KEY` env var |

**Auto-detection order**: CLI first -> API fallback -> skip test.

### API

```typescript
import { isJudgeAvailable, judge, assertWithJudge } from '../llm-judge';

if (!isJudgeAvailable()) { this.skip(); return; }

const verdict = await judge(
  'The output should contain a file listing with at least one filename',
  actualOutput,
  'Agent was asked to list files in a directory containing README.md',
);
// verdict: { pass: boolean, reasoning: string, confidence: number }
```

## Test Spec Files

| File | Phase | Tests | Focus |
|------|-------|-------|-------|
| `agor.test.ts` | Smoke | ~50 | Basic UI rendering, CSS class selectors |
| `phase-a-structure.test.ts` | A | 12 | Structural integrity + settings (Scenarios 1-2) |
| `phase-a-agent.test.ts` | A | 15 | Agent pane + prompt submission (Scenarios 3+7) |
| `phase-a-navigation.test.ts` | A | 15 | Terminal tabs + palette + focus (Scenarios 4-6) |
| `phase-b.test.ts` | B | ~15 | Multi-project grid, LLM-judged agent responses |
| `phase-c.test.ts` | C | 27 | Hardening features (palette, search, notifications, keyboard, settings, health, metrics, context, files) |

## Test Results Tracking (`results-db.ts`)

A lightweight JSON store for tracking test runs and individual step results. Writes to `test-results/results.json`.

## CI Integration (`.github/workflows/e2e.yml`)

1. **Unit tests** — `npm run test` (vitest)
2. **Cargo tests** — `cargo test` (with `env -u AGOR_TEST` to prevent env leakage)
3. **E2E tests** — `xvfb-run npm run test:e2e` (virtual framebuffer for headless WebKit2GTK)

LLM-judged tests are gated on the `ANTHROPIC_API_KEY` secret — they skip gracefully in forks.

## Writing New Tests

1. Pick the appropriate spec file (or create a new phase file)
2. Use `data-testid` selectors where possible
3. For DOM queries, use `browser.execute()` to run JS in the app context
4. For semantic assertions, use `assertWithJudge()` with clear criteria

### WebDriverIO Config (`wdio.conf.js`)

- **Single session**: `maxInstances: 1` — tauri-driver can't handle parallel sessions
- **Lifecycle**: `onPrepare` builds debug binary, `beforeSession` spawns tauri-driver with TCP readiness probe
- **Timeouts**: 60s per test, 10s waitfor, 30s connection retry
- **Skip build**: Set `SKIP_BUILD=1` to reuse existing binary

## Troubleshooting

| Problem | Solution |
|---------|----------|
| "Callback was not called before unload" | Stale binary — rebuild with `cargo tauri build --debug --no-bundle` |
| Tests hang on startup | Kill stale `tauri-driver` processes: `pkill -f tauri-driver` |
| All tests skip LLM judge | Install Claude CLI or set `ANTHROPIC_API_KEY` |
| SIGUSR2 / exit code 144 | Stale tauri-driver on port 4444 — kill and retry |
| `AGOR_TEST` leaking to cargo | Run cargo tests with `env -u AGOR_TEST cargo test` |
| No display available | Use `xvfb-run` or ensure X11/Wayland display is set |