- phase-b-grid.test.ts (227 lines): multi-project grid, tab switching, status bar, accent colors, project icons, scroll, tab bar completeness - phase-b-llm.test.ts (211 lines): LLM-judged agent response, code gen, context tab, tool calls, cost display, session persistence - Original phase-b.test.ts (377 lines) deleted - New exhaustive tests added for grid layout and agent interaction
180 lines
7.2 KiB
Markdown
180 lines
7.2 KiB
Markdown
# E2E Testing Facility
|
|
|
|
Agor's end-to-end testing uses **WebDriverIO + tauri-driver** to drive the real Tauri application through WebKit2GTK's inspector protocol. The facility has three pillars:
|
|
|
|
1. **Test Fixtures** — isolated fake environments with dummy projects
|
|
2. **Test Mode** — app-level env vars that disable watchers and redirect data/config paths
|
|
3. **LLM Judge** — Claude-powered semantic assertions for evaluating agent behavior
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Run all tests (vitest + cargo + E2E)
|
|
npm run test:all:e2e
|
|
|
|
# Run E2E only (requires pre-built debug binary)
|
|
SKIP_BUILD=1 npm run test:e2e
|
|
|
|
# Build debug binary separately (faster iteration)
|
|
cargo tauri build --debug --no-bundle
|
|
|
|
# Run with LLM judge via CLI (default, auto-detected)
|
|
npm run test:e2e
|
|
|
|
# Force LLM judge to use API instead of CLI
|
|
LLM_JUDGE_BACKEND=api ANTHROPIC_API_KEY=sk-... npm run test:e2e
|
|
```
|
|
|
|
## Prerequisites
|
|
|
|
| Dependency | Purpose | Install |
|
|
|-----------|---------|---------|
|
|
| Rust + Cargo | Build Tauri backend | [rustup.rs](https://rustup.rs) |
|
|
| Node.js 20+ | Frontend + test runner | `mise install node` |
|
|
| tauri-driver | WebDriver bridge to WebKit2GTK | `cargo install tauri-driver` |
|
|
| X11 display | WebKit2GTK needs a display | Real X, or `xvfb-run` in CI |
|
|
| Claude CLI | LLM judge (optional) | [claude.ai/download](https://claude.ai/download) |
|
|
|
|
## Architecture
|
|
|
|
```
|
|
+-----------------------------------------------------+
|
|
| WebDriverIO (mocha runner) |
|
|
| specs/*.test.ts |
|
|
| +- browser.execute() -> DOM queries + assertions |
|
|
| +- assertWithJudge() -> LLM semantic evaluation |
|
|
+-----------------------------------------------------+
|
|
| tauri-driver (port 4444) |
|
|
| WebDriver protocol <-> WebKit2GTK inspector |
|
|
+-----------------------------------------------------+
|
|
| Agor debug binary |
|
|
| AGOR_TEST=1 (disables watchers, wake scheduler) |
|
|
| AGOR_TEST_DATA_DIR -> isolated SQLite DBs |
|
|
| AGOR_TEST_CONFIG_DIR -> test groups.json |
|
|
+-----------------------------------------------------+
|
|
```
|
|
|
|
## Pillar 1: Test Fixtures (`fixtures.ts`)
|
|
|
|
The fixture generator creates isolated temporary environments so tests never touch real user data. Each fixture includes:
|
|
|
|
- **Temp root dir** under `/tmp/agor-e2e-{timestamp}/`
|
|
- **Data dir** — empty, SQLite databases created at runtime
|
|
- **Config dir** — contains a generated `groups.json` with test projects
|
|
- **Project dir** — a real git repo with `README.md` and `hello.py` (for agent testing)
|
|
|
|
### Single-Project Fixture
|
|
|
|
```typescript
|
|
import { createTestFixture, destroyTestFixture } from '../fixtures';
|
|
|
|
const fixture = createTestFixture('my-test');
|
|
// fixture.rootDir -> /tmp/my-test-1710234567890/
|
|
// fixture.dataDir -> /tmp/my-test-1710234567890/data/
|
|
// fixture.configDir -> /tmp/my-test-1710234567890/config/
|
|
// fixture.projectDir -> /tmp/my-test-1710234567890/test-project/
|
|
// fixture.env -> { AGOR_TEST: '1', AGOR_TEST_DATA_DIR: '...', ... }
|
|
|
|
destroyTestFixture(fixture);
|
|
```
|
|
|
|
### Multi-Project Fixture
|
|
|
|
```typescript
|
|
import { createMultiProjectFixture } from '../fixtures';
|
|
const fixture = createMultiProjectFixture(3); // 3 separate git repos
|
|
```
|
|
|
|
### Fixture Environment Variables
|
|
|
|
| Variable | Effect |
|
|
|----------|--------|
|
|
| `AGOR_TEST=1` | Disables file watchers, wake scheduler, enables `is_test_mode` |
|
|
| `AGOR_TEST_DATA_DIR` | Redirects `sessions.db` and `btmsg.db` storage |
|
|
| `AGOR_TEST_CONFIG_DIR` | Redirects `groups.json` config loading |
|
|
|
|
## Pillar 2: Test Mode
|
|
|
|
When `AGOR_TEST=1` is set:
|
|
|
|
- **Rust backend**: `watcher.rs` and `fs_watcher.rs` skip file watchers
|
|
- **Frontend**: `is_test_mode` Tauri command returns true, wake scheduler disabled via `disableWakeScheduler()`
|
|
- **Data isolation**: `AGOR_TEST_DATA_DIR` / `AGOR_TEST_CONFIG_DIR` override default paths
|
|
|
|
The WebDriverIO config (`wdio.conf.js`) passes these env vars via `tauri:options.env` in capabilities.
|
|
|
|
## Pillar 3: LLM Judge (`llm-judge.ts`)
|
|
|
|
The LLM judge enables semantic assertions — evaluating whether agent output "looks right" rather than exact string matching.
|
|
|
|
### Dual Backend
|
|
|
|
| Backend | How it works | Requires |
|
|
|---------|-------------|----------|
|
|
| `cli` (default) | Spawns `claude` CLI with `--output-format text` | Claude CLI installed |
|
|
| `api` | Raw `fetch` to `https://api.anthropic.com/v1/messages` | `ANTHROPIC_API_KEY` env var |
|
|
|
|
**Auto-detection order**: CLI first -> API fallback -> skip test.
|
|
|
|
### API
|
|
|
|
```typescript
|
|
import { isJudgeAvailable, judge, assertWithJudge } from '../llm-judge';
|
|
|
|
if (!isJudgeAvailable()) { this.skip(); return; }
|
|
|
|
const verdict = await judge(
|
|
'The output should contain a file listing with at least one filename',
|
|
actualOutput,
|
|
'Agent was asked to list files in a directory containing README.md',
|
|
);
|
|
// verdict: { pass: boolean, reasoning: string, confidence: number }
|
|
```
|
|
|
|
## Test Spec Files
|
|
|
|
| File | Phase | Tests | Focus |
|
|
|------|-------|-------|-------|
|
|
| `agor.test.ts` | Smoke | ~50 | Basic UI rendering, CSS class selectors |
|
|
| `phase-a-structure.test.ts` | A | 12 | Structural integrity + settings (Scenarios 1-2) |
|
|
| `phase-a-agent.test.ts` | A | 15 | Agent pane + prompt submission (Scenarios 3+7) |
|
|
| `phase-a-navigation.test.ts` | A | 15 | Terminal tabs + palette + focus (Scenarios 4-6) |
|
|
| `phase-b.test.ts` | B | ~15 | Multi-project grid, LLM-judged agent responses |
|
|
| `phase-c.test.ts` | C | 27 | Hardening features (palette, search, notifications, keyboard, settings, health, metrics, context, files) |
|
|
|
|
## Test Results Tracking (`results-db.ts`)
|
|
|
|
A lightweight JSON store for tracking test runs and individual step results. Writes to `test-results/results.json`.
|
|
|
|
## CI Integration (`.github/workflows/e2e.yml`)
|
|
|
|
1. **Unit tests** — `npm run test` (vitest)
|
|
2. **Cargo tests** — `cargo test` (with `env -u AGOR_TEST` to prevent env leakage)
|
|
3. **E2E tests** — `xvfb-run npm run test:e2e` (virtual framebuffer for headless WebKit2GTK)
|
|
|
|
LLM-judged tests are gated on the `ANTHROPIC_API_KEY` secret — they skip gracefully in forks.
|
|
|
|
## Writing New Tests
|
|
|
|
1. Pick the appropriate spec file (or create a new phase file)
|
|
2. Use `data-testid` selectors where possible
|
|
3. For DOM queries, use `browser.execute()` to run JS in the app context
|
|
4. For semantic assertions, use `assertWithJudge()` with clear criteria
|
|
|
|
### WebDriverIO Config (`wdio.conf.js`)
|
|
|
|
- **Single session**: `maxInstances: 1` — tauri-driver can't handle parallel sessions
|
|
- **Lifecycle**: `onPrepare` builds debug binary, `beforeSession` spawns tauri-driver with TCP readiness probe
|
|
- **Timeouts**: 60s per test, 10s waitfor, 30s connection retry
|
|
- **Skip build**: Set `SKIP_BUILD=1` to reuse existing binary
|
|
|
|
## Troubleshooting
|
|
|
|
| Problem | Solution |
|
|
|---------|----------|
|
|
| "Callback was not called before unload" | Stale binary — rebuild with `cargo tauri build --debug --no-bundle` |
|
|
| Tests hang on startup | Kill stale `tauri-driver` processes: `pkill -f tauri-driver` |
|
|
| All tests skip LLM judge | Install Claude CLI or set `ANTHROPIC_API_KEY` |
|
|
| SIGUSR2 / exit code 144 | Stale tauri-driver on port 4444 — kill and retry |
|
|
| `AGOR_TEST` leaking to cargo | Run cargo tests with `env -u AGOR_TEST cargo test` |
|
|
| No display available | Use `xvfb-run` or ensure X11/Wayland display is set |
|