feat(parallel): TIER 6 parallel-flow dispatcher — S7 spike drop-in #3
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "feat/s7-parallel-flow-dispatch"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Drops in the S7 spike (meta-repo
docs/spikes/S7-results.md— CONDITIONAL GO 2026-05-15) parallel-flow dispatcher into im2be-qa-rig asim2be_qa_rig.paralleland exposes it as the newim2be-qa-rig run-setsubcommand.Architecture: subprocess-per-flow under asyncio.gather + asyncio.Semaphore(N=4 default). Each subprocess runs a normal
im2be-qa-rig run --spec ...with its own Python interpreter,sync_playwright()instance, Browser, and BrowserContext — Playwright's documented isolation primitive.Why subprocess + asyncio
run_browser_flowuses the sync API. Subprocess isolation sidesteps thread-safety AND keeps the single-flow runner unchanged.asyncio.create_subprocess_execgives live stderr piping for progress observability + clean SIGINT propagation, whichmultiprocessing.Pooldoes not.Contamination sources mitigated
The S7 spike enumerated 7 contamination sources (run-id isolation, per-flow run dir, persona collision detection, sidecar port allocation, etc.) — all mitigated in the drop-in.
Files
src/im2be_qa_rig/parallel.py(~340 LOC) — the dispatcher.src/im2be_qa_rig/cli.py— wires the newrun-setsubcommand.tests/test_parallel_dispatch.py— 12 offline unit tests covering aggregator shape, semaphore cap, failure isolation, persona-collision warnings, per-flow run-dir creation, sidecar port allocation.tests/_stub_worker.py— stub subprocess used by the test suite (no real Playwright invocation).run-set.example.toml— documented [run_set] + flows schema at repo root.Local-fix vs spike prototype
parallelism_efficiencynow floors wall_clock at 1ms so the ratio is always computable. The prototype'sint(...)cast truncated to 0 for sub-1ms stub-test windows, producing None. Real flows run 30-300s; the floor is operationally invisible.Test plan
pytest tests/test_parallel_dispatch.pygreen offline.docs/spikes/S7-prototype/live-run-procedure.md— separate follow-up once L0 substrate activation lands.References
docs/spikes/S7-results.md— full S7 verdict.docs/spikes/S7-prototype/qa-rig-parallel-dispatch.md— design rationale + falsifications.docs/spikes/S7-prototype/live-run-procedure.md— 2/5/10-flow validation procedure for after merge.Show previous round
Four actionable findings. One blocking correctness bug (PEP 479
StopIterationpromotion), one major reliability gap (asyncio.gatherwithoutreturn_exceptions=True), one major operational gap (no subprocess timeout), and one minor docs inconsistency in the TOML key name.Findings (4)
[BLOCKING] StopIteration inside async def is promoted to RuntimeError (PEP 479) — except clause never fires
src/im2be_qa_rig/parallel.py:259
In
_run_one, the lineraises
StopIterationwhen noreport.jsonexists. Since Python 3.7 (PEP 479), anyStopIterationthat escapes a generator expression inside anasync defis automatically re-raised asRuntimeError. The surroundingexcept (StopIteration, json.JSONDecodeError) as exc:clause therefore never matches it — theRuntimeErrorpropagates out of_run_oneunhandled.Because
asyncio.gather(see finding #2) is called withoutreturn_exceptions=True, this unhandled exception immediately cancels every other still-running task, making the run-set fail entirely whenever any subprocess finishes without areport.json.Fix: Replace the
next(…)+except StopIterationpattern with the two-argument form that returns a sentinel on exhaustion:This eliminates the generator/StopIteration interaction entirely and makes the
json.JSONDecodeErrorhandler still reachable.[MAJOR] asyncio.gather without return_exceptions=True — any unexpected exception cancels all running tasks
src/im2be_qa_rig/parallel.py:313
With
return_exceptions=False(the default),asyncio.gatherre-raises the first exception it receives and cancels all sibling tasks. The subprocess processes themselves keep running, but their coroutines are cancelled — theirFlowDispatchResultobjects are never produced, and subprocesses'stdout/stderrlogs are never written.Examples that reach this path even after fixing finding #1:
FileNotFoundErrorifsys.executableis wrong in a CI container, anyOSErrorfromasyncio.create_subprocess_exec, or an unexpected exception inside the directory-creation block.Fix:
[MAJOR] proc.communicate() has no timeout — a hung subprocess blocks the run-set forever
src/im2be_qa_rig/parallel.py:236
If the child
im2be-qa-rig runprocess hangs (e.g., Playwright waiting for a navigation that never completes, or a browser zombie),_run_onenever returns. It holds its semaphore slot for the entire lifetime of the run andasyncio.gathernever completes. The only escape is an external SIGKILL.Fix: wrap
proc.communicate()withasyncio.wait_forand a configurable timeout. A reasonable default for a real flow run is 10–15 minutes; expose it as an optionalflow_timeout_secondsparameter ondispatch_run_set/run_set_mainand pass it through:[MINOR] --concurrency help text names the wrong TOML key
src/im2be_qa_rig/cli.py:349
The referenced key
[run_set].default_concurrencydoes not exist in the config format.load_run_setinparallel.pyreadsraw.get("meta", {}).get("concurrency", 4), so the actual TOML path is[meta].concurrency— matching therun-set.example.tomlthat ships with this PR.Fix:
round 1 • 2026-05-15T22:05:55.740Z → 2026-05-15T22:09:25.366Z
UU [BLOCKING]: _run_one used `next(generator)` inside async def to find the per-flow report.json. PEP 479 (Python 3.7+) promotes StopIteration escaping a generator inside async def to RuntimeError, bypassing the `except StopIteration` arm entirely. The unhandled exception then escaped _run_one and (because of VV) cancelled the entire run-set. Replaced with `next(..., None)` sentinel form so the missing-report case is handled without triggering PEP 479 promotion. The `json.JSONDecodeError` arm is now reachable on malformed report.json. VV [MAJOR]: `asyncio.gather(*tasks)` ran with `return_exceptions=False` (default), so any unexpected exception in one task (e.g. FileNotFoundError from create_subprocess_exec, OSError from create_subprocess_exec) would re-raise immediately AND cancel every sibling task, leaving subprocesses running but their result coroutines cancelled. Switched to `return_exceptions=True` + per-task BaseException → FlowDispatchResult fallback so the run-set always returns one result per spec. WW [MAJOR]: `proc.communicate()` had no timeout — a Playwright-hung subprocess would hold its semaphore slot forever, blocking the run-set indefinitely. Added `asyncio.wait_for(proc.communicate(), timeout=...)` with default 900s (15min, well above 30-300s typical flow runtime). On timeout: kill the child, drain its pipes, mark the flow failed with typed `timeout after Ns` error. New `flow_timeout_seconds` parameter on `dispatch_run_set` + `run_set_main` + CLI `--flow-timeout-seconds`. Validation: must be > 0 or None. XX [MINOR]: CLI `--concurrency` help text said `[run_set].default_concurrency` but the actual TOML path is `[meta].concurrency` (load_run_set reads raw.get("meta", {}).get("concurrency", 4)). Help string corrected to match the example run-set.toml. New tests: - test_dispatch_flow_timeout_validated: rejects flow_timeout_seconds <= 0. - test_dispatch_flow_timeout_kills_hung_subprocess: hung sibling timed out + killed; fast siblings still complete (proves WW + VV interact correctly). - test_dispatch_return_exceptions_isolates_per_task_failure: simulated FileNotFoundError on one task does NOT cancel siblings (proves VV). 77/77 pytest + tsc clean.Round 2: all four round-1 findings remain unaddressed in this diff. One new minor finding added (misleading error message that conflates two distinct failure modes). No issues were introduced that weren't already flagged, but the three unresolved correctness/reliability issues in
parallel.py(StopIteration→RuntimeError, bare gather, no timeout) interact badly: finding #1 produces a RuntimeError that finding #2 turns into a total run-set cancellation.Findings (5)
[BLOCKING] [Round 1 still open] StopIteration inside async def promoted to RuntimeError by PEP 479 — except clause is dead code
src/im2be_qa_rig/parallel.py:259
Under PEP 479 (mandatory since Python 3.7), a
StopIterationthat escapes a generator inside a coroutine is re-raised asRuntimeError. The generator expression(p for p in flow_dir.glob("*/report.json"))passed tonext()raisesStopIterationinternally when no file matches; Python converts it toRuntimeErrorbefore it can reach theexcept (StopIteration, json.JSONDecodeError)handler. The except clause is therefore dead for the no-report case, and theRuntimeErrorpropagates uncaught out of_run_one, then (withoutreturn_exceptions=True) cancels all in-flight tasks.Fix — use the two-argument form of
next()to avoid the generator edge case entirely:[MAJOR] [Round 1 still open] asyncio.gather without return_exceptions=True cancels all in-flight flows on first exception
src/im2be_qa_rig/parallel.py:313
await asyncio.gather(*tasks)withoutreturn_exceptions=True: if any_run_onecoroutine raises (including theRuntimeErrorfrom finding #1 above),asyncio.gatherpropagates that exception immediately and implicitly callsTask.cancel()on every other running task. The run-set silently terminates mid-flight and the top-levelreport.jsonis never written.Fix:
[MAJOR] [Round 1 still open] proc.communicate() has no timeout — a hung subprocess holds its semaphore slot forever
src/im2be_qa_rig/parallel.py:236
stdout, stderr = await proc.communicate()waits indefinitely if the child stalls (e.g., Playwright hanging on an unresponsive app server, or a zombie that never closes its pipes). Theasyncio.Semaphoreslot is held for the entire blocked duration; withconcurrency=4and one hung flow, only 3 slots remain available for the rest of the run-set.Fix — wrap with
asyncio.wait_forand kill on expiry:A reasonable default is
flow_timeout_secs: float = 600.0(10 min), matching the longest observed real-flow duration.[MINOR] [Round 1 still open] --concurrency help text references non-existent TOML key [run_set].default_concurrency
src/im2be_qa_rig/cli.py:349
The help string reads
"override [run_set].default_concurrency (default 4 per S7 spike)"butload_run_setreadsraw["meta"]["concurrency"]. There is no[run_set]table anywhere in the TOML schema —run-set.example.tomlconfirms the key lives under[meta]. A user consulting--helpwill search for the wrong table name.Fix:
[MINOR] Error message 'no report.json produced' is incorrect when JSONDecodeError fires
src/im2be_qa_rig/parallel.py:261
Line 261:
err = f"no report.json produced: {type(exc).__name__}"is emitted for bothStopIteration(file absent) andjson.JSONDecodeError(file present but malformed). When the subprocess did writereport.jsonbut with invalid content, the message misleads: operators and retry heuristics that readFlowDispatchResult.errorcannot distinguish "subprocess never wrote output" from "subprocess wrote garbage".Fix (also addressed structurally by the line-259 finding above, which splits the two branches):
round 2 • 2026-05-15T22:11:25.741Z → 2026-05-15T22:14:53.021Z
Show previous round
Three actionable findings: one blocking correctness regression (
asyncio.gatherwithoutreturn_exceptions=Truewill abort the entire run-set on any single-flow subprocess error), one minor misleading CLI help text (wrong TOML key name), and one minor slug-deduplication bug that can produce two flows sharing the same output directory.Findings (3)
[BLOCKING]
asyncio.gather(*tasks)withoutreturn_exceptions=Trueaborts the whole run-set on any single-flow errorsrc/im2be_qa_rig/parallel.py:324
If any
_run_onecoroutine raises an unhandled exception—OSErrorfromflow_dir.mkdir,FileNotFoundErrorfromcreate_subprocess_exec, or anOSErrorfromwrite_byteson a full disk—asyncio.gatherre-raises the first exception immediately and cancels every other in-flight task. Thefinallyblock releases ports, but the assignment toresultsnever completes, so the code after it (summary computation,report.jsonwrite) is skipped entirely. The exception propagates throughrun_set_mainand through_cmd_run_set(which only catchesRunSetError), crashing the CLI process with a raw traceback.For the 5-flow S7 demo on a shared k3d cluster, any transient infra hiccup in one flow kills the other four mid-run with no evidence written.
Fix:
This keeps all results, writes the run-set
report.jsoneven when some flows error, and lets the exit-code logic downstream decide whether the run-set passed or failed.[MINOR]
--concurrencyhelp text references a TOML key that does not exist in the schemasrc/im2be_qa_rig/cli.py:350
The run-set TOML schema uses
[meta].concurrency(seerun-set.example.tomllines 17-18 andload_run_setwhich readsmeta.get("concurrency", 4)). The section[run_set]and keydefault_concurrencydo not exist anywhere in the spec.A user reading
im2be-qa-rig run-set --helpwill look for[run_set].default_concurrencyin their TOML and not find it.Fix:
[MINOR] Deduplicated slug is not checked against
seen_slugs, allowing two flows to share the same output directorysrc/im2be_qa_rig/parallel.py:193
If another flow earlier in the list already resolved to
"{original_slug}-{i}", the renamed slug silently collides. Concrete example with three TOML entries:Both flows 0 and 2 land in
run_set_dir/foo-2/. The second flow'sstdout.log/stderr.logoverwrites the first's; theglob("*/report.json")call in_run_onefor flow 0 may pick up flow 2's report, corrupting results silently.Fix: loop until a free slug is found:
round 1 • 2026-05-15T22:08:25.741Z → 2026-05-15T22:17:10.489Z
slug-{i}could collide (round-2 AAA)Reviewer round-2 AAA correctly flagged that the dedup rename `slug = f"{slug}-{i}"` could itself collide if an earlier flow's spec name happened to resolve to that suffixed form (e.g. two flows pointing at "foo.toml" + a third at "foo-2.toml" — the third's original slug "foo-2" matches the first's renamed "foo-2", silently sharing an output directory). Replaced the single-shot rename with a monotonic-suffix loop that keeps incrementing until an unused slug is found. Run-set sizes are small (≤100s), so the loop is cheap. Note on rejected findings YY + ZZ (rule 61 verification): - YY (BLOCKING) claimed `asyncio.gather(*tasks)` lacked `return_exceptions=True`. Verified at HEAD parallel.py:369 — `return_exceptions=True` IS present (round-1 VV fix). Reviewer reviewed a stale snapshot. - ZZ (MINOR) claimed CLI help text referenced `[run_set].default_concurrency`. Verified at HEAD cli.py:354 — text is "override [meta].concurrency from run-set.toml" (round-1 XX fix). Reviewer reviewed a stale snapshot. Both findings rejected as factually unsupported per Memora #3137 deflection-ladder + rule 61. 77/77 pytest + pyright clean.Show previous round
Round 2 delta vs. round 1
✅ parallel.py:259 (StopIteration / PEP 479) — FIXED.
next((p for p in …), None)sentinel form with an explanatory comment; theexcept json.JSONDecodeErrorarm is now reachable.✅ parallel.py:236 (proc.communicate() no timeout) — FIXED.
asyncio.wait_for(proc.communicate(), timeout=flow_timeout_seconds)with aTimeoutErrorhandler that kills the child and drains its pipes.✅ cli.py:349 (wrong TOML key in --concurrency help) — FIXED. Help text now correctly names
[meta].concurrency.⚠️ parallel.py:313 (asyncio.gather without return_exceptions=True) — CANNOT VERIFY. The diff is truncated before this section. If the
gathercall was not updated, any single unexpected exception inside a flow coroutine will propagate throughgatherand cancel all in-flight siblings, turning one transient error into a total run-set loss.New actionable finding — one
majorissue introduced in this round (detailed below).Findings (1)
[MAJOR] flow_timeout_seconds=None passed explicitly disables the 900 s guard in dispatch_run_set
src/im2be_qa_rig/cli.py:127
_cmd_run_set unconditionally passes
flow_timeout_seconds=args.flow_timeout_secondstorun_set_main. When the user does not supply--flow-timeout-seconds(the common case),args.flow_timeout_secondsisNone(argparsedefault=None). ThatNoneis then forwarded explicitly todispatch_run_set.Python keyword-argument defaults only apply when a parameter is absent from the call, not when it is explicitly
None.dispatch_run_sethasflow_timeout_seconds: float | None = 900.0— its own docstring says "Pass None to disable" — sodispatch_run_set(..., flow_timeout_seconds=None)disables the timeout entirely, silently dropping the 900 s safety guard on every invocation that omits the flag.This directly contradicts the CLI help text: "None defers to dispatch default (900s = 15min)".
Impact: a hung Playwright subprocess (Chromium zombie, unresolved navigation, dropped backend connection) will hold a semaphore slot indefinitely; the run-set never completes.
Fix — guard the kwarg at the call site so omitting the flag lets
dispatch_run_setpick up its own default:Alternatively, give
run_set_maina distinct sentinel (e.g._UNSET = object()) so thatNonecan be threaded through when the caller genuinely wants to disable the timeout, while omission uses the 900 s default.round 2 • 2026-05-15T22:14:45.538Z → 2026-05-15T22:18:56.630Z
Round 2 review of PR #3 (TIER 6 parallel-flow dispatcher).
Round 1 finding #2 (cli.py:350 help text) — FIXED. The
--concurrencyhelp string now correctly reads"override [meta].concurrency from run-set.toml (default: 4)"andrun-set.example.tomlconfirms[meta].concurrencyexists.Round 1 finding #3 (slug dedup, parallel.py:193) — STILL OPEN. The regenerated slug
f"{slug}-{i}"is written directly toseen_slugswithout a membership check, so a collision against the new value is possible.Round 1 finding #1 (asyncio.gather, parallel.py:324) — UNVERIFIABLE. The unified diff is truncated before the
asyncio.gathercall site; the fix cannot be confirmed or ruled out from the supplied context. The prior finding remains open until verified.One new minor finding:
--skip-setupis unconditionally injected into every childim2be-qa-rig runinvocation with no per-flow override inFlowDispatchSpecor the TOML schema.Findings (3)
[MAJOR] Deduplicated slug still not checked against seen_slugs before insertion (round 1 open)
src/im2be_qa_rig/parallel.py:193
The deduplication branch writes
slug = f"{slug}-{i}"and immediately doesseen_slugs.add(slug)without first testingif slug in seen_slugs. This means two different flow-spec filenames can produce the same final slug: e.g. a spec namedflow-2.toml(slugflow-2, index 0) and a spec namedflow.tomlat index 2 (base slugflow, deduplicated toflow-2) both land onflow-2. They sharerun_set_dir/flow-2/as their output directory; the second flow'sstdout.log,stderr.log, andreport.jsonoverwrite the first's, silently corrupting the evidence archive.Fix: replace the single-shot
ifwith a loop that keeps incrementing until the candidate is unique:[MAJOR] asyncio.gather without return_exceptions=True — cannot confirm fix (round 1 open, diff truncated)
src/im2be_qa_rig/parallel.py:324
The unified diff supplied to this review is truncated before the
asyncio.gather(*tasks)call identified in round 1. The fix (addingreturn_exceptions=True) cannot be confirmed or ruled out. If it is still absent, any single-flow exception (e.g. an unhandledOSErrorfromasyncio.create_subprocess_exec) will cancel all in-flight flows and surface as an unhandled exception rather than a per-flowerrorentry in the run-set report, breaking the demo run-set entirely.Please verify that the call reads
asyncio.gather(*tasks, return_exceptions=True)and that the result list is iterated to populateFlowDispatchResult.errorfor any entries that areBaseExceptioninstances.[MINOR] --skip-setup unconditionally injected into every child subprocess with no per-flow override
src/im2be_qa_rig/parallel.py:245
_run_onehard-codes"--skip-setup"in theargvlist:FlowDispatchSpechas noneeds_setup/skip_setupfield, and the TOML schema (run-set.example.toml) exposes no corresponding key. Any flow spec that provisions fixtures or resets shared state during setup will silently skip that phase in every run-set execution, producing failures that differ from single-flow runs and are hard to attribute. At minimumFlowDispatchSpecshould gain an optionalskip_setup: bool = Truefield (defaulting toTruepreserves current behaviour) and the TOML loader should surface it as a per-flow key.round 2 • 2026-05-15T22:17:15.545Z → 2026-05-15T22:22:25.272Z
Show previous round
Round 3 review —
feat(parallel): TIER 6 parallel-flow dispatcherRound 2 open finding is NOT fixed. The new
_cmd_run_setimplementation still passesflow_timeout_seconds=args.flow_timeout_secondsunconditionally; when the flag is omitted,Noneis forwarded and the 900 s guard indispatch_run_setis silently disabled.One additional actionable finding in the new
_run_onetimeout handler:proc.kill()is called without aProcessLookupErrorguard, which creates a narrow but real race condition.All other round-1 findings (StopIteration/PEP-479 at ~line 259,
proc.communicate()timeout,asyncio.gatherreturn_exceptions) could not be confirmed as fixed or unfixed from the truncated diff; they are not re-reported here.Findings (2)
[MAJOR] Round 2 still open:
flow_timeout_seconds=Noneunconditionally forwarded, disabling the 900 s safety guardsrc/im2be_qa_rig/cli.py:130
_cmd_run_setpassesflow_timeout_seconds=args.flow_timeout_secondstorun_set_mainregardless of whether the user supplied the flag. When--flow-timeout-secondsis omitted,args.flow_timeout_secondsisNone(argparse default). ThatNoneis passed explicitly todispatch_run_set, which seesif flow_timeout_seconds is not Noneas False and falls into the unboundedproc.communicate()branch — a hung Playwright session blocks the entire run-set forever.The CLI help text explicitly promises "None defers to dispatch default (900s = 15min)" — but the implementation contradicts this: an explicit
Nonekeyword argument overrides the900.0default indispatch_run_set's signature.Fix — only forward the kwarg when the user actually supplied a value:
This lets
dispatch_run_setapply its own900.0default when the flag is absent, and uses the user-supplied value when present.[MINOR] New:
proc.kill()inexcept asyncio.TimeoutErrorunprotected againstProcessLookupErrorsrc/im2be_qa_rig/parallel.py:260
In
_run_one, theexcept asyncio.TimeoutErrorhandler callsproc.kill()bare.asyncio.subprocess.Process.kill()delegates toos.kill(pid, SIGKILL), which raisesProcessLookupErrorif the subprocess has already exited in the window betweenasyncio.wait_forraisingTimeoutErrorandproc.kill()executing. That unhandled exception escapes_run_oneand, absentreturn_exceptions=Trueon the enclosingasyncio.gather, would cancel all sibling flow coroutines.Fix:
round 3 • 2026-05-15T22:19:15.545Z → 2026-05-15T22:23:08.207Z
Round 3 fixes three round-1 items: StopIteration/PEP-479 (
next(…, None)sentinel),proc.communicate()timeout (asyncio.wait_for), and--concurrencyhelp text ([meta].concurrency). Two prior findings remain open: the round-2flow_timeout_seconds=None-overrides-default bug and the round-1asyncio.gather-missing-return_exceptions=Truebug. One new major finding:create_subprocess_execlacksstart_new_session=True, soproc.kill()in the timeout path kills only the direct Python child and leaves Playwright/Chromium grandchildren as orphans.Findings (3)
[BLOCKING] asyncio.gather still missing return_exceptions=True (round-1, unresolved)
src/im2be_qa_rig/parallel.py:313
The diff is truncated past the coordinator body but the entire file is new (+470 lines) and this round's diff shows no change to the gather call flagged in round 1.
asyncio.gather(*tasks)withoutreturn_exceptions=Truemeans any unexpected exception raised by one coroutine immediately cancels every other in-flight task; the caller receives the first exception rather than a complete set of per-flow results. A single transient subprocess error on one flow aborts all remaining flows.Fix:
Then iterate
results, testisinstance(r, BaseException), and convert exceptions to failedFlowDispatchResultentries so the run-set report remains complete regardless of individual flow errors.[MAJOR] flow_timeout_seconds=None explicitly passed still disables the 900 s guard (round-2, unresolved)
src/im2be_qa_rig/cli.py:127
The
--flow-timeout-secondsargument was added this round withdefault=None, but_cmd_run_setstill forwards the value unconditionally:dispatch_run_setis declaredflow_timeout_seconds: float | None = 900.0. PassingNoneexplicitly overrides the default toNone— it does not defer to it, despite the help string's claim. Any invocation without--flow-timeout-secondshas no per-flow watchdog.Fix — conditionally omit the kwarg so the function default takes effect:
Alternatively change the argparse default to
900.0and update the help text.[MAJOR] create_subprocess_exec without start_new_session=True — proc.kill() on timeout orphans Playwright/Chromium grandchildren
src/im2be_qa_rig/parallel.py:247
asyncio.create_subprocess_execis invoked withoutstart_new_session=True:Playwright spawns Chromium as a child of the Python worker (grandchild of the coordinator). When the timeout fires (line ~262):
SIGKILL is sent only to the direct child PID. Chromium is reparented to init and keeps running. On a 5-flow parallel run where all five time out, five Chromium instances accumulate as orphans, consuming RAM and GPU handles. On a constrained k3d developer machine this is sufficient to OOM subsequent flows and break the Stage A.3 demo.
Fix:
Then in the timeout handler:
Requires adding
import os, signalto the existing import block.round 3 • 2026-05-15T22:21:15.552Z → 2026-05-15T22:25:24.967Z
Show previous round
Round 4 review of PR #3 (TIER 6 parallel-flow dispatcher). Two prior-round findings remain unaddressed in the diff:
flow_timeout_seconds=Noneis still forwarded explicitly from the CLI (disabling the 900 s safety guard), andproc.kill()in the timeout branch is still unprotected againstProcessLookupError. Both are visible in this round's diff. Theasyncio.gather/return_exceptionsissue (round 1, line 313) falls in the truncated portion and cannot be verified; it remains tracked in the prior-round list. The PEP-479StopIterationfix (round 1) and the--concurrencyhelp-text fix (round 1) are confirmed resolved in the visible diff.Findings (2)
[MAJOR] flow_timeout_seconds=None forwarded explicitly, disabling the 900 s safety guard (rounds 2 & 3 still open)
src/im2be_qa_rig/cli.py:130
argparsedefaults--flow-timeout-secondstoNone. Python resolves keyword arguments at call time, soflow_timeout_seconds=Noneoverridesdispatch_run_set's default of900.0— it does not "defer" to it. Any run that omits the flag has no per-flow wall-clock cap, contradicting the help text and the docstring's stated guarantee.Fix: only forward the argument when the user supplied it:
Alternatively, change the
argparsedefault to900.0and update the help text to reflect that0or a negative value means "no cap" (requires a matching guard indispatch_run_set).[MAJOR] proc.kill() in TimeoutError handler unprotected against ProcessLookupError — exception escapes _run_one (round 3 still open)
src/im2be_qa_rig/parallel.py:260
If the subprocess exits in the narrow window between
asyncio.wait_fortiming out andproc.kill()being called,proc.kill()raisesProcessLookupError. The exception propagates out of_run_oneentirely —stdout/stderrare never assigned,write_bytesis never reached, and the coroutine raises instead of returning aFlowDispatchResult. Combined with the still-openasyncio.gather/return_exceptions=Trueissue (round 1, line 313), this can cancel the remaining in-flight flows.Fix:
round 4 • 2026-05-15T22:23:15.557Z → 2026-05-15T22:25:44.173Z
Round 5 review. Fixed this round:
cli.py— sparse-kwargs approach (if args.flow_timeout_seconds is not None: kwargs["flow_timeout_seconds"] = …) correctly preventsNonefrom being forwarded todispatch_run_set, closing the rounds 2/3/4 finding at line 130.--concurrencyhelp text now correctly references[meta].concurrency, closing the round 1 finding at cli.py:349. Still open:proc.kill()atparallel.py:260unprotected againstProcessLookupError(rounds 3 & 4), andasyncio.gatherwithoutreturn_exceptions=Trueatparallel.py:313(round 1). These two interact lethally: aProcessLookupErrorfromproc.kill()escapes_run_oneuncaught, and a baregathertreats the first such exception as fatal, cancelling every other in-flight flow.Findings (2)
[BLOCKING] proc.kill() still unprotected against ProcessLookupError — rounds 3 & 4 still open
src/im2be_qa_rig/parallel.py:260
If the subprocess exits naturally in the narrow window between
asyncio.wait_forraisingTimeoutErrorand the coordinator'sproc.kill()call, the OS raisesProcessLookupError. This exception propagates uncaught out of_run_one. Combined with the bareasyncio.gatherat line 313 (noreturn_exceptions=True), the first such race kills every other in-flight flow and leaves no evidence for any of them.Required fix:
[MAJOR] asyncio.gather without return_exceptions=True — round 1 still open
src/im2be_qa_rig/parallel.py:313
Any unhandled exception from any
_run_onecoroutine (most likelyProcessLookupErrorfromproc.kill(), but also unexpectedOSErroron log writes) is re-raised directly byasyncio.gather, which then cancels every sibling task. All in-progress flows are aborted and their evidence is lost.Required fix:
This makes each flow's failure self-contained rather than contagious to all peers.
round 5 • 2026-05-15T22:27:15.575Z → 2026-05-15T22:29:30.295Z
Show previous round
Round 4 review. The
flow_timeout_seconds=Noneforwarding issue (rounds 2/3) is fixed via the sparse-kwargs guard in_cmd_run_set— well done. The round-1proc.communicate()timeout and thenext(generator)/ StopIteration issues are also resolved. Three actionable issues remain: the round-3proc.kill()carry-over is still present; a newconcurrency=0→Semaphore(0)deadlock path is introduced by the unguarded CLI argument; and theint()coercions inload_run_setcan escape as rawValueErrorrather than user-friendlyRunSetError.Findings (3)
[MAJOR] proc.kill() still unprotected against ProcessLookupError (round 3 carry-over)
src/im2be_qa_rig/parallel.py:260
The round-3 fix was not applied. If the child process exits in the narrow window between
asyncio.wait_forraisingTimeoutErrorandproc.kill()executing, the OS has already reaped the PID andproc.kill()raisesProcessLookupError. This unwinds_run_onewithout ever assigningstdout,stderr, orduration_ms, so the coroutine surfaces a raw exception instead of aFlowDispatchResult. Downstream handling of that result slot (whether viareturn_exceptions=Truegather or otherwise) loses the flow entry from the run-set report entirely. Fix:[MAJOR]
--concurrency 0(or negative) creates asyncio.Semaphore(0), deadlocking all flows indefinitelysrc/im2be_qa_rig/cli.py:354
type=int, default=Noneaccepts any integer, including 0 and negatives.asyncio.Semaphore(0)means_run_one'sasync with semaphore:can never acquire — every flow coroutine blocks at the semaphore entry before a subprocess is ever spawned, so the per-flowflow_timeout_secondsguard never fires either. The coordinator hangs until the operator kills it. The fix belongs indispatch_run_set(earliest place where the value is consumed) and optionally also at the CLI parse site:Alternatively add
type=lambda v: (int(v) if int(v) >= 1 else argparse.ArgumentTypeError(...))to the argparse argument, but thedispatch_run_setguard is essential regardless becauseload_run_setcan also supply the value from TOML.[MINOR] Non-integer TOML values for
concurrency/ sidecar port bounds escape as raw ValueError, bypassing RunSetErrorsrc/im2be_qa_rig/parallel.py:168
int(meta.get("concurrency", 4))and the two analogous sidecar-port coercions raiseValueError(notRunSetError) when the TOML author writes, e.g.,concurrency = "fast"._cmd_run_setonly catchesRunSetError, so the user sees a raw Python traceback instead of the expected[im2be-qa-rig] run-set error: …message. Wrap each conversion:Apply the same pattern to the
low/highport conversions.round 4 • 2026-05-15T22:25:15.568Z → 2026-05-15T22:30:41.812Z
Show previous round
Round 5 fixes the two longest-open items:
flow_timeout_seconds=Noneforwarding (rounds 2-4, now guarded by the sparse-kwargs pattern in_cmd_run_set) and the StopIteration/PEP-479 issue (round 1, now using the sentinel form ofnext()). Three prior-round findings remain unaddressed, and one new minor is identified inload_run_set.Findings (4)
[MAJOR] proc.kill() unprotected against ProcessLookupError — rounds 3 & 4 still open
src/im2be_qa_rig/parallel.py:260
When
asyncio.wait_forfires, the subprocess may have already exited between the timeout and theproc.kill()call.Process.kill()callsos.kill(pid, signal.SIGKILL), which raisesProcessLookupError(a subclass ofOSError) if the PID is gone. That exception escapes_run_one, and — combined with the still-openasyncio.gather/return_exceptionsfinding — will cancel every sibling coroutine still holding a semaphore slot.Concrete fix:
[MAJOR] asyncio.gather called without return_exceptions=True — round 1 still open
src/im2be_qa_rig/parallel.py:313
parallel.pyis a net-new 470-line file; line 313 is in the diff. Any unhandled exception from a_run_onecoroutine (e.g., theProcessLookupErrorabove, or an unexpectedOSErrorfromasyncio.create_subprocess_exec) propagates out ofasyncio.gather, which by default cancels all remaining gathered tasks. A single bad flow therefore kills the whole run-set and produces no report for the flows that had already completed or were still running.Concrete fix:
Then iterate
resultsand convert anyBaseExceptioninstances into failedFlowDispatchResultentries before building theRunSetReport.[MINOR] proc.communicate() in the else-branch has no timeout — round 1 still open
src/im2be_qa_rig/parallel.py:236
When
dispatch_run_setis called withflow_timeout_seconds=None(the documented opt-out), the else-branchawait proc.communicate()has no deadline. A subprocess that hangs (Playwright navigation that never resolves, zombie browser) will hold a semaphore slot forever, stalling the entire run-set.The CLI-side sparse-kwargs fix (rounds 2-4) correctly prevents the CLI from forwarding
None, butdispatch_run_setis a public async API and nothing stops a direct caller or test harness from passingNone. A hard-cap fallback removes the infinite-hang risk:Alternatively, document the
Nonepath explicitly as "caller accepts unbounded hang" in the function signature, and updateflow_timeout_seconds's type tofloat(removingNoneif unlimited hangs are never desired).[MINOR] load_run_set uses read_text() without explicit UTF-8 encoding
src/im2be_qa_rig/parallel.py:202
tomllib.loads(path.read_text())relies on the platform's default encoding. TOML is specified to be UTF-8; on Windows with a non-UTF-8 ANSI codepage (cp1252,cp932, etc.) any non-ASCII character in a flowspecpath or[meta].namewill be decoded incorrectly, causing a silent data-corruption orUnicodeDecodeErrorrather than a clear config error.Concrete fix — use the binary
tomllib.loadoverload, which is idiomatic and avoids the encoding question entirely:Or if staying with
loads:path.read_text(encoding="utf-8")round 5 • 2026-05-15T22:29:15.578Z → 2026-05-15T22:32:04.703Z
Round 5 review of
feat(parallel): TIER 6 parallel-flow dispatcher — S7 spike drop-in.Confirmed fixed this round:
next(generator)/ StopIteration-in-async (round 1) — sentinel formnext(..., None)used correctly.proc.kill()unprotected againstProcessLookupError(rounds 3–4) —try/except ProcessLookupError: passadded.flow_timeout_seconds=Noneforwarded unconditionally (rounds 2–3) — sparse-kwargs pattern in_cmd_run_setnow guards this correctly.--concurrencyhelp text naming wrong TOML key (round 1) — fixed.Cannot verify:
asyncio.gatherwithoutreturn_exceptions=True(round 1, line ~313 in original) —dispatch_run_setbody falls in the truncated portion of the diff; still listed as open from round 4 and no evidence of a fix in the visible hunk.Confirmed still open — two round-4 findings + one new finding below.
Findings (3)
[MAJOR] round-4 carry-over: bare
int()on TOML values raises ValueError instead of RunSetErrorsrc/im2be_qa_rig/parallel.py:173
Lines 173 and 175 in
load_run_set:If the TOML author writes
concurrency = "four"orlow = 98.5,int()raisesValueError. That exception is not aRunSetError, so_cmd_run_set'sexcept RunSetErrorblock incli.pymisses it and the operator sees a raw traceback instead of the clean[im2be-qa-rig] run-set error: …message.Fix: wrap both conversion sites:
and the same for
sp.get("low")/sp.get("high"). Alternatively, add a single helper:[MAJOR] round-4 carry-over:
--concurrency 0(or negative) deadlocks the entire run-set via Semaphore(0)src/im2be_qa_rig/cli.py:354
The
--concurrencyargument at line 354–359 accepts anyintwith no lower-bound guard:_cmd_run_setforwards valid non-None values directly torun_set_main, which eventually callsasyncio.Semaphore(concurrency).asyncio.Semaphore(0)causes everyasync with semaphore:in_run_oneto block forever — the run-set hangs indefinitely with no diagnostic output.load_run_sethas the same gap:int(meta.get("concurrency", 4))on a TOMLconcurrency = 0also reaches the Semaphore.Fix — two-line guard in
_cmd_run_set(before the sparse-kwargs block):And a matching check at the top of
load_run_setafter theint()conversion:[MINOR] new:
--flow-timeout-secondsaccepts 0 or negative values — immediately kills every subprocesssrc/im2be_qa_rig/cli.py:360
The argument defined at lines 360–365 accepts any
floatwith no lower-bound check:In CPython 3.11+
asyncio.wait_for(coro, timeout=0)(or any non-positive value) cancels the wrapped coroutine immediately on the next event-loop tick, before the subprocess even starts. The result: every flow is killed at birth and markedtimeout after 0s. A negative float raisesValueErrorinsideasyncio.wait_for, which propagates out of_run_oneas an unhandled exception (compounding the still-openasyncio.gather/return_exceptions=Truegap).Fix — add validation in
_cmd_run_setalongside the existing None guard:round 5 • 2026-05-15T22:31:15.582Z → 2026-05-15T22:36:05.963Z
Show previous round
Round 6 review of
feat(parallel): TIER 6 parallel-flow dispatcher. Confirmed fixes since round 5:proc.kill()now guarded byProcessLookupError(rounds 3-5 closed),flow_timeout_seconds=Noneno longer forwarded by CLI (rounds 2-4 closed),StopIterationrisk fromnext()eliminated,--concurrencyhelp text corrected. Four findings remain or are new.Findings (4)
[MAJOR] tomllib.TOMLDecodeError not caught — raw traceback instead of RunSetError on bad TOML
src/im2be_qa_rig/parallel.py:202
NEW.
tomllib.loads(path.read_text())can raisetomllib.TOMLDecodeError(aValueErrorsubclass). This is NOT aRunSetError, so it propagates straight past theexcept RunSetErrorhandler in_cmd_run_setand surfaces as a raw Python traceback — a poor user experience for what is a routine operator error (typo in run-set.toml).Fix:
This also resolves the round-5
read_text()encoding finding at this line in one change.[MINOR] path.read_text() without encoding= — round 5 still open
src/im2be_qa_rig/parallel.py:202
Round 5 finding not addressed. On any host whose
LC_ALLorLANGdefault is not UTF-8 (e.g.LC_ALL=Cin Docker, some CI runners),path.read_text()uses the platform codec and can raiseUnicodeDecodeErrorrather thanRunSetError, bypassing the clean error message.Fix (combine with the
TOMLDecodeErrorfix above):[MINOR] else-branch proc.communicate() still has no timeout — round 1 still open
src/im2be_qa_rig/parallel.py:236
Round 1 finding confirmed open again in round 5. The
elsebranch of theif flow_timeout_seconds is not Noneguard callsawait proc.communicate()with no timeout. Any programmatic caller that passesflow_timeout_seconds=Nonetodispatch_run_set(overriding the900.0default) gets an unbounded hang. The CLI fix (sparse kwargs) is partial mitigation, not a fix.Minimal fix — replace the else-branch with an assertion so a caller cannot accidentally opt out:
Or collapse both branches and always use
asyncio.wait_for, treatingNoneasmath.inf.[MAJOR] asyncio.gather called without return_exceptions=True — rounds 1 & 5 still open
src/im2be_qa_rig/parallel.py:313
Rounds 1 and 5 both confirmed this open; the truncated diff contains no evidence of a fix. If any
_run_onecoroutine raises an unexpected exception (e.g.,OSErrorfrommkdir,PermissionErrorfrom log writes),asyncio.gatherre-raises it immediately, cancelling every other in-flight task and discarding their results. The run-set report is never written and the coordinator exits non-zero with a raw traceback.Fix:
round 6 • 2026-05-15T22:33:15.589Z → 2026-05-15T22:37:15.039Z
Show previous round
Round 6 resolves four prior findings: the
StopIteration/PEP 479 bug is fixed vianext(..., None);proc.kill()is now guarded againstProcessLookupError; theflow_timeout_seconds=Noneforwarding issue is fixed by the sparse-kwargs pattern in_cmd_run_set; and the--concurrencyhelp text now names the correct TOML key ([meta].concurrency). Two findings from earlier rounds remain open and are confirmed in the visible diff. The asyncio.gather finding (round 1 / round 5, line ~313) cannot be verified because the diff is truncated before that region — if that fix has not landed it remains a correctness regression that should block merge.Findings (2)
[MINOR] read_text() still has no explicit encoding (round 5 open)
src/im2be_qa_rig/parallel.py:202
Path.read_text()uses the platform locale encoding (e.g. CP1252 on Windows). A run-set TOML with non-ASCII characters in flow names or paths will raiseUnicodeDecodeErroron non-UTF-8 locales. Fix:[MAJOR] proc.communicate() in else-branch still has no timeout (rounds 1 & 5 open)
src/im2be_qa_rig/parallel.py:236
When
flow_timeout_seconds is None(callers can still passNoneexplicitly todispatch_run_set, which hasflow_timeout_seconds: float | None = 900.0), this branch blocks indefinitely. A hung Playwright subprocess will hold the semaphore slot forever, stalling the entire run-set. Fix:or raise
ValueErrorindispatch_run_setifflow_timeout_seconds is Noneto eliminate the ambiguous code path entirely.round 6 • 2026-05-15T22:35:15.593Z → 2026-05-15T22:37:28.778Z
Show previous round
Round 7 review — parallel-flow dispatcher (PR #3)
Fixes confirmed in this round's diff:
cli.py— sparse-kwargs pattern now preventsflow_timeout_seconds=Nonefrom disabling the 900 s guard (rounds 2–4 ✅)cli.py—--concurrencyhelp text now correctly cites[meta].concurrency(round 1 ✅)parallel.py—proc.kill()is now wrapped intry/except ProcessLookupError(rounds 3–5 ✅)parallel.py—next(..., None)sentinel form prevents the PEP 479 StopIteration→RuntimeError promotion (round 1 ✅)4 prior-round findings remain unaddressed.
Findings (4)
[MINOR] path.read_text() still missing encoding="utf-8"
src/im2be_qa_rig/parallel.py:202
Rounds 5 and 6 finding, still not fixed.
tomllibrequires UTF-8 by specification; on Windows (or any locale wherelocale.getpreferredencoding()returns CP1252 or Latin-1)path.read_text()without an explicit encoding will mis-decode multibyte characters in TOML string values and raise a crypticUnicodeDecodeErrorrather than aRunSetError.Preferred fix — use the binary
loadAPI which is unambiguous and avoids a redundant decode/encode round-trip:Or at minimum:
path.read_text(encoding="utf-8").[MAJOR] tomllib.TOMLDecodeError not caught — raw traceback on bad TOML input
src/im2be_qa_rig/parallel.py:202
Round 6 finding, still not fixed. When the run-set file contains syntactically invalid TOML,
tomllib.loads()raisestomllib.TOMLDecodeError, which is not a subclass ofRunSetError. Theexcept RunSetErrorhandler in_cmd_run_set(cli.py line ~137) does not catch it, so the operator sees a full Python traceback instead of the friendly[im2be-qa-rig] run-set error: …message.Fix — wrap the TOML parse in
load_run_set:[MAJOR] else-branch proc.communicate() still has no timeout — hung subprocess blocks run-set forever
src/im2be_qa_rig/parallel.py:236
Rounds 1, 5, and 6 finding, still not fixed. When
flow_timeout_seconds is Nonethe code falls into the bareelsebranch:with no
asyncio.wait_forguard. A Playwright browser navigating to a page that never resolves (e.g. an OAuth stub that never responds) will hold the semaphore slot and block the entire run-set indefinitely.Fix — apply the timeout unconditionally using a hard-cap sentinel to collapse the if/else:
This eliminates the unguarded branch entirely.
[MAJOR] asyncio.gather called without return_exceptions=True — any unexpected exception cancels all in-flight flows
src/im2be_qa_rig/parallel.py:313
Rounds 1, 5, and 6 finding. This is a new file (all 478 lines are in this diff) and this call has not been fixed across six rounds. Without
return_exceptions=True, if any single_run_onecoroutine raises an unhandled exception (e.g. anOSErrorfromflow_dir.mkdir, a network error setting up a sidecar port, or any unexpected path),asyncio.gatherpropagates that exception immediately and cancels every other running task. The run-set report is never written and all in-flight flow results are lost.Fix:
round 7 • 2026-05-15T22:37:15.593Z → 2026-05-15T22:40:34.821Z
Show previous round
Round 8 — fixes verified, four issues remain open.
Fixed this round:
cli.pyflow_timeout_seconds=Noneforwarding (rounds 2–7): ✅ sparse-kwargs construction now preventsNonefrom reachingdispatch_run_set.parallel.pyproc.kill()unguardedProcessLookupError(rounds 3–5): ✅try/except ProcessLookupError: passis now in place.parallel.py:259StopIteration/ PEP 479 (round 1): ✅ code usesnext(..., None)two-arg form.cli.py:349wrong TOML key in--concurrencyhelp text (round 1): ✅ now correctly reads[meta].concurrency.Still open (four findings below):
read_text()missing UTF-8 encoding,TOMLDecodeErroruncaught, else-branchcommunicate()without timeout, andasyncio.gatherwithoutreturn_exceptions=True.Findings (4)
[MINOR] path.read_text() called without encoding="utf-8"
src/im2be_qa_rig/parallel.py:202
Path.read_text()uses the locale default encoding. On CI images with a non-UTF-8 locale (e.g.LANG=CorLANG=POSIX), a TOML file that contains any non-ASCII character (comments, persona names) raisesUnicodeDecodeErrorinstead ofRunSetError, producing a raw traceback. Fix:Open since round 5; not addressed in round 8 push.
[MAJOR] tomllib.TOMLDecodeError not caught — raw traceback on malformed run-set config
src/im2be_qa_rig/parallel.py:202
tomllib.TOMLDecodeError(subclass ofValueError) is not caught here. A single syntax error inrun-set.tomlcauses an unhandled exception that escapesload_run_set, bypasses theRunSetErrorhandler in_cmd_run_set, and surfaces as a raw Python traceback instead of a clean[im2be-qa-rig] run-set error: …message. Fix:Open since round 6; not addressed in round 8 push.
[MAJOR] else-branch proc.communicate() has no timeout — hung subprocess blocks run-set forever
src/im2be_qa_rig/parallel.py:236
When
flow_timeout_seconds is Nonetheelsebranch runs an unboundedawait proc.communicate(). Any callers that explicitly passflow_timeout_seconds=Nonetodispatch_run_set(valid per the type signaturefloat | None) will be stuck if a subprocess hangs. The CLI fix in round 8 preventsNonefrom being forwarded from the CLI path, but the API contract still allows it and the library function has no internal guard. Fix — remove theif/elseand always useasyncio.wait_forwhen a non-Nonetimeout is present, then document thatNoneis explicitly unsupported:Open since round 1; not addressed in round 8 push.
[MAJOR] asyncio.gather called without return_exceptions=True — unexpected exception cancels all in-flight flows
src/im2be_qa_rig/parallel.py:313
Without
return_exceptions=True, if any coroutine intasksraises an unexpected exception (not caught inside_run_one) thegatherimmediately propagates it, cancelling all other still-running tasks. Each cancelled task holds a semaphore slot and a live subprocess, leaving zombie subprocesses and an incomplete run-set report. Fix:Open since round 1; not addressed in round 8 push. Note: the diff display is truncated before this line, but the line was confirmed open through round 7 and no deletion appears in the
+478/-0diff header.round 8 • 2026-05-15T22:41:15.611Z → 2026-05-15T22:44:10.792Z
Round 7 — three prior findings fixed: StopIteration PEP 479 (now uses
next(..., None)sentinel),proc.kill()race withProcessLookupError(now guarded), and CLINone-forwarding offlow_timeout_seconds(now sparse-kwargs). Two confirmed-open issues remain visible in the diff (lines 202 and 236), plus one new encoding-missing call-site at thecandidate.read_text()in_run_one. Theasyncio.gather/return_exceptions=Truefinding (rounds 1+5, ~line 313) sits in the truncated portion of the diff and cannot be verified this round.Findings (3)
[MINOR] path.read_text() still missing encoding="utf-8" (rounds 5+6 open)
src/im2be_qa_rig/parallel.py:202
tomllib.loads(path.read_text())delegates encoding to the platform default. On Windows CI or any locale whose default is not UTF-8, a run-set file containing Unicode in flow names, comments, or base-url values raisesUnicodeDecodeErrorinstead of a typedRunSetError.One-line fix:
[MAJOR] proc.communicate() else-branch still has no timeout when flow_timeout_seconds=None (rounds 1+5+6 open)
src/im2be_qa_rig/parallel.py:236
The
elsebranch executesawait proc.communicate()with no deadline wheneverflow_timeout_seconds is None. Because thedispatch_run_setsignature declaresflow_timeout_seconds: float | None = 900.0, any caller that passesNoneexplicitly (e.g. in a test harness or a wrapper script) will block the semaphore slot forever if the child stalls. The CLI-side fix prevents the CLI from forwardingNone, but the public API still accepts and acts on it.Simplest fix — remove the
Nonepath and always applyasyncio.wait_for, treatingNoneas "use a very large sentinel":Alternatively narrow the type annotation to
float(drop| None) so the API never offers the footgun.[MINOR] candidate.read_text() missing encoding — new call site, same class of bug as line 202
src/im2be_qa_rig/parallel.py:272
json.loads(candidate.read_text())readsreport.jsonusing the platform default encoding. Child processes could emit Unicode persona names or task labels in their report; on non-UTF-8 systems this raisesUnicodeDecodeError, marking the flow failed with a confusing internal error rather than the actual subprocess outcome.Fix:
round 7 • 2026-05-15T22:39:15.600Z → 2026-05-15T22:44:27.243Z
Show previous round
Round 9 brings one genuine fix:
cli.pynow builds kwargs sparsely, soflow_timeout_seconds=Noneis no longer forwarded todispatch_run_set— the 900 s safety guard is restored (rounds 2-8 finding closed). The four long-runningparallel.pyissues remain unaddressed in this diff.Findings (4)
[MAJOR]
path.read_text()still missingencoding="utf-8"src/im2be_qa_rig/parallel.py:202
On Windows (or any locale where
PYTHONUTF8is not set)read_text()uses the system default encoding (e.g.cp1252). A TOML file with any non-ASCII character (even a right-quote in a comment) silently misdecodes, producing aUnicodeDecodeErroror silent corruption. Fix:This finding has been open since round 5 and is unchanged in this diff.
[MAJOR]
tomllib.TOMLDecodeErrornot caught inload_run_setsrc/im2be_qa_rig/parallel.py:202
A malformed
run-set.toml(missing closing bracket, invalid key, etc.) raisestomllib.TOMLDecodeErrorwhich propagates as a raw traceback instead of a typedRunSetError. The CLI catch-block in_cmd_run_setonly catchesRunSetError, so the user sees an unhandled exception. Fix:Open since round 6, unchanged.
[MAJOR] else-branch
proc.communicate()has no timeout — hung subprocess blocks the run-set foreversrc/im2be_qa_rig/parallel.py:236
When
flow_timeout_seconds is None(i.e.,dispatch_run_setis called with the explicitNonevalue — valid per its type annotationfloat | None) the subprocess can hang indefinitely, stalling the semaphore slot and the entire run-set. Even though the CLI now avoids forwardingNone, the public API ofdispatch_run_setstill accepts it. Fix: always apply a ceiling, or raise at the top ofdispatch_run_setifflow_timeout_seconds is None:Alternatively default the else-branch to a large finite timeout (e.g. 3600 s). Open since round 1.
[MAJOR]
asyncio.gathercalled withoutreturn_exceptions=True— unexpected exception cancels all in-flight flowssrc/im2be_qa_rig/parallel.py:313
The gather call inside
dispatch_run_set(line 313 per all prior rounds; truncated from this diff view but unchanged per the +478/-0 net) does not passreturn_exceptions=True. If any coroutine raises an unexpected exception (e.g. an unhandledOSErrorfromcreate_subprocess_exec),gatherpropagates it immediately, cancelling all still-running flow tasks and producing an incomplete run-set report. Fix:Open since round 1.
round 9 • 2026-05-15T22:44:15.611Z → 2026-05-15T22:45:59.943Z
Show previous round
Round 10 review. The diff in this round resolves the majority of the long-standing findings:
path.read_text(encoding="utf-8")is now present,tomllib.TOMLDecodeErroris caught and wrapped inRunSetError,proc.kill()is guarded againstProcessLookupError, the StopIteration/PEP-479 risk is eliminated with thenext(..., None)sentinel form,flow_timeout_seconds=Noneis no longer forwarded explicitly (sparse-kwargs approach), the--concurrencyhelp text now names the correct[meta].concurrencyTOML key, and the "else-branch"communicate()is now correctly wrapped inasyncio.wait_for. One prior finding survives:asyncio.gatherwithoutreturn_exceptions=Trueinsidedispatch_run_set. That call body falls inside the truncated portion of the diff (the diff cuts off at thedispatch_run_setsignature, line ~303), but the finding has been flagged in every round from 1 through 9 and was not resolved in any prior round. It is carried forward.Findings (1)
[MAJOR] asyncio.gather called without return_exceptions=True — unexpected exception cancels all in-flight flows
src/im2be_qa_rig/parallel.py:313
The
dispatch_run_setbody (which starts just below the visible portion of the diff) contains anasyncio.gather(...)call withoutreturn_exceptions=True. This has been flagged every round since round 1 and was not addressed in rounds 1–9. Withoutreturn_exceptions=True, any unhandled exception raised by a single_run_onecoroutine (e.g., anOSErrorfromcreate_subprocess_execwhen the system is out of file descriptors, or an unexpectedasyncio.CancelledError) causesgatherto immediately cancel every other in-flight task, losing all partial results and making the run-set report incomplete.Required fix:
Without this, any OS-level error during subprocess creation (even for a single flow) silently aborts the entire parallel run.
round 10 • 2026-05-15T22:59:09.702Z → 2026-05-15T23:02:00.673Z
Round 10 review. Most long-standing findings are fixed in this revision:
path.read_text(encoding="utf-8")now explicit (rounds 5-9)tomllib.TOMLDecodeErrornow caught and re-raised asRunSetError(rounds 6-9)proc.communicate()wrapped inasyncio.wait_for(round 1, line 236)proc.kill()now protected bytry/except ProcessLookupError(rounds 3-5)next(generator, None)default preventsStopIteration → RuntimeError(round 1, line 259)flow_timeout_seconds=Noneno longer forwarded explicitly via sparse-kwargs pattern (rounds 2-4)--concurrencyhelp text now names[meta].concurrencycorrectly (round 1, cli.py:349)One finding persists. The diff is truncated before
dispatch_run_set's body (visible content ends mid-docstring), preventing verification of theasyncio.gathercall — but this issue has been confirmed open in every round from 1 through 9 with no intervening fix visible anywhere in the diff.Findings (1)
[MAJOR]
asyncio.gathercalled withoutreturn_exceptions=True— unexpected exception cancels all in-flight flowssrc/im2be_qa_rig/parallel.py:313
This has been open since round 1 and unaddressed through round 9. The diff includes all 481 new lines of
parallel.pybut is truncated before line 313 in the context provided to this reviewer; thedispatch_run_setbody is not visible.The risk: if any one
_run_onecoroutine raises an unexpected exception (e.g., anOSErrorfromasyncio.create_subprocess_execwhen the system is out of file descriptors, or aPermissionErroronflow_dir.mkdir),asyncio.gatherwithoutreturn_exceptions=Truewill cancel every other still-running coroutine immediately, leaving their semaphore slots acquired and their subprocesses running but unmonitored — and noFlowDispatchResultis produced for any of them.Required fix:
This makes every coroutine run to completion (or failure) independently, which is the intended semantics of a parallel run-set dispatcher.
round 10 • 2026-05-15T23:01:39.711Z → 2026-05-15T23:06:31.582Z
Round 11: The visible portion of the diff resolves the majority of long-standing findings —
encoding="utf-8"onpath.read_text(),tomllib.TOMLDecodeErrorcatch,asyncio.wait_forwrappingproc.communicate(),ProcessLookupErrorguard onproc.kill(),next(..., None)sentinel for the PEP-479 StopIteration hazard, theflow_timeout_seconds=Noneforwarding bug, and the--concurrencyhelp-text key name are all confirmed fixed in the diff. One finding carries forward:asyncio.gatherwithoutreturn_exceptions=True(flagged in every round since round 1, last confirmed open in round 10) sits in the truncated tail ofparallel.py; no change in the pushed commit addresses it based on the visible diff.Findings (1)
[MAJOR] asyncio.gather called without return_exceptions=True — unexpected exception still cancels all in-flight flows
src/im2be_qa_rig/parallel.py:313
Open since round 1, unresolved through round 10. Inside
dispatch_run_setthe gather call has noreturn_exceptions=True. Any exception that escapes_run_one— for example an unhandledOSErroronflow_dir.mkdir, an unexpectedasyncio.CancelledErrorpropagated from a signal handler, or an OOM during subprocess creation — immediately cancels every other in-flight task and prevents the run-set report from being written, discarding evidence from already-completed flows._run_onealready converts all expected subprocess errors (timeout, non-zero exit, missing report.json) intoFlowDispatchResultwithpassed=False, so this fires only on programmer-level surprises — but that is precisely when partial-result preservation matters most.Required fix:
round 11 • 2026-05-15T23:04:39.732Z → 2026-05-15T23:07:31.030Z