feat(rmp): RMP_DISTRIBUTED — distributed quorum as a production review mode #7
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "feat/rmp-distributed-production"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Promotes the distributed-quorum topology (which won the live reliability comparison) to a flag-gated production review path.
What
RMP_DISTRIBUTED=1(+REVIEWER_QUORUM_ENABLED=1) → the engine reviews PRs via distributed quorum: the two reviewer legs run in SEPARATE module containers (cross-process, Redis-Streams transport), reconciled by an in-engine arbiter. Per-leg failure isolation — a dead/hung leg degrades to the surviving leg rather than losing the review (the reliability win demonstrated live).Quorumruns unchanged. Full worker suite passes flag-off → byte-identical.RMP_DISTRIBUTED_LEG_A_MODEL/_LEG_B_MODEL.How
in-process.ts: generalizedrunReviewViaProducer(channel↔driver wiring);runReviewInProcessdelegates to it.worker-bridge.ts:runDistributedReviewViaDriver(tested) wires a LivenessWatchdog + shared DriverSink to the worker callbacks + drivesrunDistributedQuorumover the transport leg-runner.worker.ts: thereviewer instanceof Quorum && RMP_DISTRIBUTED_ENABLEDbranch; mirrors the RMP_DRIVER wiring (no onRawMessage — raw SDK messages don't cross the wire; the leg containers run their own IrregularityDetector + tool log).sdk-event-adapter.ts: emits token/cost telemetry from the SDK result message (closes the compare-harness cost-axis gap).Tests
Requires the
rmp-modulescompose profile up. NOT flipped on the engine — needs a deliberate flag + the leg containers.Show previous round
hib-pr-reviewer review — PR #7 (hib/hib-pr-reviewer)
Round 1 — head
66bee224fc69, basemain, triggeropenedTL;DR: CONDITIONAL_APPROVE — kept 2 unique-to-one findings (both verified at cited lines), no agreed findings, no blocking issues.
Summary
Reconciliation
Two unique-to-one findings, both independently verified against the checked-out HEAD.
Finding A (line 2014 — startup log silent on
RMP_DISTRIBUTED=1+ quorum off): Confirmed. Lines 2006–2020 show that the only log written in theelseblock is the singleconsole.logat line 2011. The ternaryquorumEnabled && RMP_DISTRIBUTED_ENABLEDsilently falls through to theRMP_DRIVER_ENABLED/ default branches whenRMP_DISTRIBUTED_ENABLED=truebutquorumEnabled=false. Noconsole.warnexists for this misconfiguration path anywhere in the block. Kept.Finding B (line 3067 —
maxIdleReadsbounds dead-leg detection at full job timeout): Confirmed.readOpts: { blockMs: 2000, maxIdleReads: Math.ceil(cfg.jobTimeoutMs / 2000) }at line 3067 sets the idle-read bound to exactly the full job timeout (e.g. 900 × 2 s = 1800 s for a 30-min job).transport.tsline 111 confirms the default isNumber.POSITIVE_INFINITY, so without this bound the loop runs forever — but the chosen bound means a silently dead leg idles for the entire job window. Thesignalat line 3066 comes fromreviewController.abort()which only fires when theLivenessWatchdogtrips — and per the topology the watchdog stays alive as long as leg A feeds it events.maxIdleReadsis therefore the sole dead-leg escape valve and it fires at the last possible moment. Kept.No agreed findings (different lines, different concerns). No dropped findings.
Blast Radius
The diff spans 12 files across
src/rmp/,src/worker.ts,tests/,docs/, andbin/. The new distributed-quorum path is fully flag-gated (RMP_DISTRIBUTED=1 && REVIEWER_QUORUM_ENABLED=1), so the existing in-process quorum and default reviewer paths are unaffected in production. The exportedModuleRunReportinterface gains a new field (findingLocations) that is additive and tested.BLAST_SCORE: 5/10
Risk Indicators
runDistributedQuorum,makeTransportLegRunner,readReviewEvents,runReviewViaProducer,runDistributedReviewViaDriverCI status (head
66bee224fc69)Overall: ✓ success
3 checks: 3 pending
Related PRs
Findings (2)
[MINOR] Startup log silently swallows
RMP_DISTRIBUTED=1when quorum is disabledsrc/worker.ts:2014
The ternary at line 2014 (
quorumEnabled && RMP_DISTRIBUTED_ENABLED) evaluates false whenRMP_DISTRIBUTED_ENABLED=truebutquorumEnabled=false, so the startup log prints either theRMP_DRIVERpath orreviewer.run (default)with no indication thatRMP_DISTRIBUTEDis set but inert. An operator who sets the flag but forgetsREVIEWER_QUORUM_ENABLED=1gets zero runtime feedback.Proposed fix — add a warn before the existing
console.logat line 2011:This follows the rule-14 pattern the PR already uses for the liveness-path log.
[MINOR]
maxIdleReadsequals the full job timeout — dead-leg isolation only activates at deadlinesrc/worker.ts:3067
With a 30-min job timeout (
jobTimeoutMs = 1_800_000),maxIdleReads = 900and900 × 2 s = 1800 s— the full timeout. If leg A completes in 5 minutes and leg B's container dies immediately after startup (emitting zero events), thereadReviewEventsloop for leg B silently idles for the full 30 minutes before declaring the leg dead.The
LivenessWatchdogdoes not help here: it stays alive because leg A keeps feeding it events, soreviewController.abort()is never called and the signal (line 3066) does not cut leg B's loop short.maxIdleReadsis the only dead-leg detection mechanism and it fires at the last possible moment.The test at line 87 (
maxIdleReads: 1) shows the knob is deliberately tunable. Consider a shorter fixed bound for fast silent-crash detection:Alternatively, document the intentional trade-off ("degrade at deadline, not proactively") explicitly in a code comment so operators are not surprised by 30-min single-leg-dead reviews.
Verdict
CONDITIONAL_APPROVE
hib-pr-reviewer • round 1 • 2 findings (2m) • 2026-06-03T07:40:17.502Z → 2026-06-03T07:41:28.637Z • posted-as: pr-reviewer-bot • model: auto
hib-pr-reviewer review — PR #7 (hib/hib-pr-reviewer)
Round 2 — head
9933cfbea1e7, basemain, triggersynchronizeTL;DR: CONDITIONAL_APPROVE — 0 agreed findings (A had 0); both unique-to-B findings verified at HEAD and kept; no blocking or major issues.
Summary
Arbitration — Round 2
Reviewer A found 0 issues; Reviewer B found 2 minor issues, both unique-to-B. Both verified via
ReadandGrepat HEAD.Finding B-1 — verified ✅
src/worker.tslines 2021-2030: the distributed-quorum branch of the startup log hard-codes the string"RMP distributed quorum (RMP_DISTRIBUTED=1, leg containers)"and does not interpolateRMP_DISTRIBUTED_LEG_A_MODULE(resolved at line 173),RMP_DISTRIBUTED_LEG_B_MODULE(line 174), orRMP_DISTRIBUTED_LEG_SILENCE_MS(lines 183-186, with a silent fallback to 300000 for non-integer inputs). An operator overriding any of these has no log evidence of the effective values. Rule 14 explicit intent: "Log effective config (secrets masked)." Kept.Finding B-2 — verified ✅
src/worker.tslines 2033-2037: the only guard in the distributed-quorum startup block is the!quorumEnabledinert-mode warning. There is no assertion thatRMP_DISTRIBUTED_LEG_SILENCE_MS < cfg.jobTimeoutMs. The.env.example(line 26) anddocs/configuration.mdboth document this as a hard requirement; the parsing at lines 183-185 accepts any positive integer, so a value ≥ job timeout silently passes. This is exactly the misconfiguration class that round 1 fixed in the old hard-codedcfg.jobTimeoutMs / 2000path — the fix is now re-openable via explicit config. Rule 14: "Validate required config at startup — fail fast." Kept.No agreed findings (A had 0). Round-1 conditions both confirmed resolved by both reviewers.
Blast Radius
The diff touches 12 files across the RMP subsystem (src/rmp/, src/worker.ts, bin/, docs/, tests/). The core change (RMP_DISTRIBUTED flag-gate) is additive and flag-off is documented as byte-identical to the prior path. The two kept findings are both scoped to the startup-log / startup-validation block in worker.ts and have no runtime data-path consequence under correct configuration.
BLAST_SCORE: 4/10
Risk Indicators
runDistributedReviewViaDriver,runReviewViaProducer,makeWatchdogSinkCI status (head
9933cfbea1e7)Overall: ✓ success
3 checks: 3 pending
Related PRs
Findings (2)
[MINOR] Startup log omits effective
RMP_DISTRIBUTED_LEG_SILENCE_MSand leg module IDssrc/worker.ts:2024
The startup log (lines 2021-2030) announces the distributed-quorum path as active but prints only the literal string
"RMP distributed quorum (RMP_DISTRIBUTED=1, leg containers)"— it does not interpolate the resolved values ofRMP_DISTRIBUTED_LEG_A_MODULE,RMP_DISTRIBUTED_LEG_B_MODULE, orRMP_DISTRIBUTED_LEG_SILENCE_MS. When any of these are overridden (or whenRMP_DISTRIBUTED_LEG_SILENCE_MSsilently reverts to 300000 because the supplied value is not a positive integer — see parsing at lines 183-185), there is no log evidence of what the engine is actually using.Concrete fix — extend the distributed branch of the log template:
This satisfies the rule-14 intent already cited in the surrounding comment: "Log effective config (secrets masked)."
[MINOR] No startup assertion that
RMP_DISTRIBUTED_LEG_SILENCE_MS < cfg.jobTimeoutMssrc/worker.ts:2033
.env.example(line 26) anddocs/configuration.mdboth document thatRMP_DISTRIBUTED_LEG_SILENCE_MS"MUST be < the job timeout", but there is no runtime enforcement. The parsing at lines 183-185 accepts any positive integer, so an operator setting e.g.RMP_DISTRIBUTED_LEG_SILENCE_MS=1800000(equal toJOB_TIMEOUT_MS) passes validation silently — the per-leg watchdog fires at or after job expiry, nullifying the failure-isolation guarantee that is the primary value of this feature. This is precisely the regression class fixed in round 1 (oldcfg.jobTimeoutMs / 2000), now re-openable via explicit misconfiguration.Concrete fix — add a fail-fast assertion after the inert-mode warning block (around line 2037), gated on the active mode:
Rule 14: "Validate required config at startup — fail fast."
Verdict
CONDITIONAL_APPROVE
hib-pr-reviewer • round 2 • 2 findings (2m) • 2026-06-03T07:48:58.873Z → 2026-06-03T07:50:00.611Z • posted-as: pr-reviewer-bot • model: auto