fix(health): cancel/restart 202 (not 503) when only the publish ack times out #13

Merged
hibryda merged 2 commits from fix/cancel-restart-202-on-deadline into main 2026-06-03 18:58:48 +02:00
Owner

Fixes a false-negative I hit live: POST /cancel/:jobId returned 503 (cancel publish deadline exceeded) even though the engine subscriber received the message and aborted the job.

Cause

Promise.race(redis.publish, deadline) doesn't cancel the losing redis.publish — so on a deadline the publish is already in flight and (cancel/restart are idempotent best-effort) may well have delivered. Reporting 503 failed was misleading.

Fix

  • New PublishDeadlineError distinguishes a deadline from a real Redis rejection: deadline → 202 {status:"dispatched-unconfirmed"} (caller verifies via the queue), a genuine publish error stays 503. Applied symmetrically to /cancel + /restart.
  • Publish deadlines injectable (cancelPublishDeadlineMs/restartPublishDeadlineMs) — fast-testable + an ops tuning knob; the two existing deadline tests now assert 202 (no 2s waits), +1 restart real-rejection-503 test.

Also backfills the CHANGELOG entry for the merged multi-origin OAuth (PR #11). full suite 2169.

Fixes a false-negative I hit live: `POST /cancel/:jobId` returned **503** (`cancel publish deadline exceeded`) even though the engine subscriber received the message and aborted the job. ## Cause `Promise.race(redis.publish, deadline)` doesn't cancel the losing `redis.publish` — so on a deadline the publish is already in flight and (cancel/restart are idempotent best-effort) may well have delivered. Reporting `503 failed` was misleading. ## Fix - New `PublishDeadlineError` distinguishes a deadline from a real Redis rejection: deadline → **`202 {status:"dispatched-unconfirmed"}`** (caller verifies via the queue), a genuine publish error stays **503**. Applied symmetrically to `/cancel` + `/restart`. - Publish deadlines injectable (`cancelPublishDeadlineMs`/`restartPublishDeadlineMs`) — fast-testable + an ops tuning knob; the two existing deadline tests now assert 202 (no 2s waits), +1 restart real-rejection-503 test. Also backfills the CHANGELOG entry for the merged multi-origin OAuth (PR #11). full suite 2169.
fix(health): cancel/restart return 202 (not 503) when only the publish ack times out
All checks were successful
pr-reviewer CI / tsc --noEmit (pull_request) Successful in 23s
pr-reviewer CI / vitest (pull_request) Successful in 59s
pr-reviewer CI / tsc build (pull_request) Successful in 36s
434be224fa
Observed live cancelling a moot review: POST /cancel returned 503
("cancel publish deadline exceeded") even though the engine subscriber received
it and aborted the job. Promise.race(publish, deadline) doesn't cancel the
losing redis.publish, so on a deadline the publish is in flight and (cancel/
restart being idempotent best-effort) may well have delivered — 503 was wrong.

- New PublishDeadlineError distinguishes a deadline from a real Redis rejection:
  deadline → 202 {status:"dispatched-unconfirmed"} (caller verifies via queue);
  a genuine publish error stays 503.
- Applied symmetrically to /cancel and /restart.
- Publish deadlines made injectable (cancelPublishDeadlineMs /
  restartPublishDeadlineMs) — fast-testable + an ops tuning knob. The two
  existing deadline tests now assert 202 (fast, no 2s waits); +1 restart
  real-rejection-503 test.

Also backfills the CHANGELOG entry for the merged multi-origin OAuth (PR #11).
full suite 2169.

Superseded by round 2.

Show previous round

hib-pr-reviewer review — PR #13 (hib/hib-pr-reviewer)

Round 1 — head 434be224faf3, base main, trigger opened

TL;DR: CONDITIONAL_APPROVE — kept 1 agreed CHANGELOG finding (A+B) and 1 verified unique-to-B stale-comment finding; no blocking issues.

Summary

Reconciliation Summary

Verified both findings sets against HEAD.

Finding 1 (AGREED — A + B): CHANGELOG.md line 18 — confirmed. The line ends …(+1 restart real-rejection-503 test). and then immediately continues with handleCallback never wrote the bootstrapAdmin(provider,userId) key… — the full bootstrapAdmin bullet text folded in without any separator, erasing the standalone bootstrap-admin entry. Kept.

Finding 2 (unique to B): src/health.ts lines 396–398 — confirmed. The inline comment still reads // returns 503 instead of hanging the request forever despite the behavior having been changed to 202 dispatched-unconfirmed via PublishDeadlineError. The comment immediately precedes the Promise.race call that now throws PublishDeadlineError on timeout; it was not updated. Kept.

No findings dropped. Both issues are minor/documentation-class.

Blast Radius

The diff is tightly scoped to two HTTP endpoints (POST /cancel/:jobId and POST /restart) in src/health.ts and their test coverage. No shared state or exported interfaces are affected beyond the injectable deadline options on HealthServerOpts. Both issues found are documentation-class (CHANGELOG prose and an inline comment) with no runtime impact.

BLAST_SCORE: 2/10

Risk Indicators

Indicator Value
Sensitive functions
Migration touched
Test delta +36 / -11 lines in test files
Dependency changes

CI status (head 434be224faf3)

Overall: ✓ success

3 checks: 3 pending

Check State Link
pr-reviewer CI / tsc --noEmit (pull_request) pending details
pr-reviewer CI / vitest (pull_request) pending details
pr-reviewer CI / tsc build (pull_request) pending details

Findings (1)

[MINOR] CHANGELOG Fixed bullet corrupted — bootstrapAdmin fix text folded into cancel/restart entry

CHANGELOG.md:18

Line 18 is a single mega-bullet that begins with the cancel/restart 202 fix description and continues — without any separator — with the full text of the previously-standalone First-login admin self-bootstrap (audit #2, 2026-06-02) entry: handleCallback never wrote the bootstrapAdmin(provider,userId) key, so isCallerAdmin always returned false → /admin/promote + /admin/demote were permanently unreachable…. That bootstrap text (which covers src/auth/{routes,redis-keys}.ts + tests/auth-routes.test.ts) is now invisible as a discrete changelog entry and is incorrectly attributed to this PR.

Fix: split line 18 at the end of (+1 restart real-rejection-503 test). and restore the bootstrap text as its own - **First-login admin self-bootstrap…** bullet under ### Fixed.

Verdict

CONDITIONAL_APPROVE


hib-pr-reviewer • round 1 • 1 finding (1m) • 2026-06-03T16:47:01.369Z → 2026-06-03T16:52:19.311Z • posted-as: pr-reviewer-bot • model: auto

<!-- hib-pr-reviewer collapsed --> > _Superseded by round 2._ <details> <summary>Show previous round</summary> <!-- hib-pr-reviewer round:1 --> ## hib-pr-reviewer review — PR #13 (hib/hib-pr-reviewer) **Round 1** — head `434be224faf3`, base `main`, trigger `opened` **TL;DR:** CONDITIONAL_APPROVE — kept 1 agreed CHANGELOG finding (A+B) and 1 verified unique-to-B stale-comment finding; no blocking issues. ### Summary ## Reconciliation Summary Verified both findings sets against HEAD. **Finding 1 (AGREED — A + B):** CHANGELOG.md line 18 — confirmed. The line ends `…(+1 restart real-rejection-503 test).` and then immediately continues with `handleCallback never wrote the bootstrapAdmin(provider,userId) key…` — the full bootstrapAdmin bullet text folded in without any separator, erasing the standalone bootstrap-admin entry. Kept. **Finding 2 (unique to B):** src/health.ts lines 396–398 — confirmed. The inline comment still reads `// returns 503 instead of hanging the request forever` despite the behavior having been changed to `202 dispatched-unconfirmed` via `PublishDeadlineError`. The comment immediately precedes the `Promise.race` call that now throws `PublishDeadlineError` on timeout; it was not updated. Kept. No findings dropped. Both issues are minor/documentation-class. ### Blast Radius The diff is tightly scoped to two HTTP endpoints (`POST /cancel/:jobId` and `POST /restart`) in `src/health.ts` and their test coverage. No shared state or exported interfaces are affected beyond the injectable deadline options on `HealthServerOpts`. Both issues found are documentation-class (CHANGELOG prose and an inline comment) with no runtime impact. **BLAST_SCORE: 2/10** ### Risk Indicators | Indicator | Value | |---|---| | Sensitive functions | — | | Migration touched | — | | Test delta | +36 / -11 lines in test files | | Dependency changes | — | ### CI status (head `434be224faf3`) **Overall: ✓ success** 3 checks: 3 pending | Check | State | Link | |---|---|---| | pr-reviewer CI / tsc --noEmit (pull_request) | ⏳ pending | [details](/hib/hib-pr-reviewer/actions/runs/88/jobs/0) | | pr-reviewer CI / vitest (pull_request) | ⏳ pending | [details](/hib/hib-pr-reviewer/actions/runs/88/jobs/1) | | pr-reviewer CI / tsc build (pull_request) | ⏳ pending | [details](/hib/hib-pr-reviewer/actions/runs/88/jobs/2) | ### Findings (1) #### **[MINOR]** CHANGELOG Fixed bullet corrupted — bootstrapAdmin fix text folded into cancel/restart entry _CHANGELOG.md:18_ Line 18 is a single mega-bullet that begins with the cancel/restart 202 fix description and continues — without any separator — with the full text of the previously-standalone **First-login admin self-bootstrap (audit #2, 2026-06-02)** entry: `handleCallback never wrote the bootstrapAdmin(provider,userId) key, so isCallerAdmin always returned false → /admin/promote + /admin/demote were permanently unreachable…`. That bootstrap text (which covers `src/auth/{routes,redis-keys}.ts` + `tests/auth-routes.test.ts`) is now invisible as a discrete changelog entry and is incorrectly attributed to this PR. Fix: split line 18 at the end of `(+1 restart real-rejection-503 test).` and restore the bootstrap text as its own `- **First-login admin self-bootstrap…**` bullet under `### Fixed`. ### Verdict **CONDITIONAL_APPROVE** --- <sub>hib-pr-reviewer • round 1 • 1 finding (1m) • 2026-06-03T16:47:01.369Z → 2026-06-03T16:52:19.311Z • posted-as: pr-reviewer-bot • model: auto</sub> </details>
fix: address PR #13 review — CHANGELOG split + stale cancel comment
All checks were successful
pr-reviewer CI / tsc --noEmit (pull_request) Successful in 23s
pr-reviewer CI / vitest (pull_request) Successful in 1m1s
pr-reviewer CI / tsc build (pull_request) Successful in 36s
7119f1694f
- CHANGELOG: my prior edit consumed the bootstrapAdmin bullet's title and folded
  its body into the cancel/restart entry; restore it as a separate bullet.
- health.ts: the cancel handler's inline comment still said "returns 503"; update
  it to the new 202 dispatched-unconfirmed behavior (PublishDeadlineError).

hib-pr-reviewer review — PR #13 (hib/hib-pr-reviewer)

Round 2 — head 7119f1694f56, base main, trigger synchronize

TL;DR: APPROVE — prior CHANGELOG finding fixed, implementation and test coverage are complete and correct.

Summary

[quorum-converged] A=0 = B=0. ## Round 2 review — prior finding resolved, no new issues

The round-1 finding (CHANGELOG.md:18 — bootstrapAdmin bullet folded into cancel/restart entry) is fixed: both bullets are now separate entries in ### Fixed (lines 18 and 19).

The health.ts implementation is correct:

  • PublishDeadlineError cleanly discriminates deadline-timeout from real Redis rejection.
  • Both cancel and restart paths return 202 dispatched-unconfirmed on deadline and 503 on genuine error.
  • Both cancelDeadlineTimer / restartDeadlineTimer are cleared in .finally() — no timer leak.
  • Injectable deadlines (cancelPublishDeadlineMs / restartPublishDeadlineMs) avoid 2 s waits in tests and serve as an ops knob.

Test coverage is now symmetric:

  • cancel: real-rejection → 503 (pre-existing, untouched)
  • cancel: deadline → 202 (updated)
  • restart: real-rejection → 503 (newly added)
  • restart: deadline → 202 (newly added)

No actionable findings this round.

Blast Radius

Changes are confined to the POST /cancel/:jobId and POST /restart handlers inside src/health.ts, an injectable options extension, and corresponding tests. No shared state, middleware, or external protocol contract is altered; the new 202 path is purely additive on top of the existing deadline branch.

BLAST_SCORE: 2/10

Risk Indicators

Indicator Value
Sensitive functions
Migration touched
Test delta +36 / -11 lines in test files
Dependency changes

CI status (head 7119f1694f56)

Overall: ✓ success

3 checks: 3 pending

Check State Link
pr-reviewer CI / tsc --noEmit (pull_request) pending details
pr-reviewer CI / vitest (pull_request) pending details
pr-reviewer CI / tsc build (pull_request) pending details

Findings

No new findings this round.

Quorum converged on empty findings (A + B both returned 0).

Verdict

APPROVE


hib-pr-reviewer • round 2 • 0 findings • 2026-06-03T16:56:34.926Z → 2026-06-03T16:58:28.609Z • posted-as: pr-reviewer-bot • model: auto

<!-- hib-pr-reviewer round:2 --> ## hib-pr-reviewer review — PR #13 (hib/hib-pr-reviewer) **Round 2** — head `7119f1694f56`, base `main`, trigger `synchronize` **TL;DR:** APPROVE — prior CHANGELOG finding fixed, implementation and test coverage are complete and correct. ### Summary [quorum-converged] A=0 = B=0. ## Round 2 review — prior finding resolved, no new issues The round-1 finding (CHANGELOG.md:18 — bootstrapAdmin bullet folded into cancel/restart entry) is **fixed**: both bullets are now separate entries in `### Fixed` (lines 18 and 19). The `health.ts` implementation is correct: - `PublishDeadlineError` cleanly discriminates deadline-timeout from real Redis rejection. - Both cancel and restart paths return 202 `dispatched-unconfirmed` on deadline and 503 on genuine error. - Both `cancelDeadlineTimer` / `restartDeadlineTimer` are cleared in `.finally()` — no timer leak. - Injectable deadlines (`cancelPublishDeadlineMs` / `restartPublishDeadlineMs`) avoid 2 s waits in tests and serve as an ops knob. Test coverage is now symmetric: - cancel: real-rejection → 503 (pre-existing, untouched) ✅ - cancel: deadline → 202 (updated) ✅ - restart: real-rejection → 503 (newly added) ✅ - restart: deadline → 202 (newly added) ✅ No actionable findings this round. ### Blast Radius Changes are confined to the `POST /cancel/:jobId` and `POST /restart` handlers inside `src/health.ts`, an injectable options extension, and corresponding tests. No shared state, middleware, or external protocol contract is altered; the new 202 path is purely additive on top of the existing deadline branch. **BLAST_SCORE: 2/10** ### Risk Indicators | Indicator | Value | |---|---| | Sensitive functions | — | | Migration touched | — | | Test delta | +36 / -11 lines in test files | | Dependency changes | — | ### CI status (head `7119f1694f56`) **Overall: ✓ success** 3 checks: 3 pending | Check | State | Link | |---|---|---| | pr-reviewer CI / tsc --noEmit (pull_request) | ⏳ pending | [details](/hib/hib-pr-reviewer/actions/runs/89/jobs/0) | | pr-reviewer CI / vitest (pull_request) | ⏳ pending | [details](/hib/hib-pr-reviewer/actions/runs/89/jobs/1) | | pr-reviewer CI / tsc build (pull_request) | ⏳ pending | [details](/hib/hib-pr-reviewer/actions/runs/89/jobs/2) | ### Findings **No new findings this round.** _Quorum converged on empty findings (A + B both returned 0)._ ### Verdict **APPROVE** --- <sub>hib-pr-reviewer • round 2 • 0 findings • 2026-06-03T16:56:34.926Z → 2026-06-03T16:58:28.609Z • posted-as: pr-reviewer-bot • model: auto</sub>
hibryda deleted branch fix/cancel-restart-202-on-deadline 2026-06-03 18:58:48 +02:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
hib/hib-pr-reviewer!13
No description provided.