feat(redis-outbox): Redis Lua-EVAL OutboxBackend module (Wave A.2 Phase 0b) #8
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "feat/redis-outbox-backend"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Wave A.2 Phase 0b — Redis Lua-EVAL
OutboxBackendmoduleNew
redis-outbox-backendMaven module: a Redis Lua-EVAL implementation of the Phase 0aOutboxBackendSPI, so identity-service keeps ticket mint + outbox event atomic (the mint is itself a Redis op — ADR-0014 D-3 /docs/decisions/0014-supporting/identity-redis-outbox-backend.md). Separate module so the 7 PG consumers never pull in spring-data-redis. Design + the 12 required refinements:.planning/32-wave-a2-redis-outbox-design.md.Storage model
One Hash
{identity:outbox}:entries(field=event_id, value=JSON entry) + companion sorted-sets:pending(score=nextAttemptAt) +:expiry-due(score=expiresAt), all hash-tagged{identity:outbox}for single-slotEVAL(#8).requiresActiveTransaction()=false; the backend owns its own inline relay (RedisOutboxRelay) + cold poller (RedisOutboxPollerWorker) — it does not route through the PGOutboxPublisher's unconditional AFTER_COMMIT relay (resolves the Phase 0a R3@apiNote).12 refinements → artifacts
outbox-enqueue): TYPE precheck before any write; HSET entry before ZADD index so a mid-script death leaves only a repairable orphan-Hash-entry (never an unpersisted index member); HEXISTS idempotent guard.recordExpiryDue/findDueExpiries/decideExpiry+expiry-decide.lua+TicketExpiryEventId(UUIDv3 collapse). Primitive only — identity wiring is Phase 1.mark-failurereturns[applied,parked]; FAILED retained in Hash, ZREM'd from:pending, never auto-deleted; parked metric.SET NX PXowner-token +lock-renew/lock-releasecompare-and-* Lua; batch bounded< lock TTL; lost-lock aborts batch;renewMargin < ttlvalidated at startup.nextAttemptAtscoring:RedisOutboxBackoff(exp, capped, overflow-safe); failure re-scores:pending;findPendingBatch=ZRANGEBYSCORE -inf now.RedisDurabilityValidator(startup +@Scheduled) asserts appendonly/appendfsync/noeviction; FAIL_FAST/WARN modes.@ConditionalOnProperty backend=REDIS(NOT classpath auto-select); suppliesOutboxBackendso PG@ConditionalOnMissingBeanbacks off.markSent/markFailureAttempthonour the SPI1/0contract (themark-failure2-element[applied,parked]Lua return is unwrapped toapplied;parkeddrives the metric). 16 main classes + 8 Lua scripts.Rule-61 falsification (flagged for review)
Verified against
redis.io/commands/wait: RedisWAITis a replication barrier — in standalone mode (identity's audited Valkey,cluster_enabled:0)WAIT 0 …returns immediately and does not flush AOF. Soappendfsyncis the only single-node durability lever.fsyncBarrierEnabledis kept as a documented no-op toggle (default false) for a future replicated topology;RedisDurabilityValidatoris the actual enforcement. The design doc §7 floated a "WAIT-after-EVAL fsync barrier" — this corrects it.Other judgment calls
ARCHIVED(PG-only) has no Redis analogue; retention HDELs SENT instead.@JsonInclude(NON_NULL)is load-bearing (Luacjsondecodes JSONnullto a truthy sentinel) — documented + unit-tested.markFailureAttemptreads current retries via one GET before the EVAL to size the backoff (best-effort hint; the script is the atomic mutator; duplicates are consumer-deduped).OutboxAutoConfiguration's eager JPA wiring (its own Javadoc's Phase-0b/1 follow-up) is a FROZEN Phase 0a file — left untouched; a Redis-only consumer simply omits spring-data-jpa from its classpath.Verification
CONFIG SET maxmemory 1mb); lock-expiry-mid-batch + single-winner; Kafka-outage-beyond-budget→parked; expire/revoke race + UUIDv3 collapse; durabilityCONFIG SETdrift; REDIS-vs-PG autoconfig selection.mvn -B install -DskipTests=false→ BUILD SUCCESS (6 modules), 0[WARNING].outbox-publisher(Phase 0a) untouched.Note
The Forgejo
mvn installCI check is red on this repo for all commits (pre-existing — runner defaultnode:22-bookwormhas no JDK/Maven,mvn -version→exit 127; PR-PLATFORM-CI-1). Verified locally green.New redis-outbox-backend Maven module: a Redis Lua-EVAL implementation of the Phase 0a OutboxBackend SPI, so identity-service can keep ticket mint + outbox event atomic (the mint is itself a Redis op — ADR-0014 D-3 / 0014-supporting/identity-redis-outbox-backend.md). Separate module so the 7 PG consumers never pull in spring-data-redis. Design + the 12 required refinements: .planning/32-wave-a2-redis-outbox-design.md. Storage: one Hash {identity:outbox}:entries (field=event_id, JSON entry) + companion sorted-sets :pending (score=nextAttemptAt) + :expiry-due (score=expiresAt), all hash-tagged for single-slot EVAL (#8). requiresActive Transaction()=false; the backend owns its OWN inline relay (RedisOutboxRelay) + cold poller (RedisOutboxPollerWorker) — it does NOT route through the PG OutboxPublisher's unconditional AFTER_COMMIT relay (Phase 0a R3 @apiNote). 12 refinements → artifacts: - #1 fail-closed Lua (outbox-enqueue): TYPE precheck before any write; HSET entry BEFORE ZADD index so a mid-script death leaves only a repairable orphan-Hash-entry, never an unpersisted index member; HEXISTS idempotent guard. - #2 durable expiry intent: recordExpiryDue/findDueExpiries/decideExpiry + expiry-decide.lua + TicketExpiryEventId (UUIDv3) — primitive only, identity wiring is Phase 1. - #3 parked FAILED: mark-failure returns [applied,parked]; FAILED retained in Hash, ZREM'd from :pending, never auto-deleted; parked metric. - #4 lock: SET NX PX owner-token + lock-renew/lock-release compare-and-* Lua, batch bounded < lock TTL, lost-lock aborts batch; renewMargin<ttl validated. - #5 nextAttemptAt scoring: RedisOutboxBackoff (exp, capped, overflow-safe); failure re-scores :pending; findPendingBatch = ZRANGEBYSCORE -inf now. - #7 durability: RedisDurabilityValidator (startup + @Scheduled) asserts appendonly/appendfsync/noeviction; FAIL_FAST/WARN modes. - #8 hash-tags on all 4 keys. - #9 retention: HSCAN + status-guarded delete of SENT past a window; FAILED never reaped. - #10 repair: reconciles both orphan classes (zset-member-no-hash, hash-PENDING-no-zset). - #11 explicit selection: @ConditionalOnProperty backend=REDIS (NOT classpath auto-select); supplies OutboxBackend so PG @ConditionalOnMissingBean backs off. - #12 client-retry idempotency via the enqueue HEXISTS guard. markSent / markFailureAttempt honour the SPI 1/0 contract (mark-failure's 2-element [applied,parked] Lua return is unwrapped to applied; parked drives the metric). 16 main classes + 8 Lua scripts. Rule-61 falsification (verified against redis.io/commands/wait): Redis WAIT is a REPLICATION barrier — in standalone (identity's audited Valkey) WAIT 0 returns immediately and does NOT flush AOF. So appendfsync is the only single-node durability lever; fsyncBarrierEnabled is kept as a documented no-op toggle (default false) for a future replicated topology, and RedisDurabilityValidator is the actual enforcement. Tests: 66 unit + 31 Testcontainers-Valkey IT = 97, all green. IT scenarios: crash-after-EVAL→repair, crash-after-ack-before-markSent→stays PENDING, OOM-under-noeviction→enqueue fail-closed (real CONFIG SET maxmemory), lock-expiry-mid-batch + single-winner, Kafka-outage-beyond-budget→parked, expire/revoke race + UUIDv3 collapse, durability CONFIG-SET drift, REDIS-vs-PG autoconfig selection. Full-reactor mvn install BUILD SUCCESS (6 modules), 0 warnings. outbox-publisher (Phase 0a) untouched; parent pom +module +3 shared version props.Show previous round
hib-pr-reviewer review — PR #8 (affinity-intelligence-rework/im2be-platform-libs)
Round 1 — head
f3f96df505e9, basemain, triggeropenedTL;DR: NEEDS_WORK — kept 2 agreed+verified major findings (corrupted-JSON poison-pill in cold poller; unbounded .get() in hot relay) plus 2 minors and 1 info; no findings dropped.
Summary
Arbitration — Redis Outbox Backend (PR #8)
Reconciled 7 candidate findings (4 from A, 3 from B). Used
Readto verify all five unique-to-one findings at their cited lines.Agreed finding kept (1): Both A and B independently flag
RedisOutboxBackend.java:154for theLinkedHashSetordering assumption — confirmed at that line (rangeByScore()→Set<String>→new ArrayList<>(dueEventIds)).Unique-to-A kept (3):
decode(json).toRecord()throws uncheckedRedisOutboxStorageExceptionon malformed JSON with no catch in the for-loop;decode()at lines 410-416 confirmed. Permanently blocks cold poller. KEPT major.RedisOutboxBackendValidationwhich grep confirms exists nowhere in the codebase; actual validation is inRedisOutboxAutoConfigurationconstructor lines 96-103. KEPT minor.@EnableSchedulingon the@AutoConfigurationclass confirmed. KEPT info.Unique-to-B kept (2):
RedisOutboxRelay.java:112—.get()with no timeout args confirmed; contrast withtryColdPublishat PollerWorker line 194 which uses.get(sendTimeoutMs, TimeUnit.MILLISECONDS). Blocking indefinitely on Kafka stall is a real major defect. KEPT major.RedisOutboxPollerWorker.java:147—elapsedMsis computed fromtickStartNanosand only grows; once threshold is crossed there is nolastRenewNanosreset, sorenewLockis called on every remaining record in the batch. Confirmed by reading lines 143-167. KEPT minor.No findings dropped: all six verified positively.
Blast Radius
This PR adds a single new opt-in module (
redis-outbox-backend) gated behindim2be.outbox.backend=REDIS. The blast radius is contained to consumers that explicitly enable the Redis backend. However, the@EnableSchedulingplacement activates the Spring task scheduler globally within those consumer contexts, and the hot-relay threading issue affects any request path that callsrelay()under Kafka degradation.BLAST_SCORE: 5/10
Risk Indicators
RedisOutboxRelay.relay,RedisOutboxRelay.relayAsync,RedisOutboxPollerWorker.tryColdPublish,RedisOutboxPollerWorker.pollAndPublishPending,RedisOutboxBackend.findPendingBatch,RedisOutboxBackend.decode,RedisDurabilityValidator.validateCI status (head
f3f96df505e9)Overall: ✗ failure
2 checks: 2 pending
Findings (6)
[MAJOR] Corrupted Hash entry propagates unhandled RedisOutboxStorageException from findPendingBatch, permanently blocking the cold poller
redis-outbox-backend/src/main/java/com/aim2be/platform/outbox/redis/RedisOutboxBackend.java:179
Line 179:
decode()(lines 410-416) throwsRedisOutboxStorageException(unchecked) onJsonProcessingException. This exception propagates out offindPendingBatch()and out ofsweep()— it is never caught before thefinally { releaseLock(ownerToken); }inpollAndPublishPending(). The@Scheduledframework swallows it and reschedules. On the next tick the same corrupted entry is still at the front of:pending(its score is unchanged), and the cycle repeats indefinitely, silencing all cold-path relaying.The
null-check at line 172 handles orphan members, but there is no catch for malformed JSON. The repair worker's owndecode()path correctlycontinues on parse error but does not detect this condition as needing repair.Fix: Add a try-catch inside the for-loop body:
Also add a unit test that injects a malformed JSON value and asserts
findPendingBatchreturns the other valid entries rather than throwing.[MAJOR]
relay()blocks on.get()with no timeout — can exhaust calling threads when Kafka degradesredis-outbox-backend/src/main/java/com/aim2be/platform/outbox/redis/RedisOutboxRelay.java:112
Lines 110-112:
This blocks indefinitely if the Kafka producer stalls. The cold-path equivalent in
RedisOutboxPollerWorker.tryColdPublish(line 194) correctly uses.get(sendTimeoutMs, TimeUnit.MILLISECONDS). When the broker becomes unreachable and nodelivery.timeout.msis set in the producer config, every thread that callsrelay()hangs until interrupted. Crucially, a hung.get()never raises an error to the circuit breaker — so the breaker never opens, and thread-pool exhaustion occurs before any self-healing kicks in.Fix: Apply the same bounded-timeout pattern as the poller:
Either reuse
RedisOutboxProperties.Poller.sendTimeoutMs(already wired) or add a dedicatedrelay.send-timeout-msproperty. PropagateTimeoutExceptiontocircuitBreaker.onError(...)the same wayExecutionExceptionis handled at line 121.[MINOR] Javadoc cites non-existent class RedisOutboxBackendValidation for the lockRenewMargin cross-field check
redis-outbox-backend/src/main/java/com/aim2be/platform/outbox/redis/RedisOutboxProperties.java:189
Lines 187-188 of the
lockRenewMarginMsfield Javadoc:RedisOutboxBackendValidationdoes not exist anywhere in this module (confirmed via repo-wide grep — the only occurrence is this comment). The actual validation lives inRedisOutboxAutoConfigurationconstructor lines 96-103. Future maintainers following the@linkwill hit a dead end.Fix:
[MINOR] Renew condition fires on every remaining record after the threshold is crossed — O(N) redundant Redis round-trips
redis-outbox-backend/src/main/java/com/aim2be/platform/outbox/redis/RedisOutboxPollerWorker.java:147
Line 144 computes
elapsedMsfromtickStartNanos; it only grows. Line 147:Once
elapsedMscrossesrenewAtElapsedMs, the condition stays true for every subsequent record in the batch. There is nolastRenewNanosreset after a successful renewal, sorenewLockis called (and makes a Redis EVAL round-trip) for every record from the threshold-crossing position to the end of the batch. For a 500-item batch with a 30s TTL and 10s margin this is potentially ~100 extra Redis calls per tick.Correctness is intact (TTL reset is idempotent), but the unnecessary round-trips add latency under load.
Fix: Track a
lastRenewNanosand gate on both elapsed and since-last-renew:Alternatively a simple
boolean renewedflag reset per tick achieves the same goal.[MINOR]
findPendingBatchordering claim relies on LinkedHashSet implementation detail of Spring Data Redisredis-outbox-backend/src/main/java/com/aim2be/platform/outbox/redis/RedisOutboxBackend.java:154
Lines 154-163:
The method Javadoc states results are in "ascending-score (oldest-due-first) order", but
opsForZSet().rangeByScore()is declared to returnSet<V>, which carries no ordering contract. The implementation works because Spring Data Redis / Lettuce returns aLinkedHashSetthat preservesZRANGEBYSCOREinsertion order, andArrayList's copy constructor iterates in that order — but this is an undeclared implementation detail.A Spring Data upgrade switching to a non-insertion-ordered set would silently break relay ordering without any compile error or test failure.
Preferred fix: Use
rangeByScoreWithScores()(as already done infindDueExpiries()at line 336) and collect values from theTypedTuplestream, which is explicitly sorted. Minimal fix: Add a code comment documenting theLinkedHashSetdependency so a future library upgrade triggers a conscious review.[INFO] @EnableScheduling on @AutoConfiguration activates the Spring task scheduler globally for the consumer context
redis-outbox-backend/src/main/java/com/aim2be/platform/outbox/redis/RedisOutboxAutoConfiguration.java:81
Line 81:
@EnableSchedulingis placed directly on the@AutoConfigurationclass. This registersScheduledAnnotationBeanPostProcessorfor the entire application context, not just for this module's beans. Any@Scheduledbeans in the consumer application that were previously dormant (because the consumer had not opted in to Spring's task scheduler) will be activated as a side effect of settingim2be.outbox.backend=REDIS.The annotation is required for the poller/durability/retention/repair ticks, and the blast is scoped to consumers who explicitly opt in, so the risk is low. However this side effect is not documented anywhere in the module's README or CLAUDE.md.
Suggested mitigation: Add a sentence to the module README noting that enabling
im2be.outbox.backend=REDISalso enables the Spring task scheduler in the application context, so consumer teams are not surprised.Verdict
NEEDS_WORK
hib-pr-reviewer • round 1 • 6 findings (2M/3m/1i) • 2026-05-28T11:35:02.016Z → 2026-05-28T11:37:02.457Z • posted-as: pr-reviewer-bot
R1 verdict NEEDS_WORK, findings (kept=6: 2 MAJOR / 3 MINOR / 1 INFO): (1) MAJOR RedisOutboxBackend.findPendingBatch:179 — corrupted-JSON poison-pill. decode() throws unchecked RedisOutboxStorageException uncaught in the for-loop → propagates out of the @Scheduled sweep every tick while the entry's :pending score is unchanged → permanently stalls ALL cold relaying. (2) MAJOR RedisOutboxRelay.relay:112 — unbounded .get(). A stalled broker blocks the calling thread forever; a hung .get() never raises to the breaker so it never opens → thread exhaustion before self-heal. (3) MINOR RedisOutboxProperties:189 — Javadoc @link to non-existent RedisOutboxBackendValidation. (4) MINOR RedisOutboxPollerWorker:147 — renew fired on every remaining row once the elapsed threshold was crossed (O(batch) redundant Redis EVALs). (5) MINOR RedisOutboxBackend:154 — findPendingBatch ordering relied on the undeclared LinkedHashSet impl detail of rangeByScore's Set<V> return. (6) INFO RedisOutboxAutoConfiguration:81 — @EnableScheduling activates the Spring scheduler context-wide (consumer side effect, undocumented). Fix: - (1) try-catch the per-entry decode in findPendingBatch: skip + WARN + a new im2be_outbox_redis_undecodable_total metric (observable, not a silent forever-skip; operator inspects/repairs off the metric). Other due entries still relay. New unit test injects malformed JSON and asserts the valid entry survives + the metric fires. - (2) bound relay's send on a new sendTimeoutMs ctor arg (wired from poller.send-timeout-ms in the autoconfig); add TimeoutException to the onError+markFailureAttempt catch so the breaker can open. New unit test proves a never-completing future returns promptly + records a failure attempt. - (3) Javadoc @link → RedisOutboxAutoConfiguration (the real validation site). - (4) gate renewal on time-SINCE-LAST-RENEW (lastRenewNanos), so it fires at most once per (ttl - margin) window. - (5) findPendingBatch uses rangeByScoreWithScores + an explicit Comparator.comparingDouble(score) sort, so oldest-due-first ordering is OUR guarantee, not a library impl detail (consistent with findDueExpiries). - (6) document the context-wide @EnableScheduling side effect in the autoconfig class Javadoc. Ripple (rule 63): the relay ctor change (5→6 args) broke RedisOutboxFailureLifecycleIT's construction — updated. The findPendingBatch switch to rangeByScoreWithScores broke 2 RedisOutboxBackendTest stubs (rangeByScore → rangeByScoreWithScores returning Set<DefaultTypedTuple>) — updated. Verification: - mvn -pl redis-outbox-backend -am clean verify → BUILD SUCCESS, 0 [WARNING] - 68 unit + 31 IT green (+2 new: malformed-JSON skip + relay send-timeout).Show previous round
hib-pr-reviewer review — PR #8 (affinity-intelligence-rework/im2be-platform-libs)
Round 2 — head
3a502993001b, basemain, triggersynchronizeTL;DR: CONDITIONAL_APPROVE — kept all 3 unique-to-one findings after direct Read verification; all minor; no blocking issues.
Summary
Arbitration — Round 2
All three findings are unique-to-one reviewer and required verification. All three were confirmed against HEAD.
A-1 (unique-to-A) —
@EnableSchedulingat line 88 ofRedisOutboxAutoConfiguration.javaconfirmed. Javadoc at lines 70–79 documents the side-effect but the annotation and its context-wide behaviour are unchanged. Kept.A-2 (unique-to-A) —
scanHashEntries()at lines 157–170 ofRedisOutboxRepairWorker.javaconfirmed. The method accumulates both keys and full JSON values (field.getValue().toString()) into aHashMapbefore any filtering. Kept.B-1 (unique-to-B) —
findDueExpiries()at lines 372–378 ofRedisOutboxBackend.javaconfirmed. The method iterates aSet<TypedTuple<String>>without calling.sorted(Comparator.comparingDouble(...)), despite the Javadoc at line 354 guaranteeing"(oldest-due-first)". The analogous fix forfindPendingBatch()is visible at line 171 of the same file, with a comment explicitly noting the reviewer-R1-MINOR rationale — making the omission infindDueExpiries()a clear oversight. Kept.Result: 3 unique-to-one findings kept after verification, 0 dropped. All minor.
Blast Radius
All changes are confined to the new
redis-outbox-backendmodule. The auto-configuration registers viaAutoConfiguration.importsand activates only whenim2be.outbox.backend=REDISis set, so existing consumers using the default PostgreSQL backend are unaffected. The@EnableSchedulingside-effect (finding 1) is the only cross-cutting concern — it reaches into the consumer's full application context when the backend is selected.BLAST_SCORE: 4/10
Risk Indicators
RedisOutboxAutoConfiguration (EnableScheduling context-wide side-effect),RedisDurabilityValidator (AOF + noeviction enforcement gate),RedisOutboxRelay (hot-path Kafka send with circuit-breaker),RedisOutboxBackend.findDueExpiries (expiry ordering contract)CI status (head
3a502993001b)Overall: ✗ failure
2 checks: 2 pending
Findings (3)
[MINOR] @EnableScheduling side-effect still present — documentation improved but not fixed (R1-6 carry-over)
redis-outbox-backend/src/main/java/com/aim2be/platform/outbox/redis/RedisOutboxAutoConfiguration.java:88
The
@EnableSchedulingannotation on the auto-configuration class (line 88) registersScheduledAnnotationBeanPostProcessorfor the entire application context. The Javadoc at lines 70–79 now explicitly documents this, which is a meaningful improvement, but the behaviour is unchanged: any consumer@Scheduledbean that was dormant because the consumer had not opted into scheduling will silently start running the momentim2be.outbox.backend=REDISis set.A safe, minimal fix that preserves the documented semantics without removing scheduling support:
This scopes the scheduling opt-in to consumers that have the backend workers active.
[MINOR] scanHashEntries() materialises every Hash field value (full JSON) into JVM heap with no upper bound
redis-outbox-backend/src/main/java/com/aim2be/platform/outbox/redis/RedisOutboxRepairWorker.java:157
scanHashEntries()(lines 157–170) accumulates both the key and full JSON value of every entry in the:entriesHash into ajava.util.HashMapbefore the cross-set comparison begins. Under the default 24-hour retention TTL, a moderately active system can accumulate tens of thousands of entries; at ~500 bytes/entry JSON this becomes tens-to-hundreds of MB materialised per repair-worker tick.The Class A check (orphan zset members) requires only field names, not values. The Class B check (orphan Hash entries) needs only the status field for candidate IDs not found in the pending set. Both can be served with a two-pass approach:
For Class B, decode on demand via a targeted
HMGETfor only those candidate IDs that are in Hash but NOT in pending, rather than pre-loading all values. This keeps the repair worker heap-neutral at the cost of one additional targeted fetch for the candidate set — a worthwhile trade given unbounded growth undernoeviction.[MINOR] findDueExpiries() iterates Set without sorting — breaks the "oldest-due-first" Javadoc contract
redis-outbox-backend/src/main/java/com/aim2be/platform/outbox/redis/RedisOutboxBackend.java:372
findDueExpiries()at lines 372–378 iterates theSet<ZSetOperations.TypedTuple<String>>returned byrangeByScoreWithScores()without sorting by score, yet the Javadoc at line 354 guarantees"(oldest-due-first)".Set<V>carries no ordering guarantee; the contract relies on the undocumented insertion-order behaviour of whateverLinkedHashSetSpring Data Redis happens to return.The identical issue was fixed for
findPendingBatch()at line 171 (same file), where a comment explicitly calls out the reviewer-R1-MINOR rationale:Apply the same pattern to
findDueExpiries():The Phase-1 due-time worker will drive expiry ordering off this list; incorrect ordering (e.g. newest-first under a different
Setimpl) would process recent expirations before older ones, violating the promised fairness contract.Verdict
CONDITIONAL_APPROVE
hib-pr-reviewer • round 2 • 3 findings (3m) • 2026-05-28T11:52:53.172Z → 2026-05-28T11:54:04.320Z • posted-as: pr-reviewer-bot
R2 verdict CONDITIONAL_APPROVE, findings (kept=3, all MINOR): (1) RedisOutboxBackend.findDueExpiries:372 — iterated the rangeByScoreWithScores Set without sorting, contradicting its "oldest-due-first" Javadoc. Direct ripple of the R1 findPendingBatch fix that wasn't applied to its sibling. (2) RedisOutboxRepairWorker.scanHashEntries:157 — materialised every Hash field VALUE (full JSON) into heap before filtering; unbounded under noeviction as :entries accumulates SENT/FAILED entries within the retention window. (3) RedisOutboxAutoConfiguration:88 — @EnableScheduling activates the scheduler context-wide (R1 INFO escalated to MINOR with a suggested nested-activator). Fix: - (1) findDueExpiries now sorts by score (Comparator.comparingDouble) + filters null value/score, identical to findPendingBatch — ordering is ours, not a Set-impl detail. - (2) repair worker is now heap-neutral: scanHashKeys() retains only field NAMES (HSCAN streams values but each goes out of scope immediately), Class A compares against that key set, and Class B fetches only the non-pending candidates' values in bounded HMGET chunks (CLASS_B_FETCH_BATCH=256), decoding on demand. Peak heap = chunk×entry, never Hash-size×entry. - (3) NOT applied — the reviewer's nested-activator gated on poller.enabled is flawed: durability (unconditional) + retention + repair are ALSO @Scheduled, so a poller-only gate would silently stop them; and Spring's @EnableScheduling registers the BPP context-wide wherever declared, so a nested activator does not scope the leak. Documented as an intrinsic, accepted trade-off in the autoconfig Javadoc (true isolation would need a module-private programmatic TaskScheduler — a deliberate non-goal). Reasoned won't-fix. Verification: - mvn -pl redis-outbox-backend -am clean verify → BUILD SUCCESS, 0 [WARNING] - 68 unit + 31 IT green; RedisOutboxCrashRecoveryIT (real Valkey) exercises the refactored repair worker.Show previous round
hib-pr-reviewer review — PR #8 (affinity-intelligence-rework/im2be-platform-libs)
Round 3 — head
71f9e4525a81, basemain, triggersynchronizeTL;DR: CONDITIONAL_APPROVE — dropped A-1 (decode() already returns null on corrupt JSON; no exception propagates); kept B-1 (minor: @Scheduled swallows FAIL_FAST throw, verified no custom ErrorHandler in AutoConfiguration); kept B-2 (info/deferred: TOCTOU on alreadyRevoked ARGV[6], Phase 1 concern).
Summary
Arbitration — Round 3
Methodology
All three findings (A-1, B-1, B-2) are unique-to-one, so each required independent verification via
Read.Finding A-1 — DROPPED (disproved by verification)
Reviewer A claimed
decode(json)atRedisOutboxRepairWorker.java:148throws an uncheckedRedisOutboxStorageExceptionwhen a Hash value is corrupt, with no try-catch in the Class B loop.Actual code (lines 202-209):
decode()already catchesJsonProcessingExceptioninternally and returnsnull. Then at lines 149-151,if (decoded == null || ...) { continue; }handles the null case — the corrupted entry is skipped and the loop continues. The method never throwsRedisOutboxStorageException. The premise of A-1 is factually wrong; the fix the reviewer requested is already in place.Finding B-1 — KEPT (verified)
RedisDurabilityValidator.periodicCheck()(line 109) callsvalidate("periodic"), which at lines 128-135 throwsRedisOutboxDurabilityException extends RuntimeExceptionin FAIL_FAST mode. Spring's@Scheduleddispatcher usesTaskUtils.LOG_AND_SUPPRESS_ERROR_HANDLERby default, which logs and re-schedules on anyRuntimeException. Verified thatRedisOutboxAutoConfiguration(all 389 lines) registers NO customTaskScheduler,SchedulingConfigurer, orErrorHandlerbean. The class-level Javadoc (line 22) and the@throwstag (lines 105-106) both imply the periodic check halts the application on FAIL_FAST violation — but this is mechanically impossible under@Scheduledwithout an overridingErrorHandler. The relay continues uninterrupted after a post-boot durability regression. B-1 is confirmed.Finding B-2 — KEPT (verified, info/deferred)
expiry-decide.lua:35documentsARGV[6] = revokedas a Java-side pre-computed value. The script header confirms Phase 0b exposes the primitive but the due-time worker caller is not wired until Phase 1. The TOCTOU gap (concurrent revocation between Java check and EVAL execution) is architecturally real; the script-level mitigation comment (UUIDv3 dedup) addresses the duplicate-event collapse but not the incorrect event-type selection. Correctly deferred to Phase 1.Blast Radius
The diff introduces a net-new module (redis-outbox-backend) with its own auto-configuration, four scheduled workers, Lua scripts, and a Spring Boot AutoConfiguration.imports registration. Blast radius is self-contained for consumers that do not opt in to im2be.outbox.backend=REDIS. For those that do, the module activates the global task scheduler context-wide (@EnableScheduling side-effect, previously documented and accepted). The durability-enforcement gap (B-1) affects all FAIL_FAST adopters post-boot; the TOCTOU gap (B-2) is Phase-1-gated.
BLAST_SCORE: 5/10
Risk Indicators
RedisDurabilityValidator.validate,RedisDurabilityValidator.afterPropertiesSet,RedisDurabilityValidator.periodicCheck,RedisOutboxBackend.decideExpiry,RedisOutboxRelay.relay,RedisOutboxPollerWorker.pollOnce,expiry-decide.lua (EVAL boundary)CI status (head
71f9e4525a81)Overall: ✗ failure
2 checks: 2 pending
Findings (2)
[MINOR] periodicCheck() throws RedisOutboxDurabilityException in FAIL_FAST mode but @Scheduled silently swallows it — relay continues against a non-durable store post-boot
redis-outbox-backend/src/main/java/com/aim2be/platform/outbox/redis/RedisDurabilityValidator.java:109
periodicCheck()(line 109) delegates tovalidate("periodic"), which at lines 128-135 throwsRedisOutboxDurabilityException extends RuntimeExceptionin FAIL_FAST mode. Spring's@Scheduleddispatcher usesTaskUtils.LOG_AND_SUPPRESS_ERROR_HANDLERas its defaultErrorHandler, which logs the exception and re-schedules the task — it does NOT propagate the exception or halt the application.RedisOutboxAutoConfigurationregisters no customTaskScheduler,SchedulingConfigurer, orErrorHandlerbean that would override this behaviour.Consequence: the FAIL_FAST halting guarantee holds only at startup (via
afterPropertiesSet()). A liveCONFIG SET appendonly nodetected by the periodic check fires theim2be_outbox_redis_durability_violations_totalmetric counter but leavesRedisOutboxRelayandRedisOutboxPollerWorkerrunning uninterrupted against the now-non-durable store. The class-level Javadoc (line 22: "so a live CONFIG SET that weakens durability after boot is caught") and the@throwsdeclaration onperiodicCheck()(lines 105-106) both imply a stronger halting semantics than the dispatch model provides.Fix options (pick one):
SchedulingConfigurerorTaskSchedulerbean with a customErrorHandlerthat callscontext.close()onRedisOutboxDurabilityException— matches the startup-gate semantics for periodic violations.@throwsonperiodicCheck(), and update the class-level Javadoc to state explicitly that the periodic check is metric-only in FAIL_FAST mode and does NOT halt the application; operators respond to the alert.Option 2 is the minimal/safe fix; option 1 is correct for audit-grade enforcement.
[INFO] Caller-supplied
revokedARGV[6] creates a TOCTOU gap — concurrent ticket revocation between Java check and EVAL can produce both TicketExpired and TicketRevoked eventsredis-outbox-backend/src/main/resources/META-INF/scripts/expiry-decide.lua:35
Line 35 documents
ARGV[6] = revoked ('1' when the ticket was already revoked, else '0'). The Java caller (RedisOutboxBackend.decideExpiry) receivesalreadyRevokedas a pre-computed boolean determined before theEVALis issued. A concurrentTicketRevokedprocessing between when the caller checks revocation state and when the Lua script executes atomically would cause the script to enqueue aTicketExpiredevent even though the ticket was revoked in that window, yielding both aTicketRevokedand aTicketExpiredevent for the same ticket.The design mitigates this via ADR-0014 D-4 consumer-side dedup (the deterministic UUIDv3
expiredEventIdcollapses double-fires), and the security-audit pipeline (ADR-0002 §3b) must be idempotent with respect to this race. However, the revocation-suppression path (script lines 63-66) is bypassed in the race window, and consumers receive two distinct event types rather than one. Depending on downstream processing, both events being simultaneously present could cause an incorrect state transition (ticket showing as both expired AND revoked rather than only revoked).A robust fix would move the revocation check inside the Lua script itself against a Redis-resident revocation marker (e.g.
EXISTS :ticket:{id}or a dedicated revoked-set membership check), eliminating the Java-side TOCTOU window entirely. Deferred to Phase 1 — the due-time worker that callsdecideExpiryis not yet wired in Phase 0b; the gap becomes exploitable only when Phase 1 lands.Verdict
CONDITIONAL_APPROVE
hib-pr-reviewer • round 3 • 2 findings (1m/1i) • 2026-05-28T12:07:27.767Z → 2026-05-28T12:09:06.769Z • posted-as: pr-reviewer-bot
R3 verdict CONDITIONAL_APPROVE. A-1 (findPendingBatch poison-pill) correctly DROPPED by the arbiter (already handled in R1). Findings (kept=2: 1 MINOR / 1 INFO-deferred): (1) MINOR RedisDurabilityValidator.periodicCheck:109 — overstated halting semantics. periodicCheck() throws RedisOutboxDurabilityException in FAIL_FAST, but Spring's @Scheduled default ErrorHandler (LOG_AND_SUPPRESS) logs + reschedules — it does NOT halt. So FAIL_FAST only halts at STARTUP; a live CONFIG SET weakening durability post-boot fired the metric but left the relay/poller running, while the Javadoc + @throws implied a hard stop. (2) INFO expiry-decide.lua:35 — TOCTOU on the caller-supplied `revoked` ARGV. A concurrent TicketRevoked in the Java-check->EVAL window can yield both TicketExpired + TicketRevoked. Deferred to Phase 1 (the due-time worker that calls decideExpiry is unreachable in Phase 0b). Fix: - (1) periodicCheck() now CATCHES the FAIL_FAST throw and converts it into an explicit, actionable ERROR alert (paired with the existing durability-violations metric) instead of letting Spring swallow it — honest (a @Scheduled method cannot abort a running app) + loud. The startup gate (afterPropertiesSet) keeps the hard fail-closed throw. Class Javadoc + periodicCheck @throws corrected to state the two-phase enforcement. New unit test asserts periodicCheck() alerts-but-does-not-throw on a FAIL_FAST violation. - (2) documented the TOCTOU + the planned in-script revocation-check fix (against a Redis-resident revocation marker) as TODO(Phase 1) on both decideExpiry's Javadoc and the expiry-decide.lua ARGV[6] comment — to land with the Phase-1 due-time worker that first exercises this path. Not fixed in Phase 0b (no caller; the gap is unreachable until Phase 1). Ripple (rule 63): the periodicCheck semantics change broke 2 RedisOutboxDurabilityIT tests that asserted periodicCheck() THROWS (the old, incorrect behavior). Rewrote them to assert detection (validate("periodic") throws internally) AND alert-don't-halt (periodicCheck() does not throw) against real Valkey. Verification: - mvn -pl redis-outbox-backend -am clean verify → BUILD SUCCESS, 0 [WARNING] - 69 unit + 31 IT green.hib-pr-reviewer review — PR #8 (affinity-intelligence-rework/im2be-platform-libs)
Round 4 — head
9853b72e5276, basemain, triggersynchronizeTL;DR: NO_NEW_FINDINGS — No new findings this round.
Summary
[quorum-converged] A=0 = B=0. ## Round 4 Review — Wave A.2 Phase 0b Redis Outbox Backend
All prior actionable findings from rounds 1–3 have been addressed in this synchronize push:
findPendingBatch)RedisOutboxStorageExceptioncaught, entry skipped,recordUndecodable()metric fired (lines 192–206)relay()unbounded.get()exhausts threads.get(sendTimeoutMs, TimeUnit.MILLISECONDS)withTimeoutExceptionrouted to circuit-breakeronError(lines 123–125)RedisOutboxBackendValidation{@link RedisOutboxAutoConfiguration}(line 188)lastRenewNanosreset after each successful renewal, subsequent records compare against time-since-last-renew (lines 143–160)LinkedHashSetimplementation detail.sorted(Comparator.comparingDouble(...))infindPendingBatch(line 171)scanHashEntries()materialises full JSON into JVM heapscanHashKeys()discards values immediately, retains only field names (lines 184–199)findDueExpiries()unsorted.sorted()ascending by score (line 379)periodicCheck()silently dropsFAIL_FASTviolationscatch (RedisOutboxDurabilityException)block converts toLOG.errorwith actionable message (lines 122–131)Two items remain as documented trade-offs with author-accepted rationale — neither is a new finding:
@AutoConfiguration(line 100 ofRedisOutboxAutoConfiguration). Author has added extensive Javadoc (lines 75–92) explaining why narrowing it to a nested@Configurationwould not scope the side-effect, and that a programmaticTaskScheduleris a deliberate non-goal. Reasoning is sound; the context-wide activation is now explicitly disclosed to library consumers.expiry-decide.lua(R3-11):alreadyRevokedARGV[6] still computed by the Java caller before EVAL (line 35). Author added a detailedTODO(Phase 1, reviewer R3 INFO)block (lines 36–44) describing the gap and the in-scriptEXISTS-based fix that lands with the Phase-1 due-time worker. ThedecideExpirycall site is unreachable in Phase 0b.No new findings this round. The module is substantially complete for Phase 0b.
CI status (head
9853b72e5276)Overall: ✗ failure
2 checks: 2 pending
Findings
No new findings this round.
Quorum converged on empty findings (A + B both returned 0).
Verdict
NO_NEW_FINDINGS
hib-pr-reviewer • round 4 • 0 findings • 2026-05-28T12:15:21.914Z → 2026-05-28T12:20:52.951Z • posted-as: pr-reviewer-bot • [bookkeeping fallback]