feat(centrifugo): L0 T0 #6 PR-OPAQUE-4e — gRPC fan-out for mint-time user context #4
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "feat/l0-t0-opaque-4e-grpc-clients"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
L0 Tranche-0 #6 PR-OPAQUE-4e — wires identity-service as the CONSUMER side of the Centrifugo opaque-ticket mint context aggregation. Issues
family-service.GetFamilyForUser+user-service.GetUserProfilein parallel at mint time and assembles the full 6-fieldTicketUserContextper ADR-0002 §3a. Companion to:affinity-intelligence-rework/im2be-protobuf@316e77f).affinity-intelligence-rework/im2be-family-service@6a1927e).affinity-intelligence-rework/im2be-user-service@7e73084).Changes
Production code
pom.xml— promotesgrpc-client-spring-boot-starterfrom test-only to compile scope; addsresilience4j-spring-boot3:2.2.0(verified 2026-05-24 against Maven Central as latest stable on Spring Boot 3.x line per rule 61).src/main/proto/{family,user}/v1/*.proto— vendors meta-repo snapshot offamily_context.proto+user_profile.proto; protobuf-maven-plugin now compiles 4 .proto files (was 2).mintcontext/SubscriptionTierMapper— wire enum → canonicalFREE/PRO/FAMILYlabels; fail-closed onUNSPECIFIED+UNRECOGNIZEDperuser_profile.proto§rpc contract.mintcontext/FamilyContextClient— outbound wrapper forGetFamilyForUser;@GrpcClient("family-service")injected blocking stub, Resilience4j circuit breaker, 2 s per-call deadline. Enforces the proto's 5-row presence table — rows 4 + 5 fail-closed symmetrically.mintcontext/UserProfileClient— outbound wrapper forGetUserProfile; same shape. Fail-closed onUNSPECIFIEDtier.mintcontext/MintContextAggregator— parallel fan-out viaCompletableFuture.allOfon a 16-thread pool, OTel context propagation,ticket.mint.fanoutchild span carrying per-RPC status + duration attributes.CentrifugoTokenController— injectsMintContextAggregator, replacesserialiseUserContext(userId)stub withserialiseUserContext(userId, ttl)that delegates to the aggregator. New catch arm forTicketContextFetchFailedExceptionwithcontext_fetch_failureoutcome counter label.exception/TicketContextFetchFailedException+GlobalExceptionHandler— new 503 +Retry-After: 1mapping with distinctTICKET_CONTEXT_FETCH_FAILEDerror code so dashboards pivot Redis-side vs gRPC-side failures independently.application.properties—grpc.client.{family,user}-serviceconfig + resilience4j circuit-breaker instances. Defaults: 50% failure threshold over 10 calls, 5 s open-state wait, COUNT_BASED sliding window.ADR
ADR-0002 §3a edit landed in meta-repo
main(docs/decisions/0002-centrifugo-opaque-ticket.md) — expanded with mint-flow diagram, parallel-fan-out semantics, circuit-breaker discipline, full failure-mapping table, staleness contract.Tests
SubscriptionTierMapperTest(6),FamilyContextClientTest(13),UserProfileClientTest(13),MintContextAggregatorTest(7).CentrifugoTokenControllerTestextended: 3 new tests (mintReturns503WhenContextFanoutFails,mintReturns500WhenContextFanoutThrowsIllegalState,mintSerialisesNoFamilyCaseWithNullFamilyId); existing tests updated to stub the aggregator + registerJavaTimeModuleon the standalone-testObjectMapperforInstantserialisation symmetry with Spring Boot's auto-configured mapper.Verification
mvn -DskipTests=false clean install→ BUILD SUCCESS, 118 tests pass (was 79), ZERO[WARNING]lines (rule 62 — clean compilation).Test plan
mvn -P unit-tests test— 118 tests pass.mvn -DskipTests=false clean install— BUILD SUCCESS, zero warnings.concurrentMintsProduceAllUniqueTickets) still passes after the refactor.clientsRunInParallel).Wires identity-service as the consumer of PR-OPAQUE-4b (family-service.GetFamilyForUser) + PR-OPAQUE-4c (user-service.GetUserProfile). Replaces the PR-OPAQUE-2 stub TicketUserContext (user_id + minted_at only) with the full 6-field realtime.v1-shaped record aggregated in parallel from both downstream services at mint time, per ADR-0002 §3a. Changes: - pom.xml — promotes grpc-client-spring-boot-starter from test-only to compile scope; adds resilience4j-spring-boot3 2.2.0 (verified 2026-05-24 against Maven Central as latest stable on the Spring Boot 3.x line per rule 61). - src/main/proto/{family,user}/v1/*.proto — vendors a snapshot of the meta-repo's family_context.proto + user_profile.proto; protobuf-maven-plugin now compiles 4 .proto files (was 2), generating FamilyContextServiceGrpc + UserProfileServiceGrpc client stubs alongside the existing ConnectProxyService server stub. - mintcontext/SubscriptionTierMapper — maps SubscriptionTier wire enum → canonical FREE/PRO/FAMILY labels; UNSPECIFIED + UNRECOGNIZED fail-closed per user_profile.proto §rpc contract. - mintcontext/FamilyContextClient — outbound wrapper for GetFamilyForUser; injected @GrpcClient("family-service") stub, Resilience4j circuit breaker, 2s per-call deadline. Enforces the proto's 5-row presence table — rows 4 + 5 fail-closed symmetrically. - mintcontext/UserProfileClient — outbound wrapper for GetUserProfile; same shape. Fail-closed on UNSPECIFIED tier. - mintcontext/MintContextAggregator — parallel fan-out via CompletableFuture.allOf on a 16-thread pool, OTel context propagation, ticket.mint.fanout child span carrying per-RPC status + duration attributes. - CentrifugoTokenController — injects MintContextAggregator, replaces serialiseUserContext(userId) stub with serialiseUserContext(userId, ttl) that delegates to the aggregator. Adds catch arm for TicketContextFetchFailedException with new context_fetch_failure outcome counter label. - exception/TicketContextFetchFailedException + GlobalExceptionHandler.handleTicketContextFetchFailed — new 503 + Retry-After:1 mapping with distinct TICKET_CONTEXT_FETCH_FAILED error code so dashboards pivot Redis-side vs gRPC-side failures independently. - application.properties — grpc.client.{family,user}-service config + resilience4j circuit-breaker instances. Defaults: 50% failure threshold over 10 calls, 5s open-state wait, COUNT_BASED sliding window. ADR-0002 §3a — expanded with the gRPC fan-out flow diagram, parallel-fan-out semantics, circuit-breaker discipline, full failure-mapping table. Tests: - 4 new test classes (39 tests): SubscriptionTierMapperTest (6), FamilyContextClientTest (13), UserProfileClientTest (13), MintContextAggregatorTest (7). - Existing CentrifugoTokenControllerTest extended: 2 new tests (mintReturns503WhenContextFanoutFails, mintReturns500WhenContextFanoutThrowsIllegalState, mintSerialisesNoFamilyCaseWithNullFamilyId); existing tests updated to stub the aggregator + register JavaTimeModule on the standalone-test ObjectMapper for Instant serialisation symmetry with Spring Boot's auto-configured mapper. Verification: - mvn -DskipTests=false clean install → BUILD SUCCESS, 118 tests pass (was 79), ZERO [WARNING] lines (rule 62). - Vioxen + plugins working trees clean (rule 53).Show previous round
hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)
Round 1 — head
ad10ae12b45e, basemain, triggeropenedTL;DR: NEEDS_WORK — kept 5 findings (1 agreed minor, 3 unique-verified minor, 1 unique-verified major); NOT_FOUND mapped to a recorded CB exception is the blocking concern.
Summary
Arbitration summary
First run for this PR — no prior history in Memora.
Memora recalled: 0 prior runs, 0 submodule patterns.
Verification performed:
application.propertieslines 105-142: confirmed deadStatusRuntimeExceptionentry (lines 129, 141) and stale comment (lines 112-115 sayIllegalArgumentException; actual entry isTicketContextFetchFailedException).FamilyContextClient.javalines 186-254: confirmedinvoke()catches and remaps allStatusRuntimeExceptionbefore CB sees it; confirmedNOT_FOUNDmaps toTicketContextFetchFailedException(line 238), which IS inrecordExceptions.UserProfileClient.javalines 133-177: same pattern confirmed;NOT_FOUND→TicketContextFetchFailedException(line 163).MintContextAggregator.javalines 95, 141-161: confirmedFANOUT_POOL_SIZE=16,newFixedThreadPool(16)(unboundedLinkedBlockingQueue), and@PreDestroy shutdown()is non-blocking with Javadoc claiming it drains.Reconciliation: Kept 1 agreed finding (A1≡B2), verified and kept 4 unique findings (A2, A3, B1, B3). No finding dropped — all 5 grounded in code.
Run persisted to Memora (id=229).
Blast Radius
The diff introduces a new gRPC fan-out layer on the mint hot-path (CentrifugoTokenController → MintContextAggregator → FamilyContextClient + UserProfileClient), wiring two new outbound network I/O boundaries protected by circuit breakers. Changes span 17 files across the controller, three new service classes, proto definitions, exception hierarchy, application config, and a full test suite. The circuit-breaker misconfiguration affects the entire user population when the CB trips open.
BLAST_SCORE: 6/10
Risk Indicators
CentrifugoTokenController.serialiseUserContext,MintContextAggregator.aggregate,FamilyContextClient.mapStatus,UserProfileClient.mapStatus,FamilyContextClient.invoke,UserProfileClient.invokeCI status (head
ad10ae12b45e)No CI checks reported for this commit.
Related PRs
Findings (5)
[MAJOR]
NOT_FOUNDcounted as a circuit-breaker failure — permanent errors can trip the breaker open for all userssrc/main/java/com/aim2be/identity/mintcontext/FamilyContextClient.java:233
Both
FamilyContextClient.mapStatus()(line 233–238) andUserProfileClient.mapStatus()(line 162–164) remap gRPCNOT_FOUNDtoTicketContextFetchFailedException. That class is listed in bothrecordExceptionsconfigs (application.propertieslines 129–130 and 141–142).Verified at HEAD:
NOT_FOUNDcase inFamilyContextClient.mapStatus()at line 233:Same pattern at
UserProfileClient.java:162.With
slidingWindowSize=10andfailureRateThreshold=50, five consecutive NOT_FOUND calls in a ten-call window open the circuit and block all mint requests for thewaitDurationInOpenState=5swindow. An account-deletion sweep followed by Centrifugo reconnect storms (typical after a backend restart) can easily produce this pattern. NOT_FOUND is a permanent semantic error, not a transient infrastructure fault — it is not what circuit breakers are designed to isolate.The code comment even acknowledges the distinction: "the client should re-auth (which will surface the real 'user gone' error … before reaching us again)" — confirming this is expected, not exceptional, infrastructure behaviour.
Fix: Introduce a narrow subclass (e.g.
UserNotFoundContextFailedException extends TicketContextFetchFailedException) that is NOT listed inrecordExceptions, and use it for theNOT_FOUNDbranch in bothmapStatus()implementations. Optionally addignoreExceptions=com.aim2be.identity.exception.UserNotFoundContextFailedExceptionfor clarity. TheTicketContextFetchFailedExceptionentry then covers only genuinely transient paths (UNAVAILABLE, DEADLINE_EXCEEDED, INTERNAL, circuit-open, fail-closed shape errors).[MINOR] Dead
io.grpc.StatusRuntimeExceptionentry in bothrecordExceptionsconfigssrc/main/resources/application.properties:129
Verified at HEAD: both
FamilyContextClient.invoke()(line 192) andUserProfileClient.invoke()(line 139) wrap the stub call incatch (StatusRuntimeException ex) { throw mapStatus(ex); }. EveryStatusRuntimeExceptionis caught and remapped to eitherTicketContextFetchFailedExceptionorIllegalStateExceptionbefore control returns tocircuitBreaker.executeSupplier(…). The circuit breaker therefore never observes a rawStatusRuntimeException.The dead entries are at
application.properties:129(family-service) andapplication.properties:141(user-service). They create false confidence that unhandled gRPC exceptions would be caught by the breaker and confuse future maintainers reasoning about what counts as a failure.Fix: Remove
io.grpc.StatusRuntimeException,\from bothrecordExceptionsblocks. The effective recording set is already justTicketContextFetchFailedException. (This finding is confirmed by both independent reviewers.)[MINOR] Comment incorrectly claims
IllegalArgumentExceptionis inrecordExceptionssrc/main/resources/application.properties:114
Verified at HEAD: lines 112–115 state:
The actual config at lines 129–130 and 141–142 lists
com.aim2be.identity.exception.TicketContextFetchFailedException, notIllegalArgumentException. TheIllegalArgumentExceptionthrown bySubscriptionTierMapper.toLabel()is caught insideUserProfileClient.parseAndValidate()and re-thrown asTicketContextFetchFailedExceptionbefore reaching the circuit breaker — so the actual config is correct, but the comment is wrong.Fix: Replace the
+ IllegalArgumentExceptionclause with+ TicketContextFetchFailedException (the fail-closed exception all wrappers raise — StatusRuntimeException is remapped to this before it reaches the CB).[MINOR]
newFixedThreadPool(16)uses unbounded queue — no back-pressure beyond 8 concurrent mint requestssrc/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:142
Verified at HEAD:
FANOUT_POOL_SIZE = 16(line 95);defaultExecutor()at line 141–152 callsExecutors.newFixedThreadPool(16, factory), which internally uses an unboundedLinkedBlockingQueue. With 2supplyAsynctasks peraggregate()call, the pool saturates at 8 concurrent in-flight mint requests. Beyond that, new tasks queue without bound while HTTP threads block onboth.join(). There is no back-pressure signal to the Tomcat thread pool; under burst load, the executor queue grows unboundedly and per-request latency grows proportionally to queue depth.Concrete fix:
CallerRunsPolicyapplies back-pressure to the Tomcat thread pool rather than queuing without limit. Size theArrayBlockingQueuecapacity to expected burst concurrency × 2 once the SLO is established.[MINOR]
@PreDestroy shutdown()does not await executor termination — Javadoc claim of "drains" is incorrectsrc/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:158
Verified at HEAD (lines 154–161):
ExecutorService.shutdown()initiates orderly shutdown but returns without blocking — it does not wait for in-flight or queued tasks to complete. The Javadoc claim that this method "drains" the executor is incorrect.Under the current deployment (
server.shutdown=graceful+spring.lifecycle.timeout-per-shutdown-phase=30s), the servlet container drains in-flight HTTP requests before the application context closes, so@PreDestroytypically fires when noaggregate()calls are active. However, the ordering is not guaranteed under all shutdown signals, and reducing the grace window below the 2 s per-RPC deadline plus scheduling overhead could expose a race.Fix:
Also update the Javadoc to accurately describe the blocking behaviour.
Verdict
NEEDS_WORK
hib-pr-reviewer • round 1 • 5 findings (1M/4m) • 2026-05-24T20:25:13.740Z → 2026-05-24T20:27:46.203Z • posted-as: pr-reviewer-bot
R1 verdict NEEDS_WORK — kept 5 findings; all addressed per rule 63 (ripple-aware) + rule 68 (structured commit body). (1) MAJOR FamilyContextClient.java:233 / UserProfileClient.java:162 — NOT_FOUND counted as a CB failure: a burst of NOT_FOUND (account- deletion sweep + Centrifugo reconnect storm) could trip the family-service / user-service circuit breaker open and starve the rest of the user population. NOT_FOUND is a permanent client-side classification, NOT a transient infrastructure fault, so it must NOT contribute to the failure-rate window. Fix: introduced UserNotFoundContextFetchFailedException (subclass of TicketContextFetchFailedException so the existing @ExceptionHandler(TicketContextFetchFailedException.class) still maps the wire shape to HTTP 503 + Retry-After:1 — wire contract unchanged). Both mapStatus() implementations now raise the subclass on NOT_FOUND. application.properties wires the subclass into resilience4j.circuitbreaker.instances.{family,user}-service .ignoreExceptions so the CB filter excludes it from the failure window. Only transient codes (UNAVAILABLE / DEADLINE_EXCEEDED / INTERNAL / circuit-open / proto fail-closed) remain CB-failure signals via the parent class. (2) minor application.properties:129/141 — dead `io.grpc.StatusRuntimeException` entry in both recordExceptions blocks: every StatusRuntimeException is caught and remapped to TicketContextFetchFailedException or IllegalStateException by mapStatus() BEFORE reaching circuitBreaker.executeSupplier, so the CB never observes the raw gRPC type. Removed both entries. (3) minor application.properties:114 — stale comment claimed IllegalArgumentException was in recordExceptions when the actual entry is TicketContextFetchFailedException. Rewrote the doc block to accurately describe the effective failure-signal class + document why io.grpc.StatusRuntimeException was dead code + document the new ignoreExceptions list. (4) minor MintContextAggregator.java:142 — newFixedThreadPool(16) uses unbounded LinkedBlockingQueue: with 2 supplyAsync tasks per aggregate() call, the pool saturated at 8 concurrent mint requests and beyond that latency grew unboundedly without back-pressure to Tomcat. Replaced with ThreadPoolExecutor + ArrayBlockingQueue (capacity = FANOUT_POOL_SIZE * 4 = 64) + CallerRunsPolicy so the calling HTTP thread runs the task synchronously when the queue fills, propagating back-pressure into Tomcat's bounded connection-accept queue. Sized for L0 T0 #6 burst envelope — re-tune against the FINDING 53 mint-RPS SLO once production traffic shows up. (5) minor MintContextAggregator.java:158 — @PreDestroy shutdown() did not await termination; the Javadoc's "drains" claim was incorrect. Replaced with shutdown() + awaitTermination(5s) + shutdownNow() escalation + InterruptedException-restores-flag + WARN log on timeout. Javadoc rewritten to accurately describe the blocking behaviour. Constant SHUTDOWN_AWAIT_SECONDS=5 chosen as 2.5× the per-RPC deadline so a steady-state in-flight RPC always drains. Verification: - mvn -DskipTests=false clean install → BUILD SUCCESS, 0 warnings - Tests run: 120, Failures: 0, Errors: 0, Skipped: 0 (was 118 — +2) • FamilyContextClientTest: 13 → 14 (+circuitBreakerDoesNotOpenOnRepeatedNotFound) • UserProfileClientTest: 13 → 14 (+circuitBreakerDoesNotOpenOnRepeatedNotFound) • Existing notFoundMapsTo* tests strengthened to assert the new subclass (UserNotFoundContextFetchFailedException) while still asserting parent-type compatibility for the controller handler. • Existing circuitBreakerOpensAfterRepeatedFailures unchanged — still uses UNAVAILABLE (transient) so the CB-open assertion holds (NOT_FOUND-vs-UNAVAILABLE distinction is exactly the bug). Refs PR #4 R1 verdict, ADR-0002 §3a, ADR-0014.Show previous round
hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)
Round 2 — head
2d1fedec921a, basemain, triggersynchronizeTL;DR: CONDITIONAL_APPROVE — 2 minor findings kept this round.
Summary
Arbiter reconciled 1 (A) + 1 (B) → 2 findings.
CI status (head
2d1fedec921a)No CI checks reported for this commit.
Findings (2)
[MINOR]
attachStatusBestomits all three timing attributes from the error-path span, breaking the Javadoc contractsrc/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:306
The class-level Javadoc (lines 47–57) unconditionally promises five attributes on every
ticket.mint.fanoutspan:fanout.family.status/fanout.user.status— ✅ written byattachStatusBestfanout.duration_ms— ❌ never written on the error pathfanout.family.duration_ms/fanout.user.duration_ms— ❌ never written on the error pathattachStatusBest(lines 306–316) only sets the two status strings and callsspan.setStatus(ERROR). Verified at HEAD via Read.Impact:
TimedResult.durationMsis silently discarded.fanout.duration_mswill miss all error-path samples.Suggested fix — add a
fanoutStartNsparameter and recover per-RPC timing from completed futures:Both call sites in
aggregate()(lines 278 and 282) must also passfanoutStartNs(already in scope there).[MINOR]
both.join()has no wall-clock ceiling — indefinite thread block possible if per-RPC deadline does not firesrc/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:268
Line 268 is
both.join()with no timeout argument (confirmed at HEAD via Read). The Javadoc at lines 44–45 explicitly acknowledges: "the join itself does not impose an additional timeout."The per-RPC
withDeadlineAfter(2000 ms)stubs are the sole safety net, but two failure modes can cause a blocking stub to outlive its deadline:Under either scenario the fanout worker thread blocks indefinitely. When the
ArrayBlockingQueue(64)fills under burst load,CallerRunsPolicycauses the HTTP request thread to run the task directly, eating a Tomcat slot until the client closes the connection.Suggested fix — replace
both.join()with a boundedget()with headroom over the per-RPC deadline:FamilyContextClient.CALL_DEADLINE_MSandUserProfileClient.CALL_DEADLINE_MSare in the samemintcontextpackage and accessible directly.Verdict
CONDITIONAL_APPROVE
hib-pr-reviewer • round 2 • 2 findings (2m) • 2026-05-24T21:00:06.801Z → 2026-05-24T21:01:55.390Z • posted-as: pr-reviewer-bot • [bookkeeping fallback]
R2 verdict findings (kept=2): (1) MINOR src/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:306 — attachStatusBest omits all three timing attributes from the error-path span, breaking the Javadoc contract. The class-level Javadoc promises 5 attributes (fanout.{family,user}.status + fanout.duration_ms + fanout.{family,user}.duration_ms) on every ticket.mint.fanout span; the error path only wrote the 2 status strings, leaving 3 attributes absent on error spans. On-call couldn't distinguish a 0 ms circuit-open rejection from a 2 000 ms deadline-exceeded; the successful future's TimedResult.durationMs was silently discarded when the other failed; trace-SLO queries against fanout.duration_ms missed all error-path samples. (2) MINOR src/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:268 — both.join() has no wall-clock ceiling. Per-RPC withDeadlineAfter (2 s) is the primary safety net, but two pathologies can outlive it: a half-open TCP socket where a firewall drops FIN silently, and event-loop thread starvation that delays the deadline-expiry callback. Either case blocks the fanout worker indefinitely; once the ArrayBlockingQueue(64) fills, CallerRunsPolicy promotes the block to the Tomcat HTTP thread, eating a slot until the client closes the connection. Fix: Two surgical changes. (1) attachStatusBest gains a fanoutStartNs parameter + writes all three timing attributes. The parent fanout.duration_ms is always measured from fanoutStartNs (so even a 0 ms early-exit is visible). Per-RPC fanout.{family,user}.duration_ms are attached for any future that completed successfully before the join's failure path was entered — the TimedResult on a successful side carries the value; failed/pending sides have no measurement to report. Per-future join calls inside the attach are wrapped in a defensive try/catch so an attribute-attach can't derail the span-finish path. All four callsites (TimeoutException, new InterruptedException, ExecutionException, CompletionException, RuntimeException) pass fanoutStartNs through. (2) both.join() replaced with both.get(FANOUT_JOIN_TIMEOUT_MS, TimeUnit.MILLISECONDS). FANOUT_JOIN_TIMEOUT_MS is pinned to max(per-RPC deadline) + 500 ms = 2_500 ms — a healthy RPC at 2 s still passes; a stuck stub surfaces as TicketContextFetchFailedException (existing 503 fail-closed path) inside 2.5 s. Made public static final so the unit test can sanity-check the bound stays within [1 s, 10 s]; exposing a SLO-relevant constant via public is consistent with the Javadoc-documented contract. New catch arms for TimeoutException (wraps to TicketContextFetchFailedException with the join-timeout message), ExecutionException (unwraps cause + routes through existing rethrow), and InterruptedException (restores interrupt flag + wraps to TicketContextFetchFailedException). Existing CompletionException + RuntimeException arms retained defensively in case the per-future joins below ever surface them. Ripple inspection (rule 63): - Javadoc (class-level + attachStatusBest method-level + new FANOUT_JOIN_TIMEOUT_MS Javadoc) updated to match the new behaviour — the Javadoc contract finding #1 reviewer cited is now the actual contract. - No external consumer reads the OTel attributes besides tail samplers + tracing UIs (no broken downstream). - TicketContextFetchFailedException already maps to 503 in GlobalExceptionHandler (verified at HEAD) — the new timeout path joins the existing fail-closed routing, no new error code needed. - Removed unused `Optional` import from the test file (preexisting per rule 18). Verification: - mvn -DskipTests=false clean install → BUILD SUCCESS - Tests run: 122, Failures: 0, Errors: 0, Skipped: 0 - MintContextAggregatorTest: 9 tests pass (was 7; +2 for new fixes) * joinTimeoutSurfacesAsFetchFailed — parks both stubs 4 s; verifies TicketContextFetchFailedException trips ~2.5 s with "fanout join timed out" message. * timeoutCeilingExposedAsConstant — guards FANOUT_JOIN_TIMEOUT_MS against accidental drift outside [1 s, 10 s]. - Compile-only warning check: zero source warnings (rule 62). External docs verification (rule 61): - CompletableFuture.get(long, TimeUnit) — Java 17 javadoc: TimeoutException on deadline; ExecutionException wraps task exception; InterruptedException for thread interrupt. Behaviour matches the catch-arm structure. - gRPC withDeadlineAfter behaviour: https://grpc.io/docs/guides/deadlines/ — deadline expiry runs on the event-loop thread; thread starvation can delay callback. Cited in the Javadoc rationale for why the join ceiling is needed.Show previous round
hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)
Round 3 — head
098b687d523b, basemain, triggersynchronizeTL;DR: CONDITIONAL_APPROVE — kept 1 verified unique-to-B minor finding (comment inaccuracy in application.properties); all 7 prior-round blocking issues confirmed resolved.
Summary
Round 3 Arbitration — affinity-intelligence-rework/im2be-identity-service#4
Prior context: No existing Memora memory for this PR (despite being round 3); created new entry (ID 233) with review_count=3.
A vs B reconciliation:
application.properties:115— comment overstatesStatusRuntimeExceptionremapping.Verification of B's unique finding:
Read
application.propertieslines 112–129: the comment states "every transient-or-permanent fanout failure surfaces asTicketContextFetchFailedException(the wrappers remap rawio.grpc.StatusRuntimeException+ the proto-level fail-closedIllegalArgumentExceptionto this typed superclass…)".Read
FamilyContextClient.java:235–244andUserProfileClient.java:157–162: bothmapStatus()methods returnnew IllegalStateException(…)— notTicketContextFetchFailedException— forINVALID_ARGUMENT. ConfirmedrecordExceptionsat line 141 lists onlyTicketContextFetchFailedException, soIllegalStateExceptionis never counted in the failure-rate window. The comment's claim that "every" fanout failure reachesTicketContextFetchFailedExceptionis factually incorrect for theINVALID_ARGUMENTbranch.Decision: Keep B's finding (verified unique-to-B). Severity remains minor (comment inaccuracy; no runtime bug — the code behaves correctly, only the prose explanation misleads).
Kept 1 unique-to-B (verified); dropped 0.
Blast Radius
The finding is a documentation comment in a configuration file. No runtime behaviour is affected; the circuit-breaker wiring, exception hierarchy, and gRPC fan-out logic are all correct. Impact is confined to developer understanding of the
INVALID_ARGUMENTcode path.BLAST_SCORE: 1/10
Risk Indicators
MintContextAggregator.aggregateContext,FamilyContextClient.mapStatus,UserProfileClient.mapStatus,CentrifugoTokenController.serialiseUserContextCI status (head
098b687d523b)No CI checks reported for this commit.
Findings (1)
[MINOR] Comment overstates
StatusRuntimeExceptionremapping —INVALID_ARGUMENTmaps toIllegalStateException, notTicketContextFetchFailedExceptionsrc/main/resources/application.properties:115
Lines 112–119 assert that "every transient-or-permanent fanout failure surfaces as
TicketContextFetchFailedException(the wrappers remap rawio.grpc.StatusRuntimeException+ the proto-level fail-closedIllegalArgumentExceptionto this typed superclass…)". This is inaccurate for theINVALID_ARGUMENTbranch: bothFamilyContextClient.mapStatus()(line 243) andUserProfileClient.mapStatus()(line 161) returnnew IllegalStateException(…)forINVALID_ARGUMENT— which is not a subclass ofTicketContextFetchFailedException. BecauseIllegalStateExceptionappears in neitherrecordExceptionsnorignoreExceptions, the circuit breaker treatsINVALID_ARGUMENToutcomes as successes, not failures. The comment implies it would be counted as a failure ("transient-or-permanent fanout failure"), which is wrong.The code itself is correct — treating
INVALID_ARGUMENTas a programming error (HTTP 500) and not as a CB-trippable transient is intentional and sound. Only the comment is wrong.Suggested replacement (sketch): "
recordExceptionscoversTicketContextFetchFailedExceptionand subclasses, exceptUserNotFoundContextFetchFailedException(inignoreExceptions). Transient gRPC codes (UNAVAILABLE/DEADLINE_EXCEEDED/INTERNAL/default-bucket) and the proto fail-closedIllegalArgumentException(fromSubscriptionTierMapper) surface asTicketContextFetchFailedExceptionand count toward the failure rate.INVALID_ARGUMENTis a programming-error path — it maps toIllegalStateException(HTTP 500) and the CB treats it as a success, not a failure."Verdict
CONDITIONAL_APPROVE
hib-pr-reviewer • round 3 • 1 finding (1m) • 2026-05-24T21:14:36.450Z → 2026-05-24T21:16:21.538Z • posted-as: pr-reviewer-bot
R3 verdict findings (kept=1): (1) MINOR application.properties:112-119 — Comment overstates StatusRuntimeException remapping. The pre-existing comment claimed "every transient-or-permanent fanout failure surfaces as TicketContextFetchFailedException." That hid the INVALID_ARGUMENT branch: UserProfileClient.mapStatus (lines 158-162) + FamilyContextClient.mapStatus map INVALID_ARGUMENT to IllegalStateException, not to the typed superclass. The CB never observes IllegalStateException (it propagates straight out of executeSupplier), so the failure-window accounting differs from the recordExceptions wording. Fix: split the exception-family mapping into three explicit bullets: - Transient (UNAVAILABLE / DEADLINE_EXCEEDED / INTERNAL) → TicketContextFetchFailedException → CB FAILURE. - NOT_FOUND → UserNotFoundContextFetchFailedException (subclass) → CB IGNORE via ignoreExceptions. - INVALID_ARGUMENT → IllegalStateException → propagates without contributing to the failure window. And restate the proto-level fail-closed path (UNSPECIFIED tier in SubscriptionTierMapper raises IllegalArgumentException, the wrapper rewraps to TicketContextFetchFailedException → CB FAILURE). Verification: - Pure comment change; no code path touched. - application.properties parsing is whitespace-insensitive; the comment block is uninterpreted by Spring (lines prefixed with `#`).Show previous round
hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)
Round 4 — head
bb22b7d155f3, basemain, triggersynchronizeTL;DR: CONDITIONAL_APPROVE — kept 3 minor findings (all unique-to-one, all verified at HEAD); all 8 prior-round findings confirmed fixed; no blocking issues.
Summary
Arbitration Round 4 — 3 unique-to-one findings verified and kept; 0 dropped
Prior rounds (1-3): Both reviewers independently confirm all 8 prior findings are fixed at HEAD. Not re-raised.
Finding A (unique):
GlobalExceptionHandler.java:68-69— Verified at HEAD. Lines 65-75 confirm the single@ExceptionHandler(TicketContextFetchFailedException.class)handler callsOtelErrorEvents.record()(line 68) andlog.error(...)(line 69) for all subclasses, includingUserNotFoundContextFetchFailedException. No more-specific handler exists for the subclass. The inconsistency with theignoreExceptionsrationale (permanent client-side classification, not transient infrastructure fault) is real and confirmed. Kept.Finding B1 (unique):
application.properties:121-123— Verified at HEAD. BothFamilyContextClient.mapStatus()(line 243-244) andUserProfileClient.mapStatus()(line 161-162) throwIllegalStateExceptionforINVALID_ARGUMENT. The comment at lines 121-123 states this "propagates out of the CB without contributing to the failure window." Under Resilience4j 2.x, whenrecordExceptionsis configured, any exception matching neitherrecordExceptionsnorignoreExceptionsis counted as SUCCESS (not neutral). This dilutes the failure rate and can prevent the CB from opening during mixed INVALID_ARGUMENT + UNAVAILABLE scenarios. Comment is factually wrong. Kept.Finding B2 (unique):
UserNotFoundContextFetchFailedException.java:25-30— Verified at HEAD. Lines 26-30 claim HTTP 503 +Retry-After:1"triggers re-auth." Standard OAuth 2.0/OIDC clients retry the 503 (per RFC 7231 Retry-After semantics), they do not initiate token refresh or re-login. JWTs are stateless; a token issued before account deletion remains cryptographically valid. The re-auth rationale is factually incorrect and will mislead on-call engineers and future maintainers. The same incorrect re-auth claim also appears inline atFamilyContextClient.java:247. Kept.Memora persistence: Tag allowlist rejected
pr-reviewer/custom tags; absorb tool permission denied. Run summary could not be stored — harness should investigate Memora tag policy for this project.Blast Radius
The diff touches 18 files spanning the
api,exception,mintcontextpackages, two proto definitions, application configuration, and four test suites. The newMintContextAggregatorgRPC fan-out path is on the hot mint endpoint. Blast radius is moderate — the new fan-out is additive and well-contained behind its own exception hierarchy, but the CB comment inaccuracy and the GlobalExceptionHandler gap affect the observability surface shared by all callers of the mint endpoint.BLAST_SCORE: 5/10
Risk Indicators
GlobalExceptionHandler.handleTicketContextFetchFailed,FamilyContextClient.mapStatus,UserProfileClient.mapStatus,MintContextAggregator.aggregateCI status (head
bb22b7d155f3)No CI checks reported for this commit.
Related PRs
Findings (3)
[MINOR]
handleTicketContextFetchFailedlogsUserNotFoundContextFetchFailedExceptionat ERROR and marks the OTel span ERROR, contradicting the NOT_FOUND classification rationalesrc/main/java/com/aim2be/identity/exception/GlobalExceptionHandler.java:69
Verified at HEAD: the single
@ExceptionHandler(TicketContextFetchFailedException.class)handler (lines 65-75) is the only handler that will catchUserNotFoundContextFetchFailedException(a direct subclass). It unconditionally callsOtelErrorEvents.record(Span.current(), ex, "GlobalExceptionHandler", "grpc")(line 68) — which sets span status to ERROR and records anerror.event— and thenlog.error("Ticket context fetch failed: {}", ex.getMessage())(line 69).This contradicts the
ignoreExceptionscircuit-breaker configuration whose explicit rationale is that NOT_FOUND is a permanent client-side classification, not a transient infrastructure fault, and that an account-deletion sweep followed by a Centrifugo reconnect storm must not trip the circuit breaker or generate transient-fault noise. The same burst will generate ERROR-level log lines and OTel error spans, which can trigger false-positive alert thresholds and inflate error-budget dashboards (per ADR-0013 §3 the tail sampler retains every ERROR-status span).Fix — add a dedicated handler above the parent-class one (Spring selects the most-specific match, so ordering within the class matters when both handlers are present):
Omitting
OtelErrorEvents.record()is intentional — NOT_FOUND should not surface as an error span.[MINOR] Comment incorrectly claims
IllegalStateExceptionis neutral to the circuit breaker — it is counted as SUCCESS, diluting the failure ratesrc/main/resources/application.properties:122
Verified at HEAD: lines 121-123 state that INVALID_ARGUMENT mapping to
IllegalStateException"propagates out of the CB without contributing to the failure window." This is factually incorrect under Resilience4j 2.x semantics.Both
FamilyContextClient.mapStatus()(line 243-244) andUserProfileClient.mapStatus()(line 161-162) throwIllegalStateExceptionforINVALID_ARGUMENT. The circuit breaker configuration usesrecordExceptions=TicketContextFetchFailedExceptionandignoreExceptions=UserNotFoundContextFetchFailedException. Under Resilience4j 2.x, exceptions matching neither list are recorded as SUCCESS, not as a neutral no-op. This means anIllegalStateExceptionincrements the success counter.Consequence: a simultaneous burst of INVALID_ARGUMENT responses (each counted as SUCCESS) and genuine UNAVAILABLE responses (each counted as FAILURE) can hold the computed failure rate precisely at the 50% threshold, preventing the breaker from opening. The comment contract is wrong and will mislead on-call engineers reading the config during an incident.
Fix (preferred): add
java.lang.IllegalStateExceptionto bothignoreExceptionslists so these calls are excluded from all CB accounting:Alternatively, at minimum correct the comment at line 122 to accurately state "counted as CB success" and open a follow-on task to add
IllegalStateExceptiontoignoreExceptions.[MINOR] Javadoc incorrectly claims HTTP 503 +
Retry-After:1triggers client re-authentication — it does notsrc/main/java/com/aim2be/identity/exception/UserNotFoundContextFetchFailedException.java:29
Verified at HEAD: lines 25-30 state: "Per the proto contract... the client should re-auth after a NOT_FOUND response — the next request will surface the real user-gone error on user-service's JWT verifier before ever reaching the identity-service mint path again. The 503 + Retry-After:1 here is a benign fast-fail that triggers that re-auth flow."
This rationale is incorrect on two counts:
503 is not a re-auth signal. HTTP 503 +
Retry-After:1is a transient-availability signal (RFC 7231 §6.6.4). Standard OAuth 2.0 / OIDC clients and Centrifugo SDK reconnect logic will re-issue the same mint request after theRetry-Afterdelay. They will not initiate a token refresh or re-login flow. The result for a genuinely deleted user is a tight 503 retry loop until the JWT expires (potentially hours later) — neither benign nor a fast-fail.JWTs are stateless. A JWT issued before account deletion remains cryptographically valid at identity-service's JWT verifier (
JwtAuthenticationFiltervalidates signature + expiry, not downstream user existence). The claim that "the next request will surface the user-gone error on user-service's JWT verifier" is incorrect.The same incorrect re-auth claim also appears in an inline comment in
FamilyContextClient.javaat line 247 (// the client should re-auth). Both should be corrected.Fix: replace the re-auth rationale with an accurate description. Example: "NOT_FOUND returns HTTP 503 + Retry-After:1 to keep the mint-endpoint error-contract uniform (all context-fetch failures are 503). Callers that receive persistent 503 responses should escalate to a full re-authentication flow rather than retrying indefinitely; the Centrifugo SDK reconnect policy is expected to back off and surface the error to the application layer after N retries. Identity-service does not return a distinct 4xx code for user-not-found because doing so would require a separate client-side code path and expose user-existence information in the error classification."
Verdict
CONDITIONAL_APPROVE
hib-pr-reviewer • round 4 • 3 findings (3m) • 2026-05-24T21:29:55.028Z → 2026-05-24T21:32:58.475Z • posted-as: pr-reviewer-bot
R4 verdict findings (kept=3): (1) MINOR GlobalExceptionHandler.java:69 — handleTicketContextFetchFailed logs UserNotFoundContextFetchFailedException at ERROR + records OTel error event, contradicting the ignoreExceptions CB rationale. NOT_FOUND is a permanent client-side classification; an account- deletion sweep + reconnect storm would generate proportional ERROR-level log volume + tail-retained error spans + false-positive 5xx error-budget alerts even though the breaker correctly stays closed. (2) MINOR application.properties:122 — Comment incorrectly claimed IllegalStateException is neutral to the CB. Under Resilience4j 2.x, exceptions matching neither recordExceptions nor ignoreExceptions are counted as CB SUCCESS, diluting the failure rate. An INVALID_ARGUMENT burst alongside genuine UNAVAILABLE could hold the failure rate precisely at threshold and prevent the breaker from opening. (3) MINOR UserNotFoundContextFetchFailedException.java:25-30 + FamilyContextClient.java:247 — Javadoc + inline comment claimed HTTP 503 + Retry-After:1 triggers client re-authentication. RFC 7231 Retry-After is a transient-availability signal; standard OIDC clients + Centrifugo SDK retry the 503, they do not initiate token refresh. JWTs are stateless — a token issued before account deletion remains cryptographically valid until expiry. Fixes: (1) Added a more-specific @ExceptionHandler(UserNotFoundContextFetchFailedException.class) that logs at WARN + omits OtelErrorEvents.record() while preserving the 503 + Retry-After:1 + TICKET_CONTEXT_FETCH_FAILED wire contract. Spring picks the most-specific handler within the advice class; NOT_FOUND falls through to the dedicated handler and bypasses the ERROR-level log + OTel error-event path. (2) Added java.lang.IllegalStateException to ignoreExceptions on both family-service + user-service CB configs so INVALID_ARGUMENT is excluded from BOTH the failure window AND the success window. The breaker's accounting now reflects only legitimate downstream- fanout outcomes. Rewrote the recordExceptions/ignoreExceptions comment to state the three-family split accurately. (3) Rewrote UserNotFoundContextFetchFailedException javadoc to describe the actual retry behaviour (uniform 503 contract + stateless-JWT retry-loop until expiry; CB ignoreExceptions protects other tenants from the user-specific burst). Rewrote FamilyContextClient.java:247 inline comment to point at the (updated) javadoc instead of repeating the wrong claim. Verification (containerised JDK 17 mvn install): - Tests run: 123, Failures: 0, Errors: 0, Skipped: 0 (was 122) - BUILD SUCCESS in 12.762 s - New test: CentrifugoTokenControllerTest#mintReturns503WhenContextFanoutReportsUserNotFound verifies the NOT_FOUND path routes through the new handler + still returns the correct wire shape.Show previous round
hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)
Round 5 — head
07cfd02f17d7, basemain, triggersynchronizeTL;DR: CONDITIONAL_APPROVE — kept 2 minor findings (unique-to-A and unique-to-B, both verified at HEAD); both share the same root cause: missing UserNotFoundContextFetchFailedException catch arm before the generic TicketContextFetchFailedException catch.
Summary
Round 5 Arbitration — 2026-05-24
Prior history: No Memora record found (first arbitration stored this run). Rounds 1–4 findings confirmed closed by both reviewers.
Both findings verified at HEAD.
A cites line 258 (the catch-arm opening); B cites line 264 (the specific
span.setStatus(StatusCode.ERROR, outcome)call inside it). These are different line numbers and materially-different bodies — they are not a Rule-1 agreed finding. Each was treated as unique-to-one and subjected to Rule-2 verification:A (line 258 — metric conflation): Confirmed. Lines 258–266 show a single
catch (TicketContextFetchFailedException fanoutFailed)block. No precedingcatch (UserNotFoundContextFetchFailedException …)arm exists.OUTCOME_CONTEXT_FETCH_FAILUREis recorded identically for all subclasses, making it impossible to distinguish permanent NOT_FOUND bursts from transient infra failures in the metric. Kept.B (line 264 — span.setStatus defeats tail-sampler): Confirmed.
span.setStatus(StatusCode.ERROR, outcome)at line 264 fires unconditionally for everyTicketContextFetchFailedException, includingUserNotFoundContextFetchFailedException.span.end()fires at line 276 infinally, so theticket.mintspan is sealedERRORbefore Spring ever dispatches toGlobalExceptionHandler. The handler's own Javadoc (lines 102–107, verified) states explicitly: "Treating it as ERROR + OTel error-event would … generate proportional ERROR-level log volume + tail-retained error spans (per ADR-0013 §3) and false-positive 5xx error-budget alerts." This is a documented design contract violation, not speculation. Kept.Both findings share a single root cause — the absence of a
UserNotFoundContextFetchFailedException-specific catch arm — and a single fix resolves both. Neither reviewer's proposed fix fully addresses the other's concern (A adds a new metric label; B omitssetStatus(ERROR)but keeps the same label), so both are worth surfacing independently. Both are minor severity; no blocking issues.Blast Radius
The diff touches 18 files spanning controller, exception hierarchy, two new gRPC client modules (FamilyContextClient, UserProfileClient, MintContextAggregator), protos, application.properties, and test suites — moderate breadth across multiple packages. The controller change mutates the public HTTP mint endpoint's observability contract (OTel span status, Micrometer outcome labels) and gRPC fan-out path, both of which are consumed by downstream alerting and tail-sampling infrastructure. The HTTP wire contract itself (503 + Retry-After:1) is unchanged.
BLAST_SCORE: 6/10
Risk Indicators
MintContextAggregator,UserProfileClient,FamilyContextClient,CentrifugoTokenController.mintTicket,GlobalExceptionHandler.handleUserNotFoundContextFetchCI status (head
07cfd02f17d7)No CI checks reported for this commit.
Findings (2)
[MINOR] Metric label
context_fetch_failureconflates NOT_FOUND (permanent) with transient infra failures — on-call cannot distinguish the two from the counter alonesrc/main/java/com/aim2be/identity/api/CentrifugoTokenController.java:258
The
catch (TicketContextFetchFailedException fanoutFailed)block at line 258 recordsOUTCOME_CONTEXT_FETCH_FAILUREfor every fanout failure, includingUserNotFoundContextFetchFailedException(NOT_FOUND / account-deletion burst).The
GlobalExceptionHandlercorrectly treats NOT_FOUND differently (WARN log, no OTel error-event on the HTTP server span), but that separation doesn't surface inMintMetrics. An alert rule oncentrifugo_tickets_minted_total{outcome="context_fetch_failure"}fires identically for an account-deletion reconnect storm (circuit breaker healthy, WARN logs) and a genuine downstream outage (circuit breaker open, ERROR logs). On-call must correlate the CB health indicator and log level before knowing whether to page a downstream team or stand down.Concrete fix: add a more-specific catch before line 258:
Also add
user_not_found_context_failureto theMintMetricsJavadoc. This lets dashboards pivot the two categories without changing the HTTP contract or circuit-breaker wiring.[MINOR] NOT_FOUND arm of TicketContextFetchFailedException catch still marks the ticket.mint span ERROR, defeating the GlobalExceptionHandler tail-sampler isolation (ADR-0013 §3)
src/main/java/com/aim2be/identity/api/CentrifugoTokenController.java:264
Line 264 calls
span.setStatus(StatusCode.ERROR, outcome)unconditionally for everyTicketContextFetchFailedExceptioninstance, including theUserNotFoundContextFetchFailedExceptionsubclass. Thespan.end()fires at line 276 in thefinallyblock — after this catch block sets the status — so theticket.mintspan is already sealed asERRORbefore Spring dispatches toGlobalExceptionHandler.GlobalExceptionHandler.handleUserNotFoundContextFetch(lines 103–107, verified at HEAD) states explicitly: "Treating it as ERROR + OTel error-event would … generate proportional ERROR-level log volume + tail-retained error spans (per ADR-0013 §3) and false-positive 5xx error-budget alerts, even though the breaker correctly stays closed." That handler correctly avoidsOtelErrorEvents.record()and logs at WARN, but the tail sampler's keep/drop decision is driven byspan.getStatus(), which is alreadyERRORby the time the handler runs.Compare the rate-limited arm (lines 242–246): it re-throws with no
setStatus(ERROR)call, consistent with ADR-0013 §3 (client-side classification only). NOT_FOUND is the same permanent-client-side category.Concrete fix — split the single catch into two ordered arms:
Java evaluates catch arms top-to-bottom; placing the more-specific subclass first mirrors the
@ExceptionHandlerordering already established inGlobalExceptionHandler.Verdict
CONDITIONAL_APPROVE
hib-pr-reviewer • round 5 • 2 findings (2m) • 2026-05-24T21:44:25.738Z → 2026-05-24T21:46:46.905Z • posted-as: pr-reviewer-bot
R5 verdict findings (kept=2; both share the same root cause — NOT_FOUND subclass needs the same more-specific split inside the controller's try/catch that R5 already applied at the GlobalExceptionHandler level): (1) MINOR CentrifugoTokenController.java:258-266 — `catch (TicketContextFetchFailedException)` arm marks the ticket.mint span StatusCode.ERROR for ALL subclasses, including UserNotFoundContextFetchFailedException. Per ADR-0013 §3 the tail sampler retains every ERROR-status span, so an account- deletion sweep + Centrifugo reconnect storm would generate proportional retained error spans for NOT_FOUND bursts even though the breaker correctly stays closed and the GlobalExceptionHandler now (R5) logs at WARN with no OtelErrorEvents.record(). The span ERROR side of the contract leaked through. (2) MINOR CentrifugoTokenController.java:259 — `outcome = OUTCOME_CONTEXT_FETCH_FAILURE` is the same label for NOT_FOUND and transient infra failures. On-call cannot distinguish the two from the counter alone; a NOT_FOUND-burst alert would page the same as a real downstream outage. Fix (single change addressing both findings — same R63 ripple inspection scope as R5's GlobalExceptionHandler fix): - Added `catch (UserNotFoundContextFetchFailedException userNotFound)` arm BEFORE the parent `catch (TicketContextFetchFailedException fanoutFailed)` arm. Subclass-before-superclass ordering is required by Java reach- ability rules + idiomatically routes the permanent classification to the dedicated branch. - New OUTCOME_CONTEXT_FETCH_USER_NOT_FOUND = "context_fetch_user_ not_found" label so on-call can pivot permanent vs transient. Updated MintMetrics.java javadoc to enumerate the new label. - Arm sets the `outcome` attribute but DOES NOT call span.setStatus(ERROR, …) — span stays UNSET, defeating tail- sampler retention for the permanent path while preserving the sampled-span outcome attribute for on-call filtering. - Outcome counter still incremented via metrics.recordMintOutcome (separate label) — observability is preserved, just routed to a distinct bucket. Verification (containerised JDK 17 mvn install): - Tests run: 123, Failures: 0, Errors: 0, Skipped: 0 - BUILD SUCCESS - mintReturns503WhenContextFanoutReportsUserNotFound test continues to pass (now also exercises the new controller-level NOT_FOUND arm in addition to the GlobalExceptionHandler's more-specific handler).Show previous round
hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)
Round 6 — head
000bc0842c1c, basemain, triggersynchronizeTL;DR: CONDITIONAL_APPROVE — kept 1 minor unique-to-A finding (verified against HEAD); Reviewer B's empty set not adopted as confirmation of absence since A's grounded finding was confirmed by direct code read.
Summary
Arbitration — Round 6
Prior-round findings: Both reviewers confirmed all 13 findings from rounds 1–5 are resolved. No re-raises needed.
New finding reconciliation:
ExecutionExceptionunwrap race inMintContextAggregator.aggregate()at line 321.Verification of unique-to-A finding (Rule 2):
Read
MintContextAggregator.javalines 297–340 confirmed the described code pattern exactly:CompletableFuture.allOf(familyFuture, userFuture)at line 297 — completes on first failing future.both.get(FANOUT_JOIN_TIMEOUT_MS, ...)at line 305.catch (ExecutionException ee)→attachStatusBest(...)→rethrow(ee.getCause())at line 321 — no inspection of the other future's completion cause.Read
CentrifugoTokenController.javalines 268–291 confirmed theUserNotFoundContextFetchFailedExceptioncatch arm does not callspan.setStatus(ERROR)— intentional per ADR-0013 §3.attachStatusBest(lines 379–411) setsStatusCode.ERRORon the fanout span only, not on the outerticket.mintspan.Verdict: The race is real and grounded. Scenario: userFuture completes fast with NOT_FOUND while familyFuture has a slower transient failure (DEADLINE_EXCEEDED / UNAVAILABLE).
allOf().get()fires on NOT_FOUND;ee.getCause()=UserNotFoundContextFetchFailedException; controller setsoutcome=context_fetch_user_not_foundwithout ERROR onticket.mintspan. On-call sees normal-churn metric and may dismiss a genuine infrastructure outage. The fanout span correctly shows ERROR one level down. Finding kept at minor — pure observability masking, response correctness unaffected.Memora persistence: Attempted to create run-summary memory; Memora server auto-tag assignment hit an allowlist restriction (tag
memora/issuesnot permitted). Memory could not be persisted due to server-side configuration issue.Blast Radius
The diff is largely contained within the new
mintcontextpackage (MintContextAggregator, FamilyContextClient, UserProfileClient) plus plumbing changes to CentrifugoTokenController and GlobalExceptionHandler. The fan-out touches two downstream gRPC services (family-service, user-service) and the circuit-breaker registry, so any misbehaviour propagates to the token-mint critical path. Score 4: exported surface within one service boundary, not a public protocol change.BLAST_SCORE: 4/10
Risk Indicators
MintContextAggregator.aggregate,CentrifugoTokenController.mintToken,FamilyContextClient.getFamilyForUser,UserProfileClient.getUserProfileCI status (head
000bc0842c1c)No CI checks reported for this commit.
Findings (1)
[MINOR]
ExecutionExceptionunwrap is race-sensitive — a fast NOT_FOUND can mask a concurrent transient infrastructure failure at the controller levelsrc/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:321
Problem
CompletableFuture.allOf(familyFuture, userFuture).get()(line 305) completes exceptionally as soon as the first failing future is done — it does not wait for the second. At line 321,rethrow(ee.getCause())therefore propagates whichever future lost the race, regardless of severity.Verified scenario:
userFuturecompletes at ~50 ms withUserNotFoundContextFetchFailedException(fast NOT_FOUND — record absent).familyFutureis still in-flight (e.g., DEADLINE_EXCEEDED at 2 000 ms or UNAVAILABLE from a slow-refuse).allOf().get()fires immediately at step 1;ee.getCause()=UserNotFoundContextFetchFailedException.rethrowpropagatesUserNotFoundContextFetchFailedException.outcome = context_fetch_user_not_found, nospan.setStatus(ERROR)onticket.mint.Meanwhile
attachStatusBest(lines 383–410, verified) correctly marks the fanout span ERROR (familyFuture.isCompletedExceptionally() || !familyFuture.isDone()→ STATUS_ERROR;span.setStatus(ERROR, "fanout-failure")). But the outerticket.mintspan carries no ERROR status andoutcome=context_fetch_user_not_found. On-call sees normal-churn metric and may dismiss a genuine infrastructure outage while the fanout span tells the truth one level down.Recommended fix
After
attachStatusBest, inspect both futures for their completion cause and prefer the transient classification when one raised a non-NOT_FOUNDTicketContextFetchFailedException:This preserves existing behaviour for all-success and same-classification-failure cases, and correctly promotes the transient signal so the controller sets
span.setStatus(ERROR)and pages on-call.Verdict
CONDITIONAL_APPROVE
hib-pr-reviewer • round 6 • 1 finding (1m) • 2026-05-24T21:55:49.967Z → 2026-05-24T21:58:45.886Z • posted-as: pr-reviewer-bot
R6 verdict finding (kept=1): (1) MINOR MintContextAggregator.java:321 — ExecutionException unwrap is race-sensitive. `allOf(familyFuture, userFuture).get()` completes as soon as the FIRST failing future is done; it does NOT wait for the second. A fast NOT_FOUND (~50 ms — record absent) can therefore race a slow transient (e.g. DEADLINE_EXCEEDED at 2 s, or UNAVAILABLE slow-refuse) and the NOT_FOUND wins. ee.getCause() then propagates the permanent classification; the controller's NOT_FOUND arm leaves the outer ticket.mint span StatusCode.UNSET + uses the context_fetch_user_not_found outcome label, so on-call sees normal account-deletion churn instead of a real downstream outage. The fanout CHILD span (one level down) correctly says ERROR via attachStatusBest, but the two signals decouple — the outer-span truth is wrong. Fix: - New `completionCauseIfTransient(CompletableFuture<?>)` helper inspects a terminally-completed future for a non-NOT_FOUND TicketContextFetchFailedException cause; returns null otherwise. - ExecutionException catch arm now: if winner is UserNotFoundContextFetchFailedException, scan BOTH futures for a transient cause via the helper; if found, promote it to `winner` and rethrow. Preserves existing behaviour for all-success + same-classification-failure cases. This is the third symmetric edit in the NOT_FOUND-as-permanent- classification contract (see Memora #3454 + #3455): - R5 (07cfd02): GlobalExceptionHandler dedicated handler. - R6 (000bc08): Controller-level catch arm + new outcome label. - R7 (this commit): Aggregator-level race-correction. All three layers must split for the contract to be honored end-to-end. Verification (containerised JDK 17 mvn install): - Tests run: 124, Failures: 0, Errors: 0, Skipped: 0 (was 123) - BUILD SUCCESS - New test: MintContextAggregatorTest#notFoundRacingTransientPromotesTransient arranges family-fast-NOT_FOUND + user-slow-transient and verifies the propagated exception is the transient (not NOT_FOUND).Show previous round
hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)
Round 7 — head
58efc2ab3dc9, basemain, triggersynchronizeTL;DR: CONDITIONAL_APPROVE — 1 agreed minor finding kept (both A and B independently flagged the same false Javadoc in
completionCauseIfTransient); no blocking or major issues; all 14 R1–R6 findings confirmed resolved.Summary
Arbitration — Round 7
Memora: No prior run history found (first-time arbitration despite round 7); create-path attempted but tag-validation errors from the Memora server prevented persistence.
Verification performed: Read
MintContextAggregator.javalines 460–494 at HEAD.Reconciliation
1 agreed finding (A + B): Both reviewers independently flag the same Javadoc paragraph in
completionCauseIfTransient. Verified at HEAD:allOf().get()throws — this is false per theCompletableFuture.allOf()contract (completes exceptionally when the first component fails; the second may still be in flight).@paramtag callsf"a fanout future already known to be terminally completed" — also false for the slow-transient case.isCompletedExceptionally()at line 481 short-circuits beforejoin()whenfis still pending. But the false Javadoc creates a trap for future maintainers who might remove the guard as "redundant".Line citation: A cites 471 (the sentence containing the false claim); B cites 469 (start of the paragraph). The problematic block spans 469–475. I cite 469 (block start) and document the exact false phrase.
Severity:
minor— agreed by both; the runtime behaviour is safe, only docs/observability gap.Blast Radius
The diff introduces a new gRPC fan-out subsystem (
MintContextAggregator,FamilyContextClient,UserProfileClient) wired into the hotCentrifugoTokenControllermint path, with two new protos, Resilience4j circuit-breaker wiring, and new exception/handler classes. The surface is consumed by a single controller action but spans five packages and introduces an outbound network I/O boundary with circuit-breaker configuration, raising the blast radius to moderate.BLAST_SCORE: 6/10
Risk Indicators
completionCauseIfTransient,serialiseUserContext,MintContextAggregator.aggregate,FamilyContextClient.getFamilyForUser,UserProfileClient.getUserProfileCI status (head
58efc2ab3dc9)No CI checks reported for this commit.
Findings (1)
[MINOR]
completionCauseIfTransientJavadoc falsely claims "both futures are guaranteed completed" — slow-transient functional gap persistssrc/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:469
Lines 469–473 read:
Line 475:
Why the claim is false: Per the
CompletableFuture.allOf()specification,allOf()completes exceptionally as soon as any one of its components completes exceptionally. WhenallOf().get()throwsExecutionException, only the first-failing future is guaranteed done; the other future may still be in flight.Why the code is currently safe: The
isCompletedExceptionally()guard at line 481 evaluates tofalsefor any future that is still pending (or that completed normally), causing an immediatereturn nullbeforejoin()is reached.join()is only called afterisCompletedExceptionally()confirms the future is terminally done exceptionally — that is the actual safety invariant, not theallOfcontract.The trap for future maintainers: Someone reading "both futures are guaranteed completed" may conclude the
isCompletedExceptionally()check is merely distinguishing exceptional vs. normal completion and remove it as redundant — that would re-introduce a blockingjoin()on an in-flight future, stalling amint-fanout-*thread until the per-RPC deadline fires.Functional gap (from R6, partially closed): The transient-cause promotion only succeeds when the transient future has already failed by the time the NOT_FOUND catch arm fires. For slow-transient futures (e.g. a 2 s
DEADLINE_EXCEEDEDfrom a downstream overload) that are still pending at catch time,isCompletedExceptionally()returnsfalse→completionCauseIfTransientreturnsnull→ NOT_FOUND wins the race → the outerticket.mintspan staysUNSETand thecontext_fetch_user_not_foundoutcome label fires, masking a real infra outage as routine account-deletion churn.Required fixes:
CompletableFuture.allOf()completes exceptionally when the first component fails; the returned future'sget()throwingExecutionExceptiondoes not guarantee the second future is done. TheisCompletedExceptionally()guard at line 481 is the actual safety net: iffis still in flight or completed normally the early-return fires beforejoin()is reached."@paramtag to: "fa fanout future that may or may not be done; returnsnullwhen still pending or completed normally".ExecutionExceptioncatch arm (~line 340) explicitly marking the transient-promotion as best-effort: if the other future is still pending at catch time its eventual failure is not surfaced at controller level —FANOUT_JOIN_TIMEOUT_MSprevents an indefinite block, but the outcome label and outer span status are already set.Verdict
CONDITIONAL_APPROVE
hib-pr-reviewer • round 7 • 1 finding (1m) • 2026-05-24T22:06:18.391Z → 2026-05-24T22:08:30.200Z • posted-as: pr-reviewer-bot
R7 verdict finding (kept=1): (1) MINOR MintContextAggregator.java:469-475 — completionCauseIfTransient Javadoc falsely claimed "both futures are guaranteed completed" by the time this helper is called from the ExecutionException arm. Per CompletableFuture.allOf() spec: allOf completes exceptionally as soon as the FIRST component fails; the other future may still be in-flight. Code remains safe because the isCompletedExceptionally() guard at line 481 short-circuits on pending futures BEFORE join() is reached — that guard IS the safety invariant, not the (false) allOf-completion claim. A maintainer trusting the Javadoc could remove the guard as "redundant", reintroducing a blocking join() on an in-flight future. Reviewer also called out the functional gap this leaves: a slow transient (e.g. 2 s DEADLINE_EXCEEDED still pending) is not promoted; the NOT_FOUND wins the race; outer ticket.mint span stays UNSET + outcome label stays permanent. FANOUT_JOIN_TIMEOUT_MS bounds the worst-case + attachStatusBest still marks the fanout CHILD span ERROR, so the SLO surface is bounded — but the outer- span/outcome-label resolution is best-effort. Fix (Javadoc-only): - Rewrote the safety-invariant paragraph to state the actual rule: isCompletedExceptionally() is the safety net, not the allOf claim. - Added a "best-effort caveat" paragraph documenting the slow-transient functional gap + why closing it fully would require defeating bounded fanout OR post-hoc reconciliation. - @param + @return tags reworded to reflect that f may be pending. - Added a BEST-EFFORT comment block in the ExecutionException arm pointing at the same caveat for future readers. Verification (containerised JDK 17 mvn install): - Tests run: 124, Failures: 0, Errors: 0, Skipped: 0 - BUILD SUCCESS - Javadoc-only delta; behaviour unchanged.hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)
Round 8 — head
90673882f8fc, basemain, triggersynchronizeTL;DR: NO_NEW_FINDINGS — No new findings this round.
Summary
[quorum-converged] A=0 = B=0. Round 8 — all 15 open findings are now resolved; no new issues found.
I read the following files at HEAD to verify closure:
MintContextAggregator.java(shutdown, executor, fanout join, attachStatusBest, completionCauseIfTransient)FamilyContextClient.javaandUserProfileClient.java(mapStatus, CB wiring)GlobalExceptionHandler.java(handler ordering, logging severity, OTel event calls)UserNotFoundContextFetchFailedException.javaandTicketContextFetchFailedException.java(Javadoc + error codes)SubscriptionTierMapper.java(IllegalArgumentException → TicketContextFetchFailedException rewrap confirmed)application.properties(recordExceptions / ignoreExceptions entries and comments)CentrifugoTokenController.javacatch-block ordering (from diff)Closure audit:
UserNotFoundContextFetchFailedExceptioninignoreExceptionsStatusRuntimeExceptioninrecordExceptionsIllegalArgumentExceptioninrecordExceptionsIllegalArgumentException → TicketContextFetchFailedExceptionrewrap viaparseAndValidate()newFixedThreadPool(16)unbounded queueArrayBlockingQueue(64)+CallerRunsPolicy@PreDestroynoawaitTerminationawaitTermination(SHUTDOWN_AWAIT_SECONDS, SECONDS)withshutdownNow()escalationattachStatusBestomits timing attributes on error pathboth.join()no wall-clock ceilingboth.get(FANOUT_JOIN_TIMEOUT_MS, MILLISECONDS)withTimeoutExceptionarmStatusRuntimeExceptionremappinghandleTicketContextFetchFailedlogs NOT_FOUND at ERROR@ExceptionHandler(UserNotFoundContextFetchFailedException.class)logs at WARN, no OTel error eventIllegalStateExceptiondiluting CB success rateIllegalStateExceptioninignoreExceptions(both instances)UserNotFoundContextFetchFailedExceptionJavadoc claims re-authRetry-After, not initiate token refreshOUTCOME_CONTEXT_FETCH_USER_NOT_FOUNDlabel;MintMetricsJavadoc updatedticket.mintspan ERRORspan.setStatus(StatusCode.ERROR, …)entirelyExecutionExceptionunwrap race-sensitivecompletionCauseIfTransientpromotes transient cause over NOT_FOUND when already done; best-effort semantics documentedcompletionCauseIfTransientJavadoc false invariantisCompletedExceptionally()guard and the slow-transient gapCI status (head
90673882f8fc)No CI checks reported for this commit.
Findings
No new findings this round.
Quorum converged on empty findings (A + B both returned 0).
Verdict
NO_NEW_FINDINGS
hib-pr-reviewer • round 8 • 0 findings • 2026-05-24T22:10:19.406Z → 2026-05-24T22:19:10.574Z • posted-as: pr-reviewer-bot • [bookkeeping fallback]