affinity-intelligence-rework/im2be-identity-service

Fork 0

feat(centrifugo): L0 T0 #6 PR-OPAQUE-4e — gRPC fan-out for mint-time user context #4

Merged

hibryda merged 8 commits from feat/l0-t0-opaque-4e-grpc-clients into main

2026-05-25 00:19:49 +02:00

hibryda commented

2026-05-24 22:21:44 +02:00

Owner

Summary

L0 Tranche-0 #6 PR-OPAQUE-4e — wires identity-service as the CONSUMER side of the Centrifugo opaque-ticket mint context aggregation. Issues family-service.GetFamilyForUser + user-service.GetUserProfile in parallel at mint time and assembles the full 6-field TicketUserContext per ADR-0002 §3a. Companion to:

PR-OPAQUE-4a — proto contracts (affinity-intelligence-rework/im2be-protobuf@316e77f).
PR-OPAQUE-4b — family-service server impl (affinity-intelligence-rework/im2be-family-service@6a1927e).
PR-OPAQUE-4c — user-service server impl (affinity-intelligence-rework/im2be-user-service@7e73084).

Changes

Production code

pom.xml — promotes grpc-client-spring-boot-starter from test-only to compile scope; adds resilience4j-spring-boot3:2.2.0 (verified 2026-05-24 against Maven Central as latest stable on Spring Boot 3.x line per rule 61).
src/main/proto/{family,user}/v1/*.proto — vendors meta-repo snapshot of family_context.proto + user_profile.proto; protobuf-maven-plugin now compiles 4 .proto files (was 2).
mintcontext/SubscriptionTierMapper — wire enum → canonical FREE/PRO/FAMILY labels; fail-closed on UNSPECIFIED + UNRECOGNIZED per user_profile.proto §rpc contract.
mintcontext/FamilyContextClient — outbound wrapper for GetFamilyForUser; @GrpcClient("family-service") injected blocking stub, Resilience4j circuit breaker, 2 s per-call deadline. Enforces the proto's 5-row presence table — rows 4 + 5 fail-closed symmetrically.
mintcontext/UserProfileClient — outbound wrapper for GetUserProfile; same shape. Fail-closed on UNSPECIFIED tier.
mintcontext/MintContextAggregator — parallel fan-out via CompletableFuture.allOf on a 16-thread pool, OTel context propagation, ticket.mint.fanout child span carrying per-RPC status + duration attributes.
CentrifugoTokenController — injects MintContextAggregator, replaces serialiseUserContext(userId) stub with serialiseUserContext(userId, ttl) that delegates to the aggregator. New catch arm for TicketContextFetchFailedException with context_fetch_failure outcome counter label.
exception/TicketContextFetchFailedException + GlobalExceptionHandler — new 503 + Retry-After: 1 mapping with distinct TICKET_CONTEXT_FETCH_FAILED error code so dashboards pivot Redis-side vs gRPC-side failures independently.
application.properties — grpc.client.{family,user}-service config + resilience4j circuit-breaker instances. Defaults: 50% failure threshold over 10 calls, 5 s open-state wait, COUNT_BASED sliding window.

ADR

ADR-0002 §3a edit landed in meta-repo main (docs/decisions/0002-centrifugo-opaque-ticket.md) — expanded with mint-flow diagram, parallel-fan-out semantics, circuit-breaker discipline, full failure-mapping table, staleness contract.

Tests

4 new test classes (39 tests): SubscriptionTierMapperTest (6), FamilyContextClientTest (13), UserProfileClientTest (13), MintContextAggregatorTest (7).
CentrifugoTokenControllerTest extended: 3 new tests (mintReturns503WhenContextFanoutFails, mintReturns500WhenContextFanoutThrowsIllegalState, mintSerialisesNoFamilyCaseWithNullFamilyId); existing tests updated to stub the aggregator + register JavaTimeModule on the standalone-test ObjectMapper for Instant serialisation symmetry with Spring Boot's auto-configured mapper.

Verification

mvn -DskipTests=false clean install → BUILD SUCCESS, 118 tests pass (was 79), ZERO [WARNING] lines (rule 62 — clean compilation).
Vioxen + plugins working trees clean (rule 53).

Test plan

mvn -P unit-tests test — 118 tests pass.
mvn -DskipTests=false clean install — BUILD SUCCESS, zero warnings.
Concurrent-mint smoke test (concurrentMintsProduceAllUniqueTickets) still passes after the refactor.
Aggregator parallelism proven via latch-based test (clientsRunInParallel).
Circuit-breaker open-state path covered for both clients.
Fail-closed paths covered for: row 4 / row 5 family-service shape, UNSPECIFIED user-service tier.
Integration smoke against live family-service + user-service — deferred to PR-OPAQUE-6 E2E qa-rig flow.

## Summary L0 Tranche-0 #6 **PR-OPAQUE-4e** — wires identity-service as the CONSUMER side of the Centrifugo opaque-ticket mint context aggregation. Issues `family-service.GetFamilyForUser` + `user-service.GetUserProfile` in parallel at mint time and assembles the full 6-field `TicketUserContext` per ADR-0002 §3a. Companion to: - **PR-OPAQUE-4a** — proto contracts (`affinity-intelligence-rework/im2be-protobuf@316e77f`). - **PR-OPAQUE-4b** — family-service server impl (`affinity-intelligence-rework/im2be-family-service@6a1927e`). - **PR-OPAQUE-4c** — user-service server impl (`affinity-intelligence-rework/im2be-user-service@7e73084`). ## Changes ### Production code - **`pom.xml`** — promotes `grpc-client-spring-boot-starter` from test-only to compile scope; adds `resilience4j-spring-boot3:2.2.0` (verified 2026-05-24 against Maven Central as latest stable on Spring Boot 3.x line per rule 61). - **`src/main/proto/{family,user}/v1/*.proto`** — vendors meta-repo snapshot of `family_context.proto` + `user_profile.proto`; protobuf-maven-plugin now compiles 4 .proto files (was 2). - **`mintcontext/SubscriptionTierMapper`** — wire enum → canonical `FREE`/`PRO`/`FAMILY` labels; fail-closed on `UNSPECIFIED` + `UNRECOGNIZED` per `user_profile.proto` §rpc contract. - **`mintcontext/FamilyContextClient`** — outbound wrapper for `GetFamilyForUser`; `@GrpcClient("family-service")` injected blocking stub, Resilience4j circuit breaker, 2 s per-call deadline. Enforces the proto's 5-row presence table — rows 4 + 5 fail-closed symmetrically. - **`mintcontext/UserProfileClient`** — outbound wrapper for `GetUserProfile`; same shape. Fail-closed on `UNSPECIFIED` tier. - **`mintcontext/MintContextAggregator`** — parallel fan-out via `CompletableFuture.allOf` on a 16-thread pool, OTel context propagation, `ticket.mint.fanout` child span carrying per-RPC status + duration attributes. - **`CentrifugoTokenController`** — injects `MintContextAggregator`, replaces `serialiseUserContext(userId)` stub with `serialiseUserContext(userId, ttl)` that delegates to the aggregator. New catch arm for `TicketContextFetchFailedException` with `context_fetch_failure` outcome counter label. - **`exception/TicketContextFetchFailedException` + `GlobalExceptionHandler`** — new 503 + `Retry-After: 1` mapping with distinct `TICKET_CONTEXT_FETCH_FAILED` error code so dashboards pivot Redis-side vs gRPC-side failures independently. - **`application.properties`** — `grpc.client.{family,user}-service` config + resilience4j circuit-breaker instances. Defaults: 50% failure threshold over 10 calls, 5 s open-state wait, COUNT_BASED sliding window. ### ADR ADR-0002 §3a edit landed in meta-repo `main` (`docs/decisions/0002-centrifugo-opaque-ticket.md`) — expanded with mint-flow diagram, parallel-fan-out semantics, circuit-breaker discipline, full failure-mapping table, staleness contract. ### Tests - **4 new test classes (39 tests)**: `SubscriptionTierMapperTest` (6), `FamilyContextClientTest` (13), `UserProfileClientTest` (13), `MintContextAggregatorTest` (7). - **`CentrifugoTokenControllerTest`** extended: 3 new tests (`mintReturns503WhenContextFanoutFails`, `mintReturns500WhenContextFanoutThrowsIllegalState`, `mintSerialisesNoFamilyCaseWithNullFamilyId`); existing tests updated to stub the aggregator + register `JavaTimeModule` on the standalone-test `ObjectMapper` for `Instant` serialisation symmetry with Spring Boot's auto-configured mapper. ## Verification - `mvn -DskipTests=false clean install` → **BUILD SUCCESS**, **118 tests pass** (was 79), **ZERO `[WARNING]` lines** (rule 62 — clean compilation). - Vioxen + plugins working trees clean (rule 53). ## Test plan - [x] `mvn -P unit-tests test` — 118 tests pass. - [x] `mvn -DskipTests=false clean install` — BUILD SUCCESS, zero warnings. - [x] Concurrent-mint smoke test (`concurrentMintsProduceAllUniqueTickets`) still passes after the refactor. - [x] Aggregator parallelism proven via latch-based test (`clientsRunInParallel`). - [x] Circuit-breaker open-state path covered for both clients. - [x] Fail-closed paths covered for: row 4 / row 5 family-service shape, UNSPECIFIED user-service tier. - [ ] Integration smoke against live family-service + user-service — deferred to PR-OPAQUE-6 E2E qa-rig flow.

hibryda added 1 commit

2026-05-24 22:21:44 +02:00

feat(centrifugo): L0 T0 #6 PR-OPAQUE-4e — gRPC fan-out for mint-time user context ad10ae12b4

Wires identity-service as the consumer of PR-OPAQUE-4b
(family-service.GetFamilyForUser) + PR-OPAQUE-4c
(user-service.GetUserProfile). Replaces the PR-OPAQUE-2 stub
TicketUserContext (user_id + minted_at only) with the full
6-field realtime.v1-shaped record aggregated in parallel from
both downstream services at mint time, per ADR-0002 §3a.

Changes:

- pom.xml — promotes grpc-client-spring-boot-starter from
  test-only to compile scope; adds resilience4j-spring-boot3
  2.2.0 (verified 2026-05-24 against Maven Central as latest
  stable on the Spring Boot 3.x line per rule 61).
- src/main/proto/{family,user}/v1/*.proto — vendors a snapshot
  of the meta-repo's family_context.proto + user_profile.proto;
  protobuf-maven-plugin now compiles 4 .proto files (was 2),
  generating FamilyContextServiceGrpc + UserProfileServiceGrpc
  client stubs alongside the existing ConnectProxyService server
  stub.
- mintcontext/SubscriptionTierMapper — maps SubscriptionTier
  wire enum → canonical FREE/PRO/FAMILY labels; UNSPECIFIED +
  UNRECOGNIZED fail-closed per user_profile.proto §rpc contract.
- mintcontext/FamilyContextClient — outbound wrapper for
  GetFamilyForUser; injected @GrpcClient("family-service") stub,
  Resilience4j circuit breaker, 2s per-call deadline. Enforces
  the proto's 5-row presence table — rows 4 + 5 fail-closed
  symmetrically.
- mintcontext/UserProfileClient — outbound wrapper for
  GetUserProfile; same shape. Fail-closed on UNSPECIFIED tier.
- mintcontext/MintContextAggregator — parallel fan-out via
  CompletableFuture.allOf on a 16-thread pool, OTel context
  propagation, ticket.mint.fanout child span carrying per-RPC
  status + duration attributes.
- CentrifugoTokenController — injects MintContextAggregator,
  replaces serialiseUserContext(userId) stub with
  serialiseUserContext(userId, ttl) that delegates to the
  aggregator. Adds catch arm for TicketContextFetchFailedException
  with new context_fetch_failure outcome counter label.
- exception/TicketContextFetchFailedException +
  GlobalExceptionHandler.handleTicketContextFetchFailed — new
  503 + Retry-After:1 mapping with distinct
  TICKET_CONTEXT_FETCH_FAILED error code so dashboards pivot
  Redis-side vs gRPC-side failures independently.
- application.properties — grpc.client.{family,user}-service
  config + resilience4j circuit-breaker instances. Defaults:
  50% failure threshold over 10 calls, 5s open-state wait,
  COUNT_BASED sliding window.

ADR-0002 §3a — expanded with the gRPC fan-out flow diagram,
parallel-fan-out semantics, circuit-breaker discipline, full
failure-mapping table.

Tests:

- 4 new test classes (39 tests): SubscriptionTierMapperTest (6),
  FamilyContextClientTest (13), UserProfileClientTest (13),
  MintContextAggregatorTest (7).
- Existing CentrifugoTokenControllerTest extended: 2 new tests
  (mintReturns503WhenContextFanoutFails,
  mintReturns500WhenContextFanoutThrowsIllegalState,
  mintSerialisesNoFamilyCaseWithNullFamilyId); existing tests
  updated to stub the aggregator + register JavaTimeModule on
  the standalone-test ObjectMapper for Instant serialisation
  symmetry with Spring Boot's auto-configured mapper.

Verification:

- mvn -DskipTests=false clean install → BUILD SUCCESS, 118
  tests pass (was 79), ZERO [WARNING] lines (rule 62).
- Vioxen + plugins working trees clean (rule 53).

pr-reviewer-bot commented

2026-05-24 22:27:46 +02:00

Member

Superseded by round 2.

Show previous round

hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)

Round 1 — head ad10ae12b45e, base main, trigger opened

TL;DR: NEEDS_WORK — kept 5 findings (1 agreed minor, 3 unique-verified minor, 1 unique-verified major); NOT_FOUND mapped to a recorded CB exception is the blocking concern.

Summary

Arbitration summary

First run for this PR — no prior history in Memora.

Memora recalled: 0 prior runs, 0 submodule patterns.

Verification performed:

application.properties lines 105-142: confirmed dead StatusRuntimeException entry (lines 129, 141) and stale comment (lines 112-115 say IllegalArgumentException; actual entry is TicketContextFetchFailedException).
FamilyContextClient.java lines 186-254: confirmed invoke() catches and remaps all StatusRuntimeException before CB sees it; confirmed NOT_FOUND maps to TicketContextFetchFailedException (line 238), which IS in recordExceptions.
UserProfileClient.java lines 133-177: same pattern confirmed; NOT_FOUND → TicketContextFetchFailedException (line 163).
MintContextAggregator.java lines 95, 141-161: confirmed FANOUT_POOL_SIZE=16, newFixedThreadPool(16) (unbounded LinkedBlockingQueue), and @PreDestroy shutdown() is non-blocking with Javadoc claiming it drains.

Reconciliation: Kept 1 agreed finding (A1≡B2), verified and kept 4 unique findings (A2, A3, B1, B3). No finding dropped — all 5 grounded in code.

Run persisted to Memora (id=229).

Blast Radius

The diff introduces a new gRPC fan-out layer on the mint hot-path (CentrifugoTokenController → MintContextAggregator → FamilyContextClient + UserProfileClient), wiring two new outbound network I/O boundaries protected by circuit breakers. Changes span 17 files across the controller, three new service classes, proto definitions, exception hierarchy, application config, and a full test suite. The circuit-breaker misconfiguration affects the entire user population when the CB trips open.

BLAST_SCORE: 6/10

Risk Indicators

Indicator	Value
Sensitive functions	`CentrifugoTokenController.serialiseUserContext`, `MintContextAggregator.aggregate`, `FamilyContextClient.mapStatus`, `UserProfileClient.mapStatus`, `FamilyContextClient.invoke`, `UserProfileClient.invoke`
Migration touched	—
Test delta	—
Dependency changes	—

CI status (head `ad10ae12b45e`)

No CI checks reported for this commit.

PR-OPAQUE-1 (grpc-client-spring-boot-starter promoted from test scope)
PR-OPAQUE-2 (user context stub — user_id + minted_at only)
PR-OPAQUE-3 (Validate path — ConnectValidateResponse consumer)

Findings (5)

[MAJOR] `NOT_FOUND` counted as a circuit-breaker failure — permanent errors can trip the breaker open for all users

src/main/java/com/aim2be/identity/mintcontext/FamilyContextClient.java:233

Both FamilyContextClient.mapStatus() (line 233–238) and UserProfileClient.mapStatus() (line 162–164) remap gRPC NOT_FOUND to TicketContextFetchFailedException. That class is listed in both recordExceptions configs (application.properties lines 129–130 and 141–142).

Verified at HEAD: NOT_FOUND case in FamilyContextClient.mapStatus() at line 233:

case NOT_FOUND:
    return new TicketContextFetchFailedException("family-service: user not found");

Same pattern at UserProfileClient.java:162.

With slidingWindowSize=10 and failureRateThreshold=50, five consecutive NOT_FOUND calls in a ten-call window open the circuit and block all mint requests for the waitDurationInOpenState=5s window. An account-deletion sweep followed by Centrifugo reconnect storms (typical after a backend restart) can easily produce this pattern. NOT_FOUND is a permanent semantic error, not a transient infrastructure fault — it is not what circuit breakers are designed to isolate.

The code comment even acknowledges the distinction: "the client should re-auth (which will surface the real 'user gone' error … before reaching us again)" — confirming this is expected, not exceptional, infrastructure behaviour.

Fix: Introduce a narrow subclass (e.g. UserNotFoundContextFailedException extends TicketContextFetchFailedException) that is NOT listed in recordExceptions, and use it for the NOT_FOUND branch in both mapStatus() implementations. Optionally add ignoreExceptions=com.aim2be.identity.exception.UserNotFoundContextFailedException for clarity. The TicketContextFetchFailedException entry then covers only genuinely transient paths (UNAVAILABLE, DEADLINE_EXCEEDED, INTERNAL, circuit-open, fail-closed shape errors).

[MINOR] Dead `io.grpc.StatusRuntimeException` entry in both `recordExceptions` configs

src/main/resources/application.properties:129

Verified at HEAD: both FamilyContextClient.invoke() (line 192) and UserProfileClient.invoke() (line 139) wrap the stub call in catch (StatusRuntimeException ex) { throw mapStatus(ex); }. Every StatusRuntimeException is caught and remapped to either TicketContextFetchFailedException or IllegalStateException before control returns to circuitBreaker.executeSupplier(…). The circuit breaker therefore never observes a raw StatusRuntimeException.

The dead entries are at application.properties:129 (family-service) and application.properties:141 (user-service). They create false confidence that unhandled gRPC exceptions would be caught by the breaker and confuse future maintainers reasoning about what counts as a failure.

Fix: Remove io.grpc.StatusRuntimeException,\ from both recordExceptions blocks. The effective recording set is already just TicketContextFetchFailedException. (This finding is confirmed by both independent reviewers.)

[MINOR] Comment incorrectly claims `IllegalArgumentException` is in `recordExceptions`

src/main/resources/application.properties:114

Verified at HEAD: lines 112–115 state:

# `recordExceptions` includes io.grpc.StatusRuntimeException
# (every transient-or-permanent gRPC failure surfaces as one
# of its subclasses) + IllegalArgumentException (the proto-
# level fail-closed path the wrappers raise on bad shapes).

The actual config at lines 129–130 and 141–142 lists com.aim2be.identity.exception.TicketContextFetchFailedException, not IllegalArgumentException. The IllegalArgumentException thrown by SubscriptionTierMapper.toLabel() is caught inside UserProfileClient.parseAndValidate() and re-thrown as TicketContextFetchFailedException before reaching the circuit breaker — so the actual config is correct, but the comment is wrong.

Fix: Replace the + IllegalArgumentException clause with + TicketContextFetchFailedException (the fail-closed exception all wrappers raise — StatusRuntimeException is remapped to this before it reaches the CB).

[MINOR] `newFixedThreadPool(16)` uses unbounded queue — no back-pressure beyond 8 concurrent mint requests

src/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:142

Verified at HEAD: FANOUT_POOL_SIZE = 16 (line 95); defaultExecutor() at line 141–152 calls Executors.newFixedThreadPool(16, factory), which internally uses an unbounded LinkedBlockingQueue. With 2 supplyAsync tasks per aggregate() call, the pool saturates at 8 concurrent in-flight mint requests. Beyond that, new tasks queue without bound while HTTP threads block on both.join(). There is no back-pressure signal to the Tomcat thread pool; under burst load, the executor queue grows unboundedly and per-request latency grows proportionally to queue depth.

Concrete fix:

new ThreadPoolExecutor(
    FANOUT_POOL_SIZE,
    FANOUT_POOL_SIZE * 2,
    60L, TimeUnit.SECONDS,
    new ArrayBlockingQueue<>(FANOUT_POOL_SIZE * 4),
    factory,
    new ThreadPoolExecutor.CallerRunsPolicy() // back-pressure to HTTP thread
)

CallerRunsPolicy applies back-pressure to the Tomcat thread pool rather than queuing without limit. Size the ArrayBlockingQueue capacity to expected burst concurrency × 2 once the SLO is established.

[MINOR] `@PreDestroy shutdown()` does not await executor termination — Javadoc claim of "drains" is incorrect

src/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:158

Verified at HEAD (lines 154–161):

/**
 * Drains the executor on Spring context shutdown so the JVM can
 * exit cleanly even under graceful shutdown.
 */
@PreDestroy
public void shutdown() {
    fanoutExecutor.shutdown();   // non-blocking — returns immediately
}

ExecutorService.shutdown() initiates orderly shutdown but returns without blocking — it does not wait for in-flight or queued tasks to complete. The Javadoc claim that this method "drains" the executor is incorrect.

Under the current deployment (server.shutdown=graceful + spring.lifecycle.timeout-per-shutdown-phase=30s), the servlet container drains in-flight HTTP requests before the application context closes, so @PreDestroy typically fires when no aggregate() calls are active. However, the ordering is not guaranteed under all shutdown signals, and reducing the grace window below the 2 s per-RPC deadline plus scheduling overhead could expose a race.

Fix:

@PreDestroy
public void shutdown() {
    fanoutExecutor.shutdown();
    try {
        if (!fanoutExecutor.awaitTermination(5, TimeUnit.SECONDS)) {
            fanoutExecutor.shutdownNow();
        }
    } catch (InterruptedException e) {
        fanoutExecutor.shutdownNow();
        Thread.currentThread().interrupt();
    }
}

Also update the Javadoc to accurately describe the blocking behaviour.

Verdict

NEEDS_WORK

_{hib-pr-reviewer • round 1 • 5 findings (1M/4m) • 2026-05-24T20:25:13.740Z → 2026-05-24T20:27:46.203Z • posted-as: pr-reviewer-bot}

> _Superseded by round 2._ <details> <summary>Show previous round</summary> ## hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service) **Round 1** — head `ad10ae12b45e`, base `main`, trigger `opened` **TL;DR:** NEEDS_WORK — kept 5 findings (1 agreed minor, 3 unique-verified minor, 1 unique-verified major); NOT_FOUND mapped to a recorded CB exception is the blocking concern. ### Summary ## Arbitration summary First run for this PR — no prior history in Memora. **Memora recalled:** 0 prior runs, 0 submodule patterns. **Verification performed:** - `application.properties` lines 105-142: confirmed dead `StatusRuntimeException` entry (lines 129, 141) and stale comment (lines 112-115 say `IllegalArgumentException`; actual entry is `TicketContextFetchFailedException`). - `FamilyContextClient.java` lines 186-254: confirmed `invoke()` catches and remaps all `StatusRuntimeException` before CB sees it; confirmed `NOT_FOUND` maps to `TicketContextFetchFailedException` (line 238), which IS in `recordExceptions`. - `UserProfileClient.java` lines 133-177: same pattern confirmed; `NOT_FOUND` → `TicketContextFetchFailedException` (line 163). - `MintContextAggregator.java` lines 95, 141-161: confirmed `FANOUT_POOL_SIZE=16`, `newFixedThreadPool(16)` (unbounded `LinkedBlockingQueue`), and `@PreDestroy shutdown()` is non-blocking with Javadoc claiming it drains. **Reconciliation:** Kept 1 agreed finding (A1≡B2), verified and kept 4 unique findings (A2, A3, B1, B3). No finding dropped — all 5 grounded in code. **Run persisted** to Memora (id=229). ### Blast Radius The diff introduces a new gRPC fan-out layer on the mint hot-path (CentrifugoTokenController → MintContextAggregator → FamilyContextClient + UserProfileClient), wiring two new outbound network I/O boundaries protected by circuit breakers. Changes span 17 files across the controller, three new service classes, proto definitions, exception hierarchy, application config, and a full test suite. The circuit-breaker misconfiguration affects the entire user population when the CB trips open. **BLAST_SCORE: 6/10** ### Risk Indicators | Indicator | Value | |---|---| | Sensitive functions | `CentrifugoTokenController.serialiseUserContext`, `MintContextAggregator.aggregate`, `FamilyContextClient.mapStatus`, `UserProfileClient.mapStatus`, `FamilyContextClient.invoke`, `UserProfileClient.invoke` | | Migration touched | — | | Test delta | — | | Dependency changes | — | ### CI status (head `ad10ae12b45e`) _No CI checks reported for this commit._ ### Related PRs - PR-OPAQUE-1 (grpc-client-spring-boot-starter promoted from test scope) - PR-OPAQUE-2 (user context stub — user_id + minted_at only) - PR-OPAQUE-3 (Validate path — ConnectValidateResponse consumer) ### Findings (5) #### **[MAJOR]** `NOT_FOUND` counted as a circuit-breaker failure — permanent errors can trip the breaker open for all users _src/main/java/com/aim2be/identity/mintcontext/FamilyContextClient.java:233_ Both `FamilyContextClient.mapStatus()` (line 233–238) and `UserProfileClient.mapStatus()` (line 162–164) remap gRPC `NOT_FOUND` to `TicketContextFetchFailedException`. That class is listed in both `recordExceptions` configs (`application.properties` lines 129–130 and 141–142). Verified at HEAD: `NOT_FOUND` case in `FamilyContextClient.mapStatus()` at line 233: ```java case NOT_FOUND: return new TicketContextFetchFailedException("family-service: user not found"); ``` Same pattern at `UserProfileClient.java:162`. With `slidingWindowSize=10` and `failureRateThreshold=50`, **five consecutive NOT_FOUND calls in a ten-call window** open the circuit and block all mint requests for the `waitDurationInOpenState=5s` window. An account-deletion sweep followed by Centrifugo reconnect storms (typical after a backend restart) can easily produce this pattern. NOT_FOUND is a **permanent semantic error**, not a transient infrastructure fault — it is not what circuit breakers are designed to isolate. The code comment even acknowledges the distinction: *"the client should re-auth (which will surface the real 'user gone' error … before reaching us again)"* — confirming this is expected, not exceptional, infrastructure behaviour. **Fix:** Introduce a narrow subclass (e.g. `UserNotFoundContextFailedException extends TicketContextFetchFailedException`) that is NOT listed in `recordExceptions`, and use it for the `NOT_FOUND` branch in both `mapStatus()` implementations. Optionally add `ignoreExceptions=com.aim2be.identity.exception.UserNotFoundContextFailedException` for clarity. The `TicketContextFetchFailedException` entry then covers only genuinely transient paths (UNAVAILABLE, DEADLINE_EXCEEDED, INTERNAL, circuit-open, fail-closed shape errors). #### **[MINOR]** Dead `io.grpc.StatusRuntimeException` entry in both `recordExceptions` configs _src/main/resources/application.properties:129_ Verified at HEAD: both `FamilyContextClient.invoke()` (line 192) and `UserProfileClient.invoke()` (line 139) wrap the stub call in `catch (StatusRuntimeException ex) { throw mapStatus(ex); }`. Every `StatusRuntimeException` is caught and remapped to either `TicketContextFetchFailedException` or `IllegalStateException` before control returns to `circuitBreaker.executeSupplier(…)`. The circuit breaker therefore never observes a raw `StatusRuntimeException`. The dead entries are at `application.properties:129` (family-service) and `application.properties:141` (user-service). They create false confidence that unhandled gRPC exceptions would be caught by the breaker and confuse future maintainers reasoning about what counts as a failure. **Fix:** Remove `io.grpc.StatusRuntimeException,\` from both `recordExceptions` blocks. The effective recording set is already just `TicketContextFetchFailedException`. *(This finding is confirmed by both independent reviewers.)* #### **[MINOR]** Comment incorrectly claims `IllegalArgumentException` is in `recordExceptions` _src/main/resources/application.properties:114_ Verified at HEAD: lines 112–115 state: ``` # `recordExceptions` includes io.grpc.StatusRuntimeException # (every transient-or-permanent gRPC failure surfaces as one # of its subclasses) + IllegalArgumentException (the proto- # level fail-closed path the wrappers raise on bad shapes). ``` The actual config at lines 129–130 and 141–142 lists `com.aim2be.identity.exception.TicketContextFetchFailedException`, not `IllegalArgumentException`. The `IllegalArgumentException` thrown by `SubscriptionTierMapper.toLabel()` is caught inside `UserProfileClient.parseAndValidate()` and re-thrown as `TicketContextFetchFailedException` before reaching the circuit breaker — so the actual config is correct, but the comment is wrong. **Fix:** Replace the `+ IllegalArgumentException` clause with `+ TicketContextFetchFailedException (the fail-closed exception all wrappers raise — StatusRuntimeException is remapped to this before it reaches the CB).` #### **[MINOR]** `newFixedThreadPool(16)` uses unbounded queue — no back-pressure beyond 8 concurrent mint requests _src/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:142_ Verified at HEAD: `FANOUT_POOL_SIZE = 16` (line 95); `defaultExecutor()` at line 141–152 calls `Executors.newFixedThreadPool(16, factory)`, which internally uses an unbounded `LinkedBlockingQueue`. With 2 `supplyAsync` tasks per `aggregate()` call, the pool saturates at 8 concurrent in-flight mint requests. Beyond that, new tasks queue without bound while HTTP threads block on `both.join()`. There is no back-pressure signal to the Tomcat thread pool; under burst load, the executor queue grows unboundedly and per-request latency grows proportionally to queue depth. **Concrete fix:** ```java new ThreadPoolExecutor( FANOUT_POOL_SIZE, FANOUT_POOL_SIZE * 2, 60L, TimeUnit.SECONDS, new ArrayBlockingQueue<>(FANOUT_POOL_SIZE * 4), factory, new ThreadPoolExecutor.CallerRunsPolicy() // back-pressure to HTTP thread ) ``` `CallerRunsPolicy` applies back-pressure to the Tomcat thread pool rather than queuing without limit. Size the `ArrayBlockingQueue` capacity to expected burst concurrency × 2 once the SLO is established. #### **[MINOR]** `@PreDestroy shutdown()` does not await executor termination — Javadoc claim of "drains" is incorrect _src/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:158_ Verified at HEAD (lines 154–161): ```java /** * Drains the executor on Spring context shutdown so the JVM can * exit cleanly even under graceful shutdown. */ @PreDestroy public void shutdown() { fanoutExecutor.shutdown(); // non-blocking — returns immediately } ``` `ExecutorService.shutdown()` initiates orderly shutdown but returns without blocking — it does not wait for in-flight or queued tasks to complete. The Javadoc claim that this method "drains" the executor is incorrect. Under the current deployment (`server.shutdown=graceful` + `spring.lifecycle.timeout-per-shutdown-phase=30s`), the servlet container drains in-flight HTTP requests before the application context closes, so `@PreDestroy` typically fires when no `aggregate()` calls are active. However, the ordering is not guaranteed under all shutdown signals, and reducing the grace window below the 2 s per-RPC deadline plus scheduling overhead could expose a race. **Fix:** ```java @PreDestroy public void shutdown() { fanoutExecutor.shutdown(); try { if (!fanoutExecutor.awaitTermination(5, TimeUnit.SECONDS)) { fanoutExecutor.shutdownNow(); } } catch (InterruptedException e) { fanoutExecutor.shutdownNow(); Thread.currentThread().interrupt(); } } ``` Also update the Javadoc to accurately describe the blocking behaviour. ### Verdict **NEEDS_WORK** --- hib-pr-reviewer • round 1 • 5 findings (1M/4m) • 2026-05-24T20:25:13.740Z → 2026-05-24T20:27:46.203Z • posted-as: pr-reviewer-bot </details>

hibryda referenced this pull request from a commit

2026-05-24 22:55:17 +02:00

fix(mintcontext): apply PR #4 R1 reviewer findings (1 MAJOR + 4 minor)

hibryda added 1 commit

2026-05-24 22:55:18 +02:00

fix(mintcontext): apply PR #4 R1 reviewer findings (1 MAJOR + 4 minor) 2d1fedec92

R1 verdict NEEDS_WORK — kept 5 findings; all addressed per rule 63
(ripple-aware) + rule 68 (structured commit body).

(1) MAJOR FamilyContextClient.java:233 / UserProfileClient.java:162 —
    NOT_FOUND counted as a CB failure: a burst of NOT_FOUND (account-
    deletion sweep + Centrifugo reconnect storm) could trip the
    family-service / user-service circuit breaker open and starve the
    rest of the user population. NOT_FOUND is a permanent client-side
    classification, NOT a transient infrastructure fault, so it must
    NOT contribute to the failure-rate window.

    Fix: introduced UserNotFoundContextFetchFailedException (subclass
    of TicketContextFetchFailedException so the existing
    @ExceptionHandler(TicketContextFetchFailedException.class) still
    maps the wire shape to HTTP 503 + Retry-After:1 — wire contract
    unchanged). Both mapStatus() implementations now raise the
    subclass on NOT_FOUND. application.properties wires the subclass
    into resilience4j.circuitbreaker.instances.{family,user}-service
    .ignoreExceptions so the CB filter excludes it from the failure
    window. Only transient codes (UNAVAILABLE / DEADLINE_EXCEEDED /
    INTERNAL / circuit-open / proto fail-closed) remain CB-failure
    signals via the parent class.

(2) minor application.properties:129/141 — dead `io.grpc.StatusRuntimeException`
    entry in both recordExceptions blocks: every StatusRuntimeException
    is caught and remapped to TicketContextFetchFailedException or
    IllegalStateException by mapStatus() BEFORE reaching
    circuitBreaker.executeSupplier, so the CB never observes the raw
    gRPC type. Removed both entries.

(3) minor application.properties:114 — stale comment claimed
    IllegalArgumentException was in recordExceptions when the actual
    entry is TicketContextFetchFailedException. Rewrote the doc block
    to accurately describe the effective failure-signal class +
    document why io.grpc.StatusRuntimeException was dead code +
    document the new ignoreExceptions list.

(4) minor MintContextAggregator.java:142 — newFixedThreadPool(16) uses
    unbounded LinkedBlockingQueue: with 2 supplyAsync tasks per
    aggregate() call, the pool saturated at 8 concurrent mint requests
    and beyond that latency grew unboundedly without back-pressure to
    Tomcat. Replaced with ThreadPoolExecutor + ArrayBlockingQueue
    (capacity = FANOUT_POOL_SIZE * 4 = 64) + CallerRunsPolicy so the
    calling HTTP thread runs the task synchronously when the queue
    fills, propagating back-pressure into Tomcat's bounded
    connection-accept queue. Sized for L0 T0 #6 burst envelope —
    re-tune against the FINDING 53 mint-RPS SLO once production
    traffic shows up.

(5) minor MintContextAggregator.java:158 — @PreDestroy shutdown() did
    not await termination; the Javadoc's "drains" claim was incorrect.
    Replaced with shutdown() + awaitTermination(5s) + shutdownNow()
    escalation + InterruptedException-restores-flag + WARN log on
    timeout. Javadoc rewritten to accurately describe the blocking
    behaviour. Constant SHUTDOWN_AWAIT_SECONDS=5 chosen as 2.5× the
    per-RPC deadline so a steady-state in-flight RPC always drains.

Verification:
- mvn -DskipTests=false clean install → BUILD SUCCESS, 0 warnings
- Tests run: 120, Failures: 0, Errors: 0, Skipped: 0 (was 118 — +2)
  • FamilyContextClientTest: 13 → 14 (+circuitBreakerDoesNotOpenOnRepeatedNotFound)
  • UserProfileClientTest:   13 → 14 (+circuitBreakerDoesNotOpenOnRepeatedNotFound)
  • Existing notFoundMapsTo* tests strengthened to assert the new
    subclass (UserNotFoundContextFetchFailedException) while still
    asserting parent-type compatibility for the controller handler.
  • Existing circuitBreakerOpensAfterRepeatedFailures unchanged —
    still uses UNAVAILABLE (transient) so the CB-open assertion
    holds (NOT_FOUND-vs-UNAVAILABLE distinction is exactly the bug).

Refs PR #4 R1 verdict, ADR-0002 §3a, ADR-0014.

pr-reviewer-bot commented

2026-05-24 23:01:55 +02:00

Member

Superseded by round 3.

Show previous round

hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)

Round 2 — head 2d1fedec921a, base main, trigger synchronize

TL;DR: CONDITIONAL_APPROVE — 2 minor findings kept this round.

Summary

Arbiter reconciled 1 (A) + 1 (B) → 2 findings.

CI status (head `2d1fedec921a`)

No CI checks reported for this commit.

Findings (2)

[MINOR] `attachStatusBest` omits all three timing attributes from the error-path span, breaking the Javadoc contract

src/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:306

The class-level Javadoc (lines 47–57) unconditionally promises five attributes on every ticket.mint.fanout span:

fanout.family.status / fanout.user.status — ✅ written by attachStatusBest
fanout.duration_ms — ❌ never written on the error path
fanout.family.duration_ms / fanout.user.duration_ms — ❌ never written on the error path

attachStatusBest (lines 306–316) only sets the two status strings and calls span.setStatus(ERROR). Verified at HEAD via Read.

Impact:

On-call cannot distinguish a 0 ms circuit-open rejection from a 2 000 ms deadline-exceeded from the span alone.
When one future succeeds before the other fails, the successful future's TimedResult.durationMs is silently discarded.
Any trace-based SLO query against fanout.duration_ms will miss all error-path samples.

Suggested fix — add a fanoutStartNs parameter and recover per-RPC timing from completed futures:

private void attachStatusBest(Span span,
        CompletableFuture<TimedResult<FamilyContextClient.FamilyContextResult>> familyFuture,
        CompletableFuture<TimedResult<UserProfileClient.UserProfileResult>> userFuture,
        long fanoutStartNs) {
    span.setAttribute(ATTR_DURATION_MS, nsToMs(System.nanoTime() - fanoutStartNs));
    span.setAttribute(ATTR_FAMILY_STATUS,
            familyFuture.isCompletedExceptionally() || !familyFuture.isDone()
                    ? STATUS_ERROR : STATUS_SUCCESS);
    if (familyFuture.isDone() && !familyFuture.isCompletedExceptionally()) {
        try { span.setAttribute(ATTR_FAMILY_DURATION_MS, familyFuture.join().durationMs()); }
        catch (RuntimeException ignored) { }
    }
    // mirror for userFuture / ATTR_USER_DURATION_MS
    span.setStatus(StatusCode.ERROR, "fanout-failure");
}

Both call sites in aggregate() (lines 278 and 282) must also pass fanoutStartNs (already in scope there).

[MINOR] `both.join()` has no wall-clock ceiling — indefinite thread block possible if per-RPC deadline does not fire

src/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:268

Line 268 is both.join() with no timeout argument (confirmed at HEAD via Read). The Javadoc at lines 44–45 explicitly acknowledges: "the join itself does not impose an additional timeout."

The per-RPC withDeadlineAfter(2000 ms) stubs are the sole safety net, but two failure modes can cause a blocking stub to outlive its deadline:

Half-open TCP socket — a firewall silently drops FIN; the blocking read never returns even after the deadline timer fires on the gRPC event-loop thread.
Event-loop thread starvation — the deadline expiry callback is delayed, leaving the blocking call live past 2 s.

Under either scenario the fanout worker thread blocks indefinitely. When the ArrayBlockingQueue(64) fills under burst load, CallerRunsPolicy causes the HTTP request thread to run the task directly, eating a Tomcat slot until the client closes the connection.

Suggested fix — replace both.join() with a bounded get() with headroom over the per-RPC deadline:

static final long FANOUT_JOIN_TIMEOUT_MS =
        Math.max(FamilyContextClient.CALL_DEADLINE_MS,
                 UserProfileClient.CALL_DEADLINE_MS) + 500L;

try {
    both.get(FANOUT_JOIN_TIMEOUT_MS, TimeUnit.MILLISECONDS);
} catch (TimeoutException te) {
    attachStatusBest(span, familyFuture, userFuture);
    throw new TicketContextFetchFailedException(
            "fanout join timed out — per-RPC deadline may not have fired", te);
} catch (ExecutionException ee) {
    attachStatusBest(span, familyFuture, userFuture);
    rethrow(ee.getCause());
    return null;
}

FamilyContextClient.CALL_DEADLINE_MS and UserProfileClient.CALL_DEADLINE_MS are in the same mintcontext package and accessible directly.

Verdict

CONDITIONAL_APPROVE

_{hib-pr-reviewer • round 2 • 2 findings (2m) • 2026-05-24T21:00:06.801Z → 2026-05-24T21:01:55.390Z • posted-as: pr-reviewer-bot • [bookkeeping fallback]}

> _Superseded by round 3._ <details> <summary>Show previous round</summary> ## hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service) **Round 2** — head `2d1fedec921a`, base `main`, trigger `synchronize` **TL;DR:** CONDITIONAL_APPROVE — 2 minor findings kept this round. ### Summary Arbiter reconciled 1 (A) + 1 (B) → 2 findings. ### CI status (head `2d1fedec921a`) _No CI checks reported for this commit._ ### Findings (2) #### **[MINOR]** `attachStatusBest` omits all three timing attributes from the error-path span, breaking the Javadoc contract _src/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:306_ The class-level Javadoc (lines 47–57) unconditionally promises five attributes on every `ticket.mint.fanout` span: - `fanout.family.status` / `fanout.user.status` — ✅ written by `attachStatusBest` - `fanout.duration_ms` — ❌ **never written** on the error path - `fanout.family.duration_ms` / `fanout.user.duration_ms` — ❌ **never written** on the error path `attachStatusBest` (lines 306–316) only sets the two status strings and calls `span.setStatus(ERROR)`. Verified at HEAD via Read. **Impact:** 1. On-call cannot distinguish a 0 ms circuit-open rejection from a 2 000 ms deadline-exceeded from the span alone. 2. When one future succeeds before the other fails, the successful future's `TimedResult.durationMs` is silently discarded. 3. Any trace-based SLO query against `fanout.duration_ms` will miss all error-path samples. **Suggested fix** — add a `fanoutStartNs` parameter and recover per-RPC timing from completed futures: ```java private void attachStatusBest(Span span, CompletableFuture<TimedResult<FamilyContextClient.FamilyContextResult>> familyFuture, CompletableFuture<TimedResult<UserProfileClient.UserProfileResult>> userFuture, long fanoutStartNs) { span.setAttribute(ATTR_DURATION_MS, nsToMs(System.nanoTime() - fanoutStartNs)); span.setAttribute(ATTR_FAMILY_STATUS, familyFuture.isCompletedExceptionally() || !familyFuture.isDone() ? STATUS_ERROR : STATUS_SUCCESS); if (familyFuture.isDone() && !familyFuture.isCompletedExceptionally()) { try { span.setAttribute(ATTR_FAMILY_DURATION_MS, familyFuture.join().durationMs()); } catch (RuntimeException ignored) { } } // mirror for userFuture / ATTR_USER_DURATION_MS span.setStatus(StatusCode.ERROR, "fanout-failure"); } ``` Both call sites in `aggregate()` (lines 278 and 282) must also pass `fanoutStartNs` (already in scope there). #### **[MINOR]** `both.join()` has no wall-clock ceiling — indefinite thread block possible if per-RPC deadline does not fire _src/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:268_ Line 268 is `both.join()` with no timeout argument (confirmed at HEAD via Read). The Javadoc at lines 44–45 explicitly acknowledges: *"the join itself does not impose an additional timeout."* The per-RPC `withDeadlineAfter(2000 ms)` stubs are the sole safety net, but two failure modes can cause a blocking stub to outlive its deadline: 1. **Half-open TCP socket** — a firewall silently drops FIN; the blocking read never returns even after the deadline timer fires on the gRPC event-loop thread. 2. **Event-loop thread starvation** — the deadline expiry callback is delayed, leaving the blocking call live past 2 s. Under either scenario the fanout worker thread blocks indefinitely. When the `ArrayBlockingQueue(64)` fills under burst load, `CallerRunsPolicy` causes the HTTP request thread to run the task directly, eating a Tomcat slot until the client closes the connection. **Suggested fix** — replace `both.join()` with a bounded `get()` with headroom over the per-RPC deadline: ```java static final long FANOUT_JOIN_TIMEOUT_MS = Math.max(FamilyContextClient.CALL_DEADLINE_MS, UserProfileClient.CALL_DEADLINE_MS) + 500L; try { both.get(FANOUT_JOIN_TIMEOUT_MS, TimeUnit.MILLISECONDS); } catch (TimeoutException te) { attachStatusBest(span, familyFuture, userFuture); throw new TicketContextFetchFailedException( "fanout join timed out — per-RPC deadline may not have fired", te); } catch (ExecutionException ee) { attachStatusBest(span, familyFuture, userFuture); rethrow(ee.getCause()); return null; } ``` `FamilyContextClient.CALL_DEADLINE_MS` and `UserProfileClient.CALL_DEADLINE_MS` are in the same `mintcontext` package and accessible directly. ### Verdict **CONDITIONAL_APPROVE** --- hib-pr-reviewer • round 2 • 2 findings (2m) • 2026-05-24T21:00:06.801Z → 2026-05-24T21:01:55.390Z • posted-as: pr-reviewer-bot • [bookkeeping fallback] </details>

hibryda referenced this pull request from a commit

2026-05-24 23:08:37 +02:00

fix(mintcontext): apply PR #4 R2 reviewer findings (2 MINOR)

hibryda added 1 commit

2026-05-24 23:08:38 +02:00

fix(mintcontext): apply PR #4 R2 reviewer findings (2 MINOR) 098b687d52

R2 verdict findings (kept=2):

(1) MINOR src/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:306
    — attachStatusBest omits all three timing attributes from the
    error-path span, breaking the Javadoc contract. The class-level
    Javadoc promises 5 attributes (fanout.{family,user}.status +
    fanout.duration_ms + fanout.{family,user}.duration_ms) on every
    ticket.mint.fanout span; the error path only wrote the 2 status
    strings, leaving 3 attributes absent on error spans. On-call
    couldn't distinguish a 0 ms circuit-open rejection from a 2 000 ms
    deadline-exceeded; the successful future's TimedResult.durationMs
    was silently discarded when the other failed; trace-SLO queries
    against fanout.duration_ms missed all error-path samples.

(2) MINOR src/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:268
    — both.join() has no wall-clock ceiling. Per-RPC withDeadlineAfter
    (2 s) is the primary safety net, but two pathologies can outlive
    it: a half-open TCP socket where a firewall drops FIN silently,
    and event-loop thread starvation that delays the deadline-expiry
    callback. Either case blocks the fanout worker indefinitely; once
    the ArrayBlockingQueue(64) fills, CallerRunsPolicy promotes the
    block to the Tomcat HTTP thread, eating a slot until the client
    closes the connection.

Fix: Two surgical changes.

(1) attachStatusBest gains a fanoutStartNs parameter + writes all
    three timing attributes. The parent fanout.duration_ms is always
    measured from fanoutStartNs (so even a 0 ms early-exit is
    visible). Per-RPC fanout.{family,user}.duration_ms are attached
    for any future that completed successfully before the join's
    failure path was entered — the TimedResult on a successful side
    carries the value; failed/pending sides have no measurement to
    report. Per-future join calls inside the attach are wrapped in a
    defensive try/catch so an attribute-attach can't derail the
    span-finish path. All four callsites (TimeoutException, new
    InterruptedException, ExecutionException, CompletionException,
    RuntimeException) pass fanoutStartNs through.

(2) both.join() replaced with both.get(FANOUT_JOIN_TIMEOUT_MS,
    TimeUnit.MILLISECONDS). FANOUT_JOIN_TIMEOUT_MS is pinned to
    max(per-RPC deadline) + 500 ms = 2_500 ms — a healthy RPC at
    2 s still passes; a stuck stub surfaces as
    TicketContextFetchFailedException (existing 503 fail-closed path)
    inside 2.5 s. Made public static final so the unit test can
    sanity-check the bound stays within [1 s, 10 s]; exposing a
    SLO-relevant constant via public is consistent with the
    Javadoc-documented contract. New catch arms for TimeoutException
    (wraps to TicketContextFetchFailedException with the join-timeout
    message), ExecutionException (unwraps cause + routes through
    existing rethrow), and InterruptedException (restores interrupt
    flag + wraps to TicketContextFetchFailedException). Existing
    CompletionException + RuntimeException arms retained defensively
    in case the per-future joins below ever surface them.

Ripple inspection (rule 63):
- Javadoc (class-level + attachStatusBest method-level + new
  FANOUT_JOIN_TIMEOUT_MS Javadoc) updated to match the new behaviour
  — the Javadoc contract finding #1 reviewer cited is now the actual
  contract.
- No external consumer reads the OTel attributes besides tail samplers
  + tracing UIs (no broken downstream).
- TicketContextFetchFailedException already maps to 503 in
  GlobalExceptionHandler (verified at HEAD) — the new timeout path
  joins the existing fail-closed routing, no new error code needed.
- Removed unused `Optional` import from the test file (preexisting
  per rule 18).

Verification:
- mvn -DskipTests=false clean install → BUILD SUCCESS
- Tests run: 122, Failures: 0, Errors: 0, Skipped: 0
- MintContextAggregatorTest: 9 tests pass (was 7; +2 for new fixes)
  * joinTimeoutSurfacesAsFetchFailed — parks both stubs 4 s; verifies
    TicketContextFetchFailedException trips ~2.5 s with "fanout join
    timed out" message.
  * timeoutCeilingExposedAsConstant — guards FANOUT_JOIN_TIMEOUT_MS
    against accidental drift outside [1 s, 10 s].
- Compile-only warning check: zero source warnings (rule 62).

External docs verification (rule 61):
- CompletableFuture.get(long, TimeUnit) — Java 17 javadoc:
  TimeoutException on deadline; ExecutionException wraps task
  exception; InterruptedException for thread interrupt. Behaviour
  matches the catch-arm structure.
- gRPC withDeadlineAfter behaviour:
  https://grpc.io/docs/guides/deadlines/ — deadline expiry runs on
  the event-loop thread; thread starvation can delay callback. Cited
  in the Javadoc rationale for why the join ceiling is needed.

pr-reviewer-bot commented

2026-05-24 23:16:21 +02:00

Member

Superseded by round 4.

Show previous round

hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)

Round 3 — head 098b687d523b, base main, trigger synchronize

TL;DR: CONDITIONAL_APPROVE — kept 1 verified unique-to-B minor finding (comment inaccuracy in application.properties); all 7 prior-round blocking issues confirmed resolved.

Summary

Round 3 Arbitration — affinity-intelligence-rework/im2be-identity-service#4

Prior context: No existing Memora memory for this PR (despite being round 3); created new entry (ID 233) with review_count=3.

A vs B reconciliation:

A: 0 findings — confirmed all 7 prior-round issues fully resolved.
B: 1 unique finding at application.properties:115 — comment overstates StatusRuntimeException remapping.

Verification of B's unique finding:

Read application.properties lines 112–129: the comment states "every transient-or-permanent fanout failure surfaces as TicketContextFetchFailedException (the wrappers remap raw io.grpc.StatusRuntimeException + the proto-level fail-closed IllegalArgumentException to this typed superclass…)".

Read FamilyContextClient.java:235–244 and UserProfileClient.java:157–162: both mapStatus() methods return new IllegalStateException(…) — not TicketContextFetchFailedException — for INVALID_ARGUMENT. Confirmed recordExceptions at line 141 lists only TicketContextFetchFailedException, so IllegalStateException is never counted in the failure-rate window. The comment's claim that "every" fanout failure reaches TicketContextFetchFailedException is factually incorrect for the INVALID_ARGUMENT branch.

Decision: Keep B's finding (verified unique-to-B). Severity remains minor (comment inaccuracy; no runtime bug — the code behaves correctly, only the prose explanation misleads).

Kept 1 unique-to-B (verified); dropped 0.

Blast Radius

The finding is a documentation comment in a configuration file. No runtime behaviour is affected; the circuit-breaker wiring, exception hierarchy, and gRPC fan-out logic are all correct. Impact is confined to developer understanding of the INVALID_ARGUMENT code path.

BLAST_SCORE: 1/10

Risk Indicators

Indicator	Value
Sensitive functions	`MintContextAggregator.aggregateContext`, `FamilyContextClient.mapStatus`, `UserProfileClient.mapStatus`, `CentrifugoTokenController.serialiseUserContext`
Migration touched	—
Test delta	—
Dependency changes	—

CI status (head `098b687d523b`)

No CI checks reported for this commit.

Findings (1)

[MINOR] Comment overstates `StatusRuntimeException` remapping — `INVALID_ARGUMENT` maps to `IllegalStateException`, not `TicketContextFetchFailedException`

src/main/resources/application.properties:115

Lines 112–119 assert that "every transient-or-permanent fanout failure surfaces as TicketContextFetchFailedException (the wrappers remap raw io.grpc.StatusRuntimeException + the proto-level fail-closed IllegalArgumentException to this typed superclass…)". This is inaccurate for the INVALID_ARGUMENT branch: both FamilyContextClient.mapStatus() (line 243) and UserProfileClient.mapStatus() (line 161) return new IllegalStateException(…) for INVALID_ARGUMENT — which is not a subclass of TicketContextFetchFailedException. Because IllegalStateException appears in neither recordExceptions nor ignoreExceptions, the circuit breaker treats INVALID_ARGUMENT outcomes as successes, not failures. The comment implies it would be counted as a failure ("transient-or-permanent fanout failure"), which is wrong.

The code itself is correct — treating INVALID_ARGUMENT as a programming error (HTTP 500) and not as a CB-trippable transient is intentional and sound. Only the comment is wrong.

Suggested replacement (sketch): "recordExceptions covers TicketContextFetchFailedException and subclasses, except UserNotFoundContextFetchFailedException (in ignoreExceptions). Transient gRPC codes (UNAVAILABLE/DEADLINE_EXCEEDED/INTERNAL/default-bucket) and the proto fail-closed IllegalArgumentException (from SubscriptionTierMapper) surface as TicketContextFetchFailedException and count toward the failure rate. INVALID_ARGUMENT is a programming-error path — it maps to IllegalStateException (HTTP 500) and the CB treats it as a success, not a failure."

Verdict

CONDITIONAL_APPROVE

_{hib-pr-reviewer • round 3 • 1 finding (1m) • 2026-05-24T21:14:36.450Z → 2026-05-24T21:16:21.538Z • posted-as: pr-reviewer-bot}

> _Superseded by round 4._ <details> <summary>Show previous round</summary> ## hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service) **Round 3** — head `098b687d523b`, base `main`, trigger `synchronize` **TL;DR:** CONDITIONAL_APPROVE — kept 1 verified unique-to-B minor finding (comment inaccuracy in application.properties); all 7 prior-round blocking issues confirmed resolved. ### Summary ## Round 3 Arbitration — affinity-intelligence-rework/im2be-identity-service#4 **Prior context:** No existing Memora memory for this PR (despite being round 3); created new entry (ID 233) with review_count=3. **A vs B reconciliation:** - A: 0 findings — confirmed all 7 prior-round issues fully resolved. - B: 1 unique finding at `application.properties:115` — comment overstates `StatusRuntimeException` remapping. **Verification of B's unique finding:** Read `application.properties` lines 112–129: the comment states *"every transient-or-permanent fanout failure surfaces as `TicketContextFetchFailedException` (the wrappers remap raw `io.grpc.StatusRuntimeException` + the proto-level fail-closed `IllegalArgumentException` to this typed superclass…)"*. Read `FamilyContextClient.java:235–244` and `UserProfileClient.java:157–162`: both `mapStatus()` methods return `new IllegalStateException(…)` — **not** `TicketContextFetchFailedException` — for `INVALID_ARGUMENT`. Confirmed `recordExceptions` at line 141 lists only `TicketContextFetchFailedException`, so `IllegalStateException` is never counted in the failure-rate window. The comment's claim that "every" fanout failure reaches `TicketContextFetchFailedException` is factually incorrect for the `INVALID_ARGUMENT` branch. **Decision:** Keep B's finding (verified unique-to-B). Severity remains minor (comment inaccuracy; no runtime bug — the code behaves correctly, only the prose explanation misleads). Kept 1 unique-to-B (verified); dropped 0. ### Blast Radius The finding is a documentation comment in a configuration file. No runtime behaviour is affected; the circuit-breaker wiring, exception hierarchy, and gRPC fan-out logic are all correct. Impact is confined to developer understanding of the `INVALID_ARGUMENT` code path. **BLAST_SCORE: 1/10** ### Risk Indicators | Indicator | Value | |---|---| | Sensitive functions | `MintContextAggregator.aggregateContext`, `FamilyContextClient.mapStatus`, `UserProfileClient.mapStatus`, `CentrifugoTokenController.serialiseUserContext` | | Migration touched | — | | Test delta | — | | Dependency changes | — | ### CI status (head `098b687d523b`) _No CI checks reported for this commit._ ### Findings (1) #### **[MINOR]** Comment overstates `StatusRuntimeException` remapping — `INVALID_ARGUMENT` maps to `IllegalStateException`, not `TicketContextFetchFailedException` _src/main/resources/application.properties:115_ Lines 112–119 assert that *"every transient-or-permanent fanout failure surfaces as `TicketContextFetchFailedException` (the wrappers remap raw `io.grpc.StatusRuntimeException` + the proto-level fail-closed `IllegalArgumentException` to this typed superclass…)"*. This is inaccurate for the `INVALID_ARGUMENT` branch: both `FamilyContextClient.mapStatus()` (line 243) and `UserProfileClient.mapStatus()` (line 161) return `new IllegalStateException(…)` for `INVALID_ARGUMENT` — which is **not** a subclass of `TicketContextFetchFailedException`. Because `IllegalStateException` appears in neither `recordExceptions` nor `ignoreExceptions`, the circuit breaker treats `INVALID_ARGUMENT` outcomes as **successes**, not failures. The comment implies it would be counted as a failure ("transient-or-permanent fanout failure"), which is wrong. The code itself is correct — treating `INVALID_ARGUMENT` as a programming error (HTTP 500) and not as a CB-trippable transient is intentional and sound. Only the comment is wrong. Suggested replacement (sketch): *"`recordExceptions` covers `TicketContextFetchFailedException` and subclasses, except `UserNotFoundContextFetchFailedException` (in `ignoreExceptions`). Transient gRPC codes (`UNAVAILABLE`/`DEADLINE_EXCEEDED`/`INTERNAL`/default-bucket) and the proto fail-closed `IllegalArgumentException` (from `SubscriptionTierMapper`) surface as `TicketContextFetchFailedException` and count toward the failure rate. `INVALID_ARGUMENT` is a programming-error path — it maps to `IllegalStateException` (HTTP 500) and the CB treats it as a success, not a failure.*" ### Verdict **CONDITIONAL_APPROVE** --- hib-pr-reviewer • round 3 • 1 finding (1m) • 2026-05-24T21:14:36.450Z → 2026-05-24T21:16:21.538Z • posted-as: pr-reviewer-bot </details>

hibryda added 1 commit

2026-05-24 23:22:19 +02:00

docs(mintcontext): clarify recordExceptions split — R3 review (1 MINOR) bb22b7d155

R3 verdict findings (kept=1):

(1) MINOR application.properties:112-119 — Comment overstates StatusRuntimeException remapping.

The pre-existing comment claimed "every transient-or-permanent fanout
failure surfaces as TicketContextFetchFailedException." That hid the
INVALID_ARGUMENT branch: UserProfileClient.mapStatus (lines 158-162) +
FamilyContextClient.mapStatus map INVALID_ARGUMENT to IllegalStateException,
not to the typed superclass. The CB never observes IllegalStateException
(it propagates straight out of executeSupplier), so the failure-window
accounting differs from the recordExceptions wording.

Fix: split the exception-family mapping into three explicit bullets:
  - Transient (UNAVAILABLE / DEADLINE_EXCEEDED / INTERNAL)
    → TicketContextFetchFailedException → CB FAILURE.
  - NOT_FOUND → UserNotFoundContextFetchFailedException (subclass)
    → CB IGNORE via ignoreExceptions.
  - INVALID_ARGUMENT → IllegalStateException → propagates without
    contributing to the failure window.
And restate the proto-level fail-closed path (UNSPECIFIED tier in
SubscriptionTierMapper raises IllegalArgumentException, the wrapper
rewraps to TicketContextFetchFailedException → CB FAILURE).

Verification:
- Pure comment change; no code path touched.
- application.properties parsing is whitespace-insensitive; the comment
  block is uninterpreted by Spring (lines prefixed with `#`).

pr-reviewer-bot commented

2026-05-24 23:32:58 +02:00

Member

Superseded by round 5.

Show previous round

hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)

Round 4 — head bb22b7d155f3, base main, trigger synchronize

TL;DR: CONDITIONAL_APPROVE — kept 3 minor findings (all unique-to-one, all verified at HEAD); all 8 prior-round findings confirmed fixed; no blocking issues.

Summary

Arbitration Round 4 — 3 unique-to-one findings verified and kept; 0 dropped

Prior rounds (1-3): Both reviewers independently confirm all 8 prior findings are fixed at HEAD. Not re-raised.

Finding A (unique): GlobalExceptionHandler.java:68-69 — Verified at HEAD. Lines 65-75 confirm the single @ExceptionHandler(TicketContextFetchFailedException.class) handler calls OtelErrorEvents.record() (line 68) and log.error(...) (line 69) for all subclasses, including UserNotFoundContextFetchFailedException. No more-specific handler exists for the subclass. The inconsistency with the ignoreExceptions rationale (permanent client-side classification, not transient infrastructure fault) is real and confirmed. Kept.

Finding B1 (unique): application.properties:121-123 — Verified at HEAD. Both FamilyContextClient.mapStatus() (line 243-244) and UserProfileClient.mapStatus() (line 161-162) throw IllegalStateException for INVALID_ARGUMENT. The comment at lines 121-123 states this "propagates out of the CB without contributing to the failure window." Under Resilience4j 2.x, when recordExceptions is configured, any exception matching neither recordExceptions nor ignoreExceptions is counted as SUCCESS (not neutral). This dilutes the failure rate and can prevent the CB from opening during mixed INVALID_ARGUMENT + UNAVAILABLE scenarios. Comment is factually wrong. Kept.

Finding B2 (unique): UserNotFoundContextFetchFailedException.java:25-30 — Verified at HEAD. Lines 26-30 claim HTTP 503 + Retry-After:1 "triggers re-auth." Standard OAuth 2.0/OIDC clients retry the 503 (per RFC 7231 Retry-After semantics), they do not initiate token refresh or re-login. JWTs are stateless; a token issued before account deletion remains cryptographically valid. The re-auth rationale is factually incorrect and will mislead on-call engineers and future maintainers. The same incorrect re-auth claim also appears inline at FamilyContextClient.java:247. Kept.

Memora persistence: Tag allowlist rejected pr-reviewer/custom tags; absorb tool permission denied. Run summary could not be stored — harness should investigate Memora tag policy for this project.

Blast Radius

The diff touches 18 files spanning the api, exception, mintcontext packages, two proto definitions, application configuration, and four test suites. The new MintContextAggregator gRPC fan-out path is on the hot mint endpoint. Blast radius is moderate — the new fan-out is additive and well-contained behind its own exception hierarchy, but the CB comment inaccuracy and the GlobalExceptionHandler gap affect the observability surface shared by all callers of the mint endpoint.

BLAST_SCORE: 5/10

Risk Indicators

Indicator	Value
Sensitive functions	`GlobalExceptionHandler.handleTicketContextFetchFailed`, `FamilyContextClient.mapStatus`, `UserProfileClient.mapStatus`, `MintContextAggregator.aggregate`
Migration touched	—
Test delta	—
Dependency changes	—

CI status (head `bb22b7d155f3`)

No CI checks reported for this commit.

im2be-identity-service#1
im2be-identity-service#2
im2be-identity-service#3

Findings (3)

[MINOR] `handleTicketContextFetchFailed` logs `UserNotFoundContextFetchFailedException` at ERROR and marks the OTel span ERROR, contradicting the NOT_FOUND classification rationale

src/main/java/com/aim2be/identity/exception/GlobalExceptionHandler.java:69

Verified at HEAD: the single @ExceptionHandler(TicketContextFetchFailedException.class) handler (lines 65-75) is the only handler that will catch UserNotFoundContextFetchFailedException (a direct subclass). It unconditionally calls OtelErrorEvents.record(Span.current(), ex, "GlobalExceptionHandler", "grpc") (line 68) — which sets span status to ERROR and records an error.event — and then log.error("Ticket context fetch failed: {}", ex.getMessage()) (line 69).

This contradicts the ignoreExceptions circuit-breaker configuration whose explicit rationale is that NOT_FOUND is a permanent client-side classification, not a transient infrastructure fault, and that an account-deletion sweep followed by a Centrifugo reconnect storm must not trip the circuit breaker or generate transient-fault noise. The same burst will generate ERROR-level log lines and OTel error spans, which can trigger false-positive alert thresholds and inflate error-budget dashboards (per ADR-0013 §3 the tail sampler retains every ERROR-status span).

Fix — add a dedicated handler above the parent-class one (Spring selects the most-specific match, so ordering within the class matters when both handlers are present):

@ExceptionHandler(UserNotFoundContextFetchFailedException.class)
public ResponseEntity<Map<String, Object>> handleUserNotFoundContextFetch(
        UserNotFoundContextFetchFailedException ex) {
    // NOT_FOUND is a permanent client-side classification; log at WARN
    // so account-deletion sweeps don't generate ERROR noise / OTel error
    // events. Same reasoning as the 429 handler's omission of
    // OtelErrorEvents.record() (lines 95-116 of this file).
    log.warn("Ticket context fetch: user not found downstream: {}", ex.getMessage());
    return ResponseEntity.status(HttpStatus.SERVICE_UNAVAILABLE)
            .contentType(MediaType.APPLICATION_PROBLEM_JSON)
            .header("Retry-After", "1")
            .body(problemBody(HttpStatus.SERVICE_UNAVAILABLE, ex.getErrorCode(),
                    "Ticket context fetch temporarily unavailable; retry"));
}

Omitting OtelErrorEvents.record() is intentional — NOT_FOUND should not surface as an error span.

[MINOR] Comment incorrectly claims `IllegalStateException` is neutral to the circuit breaker — it is counted as SUCCESS, diluting the failure rate

src/main/resources/application.properties:122

Verified at HEAD: lines 121-123 state that INVALID_ARGUMENT mapping to IllegalStateException "propagates out of the CB without contributing to the failure window." This is factually incorrect under Resilience4j 2.x semantics.

Both FamilyContextClient.mapStatus() (line 243-244) and UserProfileClient.mapStatus() (line 161-162) throw IllegalStateException for INVALID_ARGUMENT. The circuit breaker configuration uses recordExceptions=TicketContextFetchFailedException and ignoreExceptions=UserNotFoundContextFetchFailedException. Under Resilience4j 2.x, exceptions matching neither list are recorded as SUCCESS, not as a neutral no-op. This means an IllegalStateException increments the success counter.

Consequence: a simultaneous burst of INVALID_ARGUMENT responses (each counted as SUCCESS) and genuine UNAVAILABLE responses (each counted as FAILURE) can hold the computed failure rate precisely at the 50% threshold, preventing the breaker from opening. The comment contract is wrong and will mislead on-call engineers reading the config during an incident.

Fix (preferred): add java.lang.IllegalStateException to both ignoreExceptions lists so these calls are excluded from all CB accounting:

resilience4j.circuitbreaker.instances.family-service.ignoreExceptions=\
  com.aim2be.identity.exception.UserNotFoundContextFetchFailedException,\
  java.lang.IllegalStateException

resilience4j.circuitbreaker.instances.user-service.ignoreExceptions=\
  com.aim2be.identity.exception.UserNotFoundContextFetchFailedException,\
  java.lang.IllegalStateException

Alternatively, at minimum correct the comment at line 122 to accurately state "counted as CB success" and open a follow-on task to add IllegalStateException to ignoreExceptions.

[MINOR] Javadoc incorrectly claims HTTP 503 + `Retry-After:1` triggers client re-authentication — it does not

src/main/java/com/aim2be/identity/exception/UserNotFoundContextFetchFailedException.java:29

Verified at HEAD: lines 25-30 state: "Per the proto contract... the client should re-auth after a NOT_FOUND response — the next request will surface the real user-gone error on user-service's JWT verifier before ever reaching the identity-service mint path again. The 503 + Retry-After:1 here is a benign fast-fail that triggers that re-auth flow."

This rationale is incorrect on two counts:

503 is not a re-auth signal. HTTP 503 + Retry-After:1 is a transient-availability signal (RFC 7231 §6.6.4). Standard OAuth 2.0 / OIDC clients and Centrifugo SDK reconnect logic will re-issue the same mint request after the Retry-After delay. They will not initiate a token refresh or re-login flow. The result for a genuinely deleted user is a tight 503 retry loop until the JWT expires (potentially hours later) — neither benign nor a fast-fail.
JWTs are stateless. A JWT issued before account deletion remains cryptographically valid at identity-service's JWT verifier (JwtAuthenticationFilter validates signature + expiry, not downstream user existence). The claim that "the next request will surface the user-gone error on user-service's JWT verifier" is incorrect.

The same incorrect re-auth claim also appears in an inline comment in FamilyContextClient.java at line 247 (// the client should re-auth). Both should be corrected.

Fix: replace the re-auth rationale with an accurate description. Example: "NOT_FOUND returns HTTP 503 + Retry-After:1 to keep the mint-endpoint error-contract uniform (all context-fetch failures are 503). Callers that receive persistent 503 responses should escalate to a full re-authentication flow rather than retrying indefinitely; the Centrifugo SDK reconnect policy is expected to back off and surface the error to the application layer after N retries. Identity-service does not return a distinct 4xx code for user-not-found because doing so would require a separate client-side code path and expose user-existence information in the error classification."

Verdict

CONDITIONAL_APPROVE

_{hib-pr-reviewer • round 4 • 3 findings (3m) • 2026-05-24T21:29:55.028Z → 2026-05-24T21:32:58.475Z • posted-as: pr-reviewer-bot}

> _Superseded by round 5._ <details> <summary>Show previous round</summary> ## hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service) **Round 4** — head `bb22b7d155f3`, base `main`, trigger `synchronize` **TL;DR:** CONDITIONAL_APPROVE — kept 3 minor findings (all unique-to-one, all verified at HEAD); all 8 prior-round findings confirmed fixed; no blocking issues. ### Summary ## Arbitration Round 4 — 3 unique-to-one findings verified and kept; 0 dropped **Prior rounds (1-3):** Both reviewers independently confirm all 8 prior findings are fixed at HEAD. Not re-raised. **Finding A (unique):** `GlobalExceptionHandler.java:68-69` — Verified at HEAD. Lines 65-75 confirm the single `@ExceptionHandler(TicketContextFetchFailedException.class)` handler calls `OtelErrorEvents.record()` (line 68) and `log.error(...)` (line 69) for **all** subclasses, including `UserNotFoundContextFetchFailedException`. No more-specific handler exists for the subclass. The inconsistency with the `ignoreExceptions` rationale (permanent client-side classification, not transient infrastructure fault) is real and confirmed. **Kept.** **Finding B1 (unique):** `application.properties:121-123` — Verified at HEAD. Both `FamilyContextClient.mapStatus()` (line 243-244) and `UserProfileClient.mapStatus()` (line 161-162) throw `IllegalStateException` for `INVALID_ARGUMENT`. The comment at lines 121-123 states this "propagates out of the CB without contributing to the failure window." Under Resilience4j 2.x, when `recordExceptions` is configured, any exception matching neither `recordExceptions` nor `ignoreExceptions` is counted as **SUCCESS** (not neutral). This dilutes the failure rate and can prevent the CB from opening during mixed INVALID_ARGUMENT + UNAVAILABLE scenarios. Comment is factually wrong. **Kept.** **Finding B2 (unique):** `UserNotFoundContextFetchFailedException.java:25-30` — Verified at HEAD. Lines 26-30 claim HTTP 503 + `Retry-After:1` "triggers re-auth." Standard OAuth 2.0/OIDC clients retry the 503 (per RFC 7231 Retry-After semantics), they do not initiate token refresh or re-login. JWTs are stateless; a token issued before account deletion remains cryptographically valid. The re-auth rationale is factually incorrect and will mislead on-call engineers and future maintainers. The same incorrect re-auth claim also appears inline at `FamilyContextClient.java:247`. **Kept.** **Memora persistence:** Tag allowlist rejected `pr-reviewer`/custom tags; absorb tool permission denied. Run summary could not be stored — harness should investigate Memora tag policy for this project. ### Blast Radius The diff touches 18 files spanning the `api`, `exception`, `mintcontext` packages, two proto definitions, application configuration, and four test suites. The new `MintContextAggregator` gRPC fan-out path is on the hot mint endpoint. Blast radius is moderate — the new fan-out is additive and well-contained behind its own exception hierarchy, but the CB comment inaccuracy and the GlobalExceptionHandler gap affect the observability surface shared by all callers of the mint endpoint. **BLAST_SCORE: 5/10** ### Risk Indicators | Indicator | Value | |---|---| | Sensitive functions | `GlobalExceptionHandler.handleTicketContextFetchFailed`, `FamilyContextClient.mapStatus`, `UserProfileClient.mapStatus`, `MintContextAggregator.aggregate` | | Migration touched | — | | Test delta | — | | Dependency changes | — | ### CI status (head `bb22b7d155f3`) _No CI checks reported for this commit._ ### Related PRs - im2be-identity-service#1 - im2be-identity-service#2 - im2be-identity-service#3 ### Findings (3) #### **[MINOR]** `handleTicketContextFetchFailed` logs `UserNotFoundContextFetchFailedException` at ERROR and marks the OTel span ERROR, contradicting the NOT_FOUND classification rationale _src/main/java/com/aim2be/identity/exception/GlobalExceptionHandler.java:69_ Verified at HEAD: the single `@ExceptionHandler(TicketContextFetchFailedException.class)` handler (lines 65-75) is the only handler that will catch `UserNotFoundContextFetchFailedException` (a direct subclass). It unconditionally calls `OtelErrorEvents.record(Span.current(), ex, "GlobalExceptionHandler", "grpc")` (line 68) — which sets span status to ERROR and records an `error.event` — and then `log.error("Ticket context fetch failed: {}", ex.getMessage())` (line 69). This contradicts the `ignoreExceptions` circuit-breaker configuration whose explicit rationale is that NOT_FOUND is a *permanent client-side classification, not a transient infrastructure fault*, and that an account-deletion sweep followed by a Centrifugo reconnect storm must not trip the circuit breaker or generate transient-fault noise. The same burst will generate ERROR-level log lines and OTel error spans, which can trigger false-positive alert thresholds and inflate error-budget dashboards (per ADR-0013 §3 the tail sampler retains every ERROR-status span). Fix — add a dedicated handler *above* the parent-class one (Spring selects the most-specific match, so ordering within the class matters when both handlers are present): ```java @ExceptionHandler(UserNotFoundContextFetchFailedException.class) public ResponseEntity<Map<String, Object>> handleUserNotFoundContextFetch( UserNotFoundContextFetchFailedException ex) { // NOT_FOUND is a permanent client-side classification; log at WARN // so account-deletion sweeps don't generate ERROR noise / OTel error // events. Same reasoning as the 429 handler's omission of // OtelErrorEvents.record() (lines 95-116 of this file). log.warn("Ticket context fetch: user not found downstream: {}", ex.getMessage()); return ResponseEntity.status(HttpStatus.SERVICE_UNAVAILABLE) .contentType(MediaType.APPLICATION_PROBLEM_JSON) .header("Retry-After", "1") .body(problemBody(HttpStatus.SERVICE_UNAVAILABLE, ex.getErrorCode(), "Ticket context fetch temporarily unavailable; retry")); } ``` Omitting `OtelErrorEvents.record()` is intentional — NOT_FOUND should not surface as an error span. #### **[MINOR]** Comment incorrectly claims `IllegalStateException` is neutral to the circuit breaker — it is counted as SUCCESS, diluting the failure rate _src/main/resources/application.properties:122_ Verified at HEAD: lines 121-123 state that INVALID_ARGUMENT mapping to `IllegalStateException` "propagates out of the CB without contributing to the failure window." This is factually incorrect under Resilience4j 2.x semantics. Both `FamilyContextClient.mapStatus()` (line 243-244) and `UserProfileClient.mapStatus()` (line 161-162) throw `IllegalStateException` for `INVALID_ARGUMENT`. The circuit breaker configuration uses `recordExceptions=TicketContextFetchFailedException` and `ignoreExceptions=UserNotFoundContextFetchFailedException`. Under Resilience4j 2.x, exceptions matching neither list are recorded as **SUCCESS**, not as a neutral no-op. This means an `IllegalStateException` increments the success counter. Consequence: a simultaneous burst of INVALID_ARGUMENT responses (each counted as SUCCESS) and genuine UNAVAILABLE responses (each counted as FAILURE) can hold the computed failure rate precisely at the 50% threshold, preventing the breaker from opening. The comment contract is wrong and will mislead on-call engineers reading the config during an incident. Fix (preferred): add `java.lang.IllegalStateException` to both `ignoreExceptions` lists so these calls are excluded from all CB accounting: ```properties resilience4j.circuitbreaker.instances.family-service.ignoreExceptions=\ com.aim2be.identity.exception.UserNotFoundContextFetchFailedException,\ java.lang.IllegalStateException resilience4j.circuitbreaker.instances.user-service.ignoreExceptions=\ com.aim2be.identity.exception.UserNotFoundContextFetchFailedException,\ java.lang.IllegalStateException ``` Alternatively, at minimum correct the comment at line 122 to accurately state "counted as CB success" and open a follow-on task to add `IllegalStateException` to `ignoreExceptions`. #### **[MINOR]** Javadoc incorrectly claims HTTP 503 + `Retry-After:1` triggers client re-authentication — it does not _src/main/java/com/aim2be/identity/exception/UserNotFoundContextFetchFailedException.java:29_ Verified at HEAD: lines 25-30 state: "Per the proto contract... the client should re-auth after a NOT_FOUND response — the next request will surface the real user-gone error on user-service's JWT verifier before ever reaching the identity-service mint path again. The 503 + Retry-After:1 here is a benign fast-fail that triggers that re-auth flow." This rationale is incorrect on two counts: 1. **503 is not a re-auth signal.** HTTP 503 + `Retry-After:1` is a transient-availability signal (RFC 7231 §6.6.4). Standard OAuth 2.0 / OIDC clients and Centrifugo SDK reconnect logic will re-issue the **same** mint request after the `Retry-After` delay. They will not initiate a token refresh or re-login flow. The result for a genuinely deleted user is a tight 503 retry loop until the JWT expires (potentially hours later) — neither benign nor a fast-fail. 2. **JWTs are stateless.** A JWT issued before account deletion remains cryptographically valid at identity-service's JWT verifier (`JwtAuthenticationFilter` validates signature + expiry, not downstream user existence). The claim that "the next request will surface the user-gone error on user-service's JWT verifier" is incorrect. The same incorrect re-auth claim also appears in an inline comment in `FamilyContextClient.java` at line 247 (`// the client should re-auth`). Both should be corrected. Fix: replace the re-auth rationale with an accurate description. Example: "NOT_FOUND returns HTTP 503 + Retry-After:1 to keep the mint-endpoint error-contract uniform (all context-fetch failures are 503). Callers that receive persistent 503 responses should escalate to a full re-authentication flow rather than retrying indefinitely; the Centrifugo SDK reconnect policy is expected to back off and surface the error to the application layer after N retries. Identity-service does not return a distinct 4xx code for user-not-found because doing so would require a separate client-side code path and expose user-existence information in the error classification." ### Verdict **CONDITIONAL_APPROVE** --- hib-pr-reviewer • round 4 • 3 findings (3m) • 2026-05-24T21:29:55.028Z → 2026-05-24T21:32:58.475Z • posted-as: pr-reviewer-bot </details>

hibryda referenced this pull request from a commit

2026-05-24 23:39:20 +02:00

fix(mintcontext): apply PR #4 R4 reviewer findings (3 MINOR)

hibryda added 1 commit

2026-05-24 23:39:21 +02:00

fix(mintcontext): apply PR #4 R4 reviewer findings (3 MINOR) 07cfd02f17

R4 verdict findings (kept=3):

(1) MINOR GlobalExceptionHandler.java:69 — handleTicketContextFetchFailed
    logs UserNotFoundContextFetchFailedException at ERROR + records OTel
    error event, contradicting the ignoreExceptions CB rationale.
    NOT_FOUND is a permanent client-side classification; an account-
    deletion sweep + reconnect storm would generate proportional
    ERROR-level log volume + tail-retained error spans + false-positive
    5xx error-budget alerts even though the breaker correctly stays
    closed.

(2) MINOR application.properties:122 — Comment incorrectly claimed
    IllegalStateException is neutral to the CB. Under Resilience4j 2.x,
    exceptions matching neither recordExceptions nor ignoreExceptions
    are counted as CB SUCCESS, diluting the failure rate. An
    INVALID_ARGUMENT burst alongside genuine UNAVAILABLE could hold
    the failure rate precisely at threshold and prevent the breaker
    from opening.

(3) MINOR UserNotFoundContextFetchFailedException.java:25-30 +
    FamilyContextClient.java:247 — Javadoc + inline comment claimed
    HTTP 503 + Retry-After:1 triggers client re-authentication. RFC
    7231 Retry-After is a transient-availability signal; standard
    OIDC clients + Centrifugo SDK retry the 503, they do not
    initiate token refresh. JWTs are stateless — a token issued
    before account deletion remains cryptographically valid until
    expiry.

Fixes:

(1) Added a more-specific
    @ExceptionHandler(UserNotFoundContextFetchFailedException.class)
    that logs at WARN + omits OtelErrorEvents.record() while
    preserving the 503 + Retry-After:1 + TICKET_CONTEXT_FETCH_FAILED
    wire contract. Spring picks the most-specific handler within the
    advice class; NOT_FOUND falls through to the dedicated handler
    and bypasses the ERROR-level log + OTel error-event path.

(2) Added java.lang.IllegalStateException to ignoreExceptions on
    both family-service + user-service CB configs so INVALID_ARGUMENT
    is excluded from BOTH the failure window AND the success window.
    The breaker's accounting now reflects only legitimate downstream-
    fanout outcomes. Rewrote the recordExceptions/ignoreExceptions
    comment to state the three-family split accurately.

(3) Rewrote UserNotFoundContextFetchFailedException javadoc to
    describe the actual retry behaviour (uniform 503 contract +
    stateless-JWT retry-loop until expiry; CB ignoreExceptions
    protects other tenants from the user-specific burst).
    Rewrote FamilyContextClient.java:247 inline comment to point at
    the (updated) javadoc instead of repeating the wrong claim.

Verification (containerised JDK 17 mvn install):
- Tests run: 123, Failures: 0, Errors: 0, Skipped: 0 (was 122)
- BUILD SUCCESS in 12.762 s
- New test:
  CentrifugoTokenControllerTest#mintReturns503WhenContextFanoutReportsUserNotFound
  verifies the NOT_FOUND path routes through the new handler + still
  returns the correct wire shape.

pr-reviewer-bot commented

2026-05-24 23:46:47 +02:00

Member

Superseded by round 6.

Show previous round

hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)

Round 5 — head 07cfd02f17d7, base main, trigger synchronize

TL;DR: CONDITIONAL_APPROVE — kept 2 minor findings (unique-to-A and unique-to-B, both verified at HEAD); both share the same root cause: missing UserNotFoundContextFetchFailedException catch arm before the generic TicketContextFetchFailedException catch.

Summary

Round 5 Arbitration — 2026-05-24

Prior history: No Memora record found (first arbitration stored this run). Rounds 1–4 findings confirmed closed by both reviewers.

Both findings verified at HEAD.

A cites line 258 (the catch-arm opening); B cites line 264 (the specific span.setStatus(StatusCode.ERROR, outcome) call inside it). These are different line numbers and materially-different bodies — they are not a Rule-1 agreed finding. Each was treated as unique-to-one and subjected to Rule-2 verification:

A (line 258 — metric conflation): Confirmed. Lines 258–266 show a single catch (TicketContextFetchFailedException fanoutFailed) block. No preceding catch (UserNotFoundContextFetchFailedException …) arm exists. OUTCOME_CONTEXT_FETCH_FAILURE is recorded identically for all subclasses, making it impossible to distinguish permanent NOT_FOUND bursts from transient infra failures in the metric. Kept.
B (line 264 — span.setStatus defeats tail-sampler): Confirmed. span.setStatus(StatusCode.ERROR, outcome) at line 264 fires unconditionally for every TicketContextFetchFailedException, including UserNotFoundContextFetchFailedException. span.end() fires at line 276 in finally, so the ticket.mint span is sealed ERROR before Spring ever dispatches to GlobalExceptionHandler. The handler's own Javadoc (lines 102–107, verified) states explicitly: "Treating it as ERROR + OTel error-event would … generate proportional ERROR-level log volume + tail-retained error spans (per ADR-0013 §3) and false-positive 5xx error-budget alerts." This is a documented design contract violation, not speculation. Kept.

Both findings share a single root cause — the absence of a UserNotFoundContextFetchFailedException-specific catch arm — and a single fix resolves both. Neither reviewer's proposed fix fully addresses the other's concern (A adds a new metric label; B omits setStatus(ERROR) but keeps the same label), so both are worth surfacing independently. Both are minor severity; no blocking issues.

Blast Radius

The diff touches 18 files spanning controller, exception hierarchy, two new gRPC client modules (FamilyContextClient, UserProfileClient, MintContextAggregator), protos, application.properties, and test suites — moderate breadth across multiple packages. The controller change mutates the public HTTP mint endpoint's observability contract (OTel span status, Micrometer outcome labels) and gRPC fan-out path, both of which are consumed by downstream alerting and tail-sampling infrastructure. The HTTP wire contract itself (503 + Retry-After:1) is unchanged.

BLAST_SCORE: 6/10

Risk Indicators

Indicator	Value
Sensitive functions	`MintContextAggregator`, `UserProfileClient`, `FamilyContextClient`, `CentrifugoTokenController.mintTicket`, `GlobalExceptionHandler.handleUserNotFoundContextFetch`
Migration touched	—
Test delta	—
Dependency changes	—

CI status (head `07cfd02f17d7`)

No CI checks reported for this commit.

Findings (2)

[MINOR] Metric label `context_fetch_failure` conflates NOT_FOUND (permanent) with transient infra failures — on-call cannot distinguish the two from the counter alone

src/main/java/com/aim2be/identity/api/CentrifugoTokenController.java:258

The catch (TicketContextFetchFailedException fanoutFailed) block at line 258 records OUTCOME_CONTEXT_FETCH_FAILURE for every fanout failure, including UserNotFoundContextFetchFailedException (NOT_FOUND / account-deletion burst).

The GlobalExceptionHandler correctly treats NOT_FOUND differently (WARN log, no OTel error-event on the HTTP server span), but that separation doesn't surface in MintMetrics. An alert rule on centrifugo_tickets_minted_total{outcome="context_fetch_failure"} fires identically for an account-deletion reconnect storm (circuit breaker healthy, WARN logs) and a genuine downstream outage (circuit breaker open, ERROR logs). On-call must correlate the CB health indicator and log level before knowing whether to page a downstream team or stand down.

Concrete fix: add a more-specific catch before line 258:

} catch (UserNotFoundContextFetchFailedException notFound) {
    outcome = OUTCOME_USER_NOT_FOUND_CONTEXT_FAILURE; // new constant
    span.setAttribute(ATTR_OUTCOME, outcome);
    // Span status intentionally not set to ERROR — see B's companion
    // finding and ADR-0013 §3 / GlobalExceptionHandler Javadoc.
    metrics.recordMintOutcome(outcome);
    throw notFound;
} catch (TicketContextFetchFailedException fanoutFailed) { ...

Also add user_not_found_context_failure to the MintMetrics Javadoc. This lets dashboards pivot the two categories without changing the HTTP contract or circuit-breaker wiring.

[MINOR] NOT_FOUND arm of TicketContextFetchFailedException catch still marks the ticket.mint span ERROR, defeating the GlobalExceptionHandler tail-sampler isolation (ADR-0013 §3)

src/main/java/com/aim2be/identity/api/CentrifugoTokenController.java:264

Line 264 calls span.setStatus(StatusCode.ERROR, outcome) unconditionally for every TicketContextFetchFailedException instance, including the UserNotFoundContextFetchFailedException subclass. The span.end() fires at line 276 in the finally block — after this catch block sets the status — so the ticket.mint span is already sealed as ERROR before Spring dispatches to GlobalExceptionHandler.

GlobalExceptionHandler.handleUserNotFoundContextFetch (lines 103–107, verified at HEAD) states explicitly: "Treating it as ERROR + OTel error-event would … generate proportional ERROR-level log volume + tail-retained error spans (per ADR-0013 §3) and false-positive 5xx error-budget alerts, even though the breaker correctly stays closed." That handler correctly avoids OtelErrorEvents.record() and logs at WARN, but the tail sampler's keep/drop decision is driven by span.getStatus(), which is already ERROR by the time the handler runs.

Compare the rate-limited arm (lines 242–246): it re-throws with no setStatus(ERROR) call, consistent with ADR-0013 §3 (client-side classification only). NOT_FOUND is the same permanent-client-side category.

Concrete fix — split the single catch into two ordered arms:

} catch (UserNotFoundContextFetchFailedException notFound) {
    outcome = OUTCOME_USER_NOT_FOUND_CONTEXT_FAILURE; // or keep same label per A's companion finding
    span.setAttribute(ATTR_OUTCOME, outcome);
    // NOT_FOUND is permanent client-side semantics — leave span status
    // unset so tail samplers do not retain these as error traces per
    // ADR-0013 §3. GlobalExceptionHandler#handleUserNotFoundContextFetch
    // logs WARN and returns 503 + Retry-After:1 as before.
    metrics.recordMintOutcome(outcome);
    throw notFound;
} catch (TicketContextFetchFailedException fanoutFailed) {
    outcome = OUTCOME_CONTEXT_FETCH_FAILURE;
    span.setAttribute(ATTR_OUTCOME, outcome);
    span.setStatus(StatusCode.ERROR, outcome); // genuine infra fault — keep ERROR
    metrics.recordMintOutcome(outcome);
    throw fanoutFailed;
}

Java evaluates catch arms top-to-bottom; placing the more-specific subclass first mirrors the @ExceptionHandler ordering already established in GlobalExceptionHandler.

Verdict

CONDITIONAL_APPROVE

_{hib-pr-reviewer • round 5 • 2 findings (2m) • 2026-05-24T21:44:25.738Z → 2026-05-24T21:46:46.905Z • posted-as: pr-reviewer-bot}

> _Superseded by round 6._ <details> <summary>Show previous round</summary> ## hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service) **Round 5** — head `07cfd02f17d7`, base `main`, trigger `synchronize` **TL;DR:** CONDITIONAL_APPROVE — kept 2 minor findings (unique-to-A and unique-to-B, both verified at HEAD); both share the same root cause: missing UserNotFoundContextFetchFailedException catch arm before the generic TicketContextFetchFailedException catch. ### Summary ## Round 5 Arbitration — 2026-05-24 **Prior history:** No Memora record found (first arbitration stored this run). Rounds 1–4 findings confirmed closed by both reviewers. **Both findings verified at HEAD.** A cites line 258 (the catch-arm opening); B cites line 264 (the specific `span.setStatus(StatusCode.ERROR, outcome)` call inside it). These are different line numbers and materially-different bodies — they are not a Rule-1 agreed finding. Each was treated as unique-to-one and subjected to Rule-2 verification: - **A (line 258 — metric conflation):** Confirmed. Lines 258–266 show a single `catch (TicketContextFetchFailedException fanoutFailed)` block. No preceding `catch (UserNotFoundContextFetchFailedException …)` arm exists. `OUTCOME_CONTEXT_FETCH_FAILURE` is recorded identically for all subclasses, making it impossible to distinguish permanent NOT_FOUND bursts from transient infra failures in the metric. **Kept.** - **B (line 264 — span.setStatus defeats tail-sampler):** Confirmed. `span.setStatus(StatusCode.ERROR, outcome)` at line 264 fires unconditionally for every `TicketContextFetchFailedException`, including `UserNotFoundContextFetchFailedException`. `span.end()` fires at line 276 in `finally`, so the `ticket.mint` span is sealed `ERROR` before Spring ever dispatches to `GlobalExceptionHandler`. The handler's own Javadoc (lines 102–107, verified) states explicitly: *"Treating it as ERROR + OTel error-event would … generate proportional ERROR-level log volume + tail-retained error spans (per ADR-0013 §3) and false-positive 5xx error-budget alerts."* This is a documented design contract violation, not speculation. **Kept.** Both findings share a single root cause — the absence of a `UserNotFoundContextFetchFailedException`-specific catch arm — and a single fix resolves both. Neither reviewer's proposed fix fully addresses the other's concern (A adds a new metric label; B omits `setStatus(ERROR)` but keeps the same label), so both are worth surfacing independently. Both are minor severity; no blocking issues. ### Blast Radius The diff touches 18 files spanning controller, exception hierarchy, two new gRPC client modules (FamilyContextClient, UserProfileClient, MintContextAggregator), protos, application.properties, and test suites — moderate breadth across multiple packages. The controller change mutates the public HTTP mint endpoint's observability contract (OTel span status, Micrometer outcome labels) and gRPC fan-out path, both of which are consumed by downstream alerting and tail-sampling infrastructure. The HTTP wire contract itself (503 + Retry-After:1) is unchanged. **BLAST_SCORE: 6/10** ### Risk Indicators | Indicator | Value | |---|---| | Sensitive functions | `MintContextAggregator`, `UserProfileClient`, `FamilyContextClient`, `CentrifugoTokenController.mintTicket`, `GlobalExceptionHandler.handleUserNotFoundContextFetch` | | Migration touched | — | | Test delta | — | | Dependency changes | — | ### CI status (head `07cfd02f17d7`) _No CI checks reported for this commit._ ### Findings (2) #### **[MINOR]** Metric label `context_fetch_failure` conflates NOT_FOUND (permanent) with transient infra failures — on-call cannot distinguish the two from the counter alone _src/main/java/com/aim2be/identity/api/CentrifugoTokenController.java:258_ The `catch (TicketContextFetchFailedException fanoutFailed)` block at line 258 records `OUTCOME_CONTEXT_FETCH_FAILURE` for every fanout failure, including `UserNotFoundContextFetchFailedException` (NOT_FOUND / account-deletion burst). The `GlobalExceptionHandler` correctly treats NOT_FOUND differently (WARN log, no OTel error-event on the HTTP server span), but that separation doesn't surface in `MintMetrics`. An alert rule on `centrifugo_tickets_minted_total{outcome="context_fetch_failure"}` fires identically for an account-deletion reconnect storm (circuit breaker *healthy*, WARN logs) and a genuine downstream outage (circuit breaker *open*, ERROR logs). On-call must correlate the CB health indicator and log level before knowing whether to page a downstream team or stand down. **Concrete fix:** add a more-specific catch *before* line 258: ```java } catch (UserNotFoundContextFetchFailedException notFound) { outcome = OUTCOME_USER_NOT_FOUND_CONTEXT_FAILURE; // new constant span.setAttribute(ATTR_OUTCOME, outcome); // Span status intentionally not set to ERROR — see B's companion // finding and ADR-0013 §3 / GlobalExceptionHandler Javadoc. metrics.recordMintOutcome(outcome); throw notFound; } catch (TicketContextFetchFailedException fanoutFailed) { ... ``` Also add `user_not_found_context_failure` to the `MintMetrics` Javadoc. This lets dashboards pivot the two categories without changing the HTTP contract or circuit-breaker wiring. #### **[MINOR]** NOT_FOUND arm of TicketContextFetchFailedException catch still marks the ticket.mint span ERROR, defeating the GlobalExceptionHandler tail-sampler isolation (ADR-0013 §3) _src/main/java/com/aim2be/identity/api/CentrifugoTokenController.java:264_ Line 264 calls `span.setStatus(StatusCode.ERROR, outcome)` unconditionally for every `TicketContextFetchFailedException` instance, including the `UserNotFoundContextFetchFailedException` subclass. The `span.end()` fires at line 276 in the `finally` block — **after** this catch block sets the status — so the `ticket.mint` span is already sealed as `ERROR` before Spring dispatches to `GlobalExceptionHandler`. `GlobalExceptionHandler.handleUserNotFoundContextFetch` (lines 103–107, verified at HEAD) states explicitly: *"Treating it as ERROR + OTel error-event would … generate proportional ERROR-level log volume + tail-retained error spans (per ADR-0013 §3) and false-positive 5xx error-budget alerts, even though the breaker correctly stays closed."* That handler correctly avoids `OtelErrorEvents.record()` and logs at WARN, but the tail sampler's keep/drop decision is driven by `span.getStatus()`, which is already `ERROR` by the time the handler runs. Compare the rate-limited arm (lines 242–246): it re-throws with **no** `setStatus(ERROR)` call, consistent with ADR-0013 §3 (client-side classification only). NOT_FOUND is the same permanent-client-side category. **Concrete fix** — split the single catch into two ordered arms: ```java } catch (UserNotFoundContextFetchFailedException notFound) { outcome = OUTCOME_USER_NOT_FOUND_CONTEXT_FAILURE; // or keep same label per A's companion finding span.setAttribute(ATTR_OUTCOME, outcome); // NOT_FOUND is permanent client-side semantics — leave span status // unset so tail samplers do not retain these as error traces per // ADR-0013 §3. GlobalExceptionHandler#handleUserNotFoundContextFetch // logs WARN and returns 503 + Retry-After:1 as before. metrics.recordMintOutcome(outcome); throw notFound; } catch (TicketContextFetchFailedException fanoutFailed) { outcome = OUTCOME_CONTEXT_FETCH_FAILURE; span.setAttribute(ATTR_OUTCOME, outcome); span.setStatus(StatusCode.ERROR, outcome); // genuine infra fault — keep ERROR metrics.recordMintOutcome(outcome); throw fanoutFailed; } ``` Java evaluates catch arms top-to-bottom; placing the more-specific subclass first mirrors the `@ExceptionHandler` ordering already established in `GlobalExceptionHandler`. ### Verdict **CONDITIONAL_APPROVE** --- hib-pr-reviewer • round 5 • 2 findings (2m) • 2026-05-24T21:44:25.738Z → 2026-05-24T21:46:46.905Z • posted-as: pr-reviewer-bot </details>

hibryda referenced this pull request from a commit

2026-05-24 23:49:33 +02:00

fix(mintcontext): apply PR #4 R5 reviewer findings (2 MINOR)

hibryda added 1 commit

2026-05-24 23:49:33 +02:00

fix(mintcontext): apply PR #4 R5 reviewer findings (2 MINOR) 000bc0842c

R5 verdict findings (kept=2; both share the same root cause —
NOT_FOUND subclass needs the same more-specific split inside the
controller's try/catch that R5 already applied at the
GlobalExceptionHandler level):

(1) MINOR CentrifugoTokenController.java:258-266 — `catch
    (TicketContextFetchFailedException)` arm marks the ticket.mint
    span StatusCode.ERROR for ALL subclasses, including
    UserNotFoundContextFetchFailedException. Per ADR-0013 §3 the
    tail sampler retains every ERROR-status span, so an account-
    deletion sweep + Centrifugo reconnect storm would generate
    proportional retained error spans for NOT_FOUND bursts even
    though the breaker correctly stays closed and the
    GlobalExceptionHandler now (R5) logs at WARN with no
    OtelErrorEvents.record(). The span ERROR side of the contract
    leaked through.

(2) MINOR CentrifugoTokenController.java:259 — `outcome =
    OUTCOME_CONTEXT_FETCH_FAILURE` is the same label for NOT_FOUND
    and transient infra failures. On-call cannot distinguish the
    two from the counter alone; a NOT_FOUND-burst alert would page
    the same as a real downstream outage.

Fix (single change addressing both findings — same R63 ripple
inspection scope as R5's GlobalExceptionHandler fix):

- Added `catch (UserNotFoundContextFetchFailedException
  userNotFound)` arm BEFORE the parent
  `catch (TicketContextFetchFailedException fanoutFailed)` arm.
  Subclass-before-superclass ordering is required by Java reach-
  ability rules + idiomatically routes the permanent classification
  to the dedicated branch.
- New OUTCOME_CONTEXT_FETCH_USER_NOT_FOUND = "context_fetch_user_
  not_found" label so on-call can pivot permanent vs transient.
  Updated MintMetrics.java javadoc to enumerate the new label.
- Arm sets the `outcome` attribute but DOES NOT call
  span.setStatus(ERROR, …) — span stays UNSET, defeating tail-
  sampler retention for the permanent path while preserving the
  sampled-span outcome attribute for on-call filtering.
- Outcome counter still incremented via metrics.recordMintOutcome
  (separate label) — observability is preserved, just routed to
  a distinct bucket.

Verification (containerised JDK 17 mvn install):
- Tests run: 123, Failures: 0, Errors: 0, Skipped: 0
- BUILD SUCCESS
- mintReturns503WhenContextFanoutReportsUserNotFound test continues
  to pass (now also exercises the new controller-level NOT_FOUND
  arm in addition to the GlobalExceptionHandler's more-specific
  handler).

pr-reviewer-bot commented

2026-05-24 23:58:46 +02:00

Member

Superseded by round 7.

Show previous round

hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)

Round 6 — head 000bc0842c1c, base main, trigger synchronize

TL;DR: CONDITIONAL_APPROVE — kept 1 minor unique-to-A finding (verified against HEAD); Reviewer B's empty set not adopted as confirmation of absence since A's grounded finding was confirmed by direct code read.

Summary

Arbitration — Round 6

Prior-round findings: Both reviewers confirmed all 13 findings from rounds 1–5 are resolved. No re-raises needed.

New finding reconciliation:

Reviewer A: 1 finding (minor, unique-to-A) — ExecutionException unwrap race in MintContextAggregator.aggregate() at line 321.
Reviewer B: 0 findings.

Verification of unique-to-A finding (Rule 2):

Read MintContextAggregator.java lines 297–340 confirmed the described code pattern exactly:

CompletableFuture.allOf(familyFuture, userFuture) at line 297 — completes on first failing future.
both.get(FANOUT_JOIN_TIMEOUT_MS, ...) at line 305.
catch (ExecutionException ee) → attachStatusBest(...) → rethrow(ee.getCause()) at line 321 — no inspection of the other future's completion cause.

Read CentrifugoTokenController.java lines 268–291 confirmed the UserNotFoundContextFetchFailedException catch arm does not call span.setStatus(ERROR) — intentional per ADR-0013 §3.

attachStatusBest (lines 379–411) sets StatusCode.ERROR on the fanout span only, not on the outer ticket.mint span.

Verdict: The race is real and grounded. Scenario: userFuture completes fast with NOT_FOUND while familyFuture has a slower transient failure (DEADLINE_EXCEEDED / UNAVAILABLE). allOf().get() fires on NOT_FOUND; ee.getCause() = UserNotFoundContextFetchFailedException; controller sets outcome=context_fetch_user_not_found without ERROR on ticket.mint span. On-call sees normal-churn metric and may dismiss a genuine infrastructure outage. The fanout span correctly shows ERROR one level down. Finding kept at minor — pure observability masking, response correctness unaffected.

Memora persistence: Attempted to create run-summary memory; Memora server auto-tag assignment hit an allowlist restriction (tag memora/issues not permitted). Memory could not be persisted due to server-side configuration issue.

Blast Radius

The diff is largely contained within the new mintcontext package (MintContextAggregator, FamilyContextClient, UserProfileClient) plus plumbing changes to CentrifugoTokenController and GlobalExceptionHandler. The fan-out touches two downstream gRPC services (family-service, user-service) and the circuit-breaker registry, so any misbehaviour propagates to the token-mint critical path. Score 4: exported surface within one service boundary, not a public protocol change.

BLAST_SCORE: 4/10

Risk Indicators

Indicator	Value
Sensitive functions	`MintContextAggregator.aggregate`, `CentrifugoTokenController.mintToken`, `FamilyContextClient.getFamilyForUser`, `UserProfileClient.getUserProfile`
Migration touched	—
Test delta	—
Dependency changes	—

CI status (head `000bc0842c1c`)

No CI checks reported for this commit.

Findings (1)

[MINOR] `ExecutionException` unwrap is race-sensitive — a fast NOT_FOUND can mask a concurrent transient infrastructure failure at the controller level

src/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:321

Problem

CompletableFuture.allOf(familyFuture, userFuture).get() (line 305) completes exceptionally as soon as the first failing future is done — it does not wait for the second. At line 321, rethrow(ee.getCause()) therefore propagates whichever future lost the race, regardless of severity.

Verified scenario:

userFuture completes at ~50 ms with UserNotFoundContextFetchFailedException (fast NOT_FOUND — record absent).
familyFuture is still in-flight (e.g., DEADLINE_EXCEEDED at 2 000 ms or UNAVAILABLE from a slow-refuse).
allOf().get() fires immediately at step 1; ee.getCause() = UserNotFoundContextFetchFailedException.
rethrow propagates UserNotFoundContextFetchFailedException.
Controller catch arm (line 268–291, verified): outcome = context_fetch_user_not_found, no span.setStatus(ERROR) on ticket.mint.

Meanwhile attachStatusBest (lines 383–410, verified) correctly marks the fanout span ERROR (familyFuture.isCompletedExceptionally() || !familyFuture.isDone() → STATUS_ERROR; span.setStatus(ERROR, "fanout-failure")). But the outer ticket.mint span carries no ERROR status and outcome=context_fetch_user_not_found. On-call sees normal-churn metric and may dismiss a genuine infrastructure outage while the fanout span tells the truth one level down.

Recommended fix

After attachStatusBest, inspect both futures for their completion cause and prefer the transient classification when one raised a non-NOT_FOUND TicketContextFetchFailedException:

} catch (ExecutionException ee) {
    attachStatusBest(span, familyFuture, userFuture, fanoutStartNs);
    Throwable winner = ee.getCause();
    if (winner instanceof UserNotFoundContextFetchFailedException) {
        Throwable altCause = completionCauseIfTransient(familyFuture);
        if (altCause == null) altCause = completionCauseIfTransient(userFuture);
        if (altCause != null) winner = altCause;
    }
    rethrow(winner);
    return null; // unreachable
}

private static Throwable completionCauseIfTransient(CompletableFuture<?> f) {
    if (!f.isCompletedExceptionally()) return null;
    try { f.join(); } catch (java.util.concurrent.CompletionException ce) {
        Throwable cause = ce.getCause();
        if (cause instanceof TicketContextFetchFailedException
                && !(cause instanceof UserNotFoundContextFetchFailedException)) {
            return cause;
        }
    }
    return null;
}

This preserves existing behaviour for all-success and same-classification-failure cases, and correctly promotes the transient signal so the controller sets span.setStatus(ERROR) and pages on-call.

Verdict

CONDITIONAL_APPROVE

_{hib-pr-reviewer • round 6 • 1 finding (1m) • 2026-05-24T21:55:49.967Z → 2026-05-24T21:58:45.886Z • posted-as: pr-reviewer-bot}

> _Superseded by round 7._ <details> <summary>Show previous round</summary> ## hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service) **Round 6** — head `000bc0842c1c`, base `main`, trigger `synchronize` **TL;DR:** CONDITIONAL_APPROVE — kept 1 minor unique-to-A finding (verified against HEAD); Reviewer B's empty set not adopted as confirmation of absence since A's grounded finding was confirmed by direct code read. ### Summary ## Arbitration — Round 6 **Prior-round findings:** Both reviewers confirmed all 13 findings from rounds 1–5 are resolved. No re-raises needed. **New finding reconciliation:** - Reviewer A: 1 finding (minor, unique-to-A) — `ExecutionException` unwrap race in `MintContextAggregator.aggregate()` at line 321. - Reviewer B: 0 findings. **Verification of unique-to-A finding (Rule 2):** Read `MintContextAggregator.java` lines 297–340 confirmed the described code pattern exactly: 1. `CompletableFuture.allOf(familyFuture, userFuture)` at line 297 — completes on first failing future. 2. `both.get(FANOUT_JOIN_TIMEOUT_MS, ...)` at line 305. 3. `catch (ExecutionException ee)` → `attachStatusBest(...)` → `rethrow(ee.getCause())` at line 321 — no inspection of the other future's completion cause. Read `CentrifugoTokenController.java` lines 268–291 confirmed the `UserNotFoundContextFetchFailedException` catch arm does **not** call `span.setStatus(ERROR)` — intentional per ADR-0013 §3. `attachStatusBest` (lines 379–411) sets `StatusCode.ERROR` on the fanout span only, not on the outer `ticket.mint` span. **Verdict:** The race is real and grounded. Scenario: userFuture completes fast with NOT_FOUND while familyFuture has a slower transient failure (DEADLINE_EXCEEDED / UNAVAILABLE). `allOf().get()` fires on NOT_FOUND; `ee.getCause()` = `UserNotFoundContextFetchFailedException`; controller sets `outcome=context_fetch_user_not_found` without ERROR on `ticket.mint` span. On-call sees normal-churn metric and may dismiss a genuine infrastructure outage. The fanout span correctly shows ERROR one level down. **Finding kept at minor** — pure observability masking, response correctness unaffected. **Memora persistence:** Attempted to create run-summary memory; Memora server auto-tag assignment hit an allowlist restriction (tag `memora/issues` not permitted). Memory could not be persisted due to server-side configuration issue. ### Blast Radius The diff is largely contained within the new `mintcontext` package (MintContextAggregator, FamilyContextClient, UserProfileClient) plus plumbing changes to CentrifugoTokenController and GlobalExceptionHandler. The fan-out touches two downstream gRPC services (family-service, user-service) and the circuit-breaker registry, so any misbehaviour propagates to the token-mint critical path. Score 4: exported surface within one service boundary, not a public protocol change. **BLAST_SCORE: 4/10** ### Risk Indicators | Indicator | Value | |---|---| | Sensitive functions | `MintContextAggregator.aggregate`, `CentrifugoTokenController.mintToken`, `FamilyContextClient.getFamilyForUser`, `UserProfileClient.getUserProfile` | | Migration touched | — | | Test delta | — | | Dependency changes | — | ### CI status (head `000bc0842c1c`) _No CI checks reported for this commit._ ### Findings (1) #### **[MINOR]** `ExecutionException` unwrap is race-sensitive — a fast NOT_FOUND can mask a concurrent transient infrastructure failure at the controller level _src/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:321_ ## Problem `CompletableFuture.allOf(familyFuture, userFuture).get()` (line 305) completes exceptionally as soon as the **first** failing future is done — it does not wait for the second. At line 321, `rethrow(ee.getCause())` therefore propagates whichever future lost the race, regardless of severity. **Verified scenario:** 1. `userFuture` completes at ~50 ms with `UserNotFoundContextFetchFailedException` (fast NOT_FOUND — record absent). 2. `familyFuture` is still in-flight (e.g., DEADLINE_EXCEEDED at 2 000 ms or UNAVAILABLE from a slow-refuse). 3. `allOf().get()` fires immediately at step 1; `ee.getCause()` = `UserNotFoundContextFetchFailedException`. 4. `rethrow` propagates `UserNotFoundContextFetchFailedException`. 5. Controller catch arm (line 268–291, verified): `outcome = context_fetch_user_not_found`, **no `span.setStatus(ERROR)`** on `ticket.mint`. Meanwhile `attachStatusBest` (lines 383–410, verified) correctly marks the fanout span ERROR (`familyFuture.isCompletedExceptionally() || !familyFuture.isDone()` → STATUS_ERROR; `span.setStatus(ERROR, "fanout-failure")`). But the **outer `ticket.mint` span** carries no ERROR status and `outcome=context_fetch_user_not_found`. On-call sees normal-churn metric and may dismiss a genuine infrastructure outage while the fanout span tells the truth one level down. ## Recommended fix After `attachStatusBest`, inspect both futures for their completion cause and prefer the transient classification when one raised a non-NOT_FOUND `TicketContextFetchFailedException`: ```java } catch (ExecutionException ee) { attachStatusBest(span, familyFuture, userFuture, fanoutStartNs); Throwable winner = ee.getCause(); if (winner instanceof UserNotFoundContextFetchFailedException) { Throwable altCause = completionCauseIfTransient(familyFuture); if (altCause == null) altCause = completionCauseIfTransient(userFuture); if (altCause != null) winner = altCause; } rethrow(winner); return null; // unreachable } private static Throwable completionCauseIfTransient(CompletableFuture<?> f) { if (!f.isCompletedExceptionally()) return null; try { f.join(); } catch (java.util.concurrent.CompletionException ce) { Throwable cause = ce.getCause(); if (cause instanceof TicketContextFetchFailedException && !(cause instanceof UserNotFoundContextFetchFailedException)) { return cause; } } return null; } ``` This preserves existing behaviour for all-success and same-classification-failure cases, and correctly promotes the transient signal so the controller sets `span.setStatus(ERROR)` and pages on-call. ### Verdict **CONDITIONAL_APPROVE** --- hib-pr-reviewer • round 6 • 1 finding (1m) • 2026-05-24T21:55:49.967Z → 2026-05-24T21:58:45.886Z • posted-as: pr-reviewer-bot </details>

hibryda referenced this pull request from a commit

2026-05-25 00:01:37 +02:00

fix(mintcontext): apply PR #4 R6 reviewer finding (1 MINOR — race)

hibryda added 1 commit

2026-05-25 00:01:37 +02:00

fix(mintcontext): apply PR #4 R6 reviewer finding (1 MINOR — race) 58efc2ab3d

R6 verdict finding (kept=1):

(1) MINOR MintContextAggregator.java:321 — ExecutionException unwrap is
    race-sensitive. `allOf(familyFuture, userFuture).get()` completes as
    soon as the FIRST failing future is done; it does NOT wait for the
    second. A fast NOT_FOUND (~50 ms — record absent) can therefore race
    a slow transient (e.g. DEADLINE_EXCEEDED at 2 s, or UNAVAILABLE
    slow-refuse) and the NOT_FOUND wins. ee.getCause() then propagates
    the permanent classification; the controller's NOT_FOUND arm leaves
    the outer ticket.mint span StatusCode.UNSET + uses the
    context_fetch_user_not_found outcome label, so on-call sees normal
    account-deletion churn instead of a real downstream outage. The
    fanout CHILD span (one level down) correctly says ERROR via
    attachStatusBest, but the two signals decouple — the outer-span
    truth is wrong.

Fix:

- New `completionCauseIfTransient(CompletableFuture<?>)` helper inspects
  a terminally-completed future for a non-NOT_FOUND
  TicketContextFetchFailedException cause; returns null otherwise.
- ExecutionException catch arm now: if winner is
  UserNotFoundContextFetchFailedException, scan BOTH futures for a
  transient cause via the helper; if found, promote it to `winner` and
  rethrow. Preserves existing behaviour for all-success +
  same-classification-failure cases.

This is the third symmetric edit in the NOT_FOUND-as-permanent-
classification contract (see Memora #3454 + #3455):
  - R5 (07cfd02): GlobalExceptionHandler dedicated handler.
  - R6 (000bc08): Controller-level catch arm + new outcome label.
  - R7 (this commit): Aggregator-level race-correction.
All three layers must split for the contract to be honored end-to-end.

Verification (containerised JDK 17 mvn install):
- Tests run: 124, Failures: 0, Errors: 0, Skipped: 0 (was 123)
- BUILD SUCCESS
- New test:
  MintContextAggregatorTest#notFoundRacingTransientPromotesTransient
  arranges family-fast-NOT_FOUND + user-slow-transient and verifies
  the propagated exception is the transient (not NOT_FOUND).

pr-reviewer-bot commented

2026-05-25 00:08:30 +02:00

Member

Superseded by round 8.

Show previous round

hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)

Round 7 — head 58efc2ab3dc9, base main, trigger synchronize

TL;DR: CONDITIONAL_APPROVE — 1 agreed minor finding kept (both A and B independently flagged the same false Javadoc in completionCauseIfTransient); no blocking or major issues; all 14 R1–R6 findings confirmed resolved.

Summary

Arbitration — Round 7

Memora: No prior run history found (first-time arbitration despite round 7); create-path attempted but tag-validation errors from the Memora server prevented persistence.

Verification performed: Read MintContextAggregator.java lines 460–494 at HEAD.

Reconciliation

1 agreed finding (A + B): Both reviewers independently flag the same Javadoc paragraph in completionCauseIfTransient. Verified at HEAD:

Lines 469–473 assert "both futures are guaranteed completed" after allOf().get() throws — this is false per the CompletableFuture.allOf() contract (completes exceptionally when the first component fails; the second may still be in flight).
Line 475 @param tag calls f "a fanout future already known to be terminally completed" — also false for the slow-transient case.
The code is safe: isCompletedExceptionally() at line 481 short-circuits before join() when f is still pending. But the false Javadoc creates a trap for future maintainers who might remove the guard as "redundant".
Reviewer B additionally identifies a functional gap: the transient-promotion is best-effort only — a slow-transient future (e.g. 2 s DEADLINE_EXCEEDED) will still be pending when the catch arm fires, so NOT_FOUND wins the race and masks a real outage behind account-deletion-churn signal.

Line citation: A cites 471 (the sentence containing the false claim); B cites 469 (start of the paragraph). The problematic block spans 469–475. I cite 469 (block start) and document the exact false phrase.

Severity: minor — agreed by both; the runtime behaviour is safe, only docs/observability gap.

Blast Radius

The diff introduces a new gRPC fan-out subsystem (MintContextAggregator, FamilyContextClient, UserProfileClient) wired into the hot CentrifugoTokenController mint path, with two new protos, Resilience4j circuit-breaker wiring, and new exception/handler classes. The surface is consumed by a single controller action but spans five packages and introduces an outbound network I/O boundary with circuit-breaker configuration, raising the blast radius to moderate.

BLAST_SCORE: 6/10

Risk Indicators

Indicator	Value
Sensitive functions	`completionCauseIfTransient`, `serialiseUserContext`, `MintContextAggregator.aggregate`, `FamilyContextClient.getFamilyForUser`, `UserProfileClient.getUserProfile`
Migration touched	—
Test delta	—
Dependency changes	—

CI status (head `58efc2ab3dc9`)

No CI checks reported for this commit.

Findings (1)

[MINOR] `completionCauseIfTransient` Javadoc falsely claims "both futures are guaranteed completed" — slow-transient functional gap persists

src/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:469

Lines 469–473 read:

 * <p>{@code f.join()} is safe here because we only call this from
 * the {@code allOf().get()} catch arm after {@code allOf()} already
 * fired — both futures are guaranteed completed (the lost-race one
 * may have completed exceptionally OR successfully; we only
 * inspect the exceptional case via {@code isCompletedExceptionally}).</p>

Line 475:

 * @param f a fanout future already known to be terminally completed.

Why the claim is false: Per the CompletableFuture.allOf() specification, allOf() completes exceptionally as soon as any one of its components completes exceptionally. When allOf().get() throws ExecutionException, only the first-failing future is guaranteed done; the other future may still be in flight.

Why the code is currently safe: The isCompletedExceptionally() guard at line 481 evaluates to false for any future that is still pending (or that completed normally), causing an immediate return null before join() is reached. join() is only called after isCompletedExceptionally() confirms the future is terminally done exceptionally — that is the actual safety invariant, not the allOf contract.

The trap for future maintainers: Someone reading "both futures are guaranteed completed" may conclude the isCompletedExceptionally() check is merely distinguishing exceptional vs. normal completion and remove it as redundant — that would re-introduce a blocking join() on an in-flight future, stalling a mint-fanout-* thread until the per-RPC deadline fires.

Functional gap (from R6, partially closed): The transient-cause promotion only succeeds when the transient future has already failed by the time the NOT_FOUND catch arm fires. For slow-transient futures (e.g. a 2 s DEADLINE_EXCEEDED from a downstream overload) that are still pending at catch time, isCompletedExceptionally() returns false → completionCauseIfTransient returns null → NOT_FOUND wins the race → the outer ticket.mint span stays UNSET and the context_fetch_user_not_found outcome label fires, masking a real infra outage as routine account-deletion churn.

Required fixes:

Replace the false Javadoc paragraph with: "CompletableFuture.allOf() completes exceptionally when the first component fails; the returned future's get() throwing ExecutionException does not guarantee the second future is done. The isCompletedExceptionally() guard at line 481 is the actual safety net: if f is still in flight or completed normally the early-return fires before join() is reached."
Fix the @param tag to: "f a fanout future that may or may not be done; returns null when still pending or completed normally".
Consider adding a comment in the ExecutionException catch arm (~line 340) explicitly marking the transient-promotion as best-effort: if the other future is still pending at catch time its eventual failure is not surfaced at controller level — FANOUT_JOIN_TIMEOUT_MS prevents an indefinite block, but the outcome label and outer span status are already set.

Verdict

CONDITIONAL_APPROVE

_{hib-pr-reviewer • round 7 • 1 finding (1m) • 2026-05-24T22:06:18.391Z → 2026-05-24T22:08:30.200Z • posted-as: pr-reviewer-bot}

> _Superseded by round 8._ <details> <summary>Show previous round</summary> ## hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service) **Round 7** — head `58efc2ab3dc9`, base `main`, trigger `synchronize` **TL;DR:** CONDITIONAL_APPROVE — 1 agreed minor finding kept (both A and B independently flagged the same false Javadoc in `completionCauseIfTransient`); no blocking or major issues; all 14 R1–R6 findings confirmed resolved. ### Summary ## Arbitration — Round 7 **Memora:** No prior run history found (first-time arbitration despite round 7); create-path attempted but tag-validation errors from the Memora server prevented persistence. **Verification performed:** Read `MintContextAggregator.java` lines 460–494 at HEAD. ### Reconciliation **1 agreed finding (A + B):** Both reviewers independently flag the same Javadoc paragraph in `completionCauseIfTransient`. Verified at HEAD: - Lines 469–473 assert *"both futures are guaranteed completed"* after `allOf().get()` throws — this is false per the `CompletableFuture.allOf()` contract (completes exceptionally when the *first* component fails; the second may still be in flight). - Line 475 `@param` tag calls `f` *"a fanout future already known to be terminally completed"* — also false for the slow-transient case. - The **code is safe**: `isCompletedExceptionally()` at line 481 short-circuits before `join()` when `f` is still pending. But the false Javadoc creates a trap for future maintainers who might remove the guard as "redundant". - Reviewer B additionally identifies a **functional gap**: the transient-promotion is best-effort only — a slow-transient future (e.g. 2 s DEADLINE_EXCEEDED) will still be pending when the catch arm fires, so NOT_FOUND wins the race and masks a real outage behind account-deletion-churn signal. **Line citation:** A cites 471 (the sentence containing the false claim); B cites 469 (start of the paragraph). The problematic block spans 469–475. I cite 469 (block start) and document the exact false phrase. **Severity:** `minor` — agreed by both; the runtime behaviour is safe, only docs/observability gap. ### Blast Radius The diff introduces a new gRPC fan-out subsystem (`MintContextAggregator`, `FamilyContextClient`, `UserProfileClient`) wired into the hot `CentrifugoTokenController` mint path, with two new protos, Resilience4j circuit-breaker wiring, and new exception/handler classes. The surface is consumed by a single controller action but spans five packages and introduces an outbound network I/O boundary with circuit-breaker configuration, raising the blast radius to moderate. **BLAST_SCORE: 6/10** ### Risk Indicators | Indicator | Value | |---|---| | Sensitive functions | `completionCauseIfTransient`, `serialiseUserContext`, `MintContextAggregator.aggregate`, `FamilyContextClient.getFamilyForUser`, `UserProfileClient.getUserProfile` | | Migration touched | — | | Test delta | — | | Dependency changes | — | ### CI status (head `58efc2ab3dc9`) _No CI checks reported for this commit._ ### Findings (1) #### **[MINOR]** `completionCauseIfTransient` Javadoc falsely claims "both futures are guaranteed completed" — slow-transient functional gap persists _src/main/java/com/aim2be/identity/mintcontext/MintContextAggregator.java:469_ Lines 469–473 read: ```java * {@code f.join()} is safe here because we only call this from * the {@code allOf().get()} catch arm after {@code allOf()} already * fired — both futures are guaranteed completed (the lost-race one * may have completed exceptionally OR successfully; we only * inspect the exceptional case via {@code isCompletedExceptionally}). ``` Line 475: ```java * @param f a fanout future already known to be terminally completed. ``` **Why the claim is false:** Per the `CompletableFuture.allOf()` specification, `allOf()` completes exceptionally as soon as *any one* of its components completes exceptionally. When `allOf().get()` throws `ExecutionException`, only the *first-failing* future is guaranteed done; the *other* future may still be in flight. **Why the code is currently safe:** The `isCompletedExceptionally()` guard at line 481 evaluates to `false` for any future that is still pending (or that completed normally), causing an immediate `return null` before `join()` is reached. `join()` is only called after `isCompletedExceptionally()` confirms the future is terminally done exceptionally — that is the actual safety invariant, not the `allOf` contract. **The trap for future maintainers:** Someone reading "both futures are guaranteed completed" may conclude the `isCompletedExceptionally()` check is merely distinguishing exceptional vs. normal completion and remove it as redundant — that would re-introduce a blocking `join()` on an in-flight future, stalling a `mint-fanout-*` thread until the per-RPC deadline fires. **Functional gap (from R6, partially closed):** The transient-cause promotion only succeeds when the transient future has *already failed* by the time the NOT_FOUND catch arm fires. For slow-transient futures (e.g. a 2 s `DEADLINE_EXCEEDED` from a downstream overload) that are still pending at catch time, `isCompletedExceptionally()` returns `false` → `completionCauseIfTransient` returns `null` → NOT_FOUND wins the race → the outer `ticket.mint` span stays `UNSET` and the `context_fetch_user_not_found` outcome label fires, masking a real infra outage as routine account-deletion churn. **Required fixes:** 1. Replace the false Javadoc paragraph with: *"`CompletableFuture.allOf()` completes exceptionally when the **first** component fails; the returned future's `get()` throwing `ExecutionException` does **not** guarantee the second future is done. The `isCompletedExceptionally()` guard at line 481 is the actual safety net: if `f` is still in flight or completed normally the early-return fires before `join()` is reached."* 2. Fix the `@param` tag to: *"`f` a fanout future that may or may not be done; returns `null` when still pending or completed normally"*. 3. Consider adding a comment in the `ExecutionException` catch arm (~line 340) explicitly marking the transient-promotion as **best-effort**: if the other future is still pending at catch time its eventual failure is not surfaced at controller level — `FANOUT_JOIN_TIMEOUT_MS` prevents an indefinite block, but the outcome label and outer span status are already set. ### Verdict **CONDITIONAL_APPROVE** --- hib-pr-reviewer • round 7 • 1 finding (1m) • 2026-05-24T22:06:18.391Z → 2026-05-24T22:08:30.200Z • posted-as: pr-reviewer-bot </details>

hibryda added 1 commit

2026-05-25 00:10:18 +02:00

docs(mintcontext): correct R7 race-fix Javadoc — R7 review (1 MINOR) 90673882f8

R7 verdict finding (kept=1):

(1) MINOR MintContextAggregator.java:469-475 — completionCauseIfTransient
    Javadoc falsely claimed "both futures are guaranteed completed" by
    the time this helper is called from the ExecutionException arm.
    Per CompletableFuture.allOf() spec: allOf completes exceptionally as
    soon as the FIRST component fails; the other future may still be
    in-flight. Code remains safe because the isCompletedExceptionally()
    guard at line 481 short-circuits on pending futures BEFORE join()
    is reached — that guard IS the safety invariant, not the (false)
    allOf-completion claim. A maintainer trusting the Javadoc could
    remove the guard as "redundant", reintroducing a blocking join()
    on an in-flight future.

    Reviewer also called out the functional gap this leaves: a slow
    transient (e.g. 2 s DEADLINE_EXCEEDED still pending) is not
    promoted; the NOT_FOUND wins the race; outer ticket.mint span
    stays UNSET + outcome label stays permanent. FANOUT_JOIN_TIMEOUT_MS
    bounds the worst-case + attachStatusBest still marks the fanout
    CHILD span ERROR, so the SLO surface is bounded — but the outer-
    span/outcome-label resolution is best-effort.

Fix (Javadoc-only):

- Rewrote the safety-invariant paragraph to state the actual rule:
  isCompletedExceptionally() is the safety net, not the allOf claim.
- Added a "best-effort caveat" paragraph documenting the slow-transient
  functional gap + why closing it fully would require defeating
  bounded fanout OR post-hoc reconciliation.
- @param + @return tags reworded to reflect that f may be pending.
- Added a BEST-EFFORT comment block in the ExecutionException arm
  pointing at the same caveat for future readers.

Verification (containerised JDK 17 mvn install):
- Tests run: 124, Failures: 0, Errors: 0, Skipped: 0
- BUILD SUCCESS
- Javadoc-only delta; behaviour unchanged.

pr-reviewer-bot commented

2026-05-25 00:19:10 +02:00

Member

hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)

Round 8 — head 90673882f8fc, base main, trigger synchronize

TL;DR: NO_NEW_FINDINGS — No new findings this round.

Summary

[quorum-converged] A=0 = B=0. Round 8 — all 15 open findings are now resolved; no new issues found.

I read the following files at HEAD to verify closure:

MintContextAggregator.java (shutdown, executor, fanout join, attachStatusBest, completionCauseIfTransient)
FamilyContextClient.java and UserProfileClient.java (mapStatus, CB wiring)
GlobalExceptionHandler.java (handler ordering, logging severity, OTel event calls)
UserNotFoundContextFetchFailedException.java and TicketContextFetchFailedException.java (Javadoc + error codes)
SubscriptionTierMapper.java (IllegalArgumentException → TicketContextFetchFailedException rewrap confirmed)
application.properties (recordExceptions / ignoreExceptions entries and comments)
CentrifugoTokenController.java catch-block ordering (from diff)

Closure audit:

Finding	Status
R1 NOT_FOUND as CB failure	✅ `UserNotFoundContextFetchFailedException` in `ignoreExceptions`
R1 Dead `StatusRuntimeException` in `recordExceptions`	✅ No longer present
R1 Comment claiming `IllegalArgumentException` in `recordExceptions`	✅ Comment correctly documents `IllegalArgumentException → TicketContextFetchFailedException` rewrap via `parseAndValidate()`
R1 `newFixedThreadPool(16)` unbounded queue	✅ `ArrayBlockingQueue(64)` + `CallerRunsPolicy`
R1 `@PreDestroy` no `awaitTermination`	✅ `awaitTermination(SHUTDOWN_AWAIT_SECONDS, SECONDS)` with `shutdownNow()` escalation
R2 `attachStatusBest` omits timing attributes on error path	✅ Attaches all five attributes on every code path
R2 `both.join()` no wall-clock ceiling	✅ `both.get(FANOUT_JOIN_TIMEOUT_MS, MILLISECONDS)` with `TimeoutException` arm
R3 Comment overstates `StatusRuntimeException` remapping	✅ Comment correctly describes three distinct exception families
R4 `handleTicketContextFetchFailed` logs NOT_FOUND at ERROR	✅ Separate `@ExceptionHandler(UserNotFoundContextFetchFailedException.class)` logs at WARN, no OTel error event
R4 `IllegalStateException` diluting CB success rate	✅ `IllegalStateException` in `ignoreExceptions` (both instances)
R4 `UserNotFoundContextFetchFailedException` Javadoc claims re-auth	✅ Javadoc now correctly states clients retry per `Retry-After`, not initiate token refresh
R5 Metric label conflation	✅ New `OUTCOME_CONTEXT_FETCH_USER_NOT_FOUND` label; `MintMetrics` Javadoc updated
R5 NOT_FOUND catch marks `ticket.mint` span ERROR	✅ NOT_FOUND catch block omits `span.setStatus(StatusCode.ERROR, …)` entirely
R6 `ExecutionException` unwrap race-sensitive	✅ `completionCauseIfTransient` promotes transient cause over NOT_FOUND when already done; best-effort semantics documented
R7 `completionCauseIfTransient` Javadoc false invariant	✅ Safety invariant + best-effort caveat sections accurately describe the `isCompletedExceptionally()` guard and the slow-transient gap

CI status (head `90673882f8fc`)

No CI checks reported for this commit.

Findings

No new findings this round.

Quorum converged on empty findings (A + B both returned 0).

Verdict

NO_NEW_FINDINGS

_{hib-pr-reviewer • round 8 • 0 findings • 2026-05-24T22:10:19.406Z → 2026-05-24T22:19:10.574Z • posted-as: pr-reviewer-bot • [bookkeeping fallback]}

## hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service) **Round 8** — head `90673882f8fc`, base `main`, trigger `synchronize` **TL;DR:** NO_NEW_FINDINGS — No new findings this round. ### Summary [quorum-converged] A=0 = B=0. **Round 8 — all 15 open findings are now resolved; no new issues found.** I read the following files at HEAD to verify closure: - `MintContextAggregator.java` (shutdown, executor, fanout join, attachStatusBest, completionCauseIfTransient) - `FamilyContextClient.java` and `UserProfileClient.java` (mapStatus, CB wiring) - `GlobalExceptionHandler.java` (handler ordering, logging severity, OTel event calls) - `UserNotFoundContextFetchFailedException.java` and `TicketContextFetchFailedException.java` (Javadoc + error codes) - `SubscriptionTierMapper.java` (IllegalArgumentException → TicketContextFetchFailedException rewrap confirmed) - `application.properties` (recordExceptions / ignoreExceptions entries and comments) - `CentrifugoTokenController.java` catch-block ordering (from diff) **Closure audit:** | Finding | Status | |---|---| | R1 NOT_FOUND as CB failure | ✅ `UserNotFoundContextFetchFailedException` in `ignoreExceptions` | | R1 Dead `StatusRuntimeException` in `recordExceptions` | ✅ No longer present | | R1 Comment claiming `IllegalArgumentException` in `recordExceptions` | ✅ Comment correctly documents `IllegalArgumentException → TicketContextFetchFailedException` rewrap via `parseAndValidate()` | | R1 `newFixedThreadPool(16)` unbounded queue | ✅ `ArrayBlockingQueue(64)` + `CallerRunsPolicy` | | R1 `@PreDestroy` no `awaitTermination` | ✅ `awaitTermination(SHUTDOWN_AWAIT_SECONDS, SECONDS)` with `shutdownNow()` escalation | | R2 `attachStatusBest` omits timing attributes on error path | ✅ Attaches all five attributes on every code path | | R2 `both.join()` no wall-clock ceiling | ✅ `both.get(FANOUT_JOIN_TIMEOUT_MS, MILLISECONDS)` with `TimeoutException` arm | | R3 Comment overstates `StatusRuntimeException` remapping | ✅ Comment correctly describes three distinct exception families | | R4 `handleTicketContextFetchFailed` logs NOT_FOUND at ERROR | ✅ Separate `@ExceptionHandler(UserNotFoundContextFetchFailedException.class)` logs at WARN, no OTel error event | | R4 `IllegalStateException` diluting CB success rate | ✅ `IllegalStateException` in `ignoreExceptions` (both instances) | | R4 `UserNotFoundContextFetchFailedException` Javadoc claims re-auth | ✅ Javadoc now correctly states clients retry per `Retry-After`, not initiate token refresh | | R5 Metric label conflation | ✅ New `OUTCOME_CONTEXT_FETCH_USER_NOT_FOUND` label; `MintMetrics` Javadoc updated | | R5 NOT_FOUND catch marks `ticket.mint` span ERROR | ✅ NOT_FOUND catch block omits `span.setStatus(StatusCode.ERROR, …)` entirely | | R6 `ExecutionException` unwrap race-sensitive | ✅ `completionCauseIfTransient` promotes transient cause over NOT_FOUND when already done; best-effort semantics documented | | R7 `completionCauseIfTransient` Javadoc false invariant | ✅ Safety invariant + best-effort caveat sections accurately describe the `isCompletedExceptionally()` guard and the slow-transient gap | ### CI status (head `90673882f8fc`) _No CI checks reported for this commit._ ### Findings **No new findings this round.** _Quorum converged on empty findings (A + B both returned 0)._ ### Verdict **NO_NEW_FINDINGS** --- hib-pr-reviewer • round 8 • 0 findings • 2026-05-24T22:10:19.406Z → 2026-05-24T22:19:10.574Z • posted-as: pr-reviewer-bot • [bookkeeping fallback]

hibryda merged commit 2a7c2a1e22 into main

2026-05-25 00:19:49 +02:00

hibryda referenced this pull request from a commit

2026-05-25 00:19:49 +02:00

feat(centrifugo): L0 T0 #6 PR-OPAQUE-4e — gRPC fan-out for mint-time user context (#4)

hibryda referenced this pull request from a commit

2026-05-25 00:20:05 +02:00

chore(submodule): bump identity-service to 2a7c2a1e — PR-OPAQUE-4e MERGED

No reviewers

No labels

No milestone

No project

No assignees

2 participants

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

affinity-intelligence-rework/im2be-identity-service!4

No description provided.

Rows
Columns

feat(centrifugo): L0 T0 #6 PR-OPAQUE-4e — gRPC fan-out for mint-time user context #4

Summary

Changes

Production code

ADR

Tests

Verification

Test plan

hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)

Summary

Arbitration summary

Blast Radius

Risk Indicators

CI status (head ad10ae12b45e)

Related PRs

Findings (5)

[MAJOR] NOT_FOUND counted as a circuit-breaker failure — permanent errors can trip the breaker open for all users

[MINOR] Dead io.grpc.StatusRuntimeException entry in both recordExceptions configs

[MINOR] Comment incorrectly claims IllegalArgumentException is in recordExceptions

[MINOR] newFixedThreadPool(16) uses unbounded queue — no back-pressure beyond 8 concurrent mint requests

[MINOR] @PreDestroy shutdown() does not await executor termination — Javadoc claim of "drains" is incorrect

Verdict

hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)

Summary

CI status (head 2d1fedec921a)

Findings (2)

[MINOR] attachStatusBest omits all three timing attributes from the error-path span, breaking the Javadoc contract

[MINOR] both.join() has no wall-clock ceiling — indefinite thread block possible if per-RPC deadline does not fire

Verdict

hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)

Summary

Round 3 Arbitration — affinity-intelligence-rework/im2be-identity-service#4

Blast Radius

Risk Indicators

CI status (head 098b687d523b)

Findings (1)

[MINOR] Comment overstates StatusRuntimeException remapping — INVALID_ARGUMENT maps to IllegalStateException, not TicketContextFetchFailedException

Verdict

hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)

Summary

Arbitration Round 4 — 3 unique-to-one findings verified and kept; 0 dropped

Blast Radius

Risk Indicators

CI status (head bb22b7d155f3)

Related PRs

Findings (3)

[MINOR] handleTicketContextFetchFailed logs UserNotFoundContextFetchFailedException at ERROR and marks the OTel span ERROR, contradicting the NOT_FOUND classification rationale

[MINOR] Comment incorrectly claims IllegalStateException is neutral to the circuit breaker — it is counted as SUCCESS, diluting the failure rate

[MINOR] Javadoc incorrectly claims HTTP 503 + Retry-After:1 triggers client re-authentication — it does not

Verdict

hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)

Summary

Round 5 Arbitration — 2026-05-24

Blast Radius

Risk Indicators

CI status (head 07cfd02f17d7)

Findings (2)

[MINOR] Metric label context_fetch_failure conflates NOT_FOUND (permanent) with transient infra failures — on-call cannot distinguish the two from the counter alone

[MINOR] NOT_FOUND arm of TicketContextFetchFailedException catch still marks the ticket.mint span ERROR, defeating the GlobalExceptionHandler tail-sampler isolation (ADR-0013 §3)

Verdict

hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)

Summary

Arbitration — Round 6

Blast Radius

Risk Indicators

CI status (head 000bc0842c1c)

Findings (1)

[MINOR] ExecutionException unwrap is race-sensitive — a fast NOT_FOUND can mask a concurrent transient infrastructure failure at the controller level

Problem

Recommended fix

Verdict

hib-pr-reviewer review — PR #4 (affinity-intelligence-rework/im2be-identity-service)

Summary

Arbitration — Round 7

Reconciliation

Blast Radius

Risk Indicators

CI status (head 58efc2ab3dc9)

Findings (1)

[MINOR] completionCauseIfTransient Javadoc falsely claims "both futures are guaranteed completed" — slow-transient functional gap persists

CI status (head `ad10ae12b45e`)

[MAJOR] `NOT_FOUND` counted as a circuit-breaker failure — permanent errors can trip the breaker open for all users

[MINOR] Dead `io.grpc.StatusRuntimeException` entry in both `recordExceptions` configs

[MINOR] Comment incorrectly claims `IllegalArgumentException` is in `recordExceptions`

[MINOR] `newFixedThreadPool(16)` uses unbounded queue — no back-pressure beyond 8 concurrent mint requests

[MINOR] `@PreDestroy shutdown()` does not await executor termination — Javadoc claim of "drains" is incorrect

CI status (head `2d1fedec921a`)

[MINOR] `attachStatusBest` omits all three timing attributes from the error-path span, breaking the Javadoc contract

[MINOR] `both.join()` has no wall-clock ceiling — indefinite thread block possible if per-RPC deadline does not fire

CI status (head `098b687d523b`)

[MINOR] Comment overstates `StatusRuntimeException` remapping — `INVALID_ARGUMENT` maps to `IllegalStateException`, not `TicketContextFetchFailedException`

CI status (head `bb22b7d155f3`)

[MINOR] `handleTicketContextFetchFailed` logs `UserNotFoundContextFetchFailedException` at ERROR and marks the OTel span ERROR, contradicting the NOT_FOUND classification rationale

[MINOR] Comment incorrectly claims `IllegalStateException` is neutral to the circuit breaker — it is counted as SUCCESS, diluting the failure rate

[MINOR] Javadoc incorrectly claims HTTP 503 + `Retry-After:1` triggers client re-authentication — it does not

CI status (head `07cfd02f17d7`)

[MINOR] Metric label `context_fetch_failure` conflates NOT_FOUND (permanent) with transient infra failures — on-call cannot distinguish the two from the counter alone

CI status (head `000bc0842c1c`)

[MINOR] `ExecutionException` unwrap is race-sensitive — a fast NOT_FOUND can mask a concurrent transient infrastructure failure at the controller level

CI status (head `58efc2ab3dc9`)

[MINOR] `completionCauseIfTransient` Javadoc falsely claims "both futures are guaranteed completed" — slow-transient functional gap persists

CI status (head `90673882f8fc`)