feat(observability): wire ADR-0015 durable error-event stream (#332) #6
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "feat/identity-service-error-event-stream-332"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Header
identity-service · PR (R0/open) · ADR-0015 durable error-event stream (#332) · base
mainTL;DR
NEEDS_REVIEW— wiresobservability.v1.ErrorEvent→im2be.error.eventvia the platform-libserror-event-publisher(REQUIRES_NEWoutbox write) on top of the existing span event + counter, default OFF behindim2be.error-event.enabled. Load-bearing per-service deviation: a JPA exclusion on the new dep keeps Redis-only identity-service bootable.Summary
Replicates the review-approved user-service reference (merged to
dev) for the SB-3.3.13 / Redis-outbox variant.OtelErrorEventsgains explicit-RecoveryAction(D-7) overloads ofrecord/recordPropagated/recordSanitizedEvent(String)/recordSanitizedEvent(Throwable)plus therecordSanitizedEventInternal(spanErrorType, durableErrorClass, …)FQN-split core (spanerror.type+ counter stay simple-name; durableerror_classis the FQN). A newOtelErrorEventsBridge(@ConditionalOnProperty, not@ConditionalOnBean— ordering trap) wires the publisher in when the flag is on. Commitd4834ae. Verified at HEAD:mvn -B compile test-compile0 warnings;mvn -B test149/149 green (8 new).Findings
No findings from the author — net-new wiring. Self-flagged review points:
pom.xmlerror-event-publishercarries aspring-boot-starter-data-jpaexclusion — identity is Redis-only; the publisher's transitiveoutbox-publisherdeclares JPA at compile scope, which would activate DataSource/JPA auto-config and fail startup. Mirrors the identical exclusion already onredis-outbox-backend.OtelErrorEvents.javaemitDurableErrorEventhas no try/catch by design —publish()is the best-effort boundary andbuildErrorEventis fully null-safe; a blanket catch would feed the NoBlanketCatch ratchet.OtelErrorEvents.javamessageis the sanitized string or the stable error code — never raw exception text (Rule 01).OtelErrorEvents.javarecordEventOnly(identity-specific W3 extra, no Throwable) is intentionally NOT wired to the durable stream — outside the reference pattern's three families.Verdict
NEEDS_REVIEW. Default-OFF flag → zero behaviour change until per-service rollout (post topic-CRkubectl applygate). Conditions: none carried forward. DO NOT MERGE until reviewed.Footer
identity-service • #332 ADR-0015 • R0/open • new=0 carried=0 • SB 3.3.13 / Redis-outbox variant • 2026-05-31
In addition to the existing OtelErrorEvents span event + identity_errors_total counter, OtelErrorEvents now also serializes observability.v1.ErrorEvent and publishes it to im2be.error.event via the platform-libs error-event-publisher (OutboxPublisher write in a REQUIRES_NEW tx, ADR-0015 D-3) so a domain error survives the rolling-back business transaction it was caught in. Default OFF behind im2be.error-event.enabled (ADR-0015 migration order); span event + counter behaviour is byte-for-byte unchanged when the flag is off. Wiring: - OtelErrorEvents: static volatile ErrorEventPublisher + setErrorEventPublisher; cleared in resetForTesting(). Explicit-RecoveryAction overloads (D-7) of record / recordPropagated / recordSanitizedEvent(String) / recordSanitizedEvent(Throwable). New private recordSanitizedEventInternal (spanErrorType, durableErrorClass, ...) FQN-split core: the span error.type + counter stay the SIMPLE name (uniform with the other record* paths), while the durable error_class carries the FQN so the error-event stream buckets the same class consistently regardless of which path emitted it. emitDurableErrorEvent is a no-op when no publisher is wired; otherwise publishes buildErrorEvent — NO try/catch (publish() is the best-effort boundary and buildErrorEvent is fully null-safe, so a blanket catch would only feed the NoBlanketCatch ratchet). - buildErrorEvent: service_name=INSTRUMENTATION_NAME, error_class=getName() FQN, PII-safe message = sanitized string ?: stable error code (NEVER the raw exception text — Rule 01), operation=component, trace/span ids from the span (or Span.current()) only when the context is valid, request_id="", occurred_at=Instant.now(), explicit recovery_action, all strings null-coalesced. - OtelErrorEventsBridge: @Component @ConditionalOnProperty(im2be.error-event, enabled=true) → OtelErrorEvents.setErrorEventPublisher; @PreDestroy clears. @ConditionalOnProperty NOT @ConditionalOnBean — a component-scanned bean's conditions evaluate before the platform-libs autoconfig registers the publisher, so @ConditionalOnBean would silently never wire (ordering trap); @ConditionalOnProperty reads the Environment, which is order-safe. Dependencies (pom.xml): - error-event-publisher (${platform-libs.version}=1.1.0-SNAPSHOT) WITH a spring-boot-starter-data-jpa exclusion — identity-service is Redis-ONLY and the publisher's transitive outbox-publisher declares the JPA starter at compile scope; without the exclusion JPA returns to the classpath and Spring Boot's DataSource/JPA auto-configs activate and FAIL at startup (no datasource). This mirrors the identical exclusion already on the redis-outbox-backend dep. - proto-observability (new property proto-observability.version=1.0.0, fixed release per D-5) for the ErrorEvent / RecoveryAction generated types. - The error-event autoconfig is @ConditionalOnBean(OutboxPublisher.class): identity-service supplies OutboxPublisher via the Redis backend, so the same backend-agnostic facade backs the REQUIRES_NEW error-event write. Config (application.properties): im2be.error-event.enabled (default false) + im2be.error-event.topic=im2be.error.event. im2be.outbox.enabled is already true. Test: OtelErrorEventsDurableStreamTest (8 tests) — no-publisher no-op; wired field + trace mapping; PII guard (durable message == error code, not raw text); explicit RecoveryAction; recordPropagated FQN + code fallback; recordSanitizedEvent(Throwable) FQN durable class while span stays simple; recordSanitizedEvent(Throwable, RecoveryAction) (D-7); resetForTesting clears. Uses TicketMintFailedException as the concrete IdentityError and the in-tree @BeforeAll installSdk() harness pattern (no shared OtelTestHarness exists). Verification: - JAVA_HOME=temurin-17; mvn -B compile test-compile → 0 [WARNING] (rule 62) - mvn -B test → 149/149 green (8 new); NoBlanketCatch path untouched.hib-pr-reviewer review — PR #6 (affinity-intelligence-rework/im2be-identity-service)
Round 1 — head
d4834ae95861, basemain, triggeropenedTL;DR: CONDITIONAL_APPROVE — kept 2 minor findings (recordEventOnly durable gap verified in code; proto-observability Enforcer gap confirmed by grep); no blocking issues.
Summary
Arbitration Round 1 — PR #6 (feat: ADR-0015 durable error-event stream)
Prior run history: None (first arbitration for this PR).
Reviewer A produced 1 JSON finding (pom.xml:111, minor) and mentioned a second gap (
recordEventOnly) in prose only. Reviewer B produced 0 JSON findings but also prose-mentioned therecordEventOnlygap.Verification performed
recordEventOnlydurable emission gap — ReadOtelErrorEvents.javain full (727 lines) and Grep for allemitDurableErrorEventcall-sites. Confirmed:emitDurableErrorEventis called from therecordfull-fidelity overload (line 258),recordPropagatedfull-fidelity overload (line 368), andrecordSanitizedEventInternal(line 561). TherecordEventOnlymethod (lines 580–608) ends after the counter increment with noemitDurableErrorEventcall. The class-level Javadoc at line 103 explicitly listsrecordEventOnlyin the durable-stream family. Both reviewers independently flagged this in prose; code confirms it is real → kept (Rule 2 / Rule 4 — own verification succeeds).No Enforcer guard on proto-observability version — Grep for
enforcer|requireUpperBoundDeps|dependencyConvergenceinpom.xmlreturned zero matches, confirming A's finding that no CI guard exists. Comment at lines 107–110 documents the risk but provides no build-time enforcement → kept (Rule 2 — unique-to-A, verified confirms).Outcome: 2 findings kept (both minor), no blockers. CONDITIONAL_APPROVE.
Blast Radius
The diff is concentrated in the observability layer (
OtelErrorEvents,OtelErrorEventsBridge,application.properties) pluspom.xml.OtelErrorEventsis a static utility called from every catch block in the service, so therecordEventOnlygap affects all validation-rejection paths. The feature flag (im2be.error-event.enabled, default off) constrains blast radius: no behaviour changes until the flag is turned on.BLAST_SCORE: 4/10
Risk Indicators
emitDurableErrorEvent,setErrorEventPublisher,buildErrorEvent,OtelErrorEventsBridgeCI status (head
d4834ae95861)No CI checks reported for this commit.
Findings (1)
[MINOR] No CI guard against
proto-observability/error-event-publisherversion driftpom.xml:111
Line 111 pins
<proto-observability.version>1.0.0</proto-observability.version>as a bare<properties>entry used directly in<dependencies>, making it the Maven "nearest definition" forproto-observability. The inline comment (lines 107–110) correctly documents the downgrade risk:However, no build-time check enforces alignment. Grep for
enforcer,requireUpperBoundDeps, anddependencyConvergenceinpom.xmlreturns zero matches. Whenplatform-libs.versionis bumped in a future PR anderror-event-publisherpulls a higherproto-observabilitytransitively, the mismatch is invisible at compile time and surfaces as aNoSuchMethodErrorat runtime.Concrete fix (either approach):
requireUpperBoundDepsexecution:proto-observabilityinto<dependencyManagement>sourced fromerror-event-publisher's BOM (if one exists), so the two versions are mechanically tied rather than manually synchronized. If no BOM exists, the Enforcer rule is the pragmatic path.Verdict
CONDITIONAL_APPROVE
hib-pr-reviewer • round 1 • 1 finding (1m) • 2026-05-30T23:35:59.569Z → 2026-05-30T23:38:03.260Z • posted-as: pr-reviewer-bot • model: auto