CI: mvn install runs Testcontainers IT but the runner job-container has no Docker → chronic red CI #20

Closed
opened 2026-05-31 18:34:09 +02:00 by hibryda · 2 comments
Owner

Problem

im2be-platform-libs CI is red on every PR. The mvn install job
(.forgejo/workflows/ci.yml → job maven-install, runner label java17) runs
the full Maven lifecycle, which includes the Failsafe integration-test phase.
OutboxPublisherIT is a Testcontainers IT (spins a postgres:15-alpine
container + uses @EmbeddedKafka). The Forgejo Actions runner executes the job
inside the maven-node-ci:3.9-temurin17 container, which has no access to
a Docker daemon
, so Testcontainers cannot start the PG container → the IT
errors → mvn install fails → PR CI red.

Evidence: PR #19 (and historically #15/#16/#18) all show CI red while the
identical mvn install (incl. the IT) is green locally (Docker present).
This is environmental (no Docker-in-job), not a code failure.

This blocks "green CI everywhere" and erodes the value of required-status
checks (a real failure would be indistinguishable from this chronic red).

Affected

  • Repo: affinity-intelligence-rework/im2be-platform-libs
  • Workflow: .forgejo/workflows/ci.yml jobs maven-install (PR) and
    maven-verify (main push — same Testcontainers dependency).
  • Runner: self-hosted forgejo-runner.service (label java17
    docker://…/maven-node-ci:3.9-temurin17), managed by im2be-mono
    scripts/forgejo-runner-setup.sh + ci/forgejo-runner-maven-node/.

Resolution options

A — give the runner job-container Docker access (preferred; keeps IT
coverage in CI).
Mount the host Docker socket into the job container so
Testcontainers can launch sibling containers: add
/var/run/docker.sock:/var/run/docker.sock to the runner's
container.valid_volumes (and have the workflow request it / set
options), plus TESTCONTAINERS_HOST_OVERRIDE / TESTCONTAINERS_RYUK_DISABLED
as needed. Runner-config change on the operator's laptop — operator-owned
(persistent infra; do not edit unilaterally — see im2be-mono session notes on
not mutating the working runner).

B — split ITs out of the PR job. Change maven-install to
mvn install -DskipITs; run ITs only in a Docker-capable stage. Loses CI-side
IT coverage on PRs (the IT still runs in the local dev loop). No runner change.

C — guard the IT to skip (not fail) when Docker is absent (fast "green now",
keeps coverage where Docker exists).
Add a Testcontainers/JUnit assumption
(assumeTrue(DockerClientFactory.instance().isDockerAvailable()) or
@EnabledIf…) so the IT is skipped on a Docker-less runner and runs
locally + on a Docker-enabled runner. Code-only PR (no operator/runner change).
Best combined with A later for full CI coverage.

Recommendation

Ship C now to get CI green without losing local coverage, then do A so
the IT actually runs in CI. Owner of A = operator (runner config).

Acceptance

  • im2be-platform-libs PR CI (maven-install) and main-push (maven-verify)
    report success (not red) for a clean build.
  • Either the IT runs in CI (option A) or is explicitly + visibly skipped with a
    recorded reason (option C) — never a silent red.

Refs: PR #19 (outbox byte[]-producer fix) merged green-locally but CI-red on
this exact condition. Memora #3640 (the fix) notes the CI signature.

## Problem `im2be-platform-libs` CI is **red on every PR**. The `mvn install` job (`.forgejo/workflows/ci.yml` → job `maven-install`, runner label `java17`) runs the full Maven lifecycle, which includes the **Failsafe integration-test phase**. `OutboxPublisherIT` is a **Testcontainers** IT (spins a `postgres:15-alpine` container + uses `@EmbeddedKafka`). The Forgejo Actions runner executes the job **inside** the `maven-node-ci:3.9-temurin17` container, which has **no access to a Docker daemon**, so Testcontainers cannot start the PG container → the IT errors → `mvn install` fails → PR CI red. Evidence: PR #19 (and historically #15/#16/#18) all show CI red while the **identical** `mvn install` (incl. the IT) is **green locally** (Docker present). This is environmental (no Docker-in-job), not a code failure. This blocks "green CI everywhere" and erodes the value of required-status checks (a real failure would be indistinguishable from this chronic red). ## Affected - Repo: `affinity-intelligence-rework/im2be-platform-libs` - Workflow: `.forgejo/workflows/ci.yml` jobs `maven-install` (PR) and `maven-verify` (main push — same Testcontainers dependency). - Runner: self-hosted `forgejo-runner.service` (label `java17` → `docker://…/maven-node-ci:3.9-temurin17`), managed by im2be-mono `scripts/forgejo-runner-setup.sh` + `ci/forgejo-runner-maven-node/`. ## Resolution options **A — give the runner job-container Docker access (preferred; keeps IT coverage in CI).** Mount the host Docker socket into the job container so Testcontainers can launch sibling containers: add `/var/run/docker.sock:/var/run/docker.sock` to the runner's `container.valid_volumes` (and have the workflow request it / set `options`), plus `TESTCONTAINERS_HOST_OVERRIDE` / `TESTCONTAINERS_RYUK_DISABLED` as needed. **Runner-config change on the operator's laptop — operator-owned** (persistent infra; do not edit unilaterally — see im2be-mono session notes on not mutating the working runner). **B — split ITs out of the PR job.** Change `maven-install` to `mvn install -DskipITs`; run ITs only in a Docker-capable stage. Loses CI-side IT coverage on PRs (the IT still runs in the local dev loop). No runner change. **C — guard the IT to skip (not fail) when Docker is absent (fast "green now", keeps coverage where Docker exists).** Add a Testcontainers/JUnit assumption (`assumeTrue(DockerClientFactory.instance().isDockerAvailable())` or `@EnabledIf…`) so the IT is **skipped** on a Docker-less runner and **runs** locally + on a Docker-enabled runner. Code-only PR (no operator/runner change). Best combined with **A** later for full CI coverage. ## Recommendation Ship **C** now to get CI green without losing local coverage, then do **A** so the IT actually runs in CI. Owner of **A** = operator (runner config). ## Acceptance - `im2be-platform-libs` PR CI (`maven-install`) and main-push (`maven-verify`) report **success** (not red) for a clean build. - Either the IT runs in CI (option A) or is explicitly + visibly skipped with a recorded reason (option C) — never a silent red. Refs: PR #19 (outbox byte[]-producer fix) merged green-locally but CI-red on this exact condition. Memora #3640 (the fix) notes the CI signature.
Author
Owner

Option C merged — CI is now GREEN (both jobs).

PR #21 (squash → main c524f89) shipped option C plus the dependency-resolution wall behind it:

  1. Guarded all 11 reactor Testcontainers ITs (outbox-publisher, apicurio-client, processed-kafka-events, redis-outbox-backend ×8) with @Testcontainers(disabledWithoutDocker = true) → they SKIP (not error) on the Docker-less runner job-container.
  2. Added forgejo-air as a no-creds <repositories> in the parent pom so error-event-publisher resolves proto-observability:1.0.0 (published only to forgejo-air; anonymous read = HTTP 200) on a clean runner without ~/.m2/settings.xml.

Runner result on main:

  • mvn install (push) Successful in 1m36s
  • mvn verify (main only) (push) Successful in 1m39s

Proven beforehand via a faithful bare-container repro (maven-node-ci:3.9-temurin17, no Docker socket, no ~/.m2) → BUILD SUCCESS, 8 modules, 44 IT methods skipped.

Residual (this issue stays open for it): option A — mount the host Docker socket into the runner job-container (container.valid_volumes += /var/run/docker.sock, TESTCONTAINERS_HOST_OVERRIDE/RYUK) so the ITs actually run in CI instead of skipping. That's a persistent runner-config change on the operator's laptop → operator-owned. Until then, IT coverage is via the local dev loop (where Docker is present).

**Option C merged — CI is now GREEN (both jobs).** PR #21 (squash → `main c524f89`) shipped option C plus the dependency-resolution wall behind it: 1. Guarded all **11** reactor Testcontainers ITs (outbox-publisher, apicurio-client, processed-kafka-events, redis-outbox-backend ×8) with `@Testcontainers(disabledWithoutDocker = true)` → they SKIP (not error) on the Docker-less runner job-container. 2. Added forgejo-air as a no-creds `<repositories>` in the parent pom so `error-event-publisher` resolves `proto-observability:1.0.0` (published only to forgejo-air; anonymous read = HTTP 200) on a clean runner without `~/.m2/settings.xml`. Runner result on `main`: - `mvn install (push)` — ✅ Successful in 1m36s - `mvn verify (main only) (push)` — ✅ Successful in 1m39s Proven beforehand via a faithful bare-container repro (`maven-node-ci:3.9-temurin17`, no Docker socket, no `~/.m2`) → BUILD SUCCESS, 8 modules, 44 IT methods skipped. **Residual (this issue stays open for it): option A** — mount the host Docker socket into the runner job-container (`container.valid_volumes += /var/run/docker.sock`, `TESTCONTAINERS_HOST_OVERRIDE`/`RYUK`) so the ITs actually **run** in CI instead of skipping. That's a persistent runner-config change on the operator's laptop → operator-owned. Until then, IT coverage is via the local dev loop (where Docker is present).
Author
Owner

Resolved — option C shipped; option A declined as net-negative. Closing.

CI is green (PR #21): the 11 Testcontainers ITs skip cleanly when no Docker is reachable and run in full locally. After inspecting the runner, option A (mount the Docker socket so the ITs also run in CI) is not worth doing on this infra:

  • Security: it gives every CI job on the runner root-equivalent host Docker access (the runner serves all 18 org webhooks).
  • Performance: capacity-1 runner — running 11 Testcontainers ITs (PG + Redis + Kafka) per platform-libs PR would slow CI for the whole org.
  • Reward: marginal — the ITs already run in full in the local dev loop; option A only adds also running them in CI.
  • Fragility: docker-out-of-docker in a containerized job needs networking tuning and risks breaking the shared runner.

Steady state: skip-in-CI (green) + local IT coverage. If a Docker-capable CI runner ever exists (e.g. a dedicated host-exec label or a separate runner), disabledWithoutDocker means the ITs will automatically start running there with no code change — so reopening is cheap if the calculus changes.

**Resolved — option C shipped; option A declined as net-negative. Closing.** CI is green (PR #21): the 11 Testcontainers ITs skip cleanly when no Docker is reachable and run in full locally. After inspecting the runner, **option A (mount the Docker socket so the ITs also run in CI) is not worth doing** on this infra: - **Security:** it gives every CI job on the runner root-equivalent host Docker access (the runner serves all 18 org webhooks). - **Performance:** capacity-1 runner — running 11 Testcontainers ITs (PG + Redis + Kafka) per platform-libs PR would slow CI for the whole org. - **Reward:** marginal — the ITs already run in full in the local dev loop; option A only adds *also* running them in CI. - **Fragility:** docker-out-of-docker in a containerized job needs networking tuning and risks breaking the shared runner. Steady state: **skip-in-CI (green) + local IT coverage.** If a Docker-capable CI runner ever exists (e.g. a dedicated host-exec label or a separate runner), `disabledWithoutDocker` means the ITs will automatically start running there with no code change — so reopening is cheap if the calculus changes.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
affinity-intelligence-rework/im2be-platform-libs#20
No description provided.