Commit Graph

13 Commits

Author SHA1 Message Date
Juliusz Sosinowicz 044a477378 parallel-make-check.py: only require bwrap for an actual netns run
netns needs bwrap; without it commands silently share the host network
namespace and parallel network tests collide on ports. Skip the check for
--list (it inspects configs, runs nothing), hard-fail on CI so a missing-
bubblewrap misconfig can't silently degrade, and locally just warn and fall
back to the shared namespace.
2026-06-25 13:05:35 +00:00
Juliusz Sosinowicz c9d71d52f8 parallel-make-check.py: add generic pool extensions for arbitrary commands
Let any command ride the build/check pool, not just wolfSSL builds:
  build  false skips configure/make/check (config is just prepare+run)
  netns  true runs each command under 'bwrap --unshare-net --cap-add
         CAP_NET_ADMIN' (its own network namespace) so parallel network
         tests can't collide on ports and can configure that namespace
  shards fan a config out into N instances, each with $SHARD (1..N) and
         $SHARDS=N in its env and its own build-<name>-<k> dir, so a
         command can split its work N ways (the pool load-balances them)

Error out, rather than silently degrade, on two misconfigurations that
otherwise surface as confusing test failures: netns requested but bwrap
missing (commands would share the host namespace and collide on ports),
and config-name collisions after shard fan-out (two jobs would share a
build dir and race).
2026-06-25 09:35:13 +00:00
Juliusz Sosinowicz dd861af06f .github: link parallel-make-check.py annotations to the workflow file
A ::warning::/::error:: emitted with no file= property is pinned by GitHub
to the .github directory, whose blob URL is a directory listing - so the
stale-"minutes" annotations rendered with a dead source link and a line
number that points at nothing.

Derive the workflow file path from GITHUB_WORKFLOW_REF (owner/repo/path@ref)
and pass it as file= so the annotations link to the real workflow that
embeds the config list. Falls back to the previous fileless form off-CI or
when the ref is unavailable.
2026-06-16 14:53:35 +00:00
Juliusz Sosinowicz dd2f9d3ab8 CI: offload ccache/apt/buildx caches off the GitHub Actions cache
The 10 GB, LRU-evicted, PR-scoped Actions cache was being thrashed - the
docker simulator buildx layers (~6 GiB), plus per-PR ccache and apt-archive
writes whose keys never hit - which kept evicting the shared ccache, while
the apt mirror timed out often enough to break PR CI. Move the heavy caches
to ghcr (free, separate pool) and make PR runs read-only against the Actions
cache.

apt dependencies from prebuilt ghcr .deb bundles
  - ci-deps-image.yml resolves each package list under .github/ci-deps/ into
    its .deb closure and publishes ghcr.io/<owner>/wolfssl-ci-debs:<tag> in
    two tiers: <ver>-minimal (make-check family) and <ver>-full (interop
    superset), for ubuntu-22.04 and 24.04.
  - install-apt-deps gains a ghcr-debs-tag input: pull the bundle and install
    offline (--no-download) so the apt mirror is never on the PR critical
    path. Any failure (bundle missing/not public/incomplete) falls through to
    the existing apt path, so it is always safe to set.

sim-test buildx layers to a shared ghcr registry cache
  - the 7 docker simulator workflows switch from cache-to: type=gha to
    ghcr.io/wolfssl/wolfssl-sim-cache:<scope>. cache-from reads on every run
    (anonymous); cache-to writes only on the weekend cron and manual
    workflow_dispatch. Per-distinct-image tags and de-duplicated writers keep
    parallel matrix jobs from racing on one ref.

ccache: PRs read, the schedule writes
  - ccache-setup gains read-only: PR runs restore the shared master-scoped
    cache but never upload; schedule/push runs refresh it. Wired across
    os-check (linux + macOS), pq-all, smoke-test and the 12 small make-check
    workflows.
  - parallel-make-check.py gains --build-only (compile every config, skip the
    test phase) so weekday-morning seed crons warm the cache PR runs consume.

artifact retention capped at 7 days on the failure-log/result uploads that
previously defaulted to 90.

ONE-TIME SETUP: after their first publish, make the ghcr packages
wolfssl-ci-debs and wolfssl-sim-cache PUBLIC so anonymous pulls work from PR
(including fork) runs; until then everything falls back cleanly.
2026-06-15 22:36:35 +00:00
Juliusz Sosinowicz ea3bd56e97 parallel-make-check: fix warning-doc wording and escape every workflow command
Addresses review feedback:
- The "minutes" header comment described the check backwards (the
  estimate drifting from the measured time). Reword it to match the
  code, which warns when the measured time lands more than +/-50% away
  from the estimate.
- Centralize the GitHub workflow-command escaping in gh_escape() and
  apply it to the ::group:: title in dump() and the ::error:: summary in
  main(), not just warn(), so a config name or step carrying %, CR or LF
  cannot corrupt those commands either.
2026-06-15 11:30:00 +00:00
Juliusz Sosinowicz d9079978ed parallel-make-check: percent-encode warn() workflow-command data
A config name comes from JSON and is only checked for emptiness and a
'/', so it can carry %, CR or LF. Passed straight into the ::warning::
workflow command those would truncate the annotation or be parsed as a
second command, so escape them in the GitHub branch of warn() per
GitHub's documented command-data encoding (% first). Local output is
unchanged.
2026-06-15 11:14:06 +00:00
Juliusz Sosinowicz 7b2d19ca86 parallel-make-check: warn when a job runtime drifts >50% from "minutes"
The "minutes" field is only a scheduling estimate; when it goes stale it
just packs the schedule a little worse, and there was no signal that a
value needed updating. Emit a non-fatal warning when a config that
explicitly sets "minutes" finishes more than 50% above or below it (a
GitHub ::warning:: annotation in CI, a plain line locally) and flag the
row in the step-summary table with the value to copy over.

Configs that omit "minutes" keep riding the 1.0 default and are left
alone. The warning never touches the exit status, so it cannot fail the
job.
2026-06-15 08:36:04 +00:00
Juliusz Sosinowicz 6d1d750ad3 parallel-make-check: reserved names, type hints, readability
- Reject the config names "aux" and "test": build-aux/ is autotools'
  aux-script dir and build-test/ a legacy build dir, neither the
  script's to wipe and rebuild over.
- Add type hints throughout.
- Reword the shard-partition comment (the LPT bound was unparseable)
  and replace the zip-over-pool.map result pairing with a run_one()
  helper so the pool returns complete result rows.
2026-06-12 13:39:28 +02:00
Juliusz Sosinowicz 85d3bc2380 parallel-make-check: drop the --jobs option
wolfSSL's configure enables make's jobserver by default
(AX_AM_JOBSERVER([yes]) -> AM_MAKEFLAGS += -j<nproc+1> in aminclude.am),
and automake passes that explicit -j to every recursive sub-make, where
it overrides the invoking make's job limit. The script's -j therefore
only ever scheduled the outermost recursion hop: --jobs was inert.

Measured on a 4-CPU host with 10 build-only configs oversaturating the
worker pool, the jobserver default is also the better policy: capping
sub-makes via --disable-jobserver and -j2 dropped CPU utilization from
96% to 89% and lengthened the wall time, because configs' serial
phases (configure, link) stopped being backfilled by other configs'
compile jobs. So make is now invoked with no -j at all - parallelism
within a config comes from the configure-default jobserver - and the
misleading knob is gone, including the macOS job's --jobs 3.
2026-06-12 09:47:14 +00:00
Juliusz Sosinowicz 1f6abed28e CI hardening: stamp under set -e, SIGKILL escalation for fail-fast
Third Copilot review round:

- Makefile.am: run the test-data stamp recipe body under set -e. A
  failed symlink mid-loop previously did not fail the compound command
  (only the last command's status counted), so a partially-populated
  build tree could be stamped complete. Now any failed setup command
  aborts the recipe and the stamp is not created.

- parallel-make-check.py: fail-fast sent SIGTERM only, so a test that
  traps or ignores SIGTERM could keep the job alive until the workflow
  timeout. abort_others() now polls the swept processes and SIGKILLs
  whatever is still alive after a 10 s grace period, and the
  post-registration race-window kill escalates the same way (bounded
  wait, then SIGKILL). Verified with a config running
  "trap '' TERM; sleep 300": the run completes in ~10 s with the
  stubborn config reported as aborted and no surviving processes.
2026-06-12 09:47:13 +00:00
Juliusz Sosinowicz eb59a12b36 parallel-make-check: close the fail-fast race, contain callable errors
Two fixes from the second Copilot review round:

A process spawned between abort_others()' live_procs snapshot and its
registration escaped the kill sweep, leaving that build/check running
to completion after fail-fast had begun. Re-check stop_event right
after registering the process and SIGTERM its process group if the
abort already started: either the registration happened before the
sweep's snapshot (the sweep kills it) or it happened after stop_event
was set (the re-check sees it), so the window is closed.

Exceptions from callable steps (user_settings staging, private-dir
copies) used to escape the worker thread and crash the whole script
with no summary. They are now recorded as that config's step failure
with the exception written to its make-check.log, e.g. a bad
"user_settings" path reports FAIL (stage <path>) while the other
configs keep running; the fail-fast bookkeeping is shared with the
nonzero-exit path via record_failure().
2026-06-12 09:47:13 +00:00
Juliusz Sosinowicz a62884599b CI review fixes: JSON validation, log volume, rm -rf, flag spelling
Address the Copilot review:
- parallel-make-check.py: validate "configure" (list of strings) and
  cflags/ldflags (strings) so a malformed entry fails the load instead
  of exploding a string into per-character configure arguments; print
  a single line for passing configs instead of dumping their full
  make-check.log into the CI log (failure dumps unchanged; the logs
  remain in build-<name>/ for the failure artifacts).
- Makefile.am: use rm -rf for the certs/input/quit setup and distclean
  cleanup. A --private-dir run replaces the certs symlink with a
  private directory copy that rm -f cannot remove (verified: make
  distclean in a build dir with a privatized certs/ now succeeds and
  removes it).
- psk.yml, disable-pk-algs.yml: normalize the single-dash tokens
  (-disable-rsa, -disable-ecc, -disable-aescbc, -enable-cryptonly)
  carried verbatim from the old matrices to the canonical double-dash
  form. No coverage change: configure honors single-dash spellings
  (verified -disable-rsa sets NO_RSA with no unrecognized-option
  warning), so these were always in effect; both touched configs
  re-validated end-to-end.

The --cc default stays "ccache gcc": ccache resolves the compiler
through its own masquerade symlinks (verified: no recursion and normal
cache hits with /usr/lib/ccache prepended to PATH), and the explicit
CC= also covers jobs that use ccache without the PATH masquerade.
2026-06-12 09:47:13 +00:00
Juliusz Sosinowicz 3a6c31a51e CI: pool the per-config runner matrices into parallel make-check jobs
Replace the one-runner-per-configuration matrices across the
make-check workflow family with a generic pooled runner,
.github/scripts/parallel-make-check.py. Each workflow keeps its
configuration list as JSON next to the invocation; one runner (or a
small fixed set of shards, balanced by measured per-config minutes)
builds every config in its own out-of-tree (VPATH) build directory off
a single checkout/autogen, on a pool of one-per-CPU worker threads,
longest first. Concurrent checks are isolated with bubblewrap network
namespaces, compilations are cached with ccache, the first failure
aborts the rest (fail-fast, with --no-fail-fast to run everything),
and per-config timings plus pool efficiency land in the step summary.
Failure logs upload as artifacts. smoke-test.yml is likewise reworked
into a single pooled job that runs its nine configs on one runner.

Converted workflows (runner jobs per full pass):
  os-check.yml             101 -> 8  (92 Ubuntu configs -> 4 shards;
                           the macOS matrix, the user-settings jobs and
                           the standalone
                           macos-apple-native-cert-validation.yml fold
                           into one macOS runner; Windows unchanged)
  pq-all.yml                21 -> 2 shards
  disable-pk-algs.yml       15 -> 1
  wolfCrypt-Wconversion.yml 11 -> 1
  trackmemory.yml            7 -> 1
  cryptocb-only.yml          8 -> 1  (incl. the two new SHA512 entries)
  multi-compiler.yml         6 -> 1
  smallStackSize.yml         6 -> 1
  multi-arch.yml             6 -> 1
  async.yml                  5 -> 1
  psk.yml                    5 -> 1
  no-malloc.yml              3 -> 1
  wolfsm.yml                 3 -> 1
  opensslcoexist.yml         2 -> 1

Measured against current upstream passing runs (job execution time,
queue excluded): ~200 runner jobs / ~374 runner-minutes per full pass
become 23 jobs / ~168 runner-minutes, with more coverage than before.
multi-arch's old matrix combined an "include" list of four
architectures with an "opts" axis; GitHub's include-merge rules made
each arch entry overwrite the previous one, so only the armel
combinations actually ran. The pooled list restores the intended
aarch64/armhf/riscv64 coverage (23 combinations; riscv64 x sp-math is
omitted as invalid - configure rejects sp-math without SP, and
--enable-riscv-asm, unlike --enable-sp-asm, does not bring SP in).

Out-of-tree build fixes this depends on:
- Makefile.am: symlink the read-only test data (certs/, tests/ config
  files, sniffer captures and helpers, examples/crypto_policies,
  input, quit) into the build tree via a BUILT_SOURCES stamp, removed
  again in distclean-local. ChangeToWolfRoot() and the script tests
  resolve everything relative to the working directory, so out-of-tree
  make check and make distcheck now pass.
- scripts/multi-msg-record.py: locate the client binary from the build
  tree working directory rather than the script's source directory.
- configure.ac + wolfssl/include.am: run
  support/gen-debug-trace-error-codes.sh from $srcdir; it reads the
  error-code headers from the source tree and generates into the build
  tree.
- tests/swdev: a WOLFBUILD variable points the sub-make at the build
  tree for the configure-generated headers (wolfssl/options.h,
  wolfssl/version.h); the in-tree-only guards are dropped.

Portions of PR #10649 are incorporated: the cross-platform
ccache-setup composite action, repository_owner gates on check-headers
and check-source-text, the docs-only paths-ignore on os-check, and the
libspdm timeout bumps.
2026-06-12 09:47:13 +00:00