CI: pool the per-config runner matrices into parallel make-check jobs

Replace the one-runner-per-configuration matrices across the
make-check workflow family with a generic pooled runner,
.github/scripts/parallel-make-check.py. Each workflow keeps its
configuration list as JSON next to the invocation; one runner (or a
small fixed set of shards, balanced by measured per-config minutes)
builds every config in its own out-of-tree (VPATH) build directory off
a single checkout/autogen, on a pool of one-per-CPU worker threads,
longest first. Concurrent checks are isolated with bubblewrap network
namespaces, compilations are cached with ccache, the first failure
aborts the rest (fail-fast, with --no-fail-fast to run everything),
and per-config timings plus pool efficiency land in the step summary.
Failure logs upload as artifacts. smoke-test.yml is likewise reworked
into a single pooled job that runs its nine configs on one runner.

Converted workflows (runner jobs per full pass):
  os-check.yml             101 -> 8  (92 Ubuntu configs -> 4 shards;
                           the macOS matrix, the user-settings jobs and
                           the standalone
                           macos-apple-native-cert-validation.yml fold
                           into one macOS runner; Windows unchanged)
  pq-all.yml                21 -> 2 shards
  disable-pk-algs.yml       15 -> 1
  wolfCrypt-Wconversion.yml 11 -> 1
  trackmemory.yml            7 -> 1
  cryptocb-only.yml          8 -> 1  (incl. the two new SHA512 entries)
  multi-compiler.yml         6 -> 1
  smallStackSize.yml         6 -> 1
  multi-arch.yml             6 -> 1
  async.yml                  5 -> 1
  psk.yml                    5 -> 1
  no-malloc.yml              3 -> 1
  wolfsm.yml                 3 -> 1
  opensslcoexist.yml         2 -> 1

Measured against current upstream passing runs (job execution time,
queue excluded): ~200 runner jobs / ~374 runner-minutes per full pass
become 23 jobs / ~168 runner-minutes, with more coverage than before.
multi-arch's old matrix combined an "include" list of four
architectures with an "opts" axis; GitHub's include-merge rules made
each arch entry overwrite the previous one, so only the armel
combinations actually ran. The pooled list restores the intended
aarch64/armhf/riscv64 coverage (23 combinations; riscv64 x sp-math is
omitted as invalid - configure rejects sp-math without SP, and
--enable-riscv-asm, unlike --enable-sp-asm, does not bring SP in).

Out-of-tree build fixes this depends on:
- Makefile.am: symlink the read-only test data (certs/, tests/ config
  files, sniffer captures and helpers, examples/crypto_policies,
  input, quit) into the build tree via a BUILT_SOURCES stamp, removed
  again in distclean-local. ChangeToWolfRoot() and the script tests
  resolve everything relative to the working directory, so out-of-tree
  make check and make distcheck now pass.
- scripts/multi-msg-record.py: locate the client binary from the build
  tree working directory rather than the script's source directory.
- configure.ac + wolfssl/include.am: run
  support/gen-debug-trace-error-codes.sh from $srcdir; it reads the
  error-code headers from the source tree and generates into the build
  tree.
- tests/swdev: a WOLFBUILD variable points the sub-make at the build
  tree for the configure-generated headers (wolfssl/options.h,
  wolfssl/version.h); the in-tree-only guards are dropped.

Portions of PR #10649 are incorporated: the cross-platform
ccache-setup composite action, repository_owner gates on check-headers
and check-source-text, the docs-only paths-ignore on os-check, and the
libspdm timeout bumps.
This commit is contained in:
Juliusz Sosinowicz
2026-06-11 17:13:58 +00:00
parent 351b775fd4
commit 3a6c31a51e
31 changed files with 2641 additions and 850 deletions
+432
View File
@@ -0,0 +1,432 @@
#!/usr/bin/env python3
# Build and "make check" a set of configurations, each in its own out-of-tree
# (VPATH) build directory, on a pool of worker threads (default: one per
# CPU); each thread takes the next pending config as soon as it is free.
# The final summary reports how efficiently the pool used the machine
# (thread occupancy and CPU utilization).
#
# The configurations come from a JSON file ("-" for stdin): a list of
# objects, one per configuration. Recognized keys, all optional except
# "name" (unknown keys are an error, so typos do not pass silently):
#
# name unique identifier; the config builds in build-<name>/
# configure list of extra ./configure arguments
# cc compiler passed to configure as CC=, overriding --cc
# ("" leaves CC entirely to configure / the environment)
# cflags CFLAGS for make, overriding --cflags
# ldflags LDFLAGS for make, overriding --ldflags
# minutes expected duration, from the Minutes column of a previous
# run's summary (default 1.0). Schedule weight only - configs
# run longest-first and --shard balances shards by it; a stale
# value just packs the schedule a little worse.
# user_settings header staged as <builddir>/user_settings.h before
# configure (path relative to the source root); pair it with
# --enable-usersettings in "configure"
# check false skips the make-check phase entirely (default true)
# prepare list of argv lists run in the build dir before configure
# run list of argv lists run in the build dir after the build and
# checks, e.g. [["wolfcrypt/test/testwolfcrypt"]]
# comment ignored; JSON has no comment syntax, so notes go here
#
# For example:
#
# [
# {"name": "default"},
# {"name": "all-asan", "configure": ["--enable-all"],
# "cflags": "-fsanitize=address", "ldflags": "-fsanitize=address"}
# ]
#
# Driven by CI workflows, which keep their config lists next to the
# invocation (see .github/workflows/smoke-test.yml), but also runnable
# locally - copy the JSON block out of the workflow into a file:
#
# .github/scripts/parallel-make-check.py configs.json # all configs
# .github/scripts/parallel-make-check.py configs.json default all-asan
# .github/scripts/parallel-make-check.py --list configs.json
#
# Concurrent "make check" runs are safe because the test scripts re-exec
# themselves under "bwrap --unshare-net" when bubblewrap is installed (one
# network namespace each) and the remaining test outputs land in the build
# directory; see --private-dir for the exception.
#
# The first failing config aborts the others (pending configs are skipped,
# in-flight ones are killed) so CI fails fast; pass --no-fail-fast to run
# everything and report every failure.
import argparse
import json
import os
import shutil
import signal
import subprocess
import sys
import threading
import time
from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass, field
from pathlib import Path
# cflags/ldflags are applied at make time only (never to ./configure) so
# autoconf feature detection is not poisoned by benign warnings in
# conftest probes. They are omitted entirely when empty so a plain config
# keeps the configure-chosen defaults.
@dataclass
class Config:
name: str
configure: list = field(default_factory=list)
cc: str = ""
cflags: str = ""
ldflags: str = ""
minutes: float = 1.0
user_settings: str = ""
check: bool = True
prepare: list = field(default_factory=list)
run: list = field(default_factory=list)
SRCDIR = Path(__file__).resolve().parents[2]
ON_GITHUB = os.environ.get("GITHUB_ACTIONS") == "true"
print_lock = threading.Lock()
# Fail-fast state: the first failure sets stop_event (under fail_lock, so
# exactly one config is reported as the origin) and kills the other
# workers' in-flight process groups.
stop_event = threading.Event()
fail_lock = threading.Lock()
live_procs = set()
procs_lock = threading.Lock()
def abort_others():
# Every subprocess starts its own session, so killing the process
# group takes down the whole make/test tree under it.
with procs_lock:
procs = list(live_procs)
for p in procs:
try:
os.killpg(p.pid, signal.SIGTERM)
except (ProcessLookupError, PermissionError):
pass
def nproc():
# Like nproc(1): CPUs usable by this process, falling back to all online.
try:
return len(os.sched_getaffinity(0))
except AttributeError:
return os.cpu_count() or 1
def load_configs(opts, error):
try:
if opts.json == "-":
entries = json.load(sys.stdin)
else:
entries = json.loads(Path(opts.json).read_text())
except (OSError, ValueError) as e:
error(f"{opts.json}: {e}")
if not isinstance(entries, list):
error(f"{opts.json}: expected a JSON list of config objects")
configs = []
for entry in entries:
if not isinstance(entry, dict):
error(f"{opts.json}: config entries must be objects: {entry!r}")
unknown = set(entry) - {"name", "configure", "cc", "cflags",
"ldflags", "minutes", "user_settings",
"check", "prepare", "run", "comment"}
if unknown:
error(f"{opts.json}: unknown key(s) in {entry.get('name', entry)!r}: "
f"{' '.join(sorted(unknown))}")
name = entry.get("name")
if not isinstance(name, str) or not name or "/" in name:
error(f"{opts.json}: every config needs a \"name\" usable as a "
f"directory suffix: {entry!r}")
if any(cfg.name == name for cfg in configs):
error(f"{opts.json}: duplicate config name {name!r}")
minutes = entry.get("minutes", 1.0)
if isinstance(minutes, bool) or not isinstance(minutes, (int, float)) \
or minutes < 0:
error(f"{opts.json}: \"minutes\" must be a non-negative number "
f"in {name!r}")
user_settings = entry.get("user_settings", "")
if not isinstance(user_settings, str):
error(f"{opts.json}: \"user_settings\" must be a path string "
f"in {name!r}")
check = entry.get("check", True)
if not isinstance(check, bool):
error(f"{opts.json}: \"check\" must be a boolean in {name!r}")
cc = entry.get("cc", opts.cc or "")
if not isinstance(cc, str):
error(f"{opts.json}: \"cc\" must be a string in {name!r}")
for key in ("prepare", "run"):
cmds = entry.get(key, [])
if not (isinstance(cmds, list)
and all(isinstance(cmd, list) and cmd
and all(isinstance(a, str) for a in cmd)
for cmd in cmds)):
error(f"{opts.json}: \"{key}\" must be a list of argv lists "
f"in {name!r}")
configs.append(Config(name, list(entry.get("configure", [])), cc,
entry.get("cflags", opts.cflags),
entry.get("ldflags", opts.ldflags),
float(minutes), user_settings, check,
list(entry.get("prepare", [])),
list(entry.get("run", []))))
if not configs:
error(f"{opts.json}: no configs")
return configs
def privatize_dirs(bdir, dirs):
# Replace build-tree symlinks into the source tree with private
# per-build-dir copies: tests that write into these directories would
# otherwise write through the symlink into the shared source tree and
# race with the other parallel checks. Runs after the build steps so
# that build rules which (re)create the symlinks have already run.
for name in dirs:
d = bdir / name
if d.is_symlink():
d.unlink()
shutil.copytree(SRCDIR / name, d, symlinks=True)
def dump(title, path):
print(f"::group::{title}" if ON_GITHUB else f"==== {title} ====")
try:
sys.stdout.write(path.read_text(errors="replace"))
except OSError as e:
print(e)
if ON_GITHUB:
print("::endgroup::")
sys.stdout.flush()
def run_config(cfg, opts):
if opts.fail_fast and stop_event.is_set():
return "aborted", 0.0
bdir = SRCDIR / f"build-{cfg.name}"
if bdir.exists():
shutil.rmtree(bdir)
bdir.mkdir()
configure = [str(SRCDIR / "configure")] + cfg.configure
if cfg.cc:
configure.append(f"CC={cfg.cc}")
flags = [f"CFLAGS={cfg.cflags}"] if cfg.cflags else []
flags += [f"LDFLAGS={cfg.ldflags}"] if cfg.ldflags else []
make = ["make", f"-j{opts.jobs}"] + flags
steps = []
if cfg.user_settings:
# Staged before configure; --enable-usersettings builds pick it up
# from the build dir via the default include path.
steps.append((f"stage {cfg.user_settings}",
lambda: shutil.copy(SRCDIR / cfg.user_settings,
bdir / "user_settings.h")))
steps += [(" ".join(cmd), cmd) for cmd in cfg.prepare]
steps += [("configure", configure), ("make", make)]
if cfg.check:
steps += [
# Prebuild the check programs without running any tests so
# "make check" below is pure test execution.
("make check TESTS=", make + ["check", "TESTS="]),
("private dirs", lambda: privatize_dirs(bdir, opts.private_dir)),
("make check", ["make"] + flags + ["check"]),
]
steps += [(" ".join(cmd), cmd) for cmd in cfg.run]
failed = None
start = time.monotonic()
log = bdir / "make-check.log"
with open(log, "w") as logf:
for step, cmd in steps:
if opts.fail_fast and stop_event.is_set():
failed = "aborted"
break
if callable(cmd):
cmd()
continue
print(f"+ {' '.join(cmd)}", file=logf, flush=True)
# stdin=DEVNULL so a test that reads stdin sees EOF (as in CI)
# instead of blocking forever on an interactive/socket stdin.
proc = subprocess.Popen(cmd, cwd=bdir, stdout=logf,
stderr=subprocess.STDOUT,
stdin=subprocess.DEVNULL,
start_new_session=True)
with procs_lock:
live_procs.add(proc)
try:
rc = proc.wait()
finally:
with procs_lock:
live_procs.discard(proc)
if rc != 0:
if opts.fail_fast:
# The first failure wins; any nonzero exit after the
# abort began was most likely our SIGTERM.
with fail_lock:
failed = "aborted" if stop_event.is_set() else step
stop_event.set()
if failed != "aborted":
abort_others()
else:
failed = step
break
minutes = (time.monotonic() - start) / 60
with print_lock:
if failed == "aborted":
print(f"{cfg.name}: aborted (fail-fast) [{minutes:.1f} min]")
sys.stdout.flush()
else:
verdict = f"FAIL ({failed})" if failed else "pass"
dump(f"{cfg.name}: {verdict} [{minutes:.1f} min]", log)
if failed == "configure":
dump(f"{cfg.name}: config.log", bdir / "config.log")
elif failed == "make check":
dump(f"{cfg.name}: test-suite.log", bdir / "test-suite.log")
return failed, minutes
def summarize(results, wall_min, cpu_min, nthreads):
lines = ["| Config | Result | Minutes |", "|---|---|---|"]
for cfg, failed, minutes in results:
if failed == "aborted":
ok = ":heavy_minus_sign: aborted (fail-fast)"
elif failed:
ok = f":x: FAIL ({failed})"
else:
ok = ":white_check_mark: pass"
lines.append(f"| {cfg.name} | {ok} | {minutes:.1f} |")
# Two views of how efficiently the pool used the machine: thread
# occupancy is the time the workers spent running configs out of the
# thread-minutes available (a long config left for last idles the other
# workers and drags it down); CPU utilization is the CPU time the build
# and test children actually consumed out of the CPU-minutes available
# (too-shallow make -j and serial test phases show up here).
busy_min = sum(minutes for _, _, minutes in results)
ncpu = nproc()
lines += [
"",
f"{len(results)} configs in {wall_min:.1f} min on {nthreads} "
f"threads / {ncpu} CPUs: "
f"thread occupancy {100 * busy_min / (wall_min * nthreads):.0f}% "
f"({busy_min:.1f} of {wall_min * nthreads:.1f} thread-min), "
f"CPU utilization {100 * cpu_min / (wall_min * ncpu):.0f}% "
f"({cpu_min:.1f} of {wall_min * ncpu:.1f} CPU-min)",
]
table = "\n".join(lines)
print(table)
summary = os.environ.get("GITHUB_STEP_SUMMARY")
if summary:
with open(summary, "a") as f:
print(f"### make check\n\n{table}", file=f)
def main():
p = argparse.ArgumentParser(
description="Build and make check every configuration from a JSON "
"file in its own out-of-tree build directory, in "
"parallel.")
p.add_argument("json", metavar="CONFIGS.json",
help="JSON list of configs (see the script header for "
"the format), or - for stdin")
p.add_argument("configs", nargs="*", metavar="NAME",
help="configs to run (default: all)")
p.add_argument("--list", action="store_true", help="list configs")
p.add_argument("--jobs", type=int, default=2,
help="make -j per config (default: 2)")
p.add_argument("--threads", type=int, default=nproc(),
help="worker threads; each takes the next pending config "
"when it is free (default: nproc)")
p.add_argument("--shard", metavar="K/N",
help="run only the K-th (1-based) of N shards; configs "
"are dealt to shards greedily by descending "
"\"minutes\" so the shards' totals come out even")
p.add_argument("--fail-fast", action=argparse.BooleanOptionalAction,
default=True,
help="abort everything after the first failing config: "
"pending configs are skipped and in-flight ones "
"killed (--no-fail-fast runs everything and "
"reports every failure)")
p.add_argument("--cc", default="ccache gcc" if shutil.which("ccache")
else None,
help="compiler passed to configure as CC= for configs "
"that do not set their own \"cc\"")
p.add_argument("--cflags", default="",
help="CFLAGS for configs that do not set their own")
p.add_argument("--ldflags", default="",
help="LDFLAGS for configs that do not set their own")
p.add_argument("--private-dir", action="append", default=[],
metavar="DIR",
help="give each build dir a private copy of this "
"symlinked source directory before make check, for "
"tests that write into it (repeatable)")
opts = p.parse_args()
all_configs = load_configs(opts, p.error)
selected = all_configs
if opts.configs:
by_name = {cfg.name: cfg for cfg in all_configs}
unknown = [n for n in opts.configs if n not in by_name]
if unknown:
p.error(f"unknown config(s): {' '.join(unknown)}")
selected = [by_name[n] for n in opts.configs]
# Longest first, so the heavyweights never straggle on an otherwise
# idle machine. Stable: configs without "minutes" keep list order.
selected = sorted(selected, key=lambda cfg: -cfg.minutes)
if opts.shard:
try:
k, n = map(int, opts.shard.split("/"))
except ValueError:
k = n = 0
if not 1 <= k <= n:
p.error(f"--shard: expected K/N with 1 <= K <= N, "
f"got {opts.shard!r}")
# Greedy multiway partition: longest first into the least-loaded
# shard. Deterministic, and with honest "minutes" within ~the
# longest config of optimal.
shards, loads = [[] for _ in range(n)], [0.0] * n
for cfg in selected:
i = loads.index(min(loads))
shards[i].append(cfg)
loads[i] += cfg.minutes
selected = shards[k - 1]
if opts.list:
for cfg in selected:
print(f"{cfg.name} [{cfg.minutes:g} min]: "
f"{' '.join(cfg.configure)}")
return 0
if not selected:
print(f"shard {opts.shard}: no configs to run")
return 0
if not (SRCDIR / "configure").exists():
subprocess.run(["./autogen.sh"], cwd=SRCDIR, check=True)
nthreads = max(1, min(opts.threads, len(selected)))
wall_start = time.monotonic()
cpu_start = os.times()
with ThreadPoolExecutor(max_workers=nthreads) as pool:
results = [(cfg, failed, minutes) for cfg, (failed, minutes)
in zip(selected, pool.map(
lambda cfg: run_config(cfg, opts), selected))]
wall_min = (time.monotonic() - wall_start) / 60
cpu_end = os.times()
# os.times() child counters cover the waited-for configure/make
# subprocesses of every worker thread.
cpu_min = (cpu_end.children_user - cpu_start.children_user
+ cpu_end.children_system - cpu_start.children_system) / 60
summarize(results, wall_min, cpu_min, nthreads)
failed = [cfg.name for cfg, failure, _ in results
if failure and failure != "aborted"]
aborted = sum(1 for _, failure, _ in results if failure == "aborted")
if failed or aborted:
msg = f"make check failed for: {' '.join(failed)}" if failed \
else "aborted without a recorded failure"
if aborted:
msg += f" ({aborted} config(s) aborted by fail-fast)"
print(f"::error::{msg}" if ON_GITHUB else msg)
return 1
return 0
if __name__ == "__main__":
sys.exit(main())