CI: pool the per-config runner matrices into parallel make-check jobs

Replace the one-runner-per-configuration matrices across the make-check workflow family with a generic pooled runner, .github/scripts/parallel-make-check.py. Each workflow keeps its configuration list as JSON next to the invocation; one runner (or a small fixed set of shards, balanced by measured per-config minutes) builds every config in its own out-of-tree (VPATH) build directory off a single checkout/autogen, on a pool of one-per-CPU worker threads, longest first. Concurrent checks are isolated with bubblewrap network namespaces, compilations are cached with ccache, the first failure aborts the rest (fail-fast, with --no-fail-fast to run everything), and per-config timings plus pool efficiency land in the step summary. Failure logs upload as artifacts. smoke-test.yml is likewise reworked into a single pooled job that runs its nine configs on one runner. Converted workflows (runner jobs per full pass): os-check.yml 101 -> 8 (92 Ubuntu configs -> 4 shards; the macOS matrix, the user-settings jobs and the standalone macos-apple-native-cert-validation.yml fold into one macOS runner; Windows unchanged) pq-all.yml 21 -> 2 shards disable-pk-algs.yml 15 -> 1 wolfCrypt-Wconversion.yml 11 -> 1 trackmemory.yml 7 -> 1 cryptocb-only.yml 8 -> 1 (incl. the two new SHA512 entries) multi-compiler.yml 6 -> 1 smallStackSize.yml 6 -> 1 multi-arch.yml 6 -> 1 async.yml 5 -> 1 psk.yml 5 -> 1 no-malloc.yml 3 -> 1 wolfsm.yml 3 -> 1 opensslcoexist.yml 2 -> 1 Measured against current upstream passing runs (job execution time, queue excluded): ~200 runner jobs / ~374 runner-minutes per full pass become 23 jobs / ~168 runner-minutes, with more coverage than before. multi-arch's old matrix combined an "include" list of four architectures with an "opts" axis; GitHub's include-merge rules made each arch entry overwrite the previous one, so only the armel combinations actually ran. The pooled list restores the intended aarch64/armhf/riscv64 coverage (23 combinations; riscv64 x sp-math is omitted as invalid - configure rejects sp-math without SP, and --enable-riscv-asm, unlike --enable-sp-asm, does not bring SP in). Out-of-tree build fixes this depends on: - Makefile.am: symlink the read-only test data (certs/, tests/ config files, sniffer captures and helpers, examples/crypto_policies, input, quit) into the build tree via a BUILT_SOURCES stamp, removed again in distclean-local. ChangeToWolfRoot() and the script tests resolve everything relative to the working directory, so out-of-tree make check and make distcheck now pass. - scripts/multi-msg-record.py: locate the client binary from the build tree working directory rather than the script's source directory. - configure.ac + wolfssl/include.am: run support/gen-debug-trace-error-codes.sh from $srcdir; it reads the error-code headers from the source tree and generates into the build tree. - tests/swdev: a WOLFBUILD variable points the sub-make at the build tree for the configure-generated headers (wolfssl/options.h, wolfssl/version.h); the in-tree-only guards are dropped. Portions of PR #10649 are incorporated: the cross-platform ccache-setup composite action, repository_owner gates on check-headers and check-source-text, the docs-only paths-ignore on os-check, and the libspdm timeout bumps.
2026-07-05 14:10:51 +02:00 · 2026-06-11 17:13:58 +00:00
parent 351b775fd4
commit 3a6c31a51e
31 changed files with 2641 additions and 850 deletions
@@ -0,0 +1,432 @@
+#!/usr/bin/env python3
+# Build and "make check" a set of configurations, each in its own out-of-tree
+# (VPATH) build directory, on a pool of worker threads (default: one per
+# CPU); each thread takes the next pending config as soon as it is free.
+# The final summary reports how efficiently the pool used the machine
+# (thread occupancy and CPU utilization).
+#
+# The configurations come from a JSON file ("-" for stdin): a list of
+# objects, one per configuration. Recognized keys, all optional except
+# "name" (unknown keys are an error, so typos do not pass silently):
+#
+#   name       unique identifier; the config builds in build-<name>/
+#   configure  list of extra ./configure arguments
+#   cc         compiler passed to configure as CC=, overriding --cc
+#              ("" leaves CC entirely to configure / the environment)
+#   cflags     CFLAGS for make, overriding --cflags
+#   ldflags    LDFLAGS for make, overriding --ldflags
+#   minutes    expected duration, from the Minutes column of a previous
+#              run's summary (default 1.0). Schedule weight only - configs
+#              run longest-first and --shard balances shards by it; a stale
+#              value just packs the schedule a little worse.
+#   user_settings  header staged as <builddir>/user_settings.h before
+#              configure (path relative to the source root); pair it with
+#              --enable-usersettings in "configure"
+#   check      false skips the make-check phase entirely (default true)
+#   prepare    list of argv lists run in the build dir before configure
+#   run        list of argv lists run in the build dir after the build and
+#              checks, e.g. [["wolfcrypt/test/testwolfcrypt"]]
+#   comment    ignored; JSON has no comment syntax, so notes go here
+#
+# For example:
+#
+#   [
+#     {"name": "default"},
+#     {"name": "all-asan", "configure": ["--enable-all"],
+#      "cflags": "-fsanitize=address", "ldflags": "-fsanitize=address"}
+#   ]
+#
+# Driven by CI workflows, which keep their config lists next to the
+# invocation (see .github/workflows/smoke-test.yml), but also runnable
+# locally - copy the JSON block out of the workflow into a file:
+#
+#   .github/scripts/parallel-make-check.py configs.json     # all configs
+#   .github/scripts/parallel-make-check.py configs.json default all-asan
+#   .github/scripts/parallel-make-check.py --list configs.json
+#
+# Concurrent "make check" runs are safe because the test scripts re-exec
+# themselves under "bwrap --unshare-net" when bubblewrap is installed (one
+# network namespace each) and the remaining test outputs land in the build
+# directory; see --private-dir for the exception.
+#
+# The first failing config aborts the others (pending configs are skipped,
+# in-flight ones are killed) so CI fails fast; pass --no-fail-fast to run
+# everything and report every failure.
+
+import argparse
+import json
+import os
+import shutil
+import signal
+import subprocess
+import sys
+import threading
+import time
+from concurrent.futures import ThreadPoolExecutor
+from dataclasses import dataclass, field
+from pathlib import Path
+
+# cflags/ldflags are applied at make time only (never to ./configure) so
+# autoconf feature detection is not poisoned by benign warnings in
+# conftest probes. They are omitted entirely when empty so a plain config
+# keeps the configure-chosen defaults.
+@dataclass
+class Config:
+    name: str
+    configure: list = field(default_factory=list)
+    cc: str = ""
+    cflags: str = ""
+    ldflags: str = ""
+    minutes: float = 1.0
+    user_settings: str = ""
+    check: bool = True
+    prepare: list = field(default_factory=list)
+    run: list = field(default_factory=list)
+
+SRCDIR = Path(__file__).resolve().parents[2]
+ON_GITHUB = os.environ.get("GITHUB_ACTIONS") == "true"
+print_lock = threading.Lock()
+
+# Fail-fast state: the first failure sets stop_event (under fail_lock, so
+# exactly one config is reported as the origin) and kills the other
+# workers' in-flight process groups.
+stop_event = threading.Event()
+fail_lock = threading.Lock()
+live_procs = set()
+procs_lock = threading.Lock()
+
+
+def abort_others():
+    # Every subprocess starts its own session, so killing the process
+    # group takes down the whole make/test tree under it.
+    with procs_lock:
+        procs = list(live_procs)
+    for p in procs:
+        try:
+            os.killpg(p.pid, signal.SIGTERM)
+        except (ProcessLookupError, PermissionError):
+            pass
+
+
+def nproc():
+    # Like nproc(1): CPUs usable by this process, falling back to all online.
+    try:
+        return len(os.sched_getaffinity(0))
+    except AttributeError:
+        return os.cpu_count() or 1
+
+
+def load_configs(opts, error):
+    try:
+        if opts.json == "-":
+            entries = json.load(sys.stdin)
+        else:
+            entries = json.loads(Path(opts.json).read_text())
+    except (OSError, ValueError) as e:
+        error(f"{opts.json}: {e}")
+    if not isinstance(entries, list):
+        error(f"{opts.json}: expected a JSON list of config objects")
+    configs = []
+    for entry in entries:
+        if not isinstance(entry, dict):
+            error(f"{opts.json}: config entries must be objects: {entry!r}")
+        unknown = set(entry) - {"name", "configure", "cc", "cflags",
+                                "ldflags", "minutes", "user_settings",
+                                "check", "prepare", "run", "comment"}
+        if unknown:
+            error(f"{opts.json}: unknown key(s) in {entry.get('name', entry)!r}: "
+                  f"{' '.join(sorted(unknown))}")
+        name = entry.get("name")
+        if not isinstance(name, str) or not name or "/" in name:
+            error(f"{opts.json}: every config needs a \"name\" usable as a "
+                  f"directory suffix: {entry!r}")
+        if any(cfg.name == name for cfg in configs):
+            error(f"{opts.json}: duplicate config name {name!r}")
+        minutes = entry.get("minutes", 1.0)
+        if isinstance(minutes, bool) or not isinstance(minutes, (int, float)) \
+                or minutes < 0:
+            error(f"{opts.json}: \"minutes\" must be a non-negative number "
+                  f"in {name!r}")
+        user_settings = entry.get("user_settings", "")
+        if not isinstance(user_settings, str):
+            error(f"{opts.json}: \"user_settings\" must be a path string "
+                  f"in {name!r}")
+        check = entry.get("check", True)
+        if not isinstance(check, bool):
+            error(f"{opts.json}: \"check\" must be a boolean in {name!r}")
+        cc = entry.get("cc", opts.cc or "")
+        if not isinstance(cc, str):
+            error(f"{opts.json}: \"cc\" must be a string in {name!r}")
+        for key in ("prepare", "run"):
+            cmds = entry.get(key, [])
+            if not (isinstance(cmds, list)
+                    and all(isinstance(cmd, list) and cmd
+                            and all(isinstance(a, str) for a in cmd)
+                            for cmd in cmds)):
+                error(f"{opts.json}: \"{key}\" must be a list of argv lists "
+                      f"in {name!r}")
+        configs.append(Config(name, list(entry.get("configure", [])), cc,
+                              entry.get("cflags", opts.cflags),
+                              entry.get("ldflags", opts.ldflags),
+                              float(minutes), user_settings, check,
+                              list(entry.get("prepare", [])),
+                              list(entry.get("run", []))))
+    if not configs:
+        error(f"{opts.json}: no configs")
+    return configs
+
+
+def privatize_dirs(bdir, dirs):
+    # Replace build-tree symlinks into the source tree with private
+    # per-build-dir copies: tests that write into these directories would
+    # otherwise write through the symlink into the shared source tree and
+    # race with the other parallel checks. Runs after the build steps so
+    # that build rules which (re)create the symlinks have already run.
+    for name in dirs:
+        d = bdir / name
+        if d.is_symlink():
+            d.unlink()
+            shutil.copytree(SRCDIR / name, d, symlinks=True)
+
+
+def dump(title, path):
+    print(f"::group::{title}" if ON_GITHUB else f"==== {title} ====")
+    try:
+        sys.stdout.write(path.read_text(errors="replace"))
+    except OSError as e:
+        print(e)
+    if ON_GITHUB:
+        print("::endgroup::")
+    sys.stdout.flush()
+
+
+def run_config(cfg, opts):
+    if opts.fail_fast and stop_event.is_set():
+        return "aborted", 0.0
+    bdir = SRCDIR / f"build-{cfg.name}"
+    if bdir.exists():
+        shutil.rmtree(bdir)
+    bdir.mkdir()
+    configure = [str(SRCDIR / "configure")] + cfg.configure
+    if cfg.cc:
+        configure.append(f"CC={cfg.cc}")
+    flags = [f"CFLAGS={cfg.cflags}"] if cfg.cflags else []
+    flags += [f"LDFLAGS={cfg.ldflags}"] if cfg.ldflags else []
+    make = ["make", f"-j{opts.jobs}"] + flags
+    steps = []
+    if cfg.user_settings:
+        # Staged before configure; --enable-usersettings builds pick it up
+        # from the build dir via the default include path.
+        steps.append((f"stage {cfg.user_settings}",
+                      lambda: shutil.copy(SRCDIR / cfg.user_settings,
+                                          bdir / "user_settings.h")))
+    steps += [(" ".join(cmd), cmd) for cmd in cfg.prepare]
+    steps += [("configure", configure), ("make", make)]
+    if cfg.check:
+        steps += [
+            # Prebuild the check programs without running any tests so
+            # "make check" below is pure test execution.
+            ("make check TESTS=", make + ["check", "TESTS="]),
+            ("private dirs", lambda: privatize_dirs(bdir, opts.private_dir)),
+            ("make check", ["make"] + flags + ["check"]),
+        ]
+    steps += [(" ".join(cmd), cmd) for cmd in cfg.run]
+    failed = None
+    start = time.monotonic()
+    log = bdir / "make-check.log"
+    with open(log, "w") as logf:
+        for step, cmd in steps:
+            if opts.fail_fast and stop_event.is_set():
+                failed = "aborted"
+                break
+            if callable(cmd):
+                cmd()
+                continue
+            print(f"+ {' '.join(cmd)}", file=logf, flush=True)
+            # stdin=DEVNULL so a test that reads stdin sees EOF (as in CI)
+            # instead of blocking forever on an interactive/socket stdin.
+            proc = subprocess.Popen(cmd, cwd=bdir, stdout=logf,
+                                    stderr=subprocess.STDOUT,
+                                    stdin=subprocess.DEVNULL,
+                                    start_new_session=True)
+            with procs_lock:
+                live_procs.add(proc)
+            try:
+                rc = proc.wait()
+            finally:
+                with procs_lock:
+                    live_procs.discard(proc)
+            if rc != 0:
+                if opts.fail_fast:
+                    # The first failure wins; any nonzero exit after the
+                    # abort began was most likely our SIGTERM.
+                    with fail_lock:
+                        failed = "aborted" if stop_event.is_set() else step
+                        stop_event.set()
+                    if failed != "aborted":
+                        abort_others()
+                else:
+                    failed = step
+                break
+    minutes = (time.monotonic() - start) / 60
+    with print_lock:
+        if failed == "aborted":
+            print(f"{cfg.name}: aborted (fail-fast) [{minutes:.1f} min]")
+            sys.stdout.flush()
+        else:
+            verdict = f"FAIL ({failed})" if failed else "pass"
+            dump(f"{cfg.name}: {verdict} [{minutes:.1f} min]", log)
+            if failed == "configure":
+                dump(f"{cfg.name}: config.log", bdir / "config.log")
+            elif failed == "make check":
+                dump(f"{cfg.name}: test-suite.log", bdir / "test-suite.log")
+    return failed, minutes
+
+
+def summarize(results, wall_min, cpu_min, nthreads):
+    lines = ["| Config | Result | Minutes |", "|---|---|---|"]
+    for cfg, failed, minutes in results:
+        if failed == "aborted":
+            ok = ":heavy_minus_sign: aborted (fail-fast)"
+        elif failed:
+            ok = f":x: FAIL ({failed})"
+        else:
+            ok = ":white_check_mark: pass"
+        lines.append(f"| {cfg.name} | {ok} | {minutes:.1f} |")
+    # Two views of how efficiently the pool used the machine: thread
+    # occupancy is the time the workers spent running configs out of the
+    # thread-minutes available (a long config left for last idles the other
+    # workers and drags it down); CPU utilization is the CPU time the build
+    # and test children actually consumed out of the CPU-minutes available
+    # (too-shallow make -j and serial test phases show up here).
+    busy_min = sum(minutes for _, _, minutes in results)
+    ncpu = nproc()
+    lines += [
+        "",
+        f"{len(results)} configs in {wall_min:.1f} min on {nthreads} "
+        f"threads / {ncpu} CPUs: "
+        f"thread occupancy {100 * busy_min / (wall_min * nthreads):.0f}% "
+        f"({busy_min:.1f} of {wall_min * nthreads:.1f} thread-min), "
+        f"CPU utilization {100 * cpu_min / (wall_min * ncpu):.0f}% "
+        f"({cpu_min:.1f} of {wall_min * ncpu:.1f} CPU-min)",
+    ]
+    table = "\n".join(lines)
+    print(table)
+    summary = os.environ.get("GITHUB_STEP_SUMMARY")
+    if summary:
+        with open(summary, "a") as f:
+            print(f"### make check\n\n{table}", file=f)
+
+
+def main():
+    p = argparse.ArgumentParser(
+        description="Build and make check every configuration from a JSON "
+                    "file in its own out-of-tree build directory, in "
+                    "parallel.")
+    p.add_argument("json", metavar="CONFIGS.json",
+                   help="JSON list of configs (see the script header for "
+                        "the format), or - for stdin")
+    p.add_argument("configs", nargs="*", metavar="NAME",
+                   help="configs to run (default: all)")
+    p.add_argument("--list", action="store_true", help="list configs")
+    p.add_argument("--jobs", type=int, default=2,
+                   help="make -j per config (default: 2)")
+    p.add_argument("--threads", type=int, default=nproc(),
+                   help="worker threads; each takes the next pending config "
+                        "when it is free (default: nproc)")
+    p.add_argument("--shard", metavar="K/N",
+                   help="run only the K-th (1-based) of N shards; configs "
+                        "are dealt to shards greedily by descending "
+                        "\"minutes\" so the shards' totals come out even")
+    p.add_argument("--fail-fast", action=argparse.BooleanOptionalAction,
+                   default=True,
+                   help="abort everything after the first failing config: "
+                        "pending configs are skipped and in-flight ones "
+                        "killed (--no-fail-fast runs everything and "
+                        "reports every failure)")
+    p.add_argument("--cc", default="ccache gcc" if shutil.which("ccache")
+                   else None,
+                   help="compiler passed to configure as CC= for configs "
+                        "that do not set their own \"cc\"")
+    p.add_argument("--cflags", default="",
+                   help="CFLAGS for configs that do not set their own")
+    p.add_argument("--ldflags", default="",
+                   help="LDFLAGS for configs that do not set their own")
+    p.add_argument("--private-dir", action="append", default=[],
+                   metavar="DIR",
+                   help="give each build dir a private copy of this "
+                        "symlinked source directory before make check, for "
+                        "tests that write into it (repeatable)")
+    opts = p.parse_args()
+
+    all_configs = load_configs(opts, p.error)
+    selected = all_configs
+    if opts.configs:
+        by_name = {cfg.name: cfg for cfg in all_configs}
+        unknown = [n for n in opts.configs if n not in by_name]
+        if unknown:
+            p.error(f"unknown config(s): {' '.join(unknown)}")
+        selected = [by_name[n] for n in opts.configs]
+
+    # Longest first, so the heavyweights never straggle on an otherwise
+    # idle machine. Stable: configs without "minutes" keep list order.
+    selected = sorted(selected, key=lambda cfg: -cfg.minutes)
+    if opts.shard:
+        try:
+            k, n = map(int, opts.shard.split("/"))
+        except ValueError:
+            k = n = 0
+        if not 1 <= k <= n:
+            p.error(f"--shard: expected K/N with 1 <= K <= N, "
+                    f"got {opts.shard!r}")
+        # Greedy multiway partition: longest first into the least-loaded
+        # shard. Deterministic, and with honest "minutes" within ~the
+        # longest config of optimal.
+        shards, loads = [[] for _ in range(n)], [0.0] * n
+        for cfg in selected:
+            i = loads.index(min(loads))
+            shards[i].append(cfg)
+            loads[i] += cfg.minutes
+        selected = shards[k - 1]
+
+    if opts.list:
+        for cfg in selected:
+            print(f"{cfg.name} [{cfg.minutes:g} min]: "
+                  f"{' '.join(cfg.configure)}")
+        return 0
+    if not selected:
+        print(f"shard {opts.shard}: no configs to run")
+        return 0
+
+    if not (SRCDIR / "configure").exists():
+        subprocess.run(["./autogen.sh"], cwd=SRCDIR, check=True)
+
+    nthreads = max(1, min(opts.threads, len(selected)))
+    wall_start = time.monotonic()
+    cpu_start = os.times()
+    with ThreadPoolExecutor(max_workers=nthreads) as pool:
+        results = [(cfg, failed, minutes) for cfg, (failed, minutes)
+                   in zip(selected, pool.map(
+                       lambda cfg: run_config(cfg, opts), selected))]
+    wall_min = (time.monotonic() - wall_start) / 60
+    cpu_end = os.times()
+    # os.times() child counters cover the waited-for configure/make
+    # subprocesses of every worker thread.
+    cpu_min = (cpu_end.children_user - cpu_start.children_user
+               + cpu_end.children_system - cpu_start.children_system) / 60
+    summarize(results, wall_min, cpu_min, nthreads)
+    failed = [cfg.name for cfg, failure, _ in results
+              if failure and failure != "aborted"]
+    aborted = sum(1 for _, failure, _ in results if failure == "aborted")
+    if failed or aborted:
+        msg = f"make check failed for: {' '.join(failed)}" if failed \
+            else "aborted without a recorded failure"
+        if aborted:
+            msg += f" ({aborted} config(s) aborted by fail-fast)"
+        print(f"::error::{msg}" if ON_GITHUB else msg)
+        return 1
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())