Skip to content

Improve build performance without CI fan-out#1832

Open
cpunion wants to merge 45 commits intoxgo-dev:mainfrom
cpunion:improve/build-perf-pure
Open

Improve build performance without CI fan-out#1832
cpunion wants to merge 45 commits intoxgo-dev:mainfrom
cpunion:improve/build-perf-pure

Conversation

@cpunion
Copy link
Copy Markdown
Collaborator

@cpunion cpunion commented Apr 27, 2026

Summary

This PR is the pure build-performance subset of the previous build-perf work. It intentionally removes the CI workflow/job split and sharding changes so the branch focuses only on compiler/build hot-path improvements.

What Changed

  • Optimized LLGo build-cache/fingerprint/manifest hot paths while preserving the YAML manifest format and fallback parsing paths.
  • Reduced cgo/build overhead:
    • faster cgo preamble and pragma scanning,
    • package metadata fast paths for cgo C-file discovery,
    • skip cgo extern declaration generation when no extern symbols are used.
  • Improved crosscompile compile-group performance by parallelizing independent external compiler/archive work with bounded worker counts and deterministic output ordering.
  • Simplified large crosscompile file-list construction to reduce compile/runtime overhead.
  • Kept focused tests for the new fast paths and fallbacks.

Intentionally Not Included

  • No CI workflow fan-out/splitting/sharding changes.
  • No CI-only retry/cache script changes.
  • No autoresearch metadata files.
  • No local root llgo binary.

Validation

Ran locally on this branch:

go test ./internal/build
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo

Latest Local Follow-up: Async Native Object Emission

Added source-only commit e94d62a to overlap native host object emission with later package work. The main compiler still serializes LLGo/LLVM IR generation, then sends serialized LLVM IR to bounded external clang workers for native object emission. This avoids same-process LLVM concurrency while overlapping object generation. LLGO_PARALLEL_OBJECT_EMIT=0 is available as an opt-out, and debug/-genll/command-tracing paths remain synchronous.

Local evidence before pushing:

Workload Baseline Patched Change
go test ./internal/build -run '^TestExtest$' -count=3 25.061s / 24.390s 24.501s / 24.139s -2.2% / -1.0%
go test ./internal/build -count=1 22.464s / 22.786s 22.240s / 21.707s -1.0% / -4.7%
bounded-worker repeat go test ./internal/build -count=1 23.560s 22.606s -4.0%
go clean -cache && go build -a -tags=dev ./cmd/llgo 11.078s 11.058s neutral

Additional local validation passed: targeted object-emission gating tests, go test -race subset for internal/build, clean go build -a -tags=dev -o <tmp> ./cmd/llgo, test ! -e llgo, and git diff --check.

Follow-up commit 1bb466b lowers the bounded object-emission worker cap from 4 to 2 to reduce external clang contention. Local full-suite A/B over e94d62a:

Workload Cap 4 Cap 2 Change
go test ./internal/build -count=1 25.987s 22.489s -13.5%
repeat go test ./internal/build -count=1 21.440s 21.212s -1.1%

Cap 1 and cap 3 both regressed locally, so cap 2 is the current local best. Race subset and clean go build -a -tags=dev guards passed after the cap change.

Follow-up commit 740b15c avoids copying the serialized LLVM IR string into a []byte before writing the temporary .ll file. Local full-suite A/B over cap-2 async emission:

Workload Before After Change
go test ./internal/build -count=1 23.222s 22.443s -3.4%
repeat go test ./internal/build -count=1 21.806s 21.678s -0.6%

A targeted TestExtest / object-emission gating test run passed before pushing.

Follow-up commit 48cde59 uses the C clang driver instead of clang++ for native async IR object emission while preserving the configured compiler for cross/-genll paths. Local full-suite A/B over 740b15c:

Workload Before After Change
go test ./internal/build -count=1 22.705s 22.211s -2.2%
repeat go test ./internal/build -count=1 22.109s 21.958s -0.7%

Targeted TestExtest / object-emission gating tests passed before pushing; full go test ./internal/build -count=1, clean go build -a -tags=dev -o <tmp> ./cmd/llgo, test ! -e llgo, and git diff --check also passed after pushing.

Follow-up commit 10cc3f4 broadens async object emission to external-clang/cross builds too, while explicitly keeping -genll, IR checking, and command-tracing paths synchronous. Local full-suite A/B over 48cde59:

Workload Before After Change
go test ./internal/build -count=1 22.035s 21.644s -1.8%
repeat go test ./internal/build -count=1 22.970s 22.320s -2.8%

Targeted TestParallelObjectEmitEnabled / TestExtest, race subset, and clean go build -a -tags=dev guards passed before pushing. Follow-up a2e2505 keeps external/cross async emission on the target-specific compiler (instead of the native clang driver) after Targets CI exposed xtensa builds using the wrong compiler; local build.sh empty esp32 passed with the fix.

Follow-up commit 61b1b1d pipes async LLVM IR to clang via stdin (clang -x ir -c -) instead of writing temporary .ll files when GenLL/IR-checking are disabled. Debug/check paths still materialize .ll files. Local full-suite A/B over a2e2505:

Workload Before After Change
go test ./internal/build -count=1 23.336s 22.250s -4.7%

Validation also passed: targeted async/object-emission tests plus TestExtest, build.sh empty esp32, race subset, clean go build -a -tags=dev, and git diff --check.

Follow-up commit 1c53158 leaves go/types.Info.Scopes nil in LLGo package loads because LLGo and x/tools/go/ssa do not consume lexical scope records during compilation. This avoids extra type-checker scope recording. Local full-suite A/B over 61b1b1d:

Workload Before After Change
go test ./internal/build -count=1 24.530s 22.511s -8.2%
repeat go test ./internal/build -count=1 23.085s 22.013s -4.6%

Validation passed: go test ./internal/packages ./internal/build ./ssa ./cl -count=1, race subset, clean go build -a -tags=dev, and git diff --check.

Follow-up commit 2ecb5b6 avoids copying yaml.Marshal manifest bytes into a second string by using the existing read-only unsafe byte-slice-to-string helper. Local full-suite A/B over 1c53158:

Workload Before After Change
go test ./internal/build -count=1 23.225s 22.598s -2.7%
repeat go test ./internal/build -count=1 22.562s 22.457s -0.5%

Validation passed: manifest/fingerprint/cache targeted tests, full build-cache script, clean go build -a -tags=dev, and git diff --check.

Follow-up commit 22d440a creates all x/tools/go/ssa packages first, then calls Program.Build() once so the upstream SSA builder can use its documented parallel package build path. LLGo/LLVM codegen still runs sequentially after this phase. Local full-suite A/B over 2ecb5b6:

Workload Before After Change
go test ./internal/build -count=1 24.158s 22.587s -6.5%
repeat go test ./internal/build -count=1 23.902s 22.334s -6.6%

Validation passed: go test ./internal/build ./ssa ./cl -count=1, race subset, clean go build -a -tags=dev, and git diff --check.

Follow-up commit c0ff171 avoids constructing full go/types method sets in the local SSA order fixup. The fixup now visits explicit named method functions directly via Program.FuncValue, avoiding MethodSet allocation for every type. Local full-suite A/B over 22d440a:

Workload Before After Change
go test ./internal/build -count=1 24.295s 22.676s -6.7%
repeat go test ./internal/build -count=1 22.907s 22.633s -1.2%

Validation passed: go test ./internal/build ./ssa ./cl -count=1, targeted SSA-order tests, race subset, clean go build -a -tags=dev, and git diff --check.

Follow-up commit 60e0404 tracks whether buildSSAPkgs actually created new SSA packages and skips a redundant Program.Build() traversal when the call only wraps packages built by earlier setup; local SSA fixups still run for returned packages. Local full-suite A/B over c0ff171:

Workload Before After Change
go test ./internal/build -count=1 22.722s 22.549s -0.8%
repeat go test ./internal/build -count=1 23.577s 21.986s -6.7%

Validation passed: targeted TestExtest/SSA-order/object-emission tests, race subset, and git diff --check.

Follow-up commit cc6d908 writes LLGo build manifests with a deterministic specialized YAML emitter instead of using generic yaml.Marshal reflection for the hot package-manifest path. The cache manifest remains YAML and existing YAML decoding/legacy fallback remain in place. Local full-suite A/B over 60e0404:

Workload Before After Change
go test ./internal/build -count=1 24.378s 23.615s -3.1%
repeat go test ./internal/build -count=1 23.339s 23.050s -1.2%

Validation passed: manifest/fingerprint/cache targeted tests, full build-cache script, clean go build -a -tags=dev, and git diff --check.

Follow-up commit e213989 avoids strconv.Quote for manifest strings that are safe plain YAML scalars, reducing allocations and manifest size in the specialized emitter while still quoting ambiguous/special values. Local full-suite A/B over cc6d908:

Workload Before After Change
go test ./internal/build -count=1 23.173s 22.916s -1.1%
repeat go test ./internal/build -count=1 23.534s 23.107s -1.8%

Validation passed: manifest/fingerprint/cache targeted tests, full build-cache script, clean go build -a -tags=dev, and git diff --check.

Follow-up commit 86d5af4 trims dead build helper code and reuses scratch state in the SSA-order fixup:

  • removes the now-unreachable generic yaml.Marshal fallback for build-manifest emission,
  • removes production-only helpers that were no longer used outside tests (manifestBuilder.Fingerprint, digestFile),
  • reuses a dependency scratch map while checking stores in fixSSAOrderBlock.

Local paired A/B evidence:

Workload Before After Change
go build -a -tags=dev -o <tmp> ./cmd/llgo (yaml.Marshal fallback removal) 10.853s 10.572s -2.6%
repeat go build -a -tags=dev -o <tmp> ./cmd/llgo 11.115s 10.649s -4.2%
remove unused digestFile helper 10.749s 10.522s -2.1%
repeat remove unused digestFile helper 10.526s 10.515s -0.1%
go test ./internal/build -count=1 (SSA-order scratch reuse) 25.591s 24.503s -4.3%
repeat SSA-order scratch reuse 24.367s 24.203s -0.7%

A third SSA-order repeat was slightly negative (+0.2%), so that part is treated as a small low-risk allocation cleanup rather than a large claimed speedup. Validation before push passed targeted manifest/fingerprint/digest/metadata/SSA-order/TestExtest tests, clean go build -a -tags=dev, test ! -e llgo, and git diff --check.

Follow-up commit 8549fc4 inlines the only remaining production digestBytes call site in the overlay file-digest path and keeps the hash helper test-local. This trims a dead production helper after the earlier digestFile cleanup while preserving the same sha256 + hex encoding logic.

Local clean-build A/B over 86d5af4:

Workload Before After Change
go build -a -tags=dev -o <tmp> ./cmd/llgo 10.894s 10.679s -2.0%
repeat 10.692s 10.630s -0.6%
third run 11.067s 10.619s -4.1%

Validation before push passed targeted digest/manifest/metadata/TestExtest tests, full go test ./internal/build -count=1, clean go build -a -tags=dev -o <tmp> ./cmd/llgo, test ! -e llgo, and git diff --check.

CI Result for Latest Head (8549fc4)

Latest pushed head is CI-clean:

  • Checks: 57 success, 1 skipped, 0 failed
  • Merge state: CLEAN
  • Non-skipped jobs counted for timing: 55
  • Total runner time: 5h59m51s
  • LLGo workflow total: 4h16m06s
  • Wall time: 51m30s
  • Longest job: 27m07s (test (macos-latest, 19) in Go workflow)

Compared with the previous clean head 86d5af4, this run improves total runner time by 6m40s, LLGo workflow total by 13m10s, and end-to-end wall time by 5m34s, while the longest job is 1m28s longer due to the Go workflow macOS test job. Compared with the coverage-equivalent main baseline b4d9167, it is lower in total runner time (-39m42s), LLGo workflow total (-40m41s), wall time (vs that main sample), and longest-job time (-0m05s). As before, hosted-runner variance remains significant; local paired A/B is the primary source-level evidence.

CI Result for Latest Head (86d5af4)

Latest pushed head is CI-clean:

  • Checks: 57 success, 1 skipped, 0 failed
  • Merge state: CLEAN
  • Non-skipped jobs counted for timing: 55
  • Total runner time: 6h06m31s
  • LLGo workflow total: 4h29m16s
  • Wall time: 57m04s
  • Longest job: 25m39s (llgo (macos-15-intel, 19, 1.24.2) in LLGo workflow)

Compared with the previous clean head e213989, this run improves total runner time by 14m37s and LLGo workflow total by 7m31s, with a similar longest job (-19s) and slightly higher end-to-end wall time (+47s). Compared with the coverage-equivalent main baseline b4d9167, it is lower in total runner time (-33m02s), LLGo workflow total (-27m31s), and longest job (-1m33s). As before, local paired A/B remains the primary evidence for source-level changes because hosted-runner timing is noisy.

CI Result for Latest Head (e213989)

Latest pushed head is CI-clean:

  • Checks: 57 success, 1 skipped, 0 failed
  • Merge state: CLEAN
  • Non-skipped jobs counted for timing: 55
  • Total runner time: 6h21m08s
  • LLGo workflow total: 4h36m47s
  • Wall time: 56m17s
  • Longest job: 25m58s (test (macos-latest, 19) in Go workflow)

Compared with the previous clean head 60e0404, this hosted-runner sample is mixed/noisier: total runner time is +13m16s and LLGo workflow total is +5m17s, while the longest job improves by 2m59s. Compared with the coverage-equivalent main baseline b4d9167, it remains faster in total runner time (-18m25s), LLGo workflow total (-20m00s), and longest job (-1m14s). The manifest-emitter commits are therefore justified primarily by local paired A/B and build-cache validation rather than by claiming a whole-CI timing win from this single run.

CI Result for Latest Head (60e0404)

Latest CI completed clean: 57 successful checks and 1 skipped check; merge state is CLEAN.

Compared with the previous clean async-tuning sample (48cde59), CI is mixed: total runner time and LLGo workflow total are higher on this run, while wall time is essentially unchanged and several individual jobs still improve. This reinforces that the later SSA/cache hot-path commits are justified primarily by local paired A/B evidence, not by a single hosted-runner timing sample.

Metric 48cde59 60e0404 Difference
Non-skipped jobs 55 55 +0
Skipped jobs 1 1 +0
Total runner time 5h59m35s 6h07m52s +8m17s
LLGo workflow total 4h19m13s 4h31m30s +12m17s
End-to-end wall time 55m49s 55m41s -0m08s
Longest single job 24m46s 28m57s +4m11s

Compared with the coverage-equivalent main baseline (b4d9167), the latest head remains lower in total runner time and LLGo workflow total, though the longest single job is higher in this sample.

Metric main b4d9167 60e0404 Difference
Non-skipped jobs 55 55 +0
Total runner time 6h39m33s 6h07m52s -31m41s
LLGo workflow total 4h56m47s 4h31m30s -25m17s
Longest single job 27m12s 28m57s +1m45s

CI Result for Latest Async Object Emission Tuning (48cde59)

Latest CI completed clean: 57 successful checks and 1 skipped check. Compared with the first async object-emission CI sample (e94d62a), the follow-up cap/IR-copy/clang-driver tuning improves total runner time, LLGo workflow total, and end-to-end wall time, though the single longest job is longer on this sample.

Metric Async object emission (e94d62a) Tuned async object emission (48cde59) Difference
Non-skipped jobs 55 55 +0
Skipped jobs 1 1 +0
Longest single job 21m58s 24m46s +2m48s
Total runner time 6h13m10s 5h59m35s -13m35s
LLGo workflow total 4h35m13s 4h19m13s -16m00s
End-to-end wall time 1h15m33s 55m49s -19m44s

Compared with the pre-async PR head (7004afe), the latest head is also lower by total runner time (5h59m35s vs 6h05m05s) and LLGo workflow total (4h19m13s vs 4h32m44s), with the same workflow topology/job coverage.

CI Result for Async Object Emission (e94d62a)

The first CI attempt hit a transient GitHub 502 Bad Gateway while downloading the ESP newlib tarball in hello (macos-latest, 19, 1.26.0). Rerunning the failed job succeeded; final PR status is clean: 57 successful checks and 1 skipped check.

Compared with the previous PR head 7004afe, the latest run improves the longest single job but does not show a total-runner-time win on this one CI sample:

Metric Previous PR head (7004afe) Async object emission (e94d62a) Difference
Non-skipped jobs 55 55 +0
Skipped jobs 1 1 +0
Longest single job 25m07s 21m58s -3m09s
Total runner time 6h05m05s 6h13m10s +8m05s
LLGo workflow total 4h32m44s 4h35m13s +2m29s
End-to-end wall time 3h49m46s 1h15m33s -2h34m13s

Against the fastest coverage-equivalent goplus/main baseline (b4d9167), the latest PR head remains faster overall (6h13m10s vs 6h39m33s total runner time), but the async object-emission commit itself needs more CI samples before claiming a whole-CI improvement.

CI Runtime Snapshot (latest head after tool environment caching)

Measured from GitHub Actions job startedAt / completedAt timestamps. Skipped jobs are excluded from runtime totals. Codecov checks are not included because they are external status checks rather than Actions runtime jobs. This source-only branch keeps the same workflow topology as main; end-to-end wall time can still vary significantly with hosted-runner queueing, so total runner time is the less noisy cost proxy.

Data sources:

  • PR Improve build performance without CI fan-out #1832 latest successful CI runs at 43d48c75f5c284805930291cdbfd38f4f9c9bc7d: 25047647064, 25047647084, 25047647105, 25047647095, 25047647059, 25047647044, 25047647061, 25047647056
  • Best comparable completed goplus/main CI run set by total runner time at b4d9167e460d91a4a0f09a0f8616670a8fbd23fa: 24972314382, 24972314373, 24972314376, 24972314381, 24972314377, 24972314387, 24972314374, 24972314386

Baseline selection note for future snapshots: compare against the fastest completed goplus/main run set that has the same workflow topology / job coverage (same non-skipped and skipped job count where possible). Older completed main runs with fewer jobs are not used as the main baseline because they are not coverage-equivalent. In the currently queried recent completed main runs, there are 3 completed successful main run sets with the same 55 non-skipped / 1 skipped job topology; b4d9167 remains the fastest by total runner time, while 7ea3148 is the fastest by wall time.

Metric PR #1832 latest head (43d48c7) best comparable completed goplus/main (b4d9167) Difference
Non-skipped jobs 55 55 +0
Skipped jobs 1 1 +0
Longest single job 28m36s 27m12s +1m24s
Total runner time 6h17m26s 6h39m33s -22m07s
End-to-end wall time 2h01m48s 3h01m33s -59m45s

Latest longest PR job: Go / test (macos-latest, 19).

Fastest comparable main total runner-time baseline: b4d9167e460d91a4a0f09a0f8616670a8fbd23fa. Fastest comparable main wall-time baseline: 7ea31484337c1d3b560fea9f07bbca1dcf75150a at 2h09m59s.

Workflow PR #1832 jobs / total / wall / longest best comparable goplus/main jobs / total / wall / longest
Build Cache 2 / 5m40s / 3m02s / 3m02s 2 / 6m49s / 9m12s / 3m38s
Docs 6 / 9m57s / 49m43s / 2m29s 6 / 9m20s / 12m24s / 2m51s
Format Check 1 / 0m07s / 0m07s / 0m07s 1 / 0m07s / 0m07s / 0m07s
Go 2 / 48m46s / 1h10m07s / 28m36s 2 / 42m17s / 24m34s / 23m03s
LLGo 33 / 4h32m00s / 1h55m41s / 24m24s 33 / 4h56m47s / 3h01m24s / 27m12s
Release Build 7 / 21m03s / 39m16s / 7m58s 7 / 22m00s / 33m06s / 8m29s
Stdlib Coverage 2 / 2m14s / 17m26s / 1m13s 2 / 2m00s / 5m34s / 1m01s
Targets 2 / 17m39s / 10m40s / 9m38s 2 / 20m13s / 46m55s / 11m13s

Latest Pure Build Hot-Path Follow-up

After the rebase, added one source-only follow-up commit 85d1523 focused on cgo build metadata and pragma hot paths, without changing CI workflow topology.

Local focused benchmarks used during the follow-up:

Hot path Before After Change
splitDirectiveArgs mostly-unquoted args 84.67 ns/op, 128 B/op, 2 allocs/op 55.75 ns/op, 64 B/op, 1 alloc/op -34.2%
Darwin go:cgo_* build-flow pragma collection 463.4 ns/op, 288 B/op, 10 allocs/op 231.9 ns/op, 144 B/op, 5 allocs/op -50.0%
no-cgo buildCgo with complete metadata 39.4 ns/op, 256 B/op, 1 alloc/op 9.8 ns/op, 0 B/op, 0 allocs/op -75.1%
header-heavy cgo OtherFiles metadata extraction 2060 ns/op, 2688 B/op, 1 alloc/op 263.5 ns/op, 48 B/op, 1 alloc/op -87.2%

Additional local validation after the follow-up:

go test ./internal/build
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo

Additional Pure Build Hot-Path Follow-up

Added source-only commit b427daf with further cgo metadata / pragma scan reductions. CI workflow topology is still unchanged.

Local focused benchmarks from this follow-up:

Hot path Before After Change
header-only buildCgo with complete metadata 472.6 ns/op, 0 B/op, 0 allocs/op 212.6 ns/op, 0 B/op, 0 allocs/op -55.0%
sorted multi-source cgo OtherFiles metadata 811 ns/op, 2784 B/op, 4 allocs/op 646.6 ns/op, 2688 B/op, 1 alloc/op -20.3%
single-source cgo OtherFiles metadata 16.36 ns/op, 24 B/op, 1 alloc/op 14.15 ns/op, 24 B/op, 1 alloc/op -13.5%
exact //go:cgo_ line-comment pragma parsing 120.6 ns/op, 80 B/op, 3 allocs/op 111.5 ns/op, 80 B/op, 3 allocs/op -7.6%

Additional local validation after this follow-up:

go test ./internal/build
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo

Additional Pure Build Hot-Path Follow-up 2

Added source-only commit 437d443 to reuse cgo pragma scan results across Darwin Plan9 asm handling and reduce go:cgo_import_dynamic parsing overhead. CI workflow topology is still unchanged.

Local focused benchmark from this follow-up:

Hot path Before After Change
Darwin x/sys/unix cgo pragma flow: asm trampoline check + build ldflags/dynimports 2733 ns/op, 5824 B/op, 60 allocs/op 667 ns/op, 896 B/op, 1 alloc/op -75.6%

This removes a duplicate AST comment scan between compilePkgSFiles' Darwin trampoline skip check and the later cgo alias/link-arg collection, then reduces allocation while parsing repeated exact //go:cgo_import_dynamic line directives.

Additional local validation after this follow-up:

go test ./internal/build
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo

Whole Build Pipeline Follow-up

Added source-only commit b236c0a to remove duplicate archive work in the LLGo build cache miss path. Previously an uncached package was archived to a temporary .a, then copied into the build cache. The build now publishes the archive directly at the cache path and uses that archive for the current link, falling back to the temporary archive path only when cache publication is unavailable.

End-to-end local evidence focused on the whole internal/build pipeline rather than microbenchmarks:

Workload Before After Change
go test ./internal/build -run '^TestExtest$' -count=1 wall 12.82s 11.73s / 11.09s repeat -8.5% / -13.4%
Go-reported package time for same workload 12.28s 10.96s / 10.56s repeat -10.7% / -14.0%

Rejected during the same whole-process pass: linking uncached main package object files directly instead of archiving them. It passed TestExtest, but did not improve over the cache-archive change and added linker-order complexity, so it was dropped.

Additional local validation after this follow-up:

go test ./internal/build -run '^(TestExtest|TestSaveToCache_Success|TestSaveToCache_WithMetadata|TestTryLoadFromCache_ForceRebuild)$'
go test ./internal/build
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Whole Build Setup Follow-up

Added source-only commit e36a84c to cache successful macOS SDK sysroot discovery within a process. Native macOS builds call xcrun --sdk macosx --show-sdk-path while setting up crosscompile flags; the full internal/build test package invokes the build pipeline repeatedly, so reusing a successful sysroot lookup avoids repeated external setup work without changing generated outputs. Failed lookups are not cached, so transient xcrun failures can still be retried.

End-to-end local evidence used the full internal/build package test pipeline rather than a microbenchmark:

Workload Before After Change
go test ./internal/build -count=1 wall 25.87s 25.00s / 23.31s / 23.85s -3.4% to -9.9%
Go-reported package time 25.32s 24.15s / 22.77s / 23.03s -4.6% to -10.1%

Additional local validation after this follow-up:

go test ./internal/crosscompile/...
go test ./internal/build
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Whole Build Setup Follow-up 2

Added source-only commit 6b54e5f to keep LLVM's bin directory first in PATH without prepending duplicate entries on every internal/build.Do call. The previous setup mutated process PATH repeatedly during multi-build processes such as go test ./internal/build, growing duplicate LLVM path entries and increasing external tool lookup/setup overhead. Empty LLVM bin dirs are now ignored instead of prepending an empty path component.

End-to-end local evidence again used the full internal/build package test pipeline rather than a microbenchmark:

Workload Before After Change
go test ./internal/build -count=1 wall 24.43s 23.14s / 21.67s / 22.62s -5.3% to -11.3%
Go-reported package time 23.75s 22.33s / 21.15s / 21.85s -6.0% to -11.0%

Additional local validation after this follow-up:

go test ./internal/build
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Whole Build Setup Follow-up 3

Added source-only commit 90ea768 to make LLVM target initialization idempotent within the process. internal/build.Do is called repeatedly by the full internal/build test pipeline; each call previously invoked llssa.Initialize(llssa.InitAll). LLVM target initialization is process-global, so already-initialized flag groups can be skipped while still allowing later calls with additional flags to initialize any missing groups.

End-to-end local evidence used the full internal/build package test pipeline:

Workload Before After Change
go test ./internal/build -count=1 wall 23.39s 22.80s / 22.34s -2.5% / -4.5%
Go-reported package time 22.86s 21.70s / 21.82s -5.1% / -4.6%

Additional local validation after this follow-up:

go test ./ssa
go test ./internal/build
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Whole Build Pipeline Follow-up 2

Added source-only commit 1629549 to overlap cache archive/manifest publication with later package builds. A temporary phase trace of the whole TestExtest pipeline showed cache publication as one of the larger traced subphases (saveToCache about 1.48s cumulative in that workload). The build now starts bounded asynchronous cache saves after cache misses and waits before linking, so archive/manifest I/O can overlap with subsequent package codegen while preserving link inputs and falling back to a temporary archive if cache publication does not produce one.

End-to-end local evidence used the full internal/build package test pipeline:

Workload Before After Change
go test ./internal/build -count=1 wall 22.65s 22.11s / 22.16s / 21.83s / 22.01s -2.2% to -3.6%
Go-reported package time 22.10s 21.33s / 21.63s / 21.04s / 21.26s -2.1% to -4.8%

Additional local validation after this follow-up:

go test ./internal/build
go test -race ./internal/build -run '^TestExtest$' -count=1
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Next Optimization Work

Potential follow-ups after this PR, ordered by expected value and required design work:

  1. Cache-hit minimal work path: teach cache metadata enough link/runtime/ABI/Python state to skip more LLVM/module construction on package cache hits. This has high potential for warm-cache and repeated test workloads, but needs careful correctness proof around link metadata, runtime init, ABI symbols, and reflect/global behavior.
  2. More detailed phase tracing for representative workloads: keep using temporary, non-committed tracing around package load, SSA build, cache read/write, cl compile, LLVM emit, archive, link, and test-run phases. Use it on TestExtest, go test ./internal/build, and clean/warm go build -tags=dev ./cmd/llgo before making more source changes.
  3. Bounded package/test pipeline parallelism: investigate whether independent package codegen/export/archive or multiple test-binary link/run phases can be overlapped safely. This needs proof around LLVM/SSA/cabi shared state, deterministic link order, output buffering, and failure aggregation.
  4. Cache publication refinements: after the async cache save change, look for remaining archive/manifest costs such as unnecessary temp files, redundant stat/hash work, or opportunities to batch/cache immutable manifest inputs without changing the persistent YAML format.
  5. cmd/llgo dependency graph slimming: clean go build -a -tags=dev ./cmd/llgo still spends much time in transitive stdlib / x-tools / crosscompile dependencies. Larger gains may require CLI/dependency layering changes, which should be evaluated separately from this focused build-hot-path PR.

Directions already measured and not worth revisiting without new phase evidence: hard-coded GC tuning, disabling linker ICF, removing packages.NeedExportFile, disabling build cache, broad crosscompile/env negative caches, native-only LLVM init, ABI global scan gating, and further small cgo parser/string micro-optimizations.

Whole Build Pipeline Follow-up 3

Added source-only commit 4e8a123 to let the existing bounded cache-publication workers also perform the archive fallback for uncached packages. The previous async cache save path overlapped cache archive/manifest publication, but packages that are intentionally not cached (notably main packages) still fell back to normalizeToArchive while waiting for pending cache saves. The worker now:

  • attempts cache publication when cache is enabled, the package is not main, and fingerprint/manifest are available; force rebuild still bypasses cache reads but repopulates cache entries;
  • creates the required archive in the same bounded worker if cache publication is skipped or does not produce an archive;
  • drains all pending workers before returning the first archive error, avoiding background mutation after build errors;
  • preserves the no-cache-manifest behavior for main packages.

End-to-end local evidence used the full internal/build package test pipeline:

Workload Before After Change
go test ./internal/build -count=1 wall 22.89s 22.87s / 20.74s / 20.80s / 21.82s -0.1% to -9.4%
Go-reported package time 21.99s 22.09s / 20.20s / 20.28s / 21.03s noisy to -8.1%

A focused attempt to lower the worker cap from 4 to 2 was discarded because it did not improve the whole workload and would overfit a local run.

Additional local validation after this follow-up and the force-rebuild cache refresh fix (e733953):

go test ./internal/build -run '^TestStartCacheSaveNormalizesMainPackage$|^TestTryLoadFromCache_ForceRebuild$' -count=1
(cd test/buildcache && bash ./test.sh)
go test ./internal/build
go test -race ./internal/build -run '^TestExtest$|^TestStartCacheSaveNormalizesMainPackage$|^TestTryLoadFromCache_ForceRebuild$' -count=1
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Whole Build Setup Follow-up 4

Added source-only commit 43d48c7 to cache repeated tool-environment lookups inside long-lived build/test processes:

  • internal/env.GoEnvWithEnv now caches successful go env ... results by requested variables and effective environment, while still not caching failures. This avoids repeatedly spawning go env GOROOT GOVERSION across multiple internal/build.Do calls in the same process.
  • xtool/env/llvm.New now caches successful llvm-config --bindir results by effective llvm-config/PATH selection, while still retrying failures. This avoids repeatedly spawning llvm-config --bindir during multi-build workloads.
  • Added focused tests for successful-result caching and failure-retry behavior.

Local evidence used repeated full internal/build package runs after the async cache-publication changes:

Workload Before After Change
go test ./internal/build -count=1 wall 25.17s 22.30s / 20.01s -11.4% / -20.5%
go test ./internal/env ./xtool/env/llvm ./internal/build -count=1 wall 25.17s baseline context 21.91s / 22.46s / 20.99s improved despite extra package tests

Rejected during this local-only sweep before pushing:

  • full manual YAML manifest emission: passed tests but regressed full internal/build badly;
  • source patch overlay process-global cache: regressed and had source-staleness risk;
  • ad-hoc cache-hit LLVM/package compile skipping: one prototype crashed, the safer version passed but regressed severely;
  • batching SSA package builds with ssa.Program.Build: no improvement;
  • caching default llvm-config path lookup: no improvement beyond caching successful --bindir.

Additional local validation before pushing this follow-up:

go test ./internal/env ./xtool/env/llvm ./internal/build
go test -race ./internal/build -run '^TestExtest$|^TestStartCacheSaveNormalizesMainPackage$|^TestTryLoadFromCache_ForceRebuild$' -count=1
go test ./internal/crosscompile/...
(cd test/buildcache && bash ./test.sh)
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Whole Build Pipeline Follow-up 5

Added source-only commit 3d3a3e3 to move expensive golang.org/x/tools/go/ssa sanity checking out of the default build hot path:

  • Default SSA build mode now keeps InstantiateGenerics but does not run SanityCheckFunctions for every compiled package.
  • LLGO_SSA_SANITY=1 restores the old sanity-check behavior for debugging/validation.
  • LLGO_SSA_SANITY is included in cache fingerprint env inputs, so enabling it forces rebuilds instead of reusing default no-sanity cache entries.
  • The untyped-shift workaround remains active when SSA sanity checking is enabled.
  • Added TestSSABuildModeSanityOptIn to cover the default and opt-in modes.

Local evidence from the representative internal/build workload:

Workload Before After Change
go test ./internal/build -count=1 wall 22.2s 19.8s / 19.7s about -10%
go test ./internal/build -count=1 go-reported 21.438s 19.017s / 19.156s about -10%

This was profile-guided: TestExtest CPU/memory profiles showed ssa.mustSanityCheck / ssa.WriteFunction allocating roughly 287MB in the old default path. The new default removes that validation cost from normal builds while preserving an explicit opt-in path.

Rejected during this sweep:

  • raising async cache-save cap from 4 to 8: regressed the representative workload;
  • go:embed comment pre-scan fast path: not a bottleneck and regressed;
  • ssa type-conversion allocation tweak: regressed full internal/build.

Additional local validation before pushing this follow-up:

LLGO_SSA_SANITY=1 go test ./internal/build -run '^TestExtest$|^TestSSABuildModeSanityOptIn$' -count=1
go test ./internal/build
(cd test/buildcache && bash ./test.sh)
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

Whole Build Pipeline Follow-up 6

Added source-only commit b79f897 to reduce repeated setup and scheduler overhead in multi-build processes:

  • Development LLGO_ROOT discovery now prints the repeated “Using LLGO root for devel” warning only once per root in a process, preserving the diagnostic while avoiding repeated stderr I/O during internal/build tests and other long-lived build drivers.
  • internal/packages now avoids spawning package-load goroutines for narrow import fanout and for the common single-root load case, while preserving parallel loading for wider import graph nodes.
  • Added TestLLGoROOTWarnsOnceForDevelRoot for the warning-once behavior.

Local representative evidence from this sweep:

Workload Before After Change
go test ./internal/build -count=1 wall 22.6s baseline 20.8s best local run / 20.7s post-cleanup validation about -8%
go test ./internal/build -count=1 go-reported 21.645s baseline 19.978s best local run / 20.221s post-cleanup validation about -7%
clean go build -a -tags=dev ./cmd/llgo validation run 11.8s passed; no root llgo artifact

Rejected during this local-only sweep before pushing:

  • cache-hit minimal package rebuild using new manifest bits: still crashed because cache-hit packages without LPkg miss hidden ABI/link side effects;
  • direct LLVM object emission through a new internal/build cgo shim: failed include-path portability and would add package-level cgo risk;
  • caching ABI type names in ssa/abi: passed SSA tests but regressed the full internal/build workload;
  • parsing small package file lists sequentially: regressed versus the kept package-load fanout change;
  • package-load fanout thresholds 1, 2, and 8: no improvement over the kept threshold of 4;
  • removing TypesInfo.Scopes collection and small ABI metadata slice preallocation: both regressed representative runs;
  • process-global successful LLGO_ROOT caching: no representative win and higher stale-global-state risk.

Additional local validation before pushing this follow-up:

LLGO_SSA_SANITY=1 go test ./internal/build -run '^TestExtest$|^TestSSABuildModeSanityOptIn$' -count=1
go test ./internal/env ./internal/build ./internal/crosscompile/... -count=1
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Whole Build Pipeline Follow-up 7

Added source-only commit fff8fff to reduce go/types map growth during package loading:

  • internal/packages.loadPackageEx now gives the types.Info maps modest initial capacities based on the number of parsed source files.
  • This targets the current profile hotspot in go/types.Checker.recordTypeAndValue / TypesInfo population without changing which type information LLGo collects.
  • Kept all existing TypesInfo maps (Types, Defs, Uses, Implicits, Instances, Scopes, Selections) because omitting any required map is unsafe.

Local representative evidence from this sweep:

Workload Before After Change
go test ./internal/build -count=1 wall 30.3s noisy baseline 20.4s / 20.0s best repeats about -33%
go test ./internal/build -count=1 go-reported 28.522s 19.599s / 19.452s about -31%
clean go build -a -tags=dev ./cmd/llgo guard run 11.5s passed; no root llgo artifact

Capacity tuning kept 1024 * len(syntax) for Types, 1/2 of that for Defs/Uses, and small per-file capacities for the smaller maps. Rejected alternatives:

  • removing TypesInfo.Implicits: crashed immediately;
  • removing TypesInfo.Scopes: previously passed but regressed;
  • raising Types capacity to 1536 or 2048 per file: regressed;
  • increasing Defs/Uses to match Types: regressed;
  • increasing small-map capacities to 32 per file: regressed;
  • adding a 16384 cap: did not improve the primary wall-time metric;
  • direct LLVM object emission through an internal/build cgo shim: improved internal/build execution but doubled clean go build -a ./cmd/llgo, so it was rejected and not pushed.

Additional local validation before pushing this follow-up:

LLGO_SSA_SANITY=1 go test ./internal/build -run '^TestExtest$|^TestSSABuildModeSanityOptIn$' -count=1
go test ./internal/packages ./internal/build ./internal/crosscompile/... -count=1
(cd test/buildcache && bash ./test.sh)
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Latest CI Runtime Snapshot (fff8fff)

Latest head: fff8fffec897f5a4ccf0d2f52426440962d08adb (pre-size package type info maps). All checks completed successfully; CI workflow topology remains unchanged.

Compared against the fastest completed coverage-equivalent goplus/main baseline by total runner time from the current baseline pool (b4d9167, same 55 non-skipped / 1 skipped topology):

Metric PR fff8fff Main b4d9167 Change
Non-skipped jobs 55 55 same
Skipped jobs 1 1 same
Total runner time 5h48m58s 6h39m33s -50m35s
End-to-end wall time 51m04s 3h01m44s -2h10m40s
Longest single job 20m23s (Go / test (ubuntu-latest, 19)) 27m12s (LLGo / llgo (macos-15-intel, 19, 1.26.0)) -6m49s

Baseline pool checked: latest completed successful goplus/main run sets with equivalent 55 non-skipped / 1 skipped coverage. Fastest by total runner time was b4d9167 at 6h39m33s; fastest by end-to-end wall time was 7ea3148 at 2h10m04s. PR fff8fff end-to-end wall time was 51m04s.

Per-workflow comparison against b4d9167:

Workflow Jobs PR runner Main runner Runner Δ PR wall Main wall Wall Δ PR longest Main longest Longest Δ
Build Cache 2 5m42s 6m49s -1m07s 3m13s 9m12s -5m59s 3m12s 3m38s -0m26s
Docs 6 10m19s 9m20s +0m59s 2m38s 12m24s -9m46s 2m29s 2m51s -0m22s
Format Check 1 0m07s 0m07s 0m00s 0m07s 0m07s 0m00s 0m07s 0m07s 0m00s
Go 2 40m31s 42m17s -1m46s 20m23s 24m34s -4m11s 20m23s 23m03s -2m40s
LLGo 33 4h14m04s 4h56m47s -42m43s 50m59s 3h01m24s -2h10m25s 19m02s 27m12s -8m10s
Release Build 7 17m36s 22m00s -4m24s 44m10s 33m06s +11m04s 8m05s 8m29s -0m24s
Stdlib Coverage 2 2m27s 2m00s +0m27s 7m38s 5m34s +2m04s 1m30s 1m01s +0m29s
Targets 2 18m12s 20m13s -2m01s 21m47s 46m55s -25m08s 10m12s 11m13s -1m01s

Top latest PR jobs by duration:

Duration Workflow Job
20m23s Go test (ubuntu-latest, 19)
20m08s Go test (macos-latest, 19)
19m02s LLGo llgo (macos-15-intel, 19, 1.26.0)
16m54s LLGo llgo (macos-15-intel, 19, 1.24.2)
16m45s LLGo llgo (macos-15-intel, 19, 1.21.13)

Note: end-to-end wall time is affected by GitHub-hosted runner queueing and scheduling; total runner time is usually the better signal for source-level build-performance changes.

Whole Build Pipeline Follow-up 8

Added source-only commit 66693e8 to skip parser object resolution during LLGo package loads:

  • internal/build now supplies a packages.Config.ParseFile callback that uses parser.SkipObjectResolution while still preserving parser.ParseComments for LLGo directives (go:linkname, llgo:*, go:embed, cgo pragmas, etc.).
  • LLGo relies on go/types and x/tools SSA object information, not parser-populated ast.Ident.Obj / ast.File.Scope links, so this avoids unnecessary parser work without reducing comment/directive coverage.
  • Added TestParseBuildFileSkipsObjectResolutionAndKeepsComments to cover both properties: comments are retained, parser object links are not populated.

Local representative evidence:

Workload Before After Change
go test ./internal/build -count=1 wall 23.1s 20.9s / 20.4s about -10% to -12%
go test ./internal/build -count=1 go-reported 22.064s 20.065s / 19.879s about -9% to -10%
clean go build -a -tags=dev ./cmd/llgo prior guard ~11.5s 11.0s / 11.33s no regression

Rejected during this local-only sweep before pushing:

  • direct native LLVM object emission via a tiny internal/llvmext cgo package: improved LLGo execution workloads but regressed clean go build -a ./cmd/llgo to 12.8s; still best pursued by adding EmitToFile to the existing github.com/goplus/llvm binding instead;
  • AST-count-based TypesInfo map sizing: improved noisy internal/build runs but regressed clean cmd build guard;
  • retuning TypesInfo capacity from 1024/file to 512/file: small/noisy internal improvement but clean cmd guard regressed slightly;
  • pre-sizing the package loader parse cache: internal runs improved but clean cmd guard regressed and unique-count variant regressed;
  • parsing comments only for directive-bearing files: not clearly better than SkipObjectResolution alone and has higher risk of missing directive forms.

Additional local validation before pushing this follow-up:

LLGO_SSA_SANITY=1 go test ./internal/build -run '^TestExtest$|^TestSSABuildModeSanityOptIn$|^TestParseBuildFileSkipsObjectResolutionAndKeepsComments$' -count=1
go test ./internal/build -count=1
go test ./internal/build ./internal/packages ./internal/crosscompile/... -count=1
(cd test/buildcache && bash ./test.sh)
go build -o <tmp> -tags=dev ./cmd/llgo
go build -a -tags=dev -o <tmp> ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Latest CI Runtime Snapshot (66693e8)

Latest head: 66693e87e859e1d912209f1172533bdb8e95ffc4 (skip parser object resolution in build loads). All checks completed successfully; CI workflow topology remains unchanged.

Compared against the fastest completed coverage-equivalent goplus/main baseline by total runner time from the current baseline pool (b4d9167, same 55 non-skipped / 1 skipped topology):

Metric PR 66693e8 Main b4d9167 Change
Non-skipped jobs 55 55 same
Skipped jobs 1 1 same
Total runner time 6h06m37s 6h39m33s -32m56s
End-to-end wall time 59m20s 3h01m44s -2h02m24s
Longest single job 29m24s (LLGo / llgo (macos-15-intel, 19, 1.21.13)) 27m12s (LLGo / llgo (macos-15-intel, 19, 1.26.0)) +2m12s

Compared with the previous PR head snapshot (fff8fff), this run was slower in CI (+17m39s total runner, +8m16s wall). The local representative tests for 66693e8 improved, so this CI delta is treated cautiously because the longest latest job shifted to a single macOS Intel matrix job.

Per-workflow comparison against b4d9167:

Workflow Jobs PR runner Main runner Runner Δ PR wall Main wall Wall Δ PR longest Main longest Longest Δ
Build Cache 2 5m48s 6m49s -1m01s 3m14s 9m12s -5m58s 3m14s 3m38s -0m24s
Docs 6 9m28s 9m20s +0m08s 4m59s 12m24s -7m25s 2m37s 2m51s -0m14s
Format Check 1 0m10s 0m07s +0m03s 0m10s 0m07s +0m03s 0m10s 0m07s +0m03s
Go 2 38m20s 42m17s -3m57s 19m24s 24m34s -5m10s 19m24s 23m03s -3m39s
LLGo 33 4h31m06s 4h56m47s -25m41s 59m15s 3h01m24s -2h02m09s 29m24s 27m12s +2m12s
Release Build 7 21m03s 22m00s -0m57s 18m22s 33m06s -14m44s 8m26s 8m29s -0m03s
Stdlib Coverage 2 1m54s 2m00s -0m06s 3m47s 5m34s -1m47s 1m05s 1m01s +0m04s
Targets 2 18m48s 20m13s -1m25s 10m38s 46m55s -36m17s 10m37s 11m13s -0m36s

Top latest PR jobs by duration:

Duration Workflow Job
29m24s LLGo llgo (macos-15-intel, 19, 1.21.13)
19m32s LLGo llgo (macos-15-intel, 19, 1.26.0)
19m24s Go test (ubuntu-latest, 19)
18m56s Go test (macos-latest, 19)
17m44s LLGo llgo (macos-15-intel, 19, 1.24.2)

Note: end-to-end wall time is affected by GitHub-hosted runner queueing and scheduling; total runner time is usually the better signal for source-level build-performance changes.

LLVM EmitToFile CI Validation (b549c2d)

Temporary CI-validation commit b549c2d switches native host object emission from EmitToMemoryBuffer + Go-side object-file write to the new TargetMachine.EmitToFile API on a forked LLVM module branch:

  • LLVM fork/branch: github.com/cpunion/llvm, branch feat/emit-to-file
  • LLVM commit: 40fdafa target: emit target machine output to file
  • LLGo temporary dependency override:
    replace github.com/goplus/llvm => github.com/cpunion/llvm v0.8.9-0.20260429084913-40fdafa22ac4

Local A/B before pushing:

Workload Current v0.8.8 memory buffer path Forked LLVM EmitToFile Change
go test ./internal/build -run '^TestExtest$' -count=3 wall 32.1s 23.2s -8.9s (-27.7%)
go test ./internal/build -run '^TestExtest$' -count=3 go-reported 24.254s 22.378s -1.876s (-7.7%)
go test ./internal/build -count=1 validated 20.489s / 20.866s passed
clean go build -a -tags=dev ./cmd/llgo prior guards ~11.0-11.5s 11.1s / local validation passed no LLGo-side cgo regression

This validates the direction that PR #1823 enabled: native object emission should move from memory-buffer output to direct file output, but the API belongs in the existing github.com/goplus/llvm binding rather than a new LLGo-side cgo shim. Once the LLVM PR is merged/tagged, the temporary replace should be removed and LLGo should depend on the released github.com/goplus/llvm version.

CI result for LLVM EmitToFile replace commit (b549c2d)

All checks completed successfully for b549c2d9e8cdedfad0b4ba919beb72e20d7340cf (57 success, 1 skipped; merge state CLEAN).

Coverage-equivalent comparison (55 non-skipped jobs, 1 skipped job):

Metric b549c2d with forked LLVM EmitToFile previous PR head 66693e8 Δ vs previous PR head main baseline b4d9167 Δ vs main
Total runner time 6h23m49s 6h06m37s +17m12s (+4.7%) 6h39m33s -15m44s (-3.9%)
End-to-end wall 53m51s 59m16s -5m25s (-9.1%) 3h01m33s -2h07m42s (-70.3%)
Longest job 27m08s 29m24s -2m16s (-7.7%) 27m12s -0m04s (-0.2%)
LLGo workflow total 4h37m36s 4h31m06s +6m30s (+2.4%) 4h56m47s -19m11s (-6.5%)
LLGo llgo build-job bucket 1h50m49s 1h53m55s -3m06s (-2.7%) 2h07m32s -16m43s (-13.1%)
Release Build workflow 20m50s 21m03s -0m13s (-1.0%) 22m00s -1m10s (-5.3%)

Per-workflow totals for b549c2d vs previous PR head 66693e8:

Workflow b549c2d 66693e8 Δ
Build Cache 5m41s 5m48s -0m07s
Docs 13m11s 9m28s +3m43s
Format Check 0m06s 0m10s -0m04s
Go 46m27s 38m20s +8m07s
LLGo 4h37m36s 4h31m06s +6m30s
Release Build 20m50s 21m03s -0m13s
Stdlib Coverage 2m13s 1m54s +0m19s
Targets 17m45s 18m48s -1m03s

CI is noisy at whole-run granularity: total runner time regressed vs the previous PR head mostly from Go/Docs/test-job variance, while end-to-end wall, longest job, Release Build, Targets, and the LLGo llgo build-job bucket improved. The local targeted A/B remains the clearest signal for the EmitToFile implementation itself (TestExtest -count=3: 32.1s -> 23.2s wall, -27.7%).

EmitToFile validation discarded

The temporary b549c2d validation commit using replace github.com/goplus/llvm => github.com/cpunion/llvm ... has been dropped from this PR and the branch has been reset back to 66693e8.

Reason: while the forked LLVM EmitToFile API was functional and showed a targeted local improvement for TestExtest, the completed CI run did not show a clear whole-pipeline/total-runner-time improvement over the previous PR head. To keep this PR focused on proven build-performance changes, the temporary dependency replace and LLGo code change are discarded. The LLVM API work can still be pursued separately as a cleanup/API PR, but it is not included here as a build-performance change.

Local follow-up after dropping EmitToFile (7004afe)

After discarding the temporary LLVM EmitToFile replace commit, this source-only follow-up adds two low-risk local hot-path reductions:

  • internal/goembed: skip the full go:embed declaration walk when parsed comments contain no go:embed text.
  • ssa/type_cvt.go: lazily allocate converted tuple/interface/struct slices only when a nested type actually changes.

Local paired A/B (origin/improve/build-perf-pure at 66693e8 vs combined patch):

Workload Baseline 7004afe patch Change
go test ./internal/build -run '^TestExtest$' -count=3 wall 25.777s 24.997s -0.780s (-3.0%)
same, go-reported 24.849s 24.419s -0.430s (-1.7%)

Incremental local A/B while developing:

Change Wall change Go-reported change
go:embed pre-scan 24.850s -> 23.751s (-4.4%) 23.863s -> 23.169s (-2.9%)
SSA type-conversion lazy allocation 23.946s -> 23.263s (-2.9%) 23.111s -> 22.704s (-1.8%)

Discarded during this round because paired A/B regressed:

  • isGoSSAOpaqueType reflection fast path for public go/types implementations.
  • runtime.Version() parse cache in package loading.
  • cvtNamed zero-method allocation special case.

Local validation before pushing 7004afe:

go test ./internal/goembed ./ssa ./cl ./internal/build -count=1
LLGO_SSA_SANITY=1 go test ./internal/build -run '^TestExtest$|^TestSSABuildModeSanityOptIn$' -count=1
go build -o <tmp> -tags=dev ./cmd/llgo
go clean -cache && go build -a -tags=dev -o <tmp> ./cmd/llgo
test ! -e llgo
git diff --check

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant build performance optimizations and caching mechanisms across the build system. Key improvements include parallelizing C and assembly compilation with a configurable worker pool, caching pkg-config results and LLVM version detection, and optimizing manifest fingerprinting using unsafe string-to-byte conversions to avoid allocations. Additionally, the PR implements a fast-path for manifest metadata parsing to bypass full YAML decoding and optimizes C file scanning by leveraging package metadata. Review feedback suggests further performance gains by caching environment variables used in pkg-config lookups and increasing the default parallel job limit for high-performance build environments.

Comment thread internal/build/cgo.go
Comment on lines +515 to +519
for _, env := range os.Environ() {
if strings.HasPrefix(env, "PKG_CONFIG") {
keyParts = append(keyParts, env)
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling os.Environ() inside cachedPkgConfig is inefficient because it allocates and copies the entire environment on every call. Since this function is called for every #cgo pkg-config directive across all files in a package, this can add significant overhead in projects with many cgo dependencies. Consider caching the filtered PKG_CONFIG environment variables globally or within the build context to improve performance.

Comment on lines +53 to 55
if jobs > 16 {
return 16, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The hardcoded limit of 16 parallel jobs may be too low for modern high-performance build environments. Since this PR focuses on build performance, consider increasing this limit or removing it entirely to allow full utilization of available CPU cores, especially as clang processes are independent.

Suggested change
if jobs > 16 {
return 16, nil
}
if jobs > 64 {
return 64, nil
}

@cpunion cpunion force-pushed the improve/build-perf-pure branch from 2c3d623 to 7fbdf87 Compare April 27, 2026 02:49
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 94.89164% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.36%. Comparing base (3ac9c14) to head (340e732).

Files with missing lines Patch % Lines
internal/crosscompile/compile/compile.go 91.19% 9 Missing and 8 partials ⚠️
ssa/type_cvt.go 80.76% 7 Missing and 3 partials ⚠️
internal/goembed/goembed.go 66.66% 2 Missing and 2 partials ⚠️
internal/crosscompile/crosscompile.go 87.50% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1832      +/-   ##
==========================================
- Coverage   88.37%   88.36%   -0.01%     
==========================================
  Files          51       51              
  Lines       14498    14756     +258     
==========================================
+ Hits        12812    13039     +227     
- Misses       1468     1485      +17     
- Partials      218      232      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@cpunion cpunion force-pushed the improve/build-perf-pure branch 2 times, most recently from 8566a2d to cd7cc02 Compare May 2, 2026 16:55
cpunion added 16 commits May 7, 2026 08:45
Result: {"status":"keep","rebased_internal_build_wall":30.925,"go_reported_s":30.325,"baseline_s":32.555,"patched_s":30.925,"delta_s":-1.63,"base_go_reported_s":31.983,"patched_go_reported_s":30.325}
Result: {"status":"keep","rebased_internal_build_wall":31.8,"go_reported_s":31.224,"baseline_s":31.8,"patched_s":31.8,"delta_s":0,"base_go_reported_s":31.224,"patched_go_reported_s":31.224}
Result: {"status":"keep","warm_internal_build_wall":29.279,"go_reported_s":28.415,"baseline_s":31.157,"patched_s":29.279,"delta_s":-1.878,"base_go_reported_s":30.589,"patched_go_reported_s":28.415,"wall_s":29.279}
Result: {"status":"keep","warm_internal_build_wall":29.256,"go_reported_s":28.699,"baseline_s":31.312,"patched_s":29.256,"delta_s":-2.056,"base_go_reported_s":30.727,"patched_go_reported_s":28.699,"wall_s":29.256}
Result: {"status":"keep","warm_internal_build_wall":29.456,"go_reported_s":28.89,"baseline_s":30.32,"patched_s":29.456,"delta_s":-0.864,"base_go_reported_s":29.756,"patched_go_reported_s":28.89,"wall_s":29.456}
Result: {"status":"keep","warm_internal_build_wall":28.574,"go_reported_s":27.996,"baseline_s":30.326,"patched_s":28.574,"delta_s":-1.752,"base_go_reported_s":29.768,"patched_go_reported_s":27.996,"wall_s":28.574}
@cpunion cpunion force-pushed the improve/build-perf-pure branch from 340e732 to fd5680f Compare May 7, 2026 00:46
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 7, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants