Improve build performance without CI fan-out by cpunion · Pull Request #1832 · xgo-dev/llgo

cpunion · 2026-04-27T02:41:39Z

Summary

This PR is the pure build-performance subset of the previous build-perf work. It intentionally removes the CI workflow/job split and sharding changes so the branch focuses only on compiler/build hot-path improvements.

What Changed

Optimized LLGo build-cache/fingerprint/manifest hot paths while preserving the YAML manifest format and fallback parsing paths.
Reduced cgo/build overhead:
- faster cgo preamble and pragma scanning,
- package metadata fast paths for cgo C-file discovery,
- skip cgo extern declaration generation when no extern symbols are used.
Improved crosscompile compile-group performance by parallelizing independent external compiler/archive work with bounded worker counts and deterministic output ordering.
Simplified large crosscompile file-list construction to reduce compile/runtime overhead.
Kept focused tests for the new fast paths and fallbacks.

Intentionally Not Included

No CI workflow fan-out/splitting/sharding changes.
No CI-only retry/cache script changes.
No autoresearch metadata files.
No local root llgo binary.

Validation

Ran locally on this branch:

go test ./internal/build
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo

Latest Local Follow-up: Async Native Object Emission

Added source-only commit e94d62a to overlap native host object emission with later package work. The main compiler still serializes LLGo/LLVM IR generation, then sends serialized LLVM IR to bounded external clang workers for native object emission. This avoids same-process LLVM concurrency while overlapping object generation. LLGO_PARALLEL_OBJECT_EMIT=0 is available as an opt-out, and debug/-genll/command-tracing paths remain synchronous.

Local evidence before pushing:

Workload	Baseline	Patched	Change
`go test ./internal/build -run '^TestExtest$' -count=3`	25.061s / 24.390s	24.501s / 24.139s	-2.2% / -1.0%
`go test ./internal/build -count=1`	22.464s / 22.786s	22.240s / 21.707s	-1.0% / -4.7%
bounded-worker repeat `go test ./internal/build -count=1`	23.560s	22.606s	-4.0%
`go clean -cache && go build -a -tags=dev ./cmd/llgo`	11.078s	11.058s	neutral

Additional local validation passed: targeted object-emission gating tests, go test -race subset for internal/build, clean go build -a -tags=dev -o <tmp> ./cmd/llgo, test ! -e llgo, and git diff --check.

Follow-up commit 1bb466b lowers the bounded object-emission worker cap from 4 to 2 to reduce external clang contention. Local full-suite A/B over e94d62a:

Workload	Cap 4	Cap 2	Change
`go test ./internal/build -count=1`	25.987s	22.489s	-13.5%
repeat `go test ./internal/build -count=1`	21.440s	21.212s	-1.1%

Cap 1 and cap 3 both regressed locally, so cap 2 is the current local best. Race subset and clean go build -a -tags=dev guards passed after the cap change.

Follow-up commit 740b15c avoids copying the serialized LLVM IR string into a []byte before writing the temporary .ll file. Local full-suite A/B over cap-2 async emission:

Workload	Before	After	Change
`go test ./internal/build -count=1`	23.222s	22.443s	-3.4%
repeat `go test ./internal/build -count=1`	21.806s	21.678s	-0.6%

A targeted TestExtest / object-emission gating test run passed before pushing.

Follow-up commit 48cde59 uses the C clang driver instead of clang++ for native async IR object emission while preserving the configured compiler for cross/-genll paths. Local full-suite A/B over 740b15c:

Workload	Before	After	Change
`go test ./internal/build -count=1`	22.705s	22.211s	-2.2%
repeat `go test ./internal/build -count=1`	22.109s	21.958s	-0.7%

Targeted TestExtest / object-emission gating tests passed before pushing; full go test ./internal/build -count=1, clean go build -a -tags=dev -o <tmp> ./cmd/llgo, test ! -e llgo, and git diff --check also passed after pushing.

Follow-up commit 10cc3f4 broadens async object emission to external-clang/cross builds too, while explicitly keeping -genll, IR checking, and command-tracing paths synchronous. Local full-suite A/B over 48cde59:

Workload	Before	After	Change
`go test ./internal/build -count=1`	22.035s	21.644s	-1.8%
repeat `go test ./internal/build -count=1`	22.970s	22.320s	-2.8%

Targeted TestParallelObjectEmitEnabled / TestExtest, race subset, and clean go build -a -tags=dev guards passed before pushing. Follow-up a2e2505 keeps external/cross async emission on the target-specific compiler (instead of the native clang driver) after Targets CI exposed xtensa builds using the wrong compiler; local build.sh empty esp32 passed with the fix.

Follow-up commit 61b1b1d pipes async LLVM IR to clang via stdin (clang -x ir -c -) instead of writing temporary .ll files when GenLL/IR-checking are disabled. Debug/check paths still materialize .ll files. Local full-suite A/B over a2e2505:

Workload	Before	After	Change
`go test ./internal/build -count=1`	23.336s	22.250s	-4.7%

Validation also passed: targeted async/object-emission tests plus TestExtest, build.sh empty esp32, race subset, clean go build -a -tags=dev, and git diff --check.

Follow-up commit 1c53158 leaves go/types.Info.Scopes nil in LLGo package loads because LLGo and x/tools/go/ssa do not consume lexical scope records during compilation. This avoids extra type-checker scope recording. Local full-suite A/B over 61b1b1d:

Workload	Before	After	Change
`go test ./internal/build -count=1`	24.530s	22.511s	-8.2%
repeat `go test ./internal/build -count=1`	23.085s	22.013s	-4.6%

Validation passed: go test ./internal/packages ./internal/build ./ssa ./cl -count=1, race subset, clean go build -a -tags=dev, and git diff --check.

Follow-up commit 2ecb5b6 avoids copying yaml.Marshal manifest bytes into a second string by using the existing read-only unsafe byte-slice-to-string helper. Local full-suite A/B over 1c53158:

Workload	Before	After	Change
`go test ./internal/build -count=1`	23.225s	22.598s	-2.7%
repeat `go test ./internal/build -count=1`	22.562s	22.457s	-0.5%

Validation passed: manifest/fingerprint/cache targeted tests, full build-cache script, clean go build -a -tags=dev, and git diff --check.

Follow-up commit 22d440a creates all x/tools/go/ssa packages first, then calls Program.Build() once so the upstream SSA builder can use its documented parallel package build path. LLGo/LLVM codegen still runs sequentially after this phase. Local full-suite A/B over 2ecb5b6:

Workload	Before	After	Change
`go test ./internal/build -count=1`	24.158s	22.587s	-6.5%
repeat `go test ./internal/build -count=1`	23.902s	22.334s	-6.6%

Validation passed: go test ./internal/build ./ssa ./cl -count=1, race subset, clean go build -a -tags=dev, and git diff --check.

Follow-up commit c0ff171 avoids constructing full go/types method sets in the local SSA order fixup. The fixup now visits explicit named method functions directly via Program.FuncValue, avoiding MethodSet allocation for every type. Local full-suite A/B over 22d440a:

Workload	Before	After	Change
`go test ./internal/build -count=1`	24.295s	22.676s	-6.7%
repeat `go test ./internal/build -count=1`	22.907s	22.633s	-1.2%

Validation passed: go test ./internal/build ./ssa ./cl -count=1, targeted SSA-order tests, race subset, clean go build -a -tags=dev, and git diff --check.

Follow-up commit 60e0404 tracks whether buildSSAPkgs actually created new SSA packages and skips a redundant Program.Build() traversal when the call only wraps packages built by earlier setup; local SSA fixups still run for returned packages. Local full-suite A/B over c0ff171:

Workload	Before	After	Change
`go test ./internal/build -count=1`	22.722s	22.549s	-0.8%
repeat `go test ./internal/build -count=1`	23.577s	21.986s	-6.7%

Validation passed: targeted TestExtest/SSA-order/object-emission tests, race subset, and git diff --check.

Follow-up commit cc6d908 writes LLGo build manifests with a deterministic specialized YAML emitter instead of using generic yaml.Marshal reflection for the hot package-manifest path. The cache manifest remains YAML and existing YAML decoding/legacy fallback remain in place. Local full-suite A/B over 60e0404:

Workload	Before	After	Change
`go test ./internal/build -count=1`	24.378s	23.615s	-3.1%
repeat `go test ./internal/build -count=1`	23.339s	23.050s	-1.2%

Validation passed: manifest/fingerprint/cache targeted tests, full build-cache script, clean go build -a -tags=dev, and git diff --check.

Follow-up commit e213989 avoids strconv.Quote for manifest strings that are safe plain YAML scalars, reducing allocations and manifest size in the specialized emitter while still quoting ambiguous/special values. Local full-suite A/B over cc6d908:

Workload	Before	After	Change
`go test ./internal/build -count=1`	23.173s	22.916s	-1.1%
repeat `go test ./internal/build -count=1`	23.534s	23.107s	-1.8%

Validation passed: manifest/fingerprint/cache targeted tests, full build-cache script, clean go build -a -tags=dev, and git diff --check.

Follow-up commit 86d5af4 trims dead build helper code and reuses scratch state in the SSA-order fixup:

removes the now-unreachable generic yaml.Marshal fallback for build-manifest emission,
removes production-only helpers that were no longer used outside tests (manifestBuilder.Fingerprint, digestFile),
reuses a dependency scratch map while checking stores in fixSSAOrderBlock.

Local paired A/B evidence:

Workload	Before	After	Change
`go build -a -tags=dev -o <tmp> ./cmd/llgo` (`yaml.Marshal` fallback removal)	10.853s	10.572s	-2.6%
repeat `go build -a -tags=dev -o <tmp> ./cmd/llgo`	11.115s	10.649s	-4.2%
remove unused `digestFile` helper	10.749s	10.522s	-2.1%
repeat remove unused `digestFile` helper	10.526s	10.515s	-0.1%
`go test ./internal/build -count=1` (SSA-order scratch reuse)	25.591s	24.503s	-4.3%
repeat SSA-order scratch reuse	24.367s	24.203s	-0.7%

A third SSA-order repeat was slightly negative (+0.2%), so that part is treated as a small low-risk allocation cleanup rather than a large claimed speedup. Validation before push passed targeted manifest/fingerprint/digest/metadata/SSA-order/TestExtest tests, clean go build -a -tags=dev, test ! -e llgo, and git diff --check.

Follow-up commit 8549fc4 inlines the only remaining production digestBytes call site in the overlay file-digest path and keeps the hash helper test-local. This trims a dead production helper after the earlier digestFile cleanup while preserving the same sha256 + hex encoding logic.

Local clean-build A/B over 86d5af4:

Workload	Before	After	Change
`go build -a -tags=dev -o <tmp> ./cmd/llgo`	10.894s	10.679s	-2.0%
repeat	10.692s	10.630s	-0.6%
third run	11.067s	10.619s	-4.1%

Validation before push passed targeted digest/manifest/metadata/TestExtest tests, full go test ./internal/build -count=1, clean go build -a -tags=dev -o <tmp> ./cmd/llgo, test ! -e llgo, and git diff --check.

CI Result for Latest Head (`8549fc4`)

Latest pushed head is CI-clean:

Checks: 57 success, 1 skipped, 0 failed
Merge state: CLEAN
Non-skipped jobs counted for timing: 55
Total runner time: 5h59m51s
LLGo workflow total: 4h16m06s
Wall time: 51m30s
Longest job: 27m07s (test (macos-latest, 19) in Go workflow)

Compared with the previous clean head 86d5af4, this run improves total runner time by 6m40s, LLGo workflow total by 13m10s, and end-to-end wall time by 5m34s, while the longest job is 1m28s longer due to the Go workflow macOS test job. Compared with the coverage-equivalent main baseline b4d9167, it is lower in total runner time (-39m42s), LLGo workflow total (-40m41s), wall time (vs that main sample), and longest-job time (-0m05s). As before, hosted-runner variance remains significant; local paired A/B is the primary source-level evidence.

CI Result for Latest Head (`86d5af4`)

Latest pushed head is CI-clean:

Checks: 57 success, 1 skipped, 0 failed
Merge state: CLEAN
Non-skipped jobs counted for timing: 55
Total runner time: 6h06m31s
LLGo workflow total: 4h29m16s
Wall time: 57m04s
Longest job: 25m39s (llgo (macos-15-intel, 19, 1.24.2) in LLGo workflow)

Compared with the previous clean head e213989, this run improves total runner time by 14m37s and LLGo workflow total by 7m31s, with a similar longest job (-19s) and slightly higher end-to-end wall time (+47s). Compared with the coverage-equivalent main baseline b4d9167, it is lower in total runner time (-33m02s), LLGo workflow total (-27m31s), and longest job (-1m33s). As before, local paired A/B remains the primary evidence for source-level changes because hosted-runner timing is noisy.

CI Result for Latest Head (`e213989`)

Latest pushed head is CI-clean:

Checks: 57 success, 1 skipped, 0 failed
Merge state: CLEAN
Non-skipped jobs counted for timing: 55
Total runner time: 6h21m08s
LLGo workflow total: 4h36m47s
Wall time: 56m17s
Longest job: 25m58s (test (macos-latest, 19) in Go workflow)

Compared with the previous clean head 60e0404, this hosted-runner sample is mixed/noisier: total runner time is +13m16s and LLGo workflow total is +5m17s, while the longest job improves by 2m59s. Compared with the coverage-equivalent main baseline b4d9167, it remains faster in total runner time (-18m25s), LLGo workflow total (-20m00s), and longest job (-1m14s). The manifest-emitter commits are therefore justified primarily by local paired A/B and build-cache validation rather than by claiming a whole-CI timing win from this single run.

CI Result for Latest Head (`60e0404`)

Latest CI completed clean: 57 successful checks and 1 skipped check; merge state is CLEAN.

Compared with the previous clean async-tuning sample (48cde59), CI is mixed: total runner time and LLGo workflow total are higher on this run, while wall time is essentially unchanged and several individual jobs still improve. This reinforces that the later SSA/cache hot-path commits are justified primarily by local paired A/B evidence, not by a single hosted-runner timing sample.

Metric	`48cde59`	`60e0404`	Difference
Non-skipped jobs	55	55	+0
Skipped jobs	1	1	+0
Total runner time	5h59m35s	6h07m52s	+8m17s
LLGo workflow total	4h19m13s	4h31m30s	+12m17s
End-to-end wall time	55m49s	55m41s	-0m08s
Longest single job	24m46s	28m57s	+4m11s

Compared with the coverage-equivalent main baseline (b4d9167), the latest head remains lower in total runner time and LLGo workflow total, though the longest single job is higher in this sample.

Metric	`main` `b4d9167`	`60e0404`	Difference
Non-skipped jobs	55	55	+0
Total runner time	6h39m33s	6h07m52s	-31m41s
LLGo workflow total	4h56m47s	4h31m30s	-25m17s
Longest single job	27m12s	28m57s	+1m45s

CI Result for Latest Async Object Emission Tuning (`48cde59`)

Latest CI completed clean: 57 successful checks and 1 skipped check. Compared with the first async object-emission CI sample (e94d62a), the follow-up cap/IR-copy/clang-driver tuning improves total runner time, LLGo workflow total, and end-to-end wall time, though the single longest job is longer on this sample.

Metric	Async object emission (`e94d62a`)	Tuned async object emission (`48cde59`)	Difference
Non-skipped jobs	55	55	+0
Skipped jobs	1	1	+0
Longest single job	21m58s	24m46s	+2m48s
Total runner time	6h13m10s	5h59m35s	-13m35s
LLGo workflow total	4h35m13s	4h19m13s	-16m00s
End-to-end wall time	1h15m33s	55m49s	-19m44s

Compared with the pre-async PR head (7004afe), the latest head is also lower by total runner time (5h59m35s vs 6h05m05s) and LLGo workflow total (4h19m13s vs 4h32m44s), with the same workflow topology/job coverage.

CI Result for Async Object Emission (`e94d62a`)

The first CI attempt hit a transient GitHub 502 Bad Gateway while downloading the ESP newlib tarball in hello (macos-latest, 19, 1.26.0). Rerunning the failed job succeeded; final PR status is clean: 57 successful checks and 1 skipped check.

Compared with the previous PR head 7004afe, the latest run improves the longest single job but does not show a total-runner-time win on this one CI sample:

Metric	Previous PR head (`7004afe`)	Async object emission (`e94d62a`)	Difference
Non-skipped jobs	55	55	+0
Skipped jobs	1	1	+0
Longest single job	25m07s	21m58s	-3m09s
Total runner time	6h05m05s	6h13m10s	+8m05s
LLGo workflow total	4h32m44s	4h35m13s	+2m29s
End-to-end wall time	3h49m46s	1h15m33s	-2h34m13s

Against the fastest coverage-equivalent goplus/main baseline (b4d9167), the latest PR head remains faster overall (6h13m10s vs 6h39m33s total runner time), but the async object-emission commit itself needs more CI samples before claiming a whole-CI improvement.

CI Runtime Snapshot (latest head after tool environment caching)

Measured from GitHub Actions job startedAt / completedAt timestamps. Skipped jobs are excluded from runtime totals. Codecov checks are not included because they are external status checks rather than Actions runtime jobs. This source-only branch keeps the same workflow topology as main; end-to-end wall time can still vary significantly with hosted-runner queueing, so total runner time is the less noisy cost proxy.

Data sources:

PR Improve build performance without CI fan-out #1832 latest successful CI runs at 43d48c75f5c284805930291cdbfd38f4f9c9bc7d: 25047647064, 25047647084, 25047647105, 25047647095, 25047647059, 25047647044, 25047647061, 25047647056
Best comparable completed goplus/main CI run set by total runner time at b4d9167e460d91a4a0f09a0f8616670a8fbd23fa: 24972314382, 24972314373, 24972314376, 24972314381, 24972314377, 24972314387, 24972314374, 24972314386

Baseline selection note for future snapshots: compare against the fastest completed goplus/main run set that has the same workflow topology / job coverage (same non-skipped and skipped job count where possible). Older completed main runs with fewer jobs are not used as the main baseline because they are not coverage-equivalent. In the currently queried recent completed main runs, there are 3 completed successful main run sets with the same 55 non-skipped / 1 skipped job topology; b4d9167 remains the fastest by total runner time, while 7ea3148 is the fastest by wall time.

Metric	PR #1832 latest head (`43d48c7`)	best comparable completed `goplus/main` (`b4d9167`)	Difference
Non-skipped jobs	55	55	+0
Skipped jobs	1	1	+0
Longest single job	28m36s	27m12s	+1m24s
Total runner time	6h17m26s	6h39m33s	-22m07s
End-to-end wall time	2h01m48s	3h01m33s	-59m45s

Latest longest PR job: Go / test (macos-latest, 19).

Fastest comparable main total runner-time baseline: b4d9167e460d91a4a0f09a0f8616670a8fbd23fa. Fastest comparable main wall-time baseline: 7ea31484337c1d3b560fea9f07bbca1dcf75150a at 2h09m59s.

Workflow	PR #1832 jobs / total / wall / longest	best comparable `goplus/main` jobs / total / wall / longest
Build Cache	2 / 5m40s / 3m02s / 3m02s	2 / 6m49s / 9m12s / 3m38s
Docs	6 / 9m57s / 49m43s / 2m29s	6 / 9m20s / 12m24s / 2m51s
Format Check	1 / 0m07s / 0m07s / 0m07s	1 / 0m07s / 0m07s / 0m07s
Go	2 / 48m46s / 1h10m07s / 28m36s	2 / 42m17s / 24m34s / 23m03s
LLGo	33 / 4h32m00s / 1h55m41s / 24m24s	33 / 4h56m47s / 3h01m24s / 27m12s
Release Build	7 / 21m03s / 39m16s / 7m58s	7 / 22m00s / 33m06s / 8m29s
Stdlib Coverage	2 / 2m14s / 17m26s / 1m13s	2 / 2m00s / 5m34s / 1m01s
Targets	2 / 17m39s / 10m40s / 9m38s	2 / 20m13s / 46m55s / 11m13s

Latest Pure Build Hot-Path Follow-up

After the rebase, added one source-only follow-up commit 85d1523 focused on cgo build metadata and pragma hot paths, without changing CI workflow topology.

Local focused benchmarks used during the follow-up:

Hot path	Before	After	Change
`splitDirectiveArgs` mostly-unquoted args	84.67 ns/op, 128 B/op, 2 allocs/op	55.75 ns/op, 64 B/op, 1 alloc/op	-34.2%
Darwin `go:cgo_*` build-flow pragma collection	463.4 ns/op, 288 B/op, 10 allocs/op	231.9 ns/op, 144 B/op, 5 allocs/op	-50.0%
no-cgo `buildCgo` with complete metadata	39.4 ns/op, 256 B/op, 1 alloc/op	9.8 ns/op, 0 B/op, 0 allocs/op	-75.1%
header-heavy cgo `OtherFiles` metadata extraction	2060 ns/op, 2688 B/op, 1 alloc/op	263.5 ns/op, 48 B/op, 1 alloc/op	-87.2%

Additional local validation after the follow-up:

go test ./internal/build
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo

Additional Pure Build Hot-Path Follow-up

Added source-only commit b427daf with further cgo metadata / pragma scan reductions. CI workflow topology is still unchanged.

Local focused benchmarks from this follow-up:

Hot path	Before	After	Change
header-only `buildCgo` with complete metadata	472.6 ns/op, 0 B/op, 0 allocs/op	212.6 ns/op, 0 B/op, 0 allocs/op	-55.0%
sorted multi-source cgo `OtherFiles` metadata	811 ns/op, 2784 B/op, 4 allocs/op	646.6 ns/op, 2688 B/op, 1 alloc/op	-20.3%
single-source cgo `OtherFiles` metadata	16.36 ns/op, 24 B/op, 1 alloc/op	14.15 ns/op, 24 B/op, 1 alloc/op	-13.5%
exact `//go:cgo_` line-comment pragma parsing	120.6 ns/op, 80 B/op, 3 allocs/op	111.5 ns/op, 80 B/op, 3 allocs/op	-7.6%

Additional local validation after this follow-up:

go test ./internal/build
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo

Additional Pure Build Hot-Path Follow-up 2

Added source-only commit 437d443 to reuse cgo pragma scan results across Darwin Plan9 asm handling and reduce go:cgo_import_dynamic parsing overhead. CI workflow topology is still unchanged.

Local focused benchmark from this follow-up:

Hot path	Before	After	Change
Darwin x/sys/unix cgo pragma flow: asm trampoline check + build ldflags/dynimports	2733 ns/op, 5824 B/op, 60 allocs/op	667 ns/op, 896 B/op, 1 alloc/op	-75.6%

This removes a duplicate AST comment scan between compilePkgSFiles' Darwin trampoline skip check and the later cgo alias/link-arg collection, then reduces allocation while parsing repeated exact //go:cgo_import_dynamic line directives.

Additional local validation after this follow-up:

go test ./internal/build
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo

Whole Build Pipeline Follow-up

Added source-only commit b236c0a to remove duplicate archive work in the LLGo build cache miss path. Previously an uncached package was archived to a temporary .a, then copied into the build cache. The build now publishes the archive directly at the cache path and uses that archive for the current link, falling back to the temporary archive path only when cache publication is unavailable.

End-to-end local evidence focused on the whole internal/build pipeline rather than microbenchmarks:

Workload	Before	After	Change
`go test ./internal/build -run '^TestExtest$' -count=1` wall	12.82s	11.73s / 11.09s repeat	-8.5% / -13.4%
Go-reported package time for same workload	12.28s	10.96s / 10.56s repeat	-10.7% / -14.0%

Rejected during the same whole-process pass: linking uncached main package object files directly instead of archiving them. It passed TestExtest, but did not improve over the cache-archive change and added linker-order complexity, so it was dropped.

Additional local validation after this follow-up:

go test ./internal/build -run '^(TestExtest|TestSaveToCache_Success|TestSaveToCache_WithMetadata|TestTryLoadFromCache_ForceRebuild)$'
go test ./internal/build
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Whole Build Setup Follow-up

Added source-only commit e36a84c to cache successful macOS SDK sysroot discovery within a process. Native macOS builds call xcrun --sdk macosx --show-sdk-path while setting up crosscompile flags; the full internal/build test package invokes the build pipeline repeatedly, so reusing a successful sysroot lookup avoids repeated external setup work without changing generated outputs. Failed lookups are not cached, so transient xcrun failures can still be retried.

End-to-end local evidence used the full internal/build package test pipeline rather than a microbenchmark:

Workload	Before	After	Change
`go test ./internal/build -count=1` wall	25.87s	25.00s / 23.31s / 23.85s	-3.4% to -9.9%
Go-reported package time	25.32s	24.15s / 22.77s / 23.03s	-4.6% to -10.1%

Additional local validation after this follow-up:

go test ./internal/crosscompile/...
go test ./internal/build
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Whole Build Setup Follow-up 2

Added source-only commit 6b54e5f to keep LLVM's bin directory first in PATH without prepending duplicate entries on every internal/build.Do call. The previous setup mutated process PATH repeatedly during multi-build processes such as go test ./internal/build, growing duplicate LLVM path entries and increasing external tool lookup/setup overhead. Empty LLVM bin dirs are now ignored instead of prepending an empty path component.

End-to-end local evidence again used the full internal/build package test pipeline rather than a microbenchmark:

Workload	Before	After	Change
`go test ./internal/build -count=1` wall	24.43s	23.14s / 21.67s / 22.62s	-5.3% to -11.3%
Go-reported package time	23.75s	22.33s / 21.15s / 21.85s	-6.0% to -11.0%

Additional local validation after this follow-up:

go test ./internal/build
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Whole Build Setup Follow-up 3

Added source-only commit 90ea768 to make LLVM target initialization idempotent within the process. internal/build.Do is called repeatedly by the full internal/build test pipeline; each call previously invoked llssa.Initialize(llssa.InitAll). LLVM target initialization is process-global, so already-initialized flag groups can be skipped while still allowing later calls with additional flags to initialize any missing groups.

End-to-end local evidence used the full internal/build package test pipeline:

Workload	Before	After	Change
`go test ./internal/build -count=1` wall	23.39s	22.80s / 22.34s	-2.5% / -4.5%
Go-reported package time	22.86s	21.70s / 21.82s	-5.1% / -4.6%

Additional local validation after this follow-up:

go test ./ssa
go test ./internal/build
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Whole Build Pipeline Follow-up 2

Added source-only commit 1629549 to overlap cache archive/manifest publication with later package builds. A temporary phase trace of the whole TestExtest pipeline showed cache publication as one of the larger traced subphases (saveToCache about 1.48s cumulative in that workload). The build now starts bounded asynchronous cache saves after cache misses and waits before linking, so archive/manifest I/O can overlap with subsequent package codegen while preserving link inputs and falling back to a temporary archive if cache publication does not produce one.

End-to-end local evidence used the full internal/build package test pipeline:

Workload	Before	After	Change
`go test ./internal/build -count=1` wall	22.65s	22.11s / 22.16s / 21.83s / 22.01s	-2.2% to -3.6%
Go-reported package time	22.10s	21.33s / 21.63s / 21.04s / 21.26s	-2.1% to -4.8%

Additional local validation after this follow-up:

go test ./internal/build
go test -race ./internal/build -run '^TestExtest$' -count=1
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Next Optimization Work

Potential follow-ups after this PR, ordered by expected value and required design work:

Cache-hit minimal work path: teach cache metadata enough link/runtime/ABI/Python state to skip more LLVM/module construction on package cache hits. This has high potential for warm-cache and repeated test workloads, but needs careful correctness proof around link metadata, runtime init, ABI symbols, and reflect/global behavior.
More detailed phase tracing for representative workloads: keep using temporary, non-committed tracing around package load, SSA build, cache read/write, cl compile, LLVM emit, archive, link, and test-run phases. Use it on TestExtest, go test ./internal/build, and clean/warm go build -tags=dev ./cmd/llgo before making more source changes.
Bounded package/test pipeline parallelism: investigate whether independent package codegen/export/archive or multiple test-binary link/run phases can be overlapped safely. This needs proof around LLVM/SSA/cabi shared state, deterministic link order, output buffering, and failure aggregation.
Cache publication refinements: after the async cache save change, look for remaining archive/manifest costs such as unnecessary temp files, redundant stat/hash work, or opportunities to batch/cache immutable manifest inputs without changing the persistent YAML format.
cmd/llgo dependency graph slimming: clean go build -a -tags=dev ./cmd/llgo still spends much time in transitive stdlib / x-tools / crosscompile dependencies. Larger gains may require CLI/dependency layering changes, which should be evaluated separately from this focused build-hot-path PR.

Directions already measured and not worth revisiting without new phase evidence: hard-coded GC tuning, disabling linker ICF, removing packages.NeedExportFile, disabling build cache, broad crosscompile/env negative caches, native-only LLVM init, ABI global scan gating, and further small cgo parser/string micro-optimizations.

Whole Build Pipeline Follow-up 3

Added source-only commit 4e8a123 to let the existing bounded cache-publication workers also perform the archive fallback for uncached packages. The previous async cache save path overlapped cache archive/manifest publication, but packages that are intentionally not cached (notably main packages) still fell back to normalizeToArchive while waiting for pending cache saves. The worker now:

attempts cache publication when cache is enabled, the package is not main, and fingerprint/manifest are available; force rebuild still bypasses cache reads but repopulates cache entries;
creates the required archive in the same bounded worker if cache publication is skipped or does not produce an archive;
drains all pending workers before returning the first archive error, avoiding background mutation after build errors;
preserves the no-cache-manifest behavior for main packages.

End-to-end local evidence used the full internal/build package test pipeline:

Workload	Before	After	Change
`go test ./internal/build -count=1` wall	22.89s	22.87s / 20.74s / 20.80s / 21.82s	-0.1% to -9.4%
Go-reported package time	21.99s	22.09s / 20.20s / 20.28s / 21.03s	noisy to -8.1%

A focused attempt to lower the worker cap from 4 to 2 was discarded because it did not improve the whole workload and would overfit a local run.

Additional local validation after this follow-up and the force-rebuild cache refresh fix (e733953):

go test ./internal/build -run '^TestStartCacheSaveNormalizesMainPackage$|^TestTryLoadFromCache_ForceRebuild$' -count=1
(cd test/buildcache && bash ./test.sh)
go test ./internal/build
go test -race ./internal/build -run '^TestExtest$|^TestStartCacheSaveNormalizesMainPackage$|^TestTryLoadFromCache_ForceRebuild$' -count=1
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Whole Build Setup Follow-up 4

Added source-only commit 43d48c7 to cache repeated tool-environment lookups inside long-lived build/test processes:

internal/env.GoEnvWithEnv now caches successful go env ... results by requested variables and effective environment, while still not caching failures. This avoids repeatedly spawning go env GOROOT GOVERSION across multiple internal/build.Do calls in the same process.
xtool/env/llvm.New now caches successful llvm-config --bindir results by effective llvm-config/PATH selection, while still retrying failures. This avoids repeatedly spawning llvm-config --bindir during multi-build workloads.
Added focused tests for successful-result caching and failure-retry behavior.

Local evidence used repeated full internal/build package runs after the async cache-publication changes:

Workload	Before	After	Change
`go test ./internal/build -count=1` wall	25.17s	22.30s / 20.01s	-11.4% / -20.5%
`go test ./internal/env ./xtool/env/llvm ./internal/build -count=1` wall	25.17s baseline context	21.91s / 22.46s / 20.99s	improved despite extra package tests

Rejected during this local-only sweep before pushing:

full manual YAML manifest emission: passed tests but regressed full internal/build badly;
source patch overlay process-global cache: regressed and had source-staleness risk;
ad-hoc cache-hit LLVM/package compile skipping: one prototype crashed, the safer version passed but regressed severely;
batching SSA package builds with ssa.Program.Build: no improvement;
caching default llvm-config path lookup: no improvement beyond caching successful --bindir.

Additional local validation before pushing this follow-up:

go test ./internal/env ./xtool/env/llvm ./internal/build
go test -race ./internal/build -run '^TestExtest$|^TestStartCacheSaveNormalizesMainPackage$|^TestTryLoadFromCache_ForceRebuild$' -count=1
go test ./internal/crosscompile/...
(cd test/buildcache && bash ./test.sh)
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Whole Build Pipeline Follow-up 5

Added source-only commit 3d3a3e3 to move expensive golang.org/x/tools/go/ssa sanity checking out of the default build hot path:

Default SSA build mode now keeps InstantiateGenerics but does not run SanityCheckFunctions for every compiled package.
LLGO_SSA_SANITY=1 restores the old sanity-check behavior for debugging/validation.
LLGO_SSA_SANITY is included in cache fingerprint env inputs, so enabling it forces rebuilds instead of reusing default no-sanity cache entries.
The untyped-shift workaround remains active when SSA sanity checking is enabled.
Added TestSSABuildModeSanityOptIn to cover the default and opt-in modes.

Local evidence from the representative internal/build workload:

Workload	Before	After	Change
`go test ./internal/build -count=1` wall	22.2s	19.8s / 19.7s	about -10%
`go test ./internal/build -count=1` go-reported	21.438s	19.017s / 19.156s	about -10%

This was profile-guided: TestExtest CPU/memory profiles showed ssa.mustSanityCheck / ssa.WriteFunction allocating roughly 287MB in the old default path. The new default removes that validation cost from normal builds while preserving an explicit opt-in path.

Rejected during this sweep:

raising async cache-save cap from 4 to 8: regressed the representative workload;
go:embed comment pre-scan fast path: not a bottleneck and regressed;
ssa type-conversion allocation tweak: regressed full internal/build.

Additional local validation before pushing this follow-up:

LLGO_SSA_SANITY=1 go test ./internal/build -run '^TestExtest$|^TestSSABuildModeSanityOptIn$' -count=1
go test ./internal/build
(cd test/buildcache && bash ./test.sh)
go test ./internal/crosscompile/...
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

Whole Build Pipeline Follow-up 6

Added source-only commit b79f897 to reduce repeated setup and scheduler overhead in multi-build processes:

Development LLGO_ROOT discovery now prints the repeated “Using LLGO root for devel” warning only once per root in a process, preserving the diagnostic while avoiding repeated stderr I/O during internal/build tests and other long-lived build drivers.
internal/packages now avoids spawning package-load goroutines for narrow import fanout and for the common single-root load case, while preserving parallel loading for wider import graph nodes.
Added TestLLGoROOTWarnsOnceForDevelRoot for the warning-once behavior.

Local representative evidence from this sweep:

Workload	Before	After	Change
`go test ./internal/build -count=1` wall	22.6s baseline	20.8s best local run / 20.7s post-cleanup validation	about -8%
`go test ./internal/build -count=1` go-reported	21.645s baseline	19.978s best local run / 20.221s post-cleanup validation	about -7%
clean `go build -a -tags=dev ./cmd/llgo`	validation run	11.8s	passed; no root `llgo` artifact

Rejected during this local-only sweep before pushing:

cache-hit minimal package rebuild using new manifest bits: still crashed because cache-hit packages without LPkg miss hidden ABI/link side effects;
direct LLVM object emission through a new internal/build cgo shim: failed include-path portability and would add package-level cgo risk;
caching ABI type names in ssa/abi: passed SSA tests but regressed the full internal/build workload;
parsing small package file lists sequentially: regressed versus the kept package-load fanout change;
package-load fanout thresholds 1, 2, and 8: no improvement over the kept threshold of 4;
removing TypesInfo.Scopes collection and small ABI metadata slice preallocation: both regressed representative runs;
process-global successful LLGO_ROOT caching: no representative win and higher stale-global-state risk.

Additional local validation before pushing this follow-up:

LLGO_SSA_SANITY=1 go test ./internal/build -run '^TestExtest$|^TestSSABuildModeSanityOptIn$' -count=1
go test ./internal/env ./internal/build ./internal/crosscompile/... -count=1
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Whole Build Pipeline Follow-up 7

Added source-only commit fff8fff to reduce go/types map growth during package loading:

internal/packages.loadPackageEx now gives the types.Info maps modest initial capacities based on the number of parsed source files.
This targets the current profile hotspot in go/types.Checker.recordTypeAndValue / TypesInfo population without changing which type information LLGo collects.
Kept all existing TypesInfo maps (Types, Defs, Uses, Implicits, Instances, Scopes, Selections) because omitting any required map is unsafe.

Local representative evidence from this sweep:

Workload	Before	After	Change
`go test ./internal/build -count=1` wall	30.3s noisy baseline	20.4s / 20.0s best repeats	about -33%
`go test ./internal/build -count=1` go-reported	28.522s	19.599s / 19.452s	about -31%
clean `go build -a -tags=dev ./cmd/llgo`	guard run	11.5s	passed; no root `llgo` artifact

Capacity tuning kept 1024 * len(syntax) for Types, 1/2 of that for Defs/Uses, and small per-file capacities for the smaller maps. Rejected alternatives:

removing TypesInfo.Implicits: crashed immediately;
removing TypesInfo.Scopes: previously passed but regressed;
raising Types capacity to 1536 or 2048 per file: regressed;
increasing Defs/Uses to match Types: regressed;
increasing small-map capacities to 32 per file: regressed;
adding a 16384 cap: did not improve the primary wall-time metric;
direct LLVM object emission through an internal/build cgo shim: improved internal/build execution but doubled clean go build -a ./cmd/llgo, so it was rejected and not pushed.

Additional local validation before pushing this follow-up:

LLGO_SSA_SANITY=1 go test ./internal/build -run '^TestExtest$|^TestSSABuildModeSanityOptIn$' -count=1
go test ./internal/packages ./internal/build ./internal/crosscompile/... -count=1
(cd test/buildcache && bash ./test.sh)
go build -o <tmp> -tags=dev ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Latest CI Runtime Snapshot (`fff8fff`)

Latest head: fff8fffec897f5a4ccf0d2f52426440962d08adb (pre-size package type info maps). All checks completed successfully; CI workflow topology remains unchanged.

Compared against the fastest completed coverage-equivalent goplus/main baseline by total runner time from the current baseline pool (b4d9167, same 55 non-skipped / 1 skipped topology):

Metric	PR `fff8fff`	Main `b4d9167`	Change
Non-skipped jobs	55	55	same
Skipped jobs	1	1	same
Total runner time	5h48m58s	6h39m33s	-50m35s
End-to-end wall time	51m04s	3h01m44s	-2h10m40s
Longest single job	20m23s (`Go / test (ubuntu-latest, 19)`)	27m12s (`LLGo / llgo (macos-15-intel, 19, 1.26.0)`)	-6m49s

Baseline pool checked: latest completed successful goplus/main run sets with equivalent 55 non-skipped / 1 skipped coverage. Fastest by total runner time was b4d9167 at 6h39m33s; fastest by end-to-end wall time was 7ea3148 at 2h10m04s. PR fff8fff end-to-end wall time was 51m04s.

Per-workflow comparison against b4d9167:

Workflow	Jobs	PR runner	Main runner	Runner Δ	PR wall	Main wall	Wall Δ	PR longest	Main longest	Longest Δ
Build Cache	2	5m42s	6m49s	-1m07s	3m13s	9m12s	-5m59s	3m12s	3m38s	-0m26s
Docs	6	10m19s	9m20s	+0m59s	2m38s	12m24s	-9m46s	2m29s	2m51s	-0m22s
Format Check	1	0m07s	0m07s	0m00s	0m07s	0m07s	0m00s	0m07s	0m07s	0m00s
Go	2	40m31s	42m17s	-1m46s	20m23s	24m34s	-4m11s	20m23s	23m03s	-2m40s
LLGo	33	4h14m04s	4h56m47s	-42m43s	50m59s	3h01m24s	-2h10m25s	19m02s	27m12s	-8m10s
Release Build	7	17m36s	22m00s	-4m24s	44m10s	33m06s	+11m04s	8m05s	8m29s	-0m24s
Stdlib Coverage	2	2m27s	2m00s	+0m27s	7m38s	5m34s	+2m04s	1m30s	1m01s	+0m29s
Targets	2	18m12s	20m13s	-2m01s	21m47s	46m55s	-25m08s	10m12s	11m13s	-1m01s

Top latest PR jobs by duration:

Duration	Workflow	Job
20m23s	Go	`test (ubuntu-latest, 19)`
20m08s	Go	`test (macos-latest, 19)`
19m02s	LLGo	`llgo (macos-15-intel, 19, 1.26.0)`
16m54s	LLGo	`llgo (macos-15-intel, 19, 1.24.2)`
16m45s	LLGo	`llgo (macos-15-intel, 19, 1.21.13)`

Note: end-to-end wall time is affected by GitHub-hosted runner queueing and scheduling; total runner time is usually the better signal for source-level build-performance changes.

Whole Build Pipeline Follow-up 8

Added source-only commit 66693e8 to skip parser object resolution during LLGo package loads:

internal/build now supplies a packages.Config.ParseFile callback that uses parser.SkipObjectResolution while still preserving parser.ParseComments for LLGo directives (go:linkname, llgo:*, go:embed, cgo pragmas, etc.).
LLGo relies on go/types and x/tools SSA object information, not parser-populated ast.Ident.Obj / ast.File.Scope links, so this avoids unnecessary parser work without reducing comment/directive coverage.
Added TestParseBuildFileSkipsObjectResolutionAndKeepsComments to cover both properties: comments are retained, parser object links are not populated.

Local representative evidence:

Workload	Before	After	Change
`go test ./internal/build -count=1` wall	23.1s	20.9s / 20.4s	about -10% to -12%
`go test ./internal/build -count=1` go-reported	22.064s	20.065s / 19.879s	about -9% to -10%
clean `go build -a -tags=dev ./cmd/llgo`	prior guard ~11.5s	11.0s / 11.33s	no regression

Rejected during this local-only sweep before pushing:

direct native LLVM object emission via a tiny internal/llvmext cgo package: improved LLGo execution workloads but regressed clean go build -a ./cmd/llgo to 12.8s; still best pursued by adding EmitToFile to the existing github.com/goplus/llvm binding instead;
AST-count-based TypesInfo map sizing: improved noisy internal/build runs but regressed clean cmd build guard;
retuning TypesInfo capacity from 1024/file to 512/file: small/noisy internal improvement but clean cmd guard regressed slightly;
pre-sizing the package loader parse cache: internal runs improved but clean cmd guard regressed and unique-count variant regressed;
parsing comments only for directive-bearing files: not clearly better than SkipObjectResolution alone and has higher risk of missing directive forms.

Additional local validation before pushing this follow-up:

LLGO_SSA_SANITY=1 go test ./internal/build -run '^TestExtest$|^TestSSABuildModeSanityOptIn$|^TestParseBuildFileSkipsObjectResolutionAndKeepsComments$' -count=1
go test ./internal/build -count=1
go test ./internal/build ./internal/packages ./internal/crosscompile/... -count=1
(cd test/buildcache && bash ./test.sh)
go build -o <tmp> -tags=dev ./cmd/llgo
go build -a -tags=dev -o <tmp> ./cmd/llgo
test ! -e llgo
git diff --check

CI workflow topology is still unchanged.

Latest CI Runtime Snapshot (`66693e8`)

Latest head: 66693e87e859e1d912209f1172533bdb8e95ffc4 (skip parser object resolution in build loads). All checks completed successfully; CI workflow topology remains unchanged.

Compared against the fastest completed coverage-equivalent goplus/main baseline by total runner time from the current baseline pool (b4d9167, same 55 non-skipped / 1 skipped topology):

Metric	PR `66693e8`	Main `b4d9167`	Change
Non-skipped jobs	55	55	same
Skipped jobs	1	1	same
Total runner time	6h06m37s	6h39m33s	-32m56s
End-to-end wall time	59m20s	3h01m44s	-2h02m24s
Longest single job	29m24s (`LLGo / llgo (macos-15-intel, 19, 1.21.13)`)	27m12s (`LLGo / llgo (macos-15-intel, 19, 1.26.0)`)	+2m12s

Compared with the previous PR head snapshot (fff8fff), this run was slower in CI (+17m39s total runner, +8m16s wall). The local representative tests for 66693e8 improved, so this CI delta is treated cautiously because the longest latest job shifted to a single macOS Intel matrix job.

Per-workflow comparison against b4d9167:

Workflow	Jobs	PR runner	Main runner	Runner Δ	PR wall	Main wall	Wall Δ	PR longest	Main longest	Longest Δ
Build Cache	2	5m48s	6m49s	-1m01s	3m14s	9m12s	-5m58s	3m14s	3m38s	-0m24s
Docs	6	9m28s	9m20s	+0m08s	4m59s	12m24s	-7m25s	2m37s	2m51s	-0m14s
Format Check	1	0m10s	0m07s	+0m03s	0m10s	0m07s	+0m03s	0m10s	0m07s	+0m03s
Go	2	38m20s	42m17s	-3m57s	19m24s	24m34s	-5m10s	19m24s	23m03s	-3m39s
LLGo	33	4h31m06s	4h56m47s	-25m41s	59m15s	3h01m24s	-2h02m09s	29m24s	27m12s	+2m12s
Release Build	7	21m03s	22m00s	-0m57s	18m22s	33m06s	-14m44s	8m26s	8m29s	-0m03s
Stdlib Coverage	2	1m54s	2m00s	-0m06s	3m47s	5m34s	-1m47s	1m05s	1m01s	+0m04s
Targets	2	18m48s	20m13s	-1m25s	10m38s	46m55s	-36m17s	10m37s	11m13s	-0m36s

Top latest PR jobs by duration:

Duration	Workflow	Job
29m24s	LLGo	`llgo (macos-15-intel, 19, 1.21.13)`
19m32s	LLGo	`llgo (macos-15-intel, 19, 1.26.0)`
19m24s	Go	`test (ubuntu-latest, 19)`
18m56s	Go	`test (macos-latest, 19)`
17m44s	LLGo	`llgo (macos-15-intel, 19, 1.24.2)`

Note: end-to-end wall time is affected by GitHub-hosted runner queueing and scheduling; total runner time is usually the better signal for source-level build-performance changes.

LLVM EmitToFile CI Validation (`b549c2d`)

Temporary CI-validation commit b549c2d switches native host object emission from EmitToMemoryBuffer + Go-side object-file write to the new TargetMachine.EmitToFile API on a forked LLVM module branch:

LLVM fork/branch: github.com/cpunion/llvm, branch feat/emit-to-file
LLVM commit: 40fdafa target: emit target machine output to file
LLGo temporary dependency override:
replace github.com/goplus/llvm => github.com/cpunion/llvm v0.8.9-0.20260429084913-40fdafa22ac4

Local A/B before pushing:

Workload	Current `v0.8.8` memory buffer path	Forked LLVM `EmitToFile`	Change
`go test ./internal/build -run '^TestExtest$' -count=3` wall	32.1s	23.2s	-8.9s (-27.7%)
`go test ./internal/build -run '^TestExtest$' -count=3` go-reported	24.254s	22.378s	-1.876s (-7.7%)
`go test ./internal/build -count=1`	validated	20.489s / 20.866s	passed
clean `go build -a -tags=dev ./cmd/llgo`	prior guards ~11.0-11.5s	11.1s / local validation passed	no LLGo-side cgo regression

This validates the direction that PR #1823 enabled: native object emission should move from memory-buffer output to direct file output, but the API belongs in the existing github.com/goplus/llvm binding rather than a new LLGo-side cgo shim. Once the LLVM PR is merged/tagged, the temporary replace should be removed and LLGo should depend on the released github.com/goplus/llvm version.

CI result for LLVM `EmitToFile` replace commit (`b549c2d`)

All checks completed successfully for b549c2d9e8cdedfad0b4ba919beb72e20d7340cf (57 success, 1 skipped; merge state CLEAN).

Coverage-equivalent comparison (55 non-skipped jobs, 1 skipped job):

Metric	`b549c2d` with forked LLVM `EmitToFile`	previous PR head `66693e8`	Δ vs previous PR head	main baseline `b4d9167`	Δ vs main
Total runner time	6h23m49s	6h06m37s	+17m12s (+4.7%)	6h39m33s	-15m44s (-3.9%)
End-to-end wall	53m51s	59m16s	-5m25s (-9.1%)	3h01m33s	-2h07m42s (-70.3%)
Longest job	27m08s	29m24s	-2m16s (-7.7%)	27m12s	-0m04s (-0.2%)
LLGo workflow total	4h37m36s	4h31m06s	+6m30s (+2.4%)	4h56m47s	-19m11s (-6.5%)
LLGo `llgo` build-job bucket	1h50m49s	1h53m55s	-3m06s (-2.7%)	2h07m32s	-16m43s (-13.1%)
Release Build workflow	20m50s	21m03s	-0m13s (-1.0%)	22m00s	-1m10s (-5.3%)

Per-workflow totals for b549c2d vs previous PR head 66693e8:

Workflow	`b549c2d`	`66693e8`	Δ
Build Cache	5m41s	5m48s	-0m07s
Docs	13m11s	9m28s	+3m43s
Format Check	0m06s	0m10s	-0m04s
Go	46m27s	38m20s	+8m07s
LLGo	4h37m36s	4h31m06s	+6m30s
Release Build	20m50s	21m03s	-0m13s
Stdlib Coverage	2m13s	1m54s	+0m19s
Targets	17m45s	18m48s	-1m03s

CI is noisy at whole-run granularity: total runner time regressed vs the previous PR head mostly from Go/Docs/test-job variance, while end-to-end wall, longest job, Release Build, Targets, and the LLGo llgo build-job bucket improved. The local targeted A/B remains the clearest signal for the EmitToFile implementation itself (TestExtest -count=3: 32.1s -> 23.2s wall, -27.7%).

EmitToFile validation discarded

The temporary b549c2d validation commit using replace github.com/goplus/llvm => github.com/cpunion/llvm ... has been dropped from this PR and the branch has been reset back to 66693e8.

Reason: while the forked LLVM EmitToFile API was functional and showed a targeted local improvement for TestExtest, the completed CI run did not show a clear whole-pipeline/total-runner-time improvement over the previous PR head. To keep this PR focused on proven build-performance changes, the temporary dependency replace and LLGo code change are discarded. The LLVM API work can still be pursued separately as a cleanup/API PR, but it is not included here as a build-performance change.

Local follow-up after dropping EmitToFile (`7004afe`)

After discarding the temporary LLVM EmitToFile replace commit, this source-only follow-up adds two low-risk local hot-path reductions:

internal/goembed: skip the full go:embed declaration walk when parsed comments contain no go:embed text.
ssa/type_cvt.go: lazily allocate converted tuple/interface/struct slices only when a nested type actually changes.

Local paired A/B (origin/improve/build-perf-pure at 66693e8 vs combined patch):

Workload	Baseline	`7004afe` patch	Change
`go test ./internal/build -run '^TestExtest$' -count=3` wall	25.777s	24.997s	-0.780s (-3.0%)
same, go-reported	24.849s	24.419s	-0.430s (-1.7%)

Incremental local A/B while developing:

Change	Wall change	Go-reported change
`go:embed` pre-scan	24.850s -> 23.751s (-4.4%)	23.863s -> 23.169s (-2.9%)
SSA type-conversion lazy allocation	23.946s -> 23.263s (-2.9%)	23.111s -> 22.704s (-1.8%)

Discarded during this round because paired A/B regressed:

isGoSSAOpaqueType reflection fast path for public go/types implementations.
runtime.Version() parse cache in package loading.
cvtNamed zero-method allocation special case.

Local validation before pushing 7004afe:

go test ./internal/goembed ./ssa ./cl ./internal/build -count=1
LLGO_SSA_SANITY=1 go test ./internal/build -run '^TestExtest$|^TestSSABuildModeSanityOptIn$' -count=1
go build -o <tmp> -tags=dev ./cmd/llgo
go clean -cache && go build -a -tags=dev -o <tmp> ./cmd/llgo
test ! -e llgo
git diff --check

gemini-code-assist

Code Review

This pull request introduces significant build performance optimizations and caching mechanisms across the build system. Key improvements include parallelizing C and assembly compilation with a configurable worker pool, caching pkg-config results and LLVM version detection, and optimizing manifest fingerprinting using unsafe string-to-byte conversions to avoid allocations. Additionally, the PR implements a fast-path for manifest metadata parsing to bypass full YAML decoding and optimizes C file scanning by leveraging package metadata. Review feedback suggests further performance gains by caching environment variables used in pkg-config lookups and increasing the default parallel job limit for high-performance build environments.

gemini-code-assist · 2026-04-27T02:43:53Z

+	for _, env := range os.Environ() {
+		if strings.HasPrefix(env, "PKG_CONFIG") {
+			keyParts = append(keyParts, env)
+		}
+	}


Calling os.Environ() inside cachedPkgConfig is inefficient because it allocates and copies the entire environment on every call. Since this function is called for every #cgo pkg-config directive across all files in a package, this can add significant overhead in projects with many cgo dependencies. Consider caching the filtered PKG_CONFIG environment variables globally or within the build context to improve performance.

gemini-code-assist · 2026-04-27T02:43:53Z

+	if jobs > 16 {
+		return 16, nil
 	}


The hardcoded limit of 16 parallel jobs may be too low for modern high-performance build environments. Since this PR focuses on build performance, consider increasing this limit or removing it entirely to allow full utilization of available CPU cores, especially as clang processes are independent.

Suggested change

if jobs > 16 {

return 16, nil

}

if jobs > 64 {

return 64, nil

}

codecov · 2026-04-27T03:12:11Z

Codecov Report

❌ Patch coverage is 94.89164% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.36%. Comparing base (3ac9c14) to head (340e732).

Files with missing lines	Patch %	Lines
internal/crosscompile/compile/compile.go	91.19%	9 Missing and 8 partials ⚠️
ssa/type_cvt.go	80.76%	7 Missing and 3 partials ⚠️
internal/goembed/goembed.go	66.66%	2 Missing and 2 partials ⚠️
internal/crosscompile/crosscompile.go	87.50%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1832      +/-   ##
==========================================
- Coverage   88.37%   88.36%   -0.01%     
==========================================
  Files          51       51              
  Lines       14498    14756     +258     
==========================================
+ Hits        12812    13039     +227     
- Misses       1468     1485      +17     
- Partials      218      232      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Result: {"status":"keep","rebased_internal_build_wall":30.925,"go_reported_s":30.325,"baseline_s":32.555,"patched_s":30.925,"delta_s":-1.63,"base_go_reported_s":31.983,"patched_go_reported_s":30.325}

Result: {"status":"keep","rebased_internal_build_wall":31.8,"go_reported_s":31.224,"baseline_s":31.8,"patched_s":31.8,"delta_s":0,"base_go_reported_s":31.224,"patched_go_reported_s":31.224}

Result: {"status":"keep","warm_internal_build_wall":29.279,"go_reported_s":28.415,"baseline_s":31.157,"patched_s":29.279,"delta_s":-1.878,"base_go_reported_s":30.589,"patched_go_reported_s":28.415,"wall_s":29.279}

Result: {"status":"keep","warm_internal_build_wall":29.256,"go_reported_s":28.699,"baseline_s":31.312,"patched_s":29.256,"delta_s":-2.056,"base_go_reported_s":30.727,"patched_go_reported_s":28.699,"wall_s":29.256}

Result: {"status":"keep","warm_internal_build_wall":29.456,"go_reported_s":28.89,"baseline_s":30.32,"patched_s":29.456,"delta_s":-0.864,"base_go_reported_s":29.756,"patched_go_reported_s":28.89,"wall_s":29.456}

Result: {"status":"keep","warm_internal_build_wall":28.574,"go_reported_s":27.996,"baseline_s":30.326,"patched_s":28.574,"delta_s":-1.752,"base_go_reported_s":29.768,"patched_go_reported_s":27.996,"wall_s":28.574}

codecov-commenter · 2026-05-07T03:29:02Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

cpunion force-pushed the improve/build-perf-pure branch from 2c3d623 to 7fbdf87 Compare April 27, 2026 02:49

cpunion force-pushed the improve/build-perf-pure branch 2 times, most recently from 8566a2d to cd7cc02 Compare May 2, 2026 16:55

cpunion added 25 commits May 7, 2026 08:45

rebase pure build performance on main

73ae04c

optimize cgo build metadata and pragma hot paths

76cf78c

optimize remaining cgo metadata and pragma scans

8e1fabb

reuse cgo pragma scans across darwin asm handling

f65b0ef

avoid duplicate cache archive publication

5374194

cache macOS sysroot discovery

4907ad1

deduplicate LLVM PATH setup

54380c6

make LLVM initialization idempotent

9a88c0c

overlap cache publication with package builds

e29f940

archive uncached packages in cache workers

5781fe2

preserve cache refresh on force rebuild

21b4058

cache repeated tool environment lookups

8a04264

make SSA sanity checks opt-in for builds

8f963ba

reduce repeated build setup overhead

0c14164

pre-size package type info maps

e87e5b3

skip parser object resolution in build loads

1ba94d4

reduce embed and type conversion allocation overhead

1bf2507

overlap native object emission with package builds

976b060

tune async object emission parallelism

edf7a22

avoid copying serialized LLVM IR for object emission

a1c0b05

use clang driver for async IR object emission

9ba78a3

allow async object emission for external builds

4290071

preserve target compiler for async object emission

9135bfa

compile async LLVM IR via stdin

a654fb2

avoid recording unused type scopes

4d93cca

cpunion added 16 commits May 7, 2026 08:45

avoid copying manifest YAML output

beda693

build Go SSA packages in parallel

746ac58

avoid method set construction in SSA fixup

fefd062

skip redundant SSA program builds

e2b0386

write build manifests without YAML reflection

e049343

avoid quoting simple manifest scalars

bb3171f

trim build helpers and reuse SSA fixup scratch state

0b19a35

inline overlay digest hashing

a359a5b

avoid go list for Plan9 assembly discovery

4aa314e

raise async object emit cap to four on rebased workload

0d32342

Result: {"status":"keep","rebased_internal_build_wall":30.925,"go_reported_s":30.325,"baseline_s":32.555,"patched_s":30.925,"delta_s":-1.63,"base_go_reported_s":31.983,"patched_go_reported_s":30.325}

preserve chacha8 Plan9 asm stub selection

27547e4

guard async native IR emission by clang/LLVM major compatibility

c2dae2c

Result: {"status":"keep","rebased_internal_build_wall":31.8,"go_reported_s":31.224,"baseline_s":31.8,"patched_s":31.8,"delta_s":0,"base_go_reported_s":31.224,"patched_go_reported_s":31.224}

drop unused link-time method metadata collection

aebd4c2

Result: {"status":"keep","warm_internal_build_wall":29.279,"go_reported_s":28.415,"baseline_s":31.157,"patched_s":29.279,"delta_s":-1.878,"base_go_reported_s":30.589,"patched_go_reported_s":28.415,"wall_s":29.279}

drop unused SSA reflect method metadata maps

f857424

Result: {"status":"keep","warm_internal_build_wall":29.256,"go_reported_s":28.699,"baseline_s":31.312,"patched_s":29.256,"delta_s":-2.056,"base_go_reported_s":30.727,"patched_go_reported_s":28.699,"wall_s":29.256}

pre-size linkMainPkg scratch containers

04e9b01

Result: {"status":"keep","warm_internal_build_wall":29.456,"go_reported_s":28.89,"baseline_s":30.32,"patched_s":29.456,"delta_s":-0.864,"base_go_reported_s":29.756,"patched_go_reported_s":28.89,"wall_s":29.456}

cache parsed runtime Go minor during package loading

fd5680f

Result: {"status":"keep","warm_internal_build_wall":28.574,"go_reported_s":27.996,"baseline_s":30.326,"patched_s":28.574,"delta_s":-1.752,"base_go_reported_s":29.768,"patched_go_reported_s":27.996,"wall_s":28.574}

cpunion force-pushed the improve/build-perf-pure branch from 340e732 to fd5680f Compare May 7, 2026 00:46

fix build perf ci regressions

c30abff

cpunion added 3 commits May 7, 2026 18:51

test: improve build perf patch coverage

49a9d4e

test: extend darwin go1.26 bug449 timeout

d252363

test: cover build perf helpers

0877e23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve build performance without CI fan-out#1832

Improve build performance without CI fan-out#1832
cpunion wants to merge 45 commits intoxgo-dev:mainfrom
cpunion:improve/build-perf-pure

cpunion commented Apr 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

codecov Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cpunion commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Intentionally Not Included

Validation

Latest Local Follow-up: Async Native Object Emission

CI Result for Latest Head (8549fc4)

CI Result for Latest Head (86d5af4)

CI Result for Latest Head (e213989)

CI Result for Latest Head (60e0404)

CI Result for Latest Async Object Emission Tuning (48cde59)

CI Result for Async Object Emission (e94d62a)

CI Runtime Snapshot (latest head after tool environment caching)

Latest Pure Build Hot-Path Follow-up

Additional Pure Build Hot-Path Follow-up

Additional Pure Build Hot-Path Follow-up 2

Whole Build Pipeline Follow-up

Whole Build Setup Follow-up

Whole Build Setup Follow-up 2

Whole Build Setup Follow-up 3

Whole Build Pipeline Follow-up 2

Next Optimization Work

Whole Build Pipeline Follow-up 3

Whole Build Setup Follow-up 4

Whole Build Pipeline Follow-up 5

Whole Build Pipeline Follow-up 6

Whole Build Pipeline Follow-up 7

Latest CI Runtime Snapshot (fff8fff)

Whole Build Pipeline Follow-up 8

Latest CI Runtime Snapshot (66693e8)

LLVM EmitToFile CI Validation (b549c2d)

CI result for LLVM EmitToFile replace commit (b549c2d)

EmitToFile validation discarded

Local follow-up after dropping EmitToFile (7004afe)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codecov-commenter commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cpunion commented Apr 27, 2026 •

edited

Loading

CI Result for Latest Head (`8549fc4`)

CI Result for Latest Head (`86d5af4`)

CI Result for Latest Head (`e213989`)

CI Result for Latest Head (`60e0404`)

CI Result for Latest Async Object Emission Tuning (`48cde59`)

CI Result for Async Object Emission (`e94d62a`)

Latest CI Runtime Snapshot (`fff8fff`)

Latest CI Runtime Snapshot (`66693e8`)

LLVM EmitToFile CI Validation (`b549c2d`)

CI result for LLVM `EmitToFile` replace commit (`b549c2d`)

Local follow-up after dropping EmitToFile (`7004afe`)

codecov Bot commented Apr 27, 2026 •

edited

Loading

codecov-commenter commented May 7, 2026 •

edited

Loading