[WIP] Reduce compilation overhead with @nospecialize#83
Conversation
Changed from @nospecialize(x::T) where {T<:Type} pattern to
@nospecialize(x::{<:Type}) to properly prevent type specialization.
This ensures Julia doesn't compile separate versions for each
concrete
type, reducing compilation overhead.
Add @Assert type checks to prevent merging/freezing different IDS types. This fixes type safety issues introduced by commit 74ded67 where 'where T' constraints were removed. Functions updated: - merge!(::IDS, ::IDS) - assert same type - merge!(::IDSvector, ::IDSvector) - assert same eltype - freeze!(::IDS, ::IDS) - assert same type - freeze!(::IDSvector, ::IDSvector) - assert same eltype These assertions prevent runtime errors from field mismatches when operating on incompatible types.
Replace error() with Dict return when comparing different types.
Now diff() returns a dict with 'type_mismatch' key instead of
throwing an error, making it more flexible and non-disruptive.
Example:
diff(dd, dd.equilibrium)
=> Dict('type_mismatch' => 'dd{Float64} != equilibrium{Float64}')
Add @nospecialize annotations to info and coordinates to reduce compilation overhead and binary size.
Added @nospecialize annotations to location and conversion functions in f2.jl to prevent excessive method specialization: - utlocation(ids, field) and variants - f2u(ids) - converts IDS to universal location string - fs2u(ids_type) - converts IDS type to universal location Note: ulocation specialization still observed during constructor execution despite @nospecialize - investigation ongoing into when and why specialization occurs in the call chain.
Add dd_nospecialize() helper function that uses Base.invokelatest to
prevent compiler from analyzing dd() internals during type inference.
Replace all dd() default arguments in I/O functions with dd_nospecialize()
to significantly reduce allocation and compilation time.
The invokelatest barrier prevents the compiler from specializing on the
complex 180k-line dd struct generation, while ::dd{Float64} type assertion
ensures proper type propagation without additional inference overhead.
Affected functions:
- json2imas, jstr2imas
- hdf2imas (default arg and internal call)
- h5i2imas
Performance impact: ~20M fewer allocations in hdf2imas calls.
Apply @nospecialize to fieldtype, getproperty, parent, name, goto, getindex, and time-related functions to reduce method specialization.
Temporary test file for investigating method specialization behavior.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #83 +/- ##
==========================================
+ Coverage 43.76% 43.96% +0.20%
==========================================
Files 13 15 +2
Lines 31243 31418 +175
==========================================
+ Hits 13672 13813 +141
- Misses 17571 17605 +34 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Remove type parameters from @nospecialize signatures and extract types inside function body using eltype() to properly prevent specialization.
Remove type parameters from @nospecialize signatures in ==, isequal, isapprox, and _extract_comparable_fields to properly prevent specialization on Union types. Also update docstrings for resize!, diff, and time_groups functions.
…me bottleneck
Added @nospecializeinfer to 165+ functions across 9 core modules, drastically
reducing type inference overhead and specialization explosion.
Impact:
- Eliminates ~50% of top-level type inference allocations
- Prevents Union type specialization combinatorial explosion
- Dramatically reduces first-time function compilation overhead
- Measured: hdf2imas compilation time drops from 98.71% to minimal levels
Changes:
- Added `using Base: @nospecializeinfer` import
- Applied to functions with @nospecialize parameters in:
cocos (6), data (25), expressions (18), f2 (24), identifiers (15),
io (31), show (16), time (30)
Technical rationale:
@nospecializeinfer prevents both specialization AND type inference propagation.
Critical for Union types like Union{IDS,IDSvector,Vector{IDS}} which trigger
combinatorial method generation and expensive typeinf_ext_toplevel calls.
This partially reverts the previous commit which added @nospecializeinfer to almost every function. While @nospecializeinfer improves compilation time, it prevents type inference for certain key functions that are essential for nested calls to return concrete types. Functions affected: - cocos_out, cocos_transform, transform_cocos_* (src/cocos.jl) - concrete_fieldtype_typeof, eltype_concrete_fieldtype_typeof, Base.getproperty (src/data.jl) These functions must infer concrete return types for nested IMASdd function calls to work correctly, as tested in test/runtests_concrete.jl. Result: All tests now pass again, including concrete type inference tests.
|
great find with the '@nospecializeinfer' macro !! |
Due to @nospecialize/@nospecializeinfer constraints, type parameters
cannot be used to guarantee matching types at dispatch time.
Solution:
- Add hot path methods for identical types (Int32/64, Float32/64, UInt64, Bool)
- These route to __convert_same_real_type! helper for fast path
- Generic Real->Real method now performs runtime type checking
- Convert to target type when needed (e.g., Float64 → Measurement{Float64})
Changes:
- Check COCOS conversion consistently to all cases (assumed always needed)
Separated DD from Union{DD, IDSraw, IDSvectorRawElement} into dedicated method.
Large Union (~130+ concrete subtypes) prevented compiler from inlining hot path.
DD now gets specialized method without @nospecialize for optimal performance.
Added guard condition (user_cocos != to_cocos) before calling cocos_out. Since both default to 11, this avoids unnecessary function calls on hot path.
Added @inline and ::Bool annotations to hasdata for better type inference. Helps compiler optimize getproperty hot path by eliminating runtime type checks.
…eld indices - Replace fieldnames() iteration with fieldcount() + numeric indices - Reduces from 14 allocations (4.125 KiB) to ~0 allocations - Works correctly with @nospecialize by avoiding symbolic field names - Added @inbounds for bounds-check elimination
- Replace enumerate() with eachindex() + @inbounds - Replace fieldnames() with hasfield() - Optimize hasdata() to use numeric field indices - Use tuple literals instead of arrays for membership tests
Optimize setproperty! by reusing already-computed coords variable instead of calling coordinates(ids, field) multiple times: - Line 772: Use inline generator with coords reuse - Line 774: Reuse coords instead of recalling coordinates() Eliminates 2 redundant function calls per setproperty! invocation. Uses idiomatic inline generator with any() for short-circuit benefit. fix/nospecialize
Optimize name_2_index() by caching inverted Dict per IDS type: - Add global cache _NAME_2_IDX_CACHE using IdDict - Implement lazy initialization with get!() for thread-safety - First call per type: inverts idx_2_name and caches result - Subsequent calls: returns cached Dict (zero-allocation) Performance improvement: - Before: ~22μs, 5 allocations per call - After: ~2ns, 0 allocations (after first call per type) Related optimization in fix/nospecialize branch.
Simplify in_expression() by removing redundant key check: - Remove manual `if t_id ∉ keys(_in_expression)` check - Use get!() directly for atomic check-and-create operation - get!() already handles check atomically, making manual check redundant Performance improvement: - Eliminates one dict lookup (haskey check) - Cleaner code with same thread-safety guarantees Related to fix/nospecialize optimization work.
…debug code Optimize two functions with numeric field iteration pattern: - Stack-based fill function: Replace fieldnames() with fieldcount/fieldname - Base.empty!(): Use numeric indices for field iteration - Add @inbounds for bounds check elimination Remove debug statements: - Clean up Main.@infiltrate calls from resize!() function Performance improvement: - Eliminates allocations from fieldnames() vector creation - Enables bounds check elimination with @inbounds - Consistent with other @nospecialize optimizations Related to fix/nospecialize optimization work.
|
@bclyons12 @fredrikekre Additional Performance Optimizations (6 commits)SummaryZero-allocation improvements across hot paths in Key ChangesLoop Optimization (
Field Iteration (
Field Checks (
Caching (
Thread-safe Access (
Misc (
Files Changed
|
Add diagnose_shared_objects() to detect unintended array sharing in IDS trees. This helps identify cases where `a = b` was used instead of `a .= b`. Features: - Stack-based tree traversal following isequal pattern - SharedObjectReport with indexed access (report[1].id, report[1].paths) - Cross-IDS sharing detection (e.g., core_profiles ↔ core_sources) - REPL display with chronological path ordering
…ility Replace @maybe_nospecializeinfer with @nospecializeinfer since the macro wrapper is not defined on master branch.
- Add runtests_f2.jl with 81 test cases covering: - f2p, f2i, f2u path conversion functions - i2p, p2i, i2u string parsing functions - location, ulocation path accessors - fs2u type-based lookup - f2p_name IDS naming - Round-trip consistency validation - Edge cases (standalone IDS, utime flag, deeply nested) - Move f2-related tests from runtests_ids.jl to dedicated file - Include runtests_f2.jl in main test runner
- Add _F2P_SKELETON_CACHE with concrete types for type-stable lookup - Split _f2p_skeleton into fast path (cache hit) and slow path (@noinline) - Pre-compute and cache result_size to avoid redundant count() calls - Use Vector{String} in cache for concrete value type - Remove String() conversion in loop (already cached as String)
- Add internal @_typed_cache macro with proper hygiene (gensym, esc) - Use helper function pattern to solve return-bypass caching bug - Apply macro to f2.jl: fs2u, _f2p_skeleton, f2p_name(Type) - Rename cache constants with _TCACHE_ prefix for consistency - Use Base.get single lookup instead of haskey+getindex pattern
- Remove type-based name computation (~15 lines) - Reuse cached skeleton from _f2p_skeleton(T) - Eliminates redundant replace/eachsplit/count calls per f2i invocation
- Add ::String return type to f2p_name(ids) for better type inference - Refactor f2p_name(ids::IDS, ::IDS) to reuse cached f2p_name(Type) - Remove redundant typename_str computation (now uses cache) - Add @nospecialize to entry point f2p_name(ids) for compile time
- i2u fast path: avoid String(loc) allocation when loc is already String - ulocation/location(IDSvector): use fs2u_base cache instead of SubString - Add int_to_string() cache for small integers (0-10) used in f2p and f2p_name - Add fs2u_base typed cache for IDSvector base paths (0 allocs) - Expand benchmark_f2.jl with comprehensive allocation tests Results: - f2p: 10→8 allocs (simple), 14→10 allocs (nested) - f2p_name(IDSvectorElement): 4→2 allocs - ulocation/location(IDSvector): 0 allocs (cached) - i2u(String, no brackets): 0 allocs
Changed eltype(ids) to typeof(ids) in ulocation/location(IDSvector) functions. With @nospecialize, eltype(ids) returns Any causing 3 allocations and boxing. Using typeof(ids) and extracting element type inside fs2u_base ensures type stability and 0 allocations. Result: ulocation/location(IDSvector) now ~27ns with 0 allocations (previously ~600ns with 3 allocations/128 bytes)
Replace zeros(Int, N) with zeros!(pool, Int, N) using @with_pool macro. This eliminates the small temporary array allocation (N typically 1-3) that occurred on every f2p/f2i call by reusing pooled memory.
…e allocation Under @nospecializeinfer, the closure created by `lock() do` can cause boxing due to captured variables (ids, field, func, throw_on_missing, etc). Using explicit try/finally eliminates the closure and reduces allocation. Changes: - exec_expression_with_ancestor_args (4-arg version): cache lock, use try/finally - onetime expression path: same pattern for consistency
|
@fredrikekre @bclyons12 Summary: Introduced type-based caching infrastructure to reduce allocations and improve performance in path/location functions. Also applied various micro-optimizations (array pooling, string caching, closure elimination). Key Changes1. Type-based caching (
2. Temp array reuse (
3. + Other minor micro-optimizations |
Add type normalization to prevent cache key conversion errors when
UnionAll types (e.g., SomeType instead of SomeType{Float64}) are
passed to cached functions.
Changes:
- Add _normalize_ids_type() to convert UnionAll → DataType{Float64}
- Separate fs2u/f2p_name into public wrapper + cached implementation
- Public API accepts Type, internal cache uses DataType only
- Use specific type constraints (Type{<:IDS}) to minimize method
table complexity and preserve dispatch performance
bclyons12
left a comment
There was a problem hiding this comment.
With these changes, I see the TTFX for FUSE.warmup(dd) go from 8.5 to 3.5 minutes. Second execution is about the same, slightly faster now with these changes. Timings attached
master_timing.txt
nospecialize_timing.txt
|
@bclyons12 Thanks for the detailed comparison! |
I removed most code changes (since #83 stand "on its own" when it comes to fixing latency issues. Whats here are mostly annotations for things we discussed in the meetings. In particular, some of the `getproperty` and `setproperty` code do more things that just getting and setting properties so they might be better as separate functions. In particular since some methods takes additional (keyword) arguments.
Status
🚧 Work in Progress - Testing and feedback welcome
Summary
This PR adds/fixes
@nospecializeannotations to frequently-called utility functions to prevent excessive method specialization and reduce compilation overhead.Major Changes
Core Functions Modified
ulocation(),f2u()- Added@nospecializefor IDS parametersinfo(),coordinates()- Prevented specialization on IDS typesdiff(),merge!(),freeze!()- Refactored type handlingdict2imas()- Fixed specialization in IO operationsKnown Issues
hdf2imas()or through certain code paths. Root cause investigation ongoing.Testing
Test File
See
tmp_test/test_ulocation.jlfor specialization behavior tests.Required Temporary Modification
To properly test, you need to modify
dd.jlto usegetfieldinstead ofgetpropertyorids.field:macOS:
Linux (without macOS backup extension):
sed -E -i 's/setfield!\(ids\.([^,]*)/setfield!(getfield(ids, :\1)/g' src/dd.jlThis replaces
ids.fieldproperty access withgetfield(ids, :field)to bypassgetpropertyviaids.fieldimplementations during testing.Next Steps