⚡️ Speed up function like_num by 275%
#7
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 275% (2.75x) speedup for
like_numinspacy/lang/ko/lex_attrs.py⏱️ Runtime :
1.13 milliseconds→302 microseconds(best of250runs)📝 Explanation and details
The optimization achieves a 275% speedup by addressing the most expensive operation in the original code: the Korean number word lookup.
Key Performance Bottleneck Eliminated:
The original code's line
if any(char.lower() in _num_words for char in text)consumed 87.9% of total runtime (2.40ms out of 2.73ms). This was inefficient because:_num_wordslist for each character.lower()on Korean characters (which don't have case variants)Primary Optimization - Set-Based Lookup:
The optimized version converts
_num_wordsto a set once and caches it as a function attribute, enabling O(1) character lookups instead of O(n). This reduces the Korean word check from 2.40ms to 1.21ms (50% reduction), while the caching overhead is minimal (45μs total for getattr + set creation on first call).Secondary Optimization - Split Limit:
Changed
text.split("/")totext.split("/", 1)to avoid unnecessary splitting when validating fractions, though this has minimal impact.Performance Characteristics:
This optimization is particularly valuable for NLP workloads processing Korean text at scale, where
like_numwould be called frequently during tokenization and linguistic analysis.✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-like_num-mhmik25wand push.