Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 5, 2025

📄 275% (2.75x) speedup for like_num in spacy/lang/ko/lex_attrs.py

⏱️ Runtime : 1.13 milliseconds 302 microseconds (best of 250 runs)

📝 Explanation and details

The optimization achieves a 275% speedup by addressing the most expensive operation in the original code: the Korean number word lookup.

Key Performance Bottleneck Eliminated:
The original code's line if any(char.lower() in _num_words for char in text) consumed 87.9% of total runtime (2.40ms out of 2.73ms). This was inefficient because:

  • It performs O(n) linear searches through the _num_words list for each character
  • It unnecessarily calls .lower() on Korean characters (which don't have case variants)

Primary Optimization - Set-Based Lookup:
The optimized version converts _num_words to a set once and caches it as a function attribute, enabling O(1) character lookups instead of O(n). This reduces the Korean word check from 2.40ms to 1.21ms (50% reduction), while the caching overhead is minimal (45μs total for getattr + set creation on first call).

Secondary Optimization - Split Limit:
Changed text.split("/") to text.split("/", 1) to avoid unnecessary splitting when validating fractions, though this has minimal impact.

Performance Characteristics:

  • Small inputs: Slight overhead (2-14% slower) due to caching setup
  • Korean text: 15-30% faster due to efficient set lookups
  • Large non-numeric strings: Dramatic improvements (500-650% faster) - the O(1) vs O(n) difference scales significantly with input size
  • Mixed content: 300-600% faster for strings containing Korean characters

This optimization is particularly valuable for NLP workloads processing Korean text at scale, where like_num would be called frequently during tokenization and linguistic analysis.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 118 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest  # used for our unit tests
from spacy.lang.ko.lex_attrs import like_num

# function to test
_num_words = [
    "영",
    "공",
    # Native Korean number system
    "하나",
    "둘",
    "셋",
    "넷",
    "다섯",
    "여섯",
    "일곱",
    "여덟",
    "아홉",
    "열",
    "스물",
    "서른",
    "마흔",
    "쉰",
    "예순",
    "일흔",
    "여든",
    "아흔",
    # Sino-Korean number system
    "일",
    "이",
    "삼",
    "사",
    "오",
    "육",
    "칠",
    "팔",
    "구",
    "십",
    "백",
    "천",
    "만",
    "십만",
    "백만",
    "천만",
    "일억",
    "십억",
    "백억",
]
from spacy.lang.ko.lex_attrs import like_num

# unit tests

# -------------------------------
# Basic Test Cases
# -------------------------------

def test_basic_integer():
    # Positive integer
    codeflash_output = like_num("123") # 750ns -> 813ns (7.75% slower)
    # Zero
    codeflash_output = like_num("0") # 459ns -> 498ns (7.83% slower)
    # Negative integer with minus sign
    codeflash_output = like_num("-456") # 665ns -> 637ns (4.40% faster)
    # Positive integer with plus sign
    codeflash_output = like_num("+789") # 319ns -> 326ns (2.15% slower)
    # Integer with comma separators
    codeflash_output = like_num("1,234,567") # 591ns -> 588ns (0.510% faster)
    # Integer with dot separators (should be stripped)
    codeflash_output = like_num("1.000.000") # 386ns -> 395ns (2.28% slower)

def test_basic_fraction():
    # Simple fraction
    codeflash_output = like_num("1/2") # 1.46μs -> 1.55μs (6.13% slower)
    # Fraction with leading plus
    codeflash_output = like_num("+3/4") # 869ns -> 997ns (12.8% slower)
    # Fraction with leading minus
    codeflash_output = like_num("-5/6") # 530ns -> 542ns (2.21% slower)
    # Fraction with leading tilde
    codeflash_output = like_num("~7/8") # 510ns -> 530ns (3.77% slower)
    # Fraction with leading ±
    codeflash_output = like_num("±9/10") # 911ns -> 941ns (3.19% slower)

def test_basic_korean_words():
    # Native Korean number
    codeflash_output = like_num("하나") # 3.60μs -> 2.82μs (27.6% faster)
    codeflash_output = like_num("둘") # 1.50μs -> 1.69μs (11.3% slower)
    # Sino-Korean number
    codeflash_output = like_num("삼") # 905ns -> 790ns (14.6% faster)
    codeflash_output = like_num("십") # 840ns -> 674ns (24.6% faster)
    # Large Korean number
    codeflash_output = like_num("백만") # 902ns -> 745ns (21.1% faster)

# -------------------------------
# Edge Test Cases
# -------------------------------

def test_empty_and_non_numeric_strings():
    # Empty string
    codeflash_output = like_num("") # 1.43μs -> 1.67μs (14.8% slower)
    # String with only spaces
    codeflash_output = like_num("   ") # 2.11μs -> 1.35μs (56.5% faster)
    # Non-numeric string
    codeflash_output = like_num("hello") # 1.83μs -> 1.11μs (64.4% faster)
    # Mixed alpha-numeric string
    codeflash_output = like_num("abc123") # 1.75μs -> 959ns (82.7% faster)
    # String with only symbols
    codeflash_output = like_num("!@#$") # 1.43μs -> 919ns (55.7% faster)

def test_multiple_fraction_slashes():
    # More than one slash (not a valid fraction)
    codeflash_output = like_num("1/2/3") # 2.69μs -> 1.89μs (42.2% faster)
    # Just a slash
    codeflash_output = like_num("/") # 1.56μs -> 1.47μs (5.92% faster)
    # Slash with non-digit parts
    codeflash_output = like_num("a/b") # 1.53μs -> 1.11μs (38.0% faster)
    codeflash_output = like_num("1/b") # 1.30μs -> 819ns (58.4% faster)
    codeflash_output = like_num("a/2") # 1.18μs -> 707ns (66.5% faster)

def test_leading_signs_and_symbols():
    # Leading tilde
    codeflash_output = like_num("~123") # 1.03μs -> 1.10μs (6.19% slower)
    # Leading ±
    codeflash_output = like_num("±456") # 608ns -> 680ns (10.6% slower)
    # Leading multiple signs (should only strip one)
    codeflash_output = like_num("++789") # 2.20μs -> 1.66μs (32.2% faster)
    codeflash_output = like_num("--123") # 1.45μs -> 936ns (55.0% faster)

def test_decimal_and_comma_handling():
    # Number with both commas and dots
    codeflash_output = like_num("1,234.567") # 1.13μs -> 1.17μs (3.50% slower)
    # Number with only dots
    codeflash_output = like_num("123.456") # 525ns -> 555ns (5.41% slower)
    # Number with only commas
    codeflash_output = like_num("123,456") # 364ns -> 376ns (3.19% slower)
    # Fraction with commas and dots
    codeflash_output = like_num("1,000/2,000") # 1.09μs -> 1.13μs (3.81% slower)

def test_partial_korean_words():
    # Substring of a number word
    codeflash_output = like_num("하") # 2.90μs -> 2.30μs (26.1% faster)
    # Substring not in any number word
    codeflash_output = like_num("zzz") # 1.84μs -> 1.33μs (38.3% faster)
    # Korean word as part of a longer string
    codeflash_output = like_num("hello하나world") # 4.02μs -> 1.81μs (122% faster)
    # Multiple Korean number words in a string
    codeflash_output = like_num("하나둘셋") # 1.81μs -> 1.43μs (26.8% faster)

def test_case_sensitivity():
    # Lowercase and uppercase Latin letters (should not match)
    codeflash_output = like_num("ONE") # 2.38μs -> 1.79μs (32.8% faster)
    # Korean number word with mixed case (should match since .lower() is used)
    codeflash_output = like_num("하나".upper()) # 1.98μs -> 1.45μs (36.5% faster)

def test_numeric_with_spaces():
    # Spaces around number
    codeflash_output = like_num(" 123 ") # 2.63μs -> 1.94μs (36.1% faster)
    # Spaces inside number
    codeflash_output = like_num("1 234") # 1.85μs -> 1.10μs (68.5% faster)

def test_invalid_fraction_format():
    # Fraction with missing numerator
    codeflash_output = like_num("/2") # 2.45μs -> 2.19μs (11.6% faster)
    # Fraction with missing denominator
    codeflash_output = like_num("3/") # 1.60μs -> 1.46μs (10.1% faster)
    # Fraction with non-digit numerator or denominator
    codeflash_output = like_num("a/2") # 1.40μs -> 1.06μs (32.4% faster)
    codeflash_output = like_num("3/b") # 1.26μs -> 824ns (52.5% faster)

# -------------------------------
# Large Scale Test Cases
# -------------------------------

def test_large_numeric_string():
    # Very large integer (999 digits)
    big_num = "9" * 999
    codeflash_output = like_num(big_num) # 2.54μs -> 2.62μs (3.13% slower)
    # Very large integer with commas
    big_num_commas = ",".join(["9"*3 for _ in range(333)])
    codeflash_output = like_num(big_num_commas) # 5.65μs -> 5.69μs (0.615% slower)

def test_large_fraction():
    # Fraction with large numerator and denominator
    num = "9" * 500
    denom = "8" * 499
    large_frac = f"{num}/{denom}"
    codeflash_output = like_num(large_frac) # 4.81μs -> 4.84μs (0.640% slower)

def test_large_repeated_korean_number_words():
    # String with many repeated Korean number words
    korean_num_word = "하나"
    long_korean = korean_num_word * 300  # 900 chars
    codeflash_output = like_num(long_korean) # 145μs -> 24.0μs (507% faster)

def test_large_non_numeric_string():
    # Very long string of non-numeric characters
    long_alpha = "a" * 1000
    codeflash_output = like_num(long_alpha) # 181μs -> 24.1μs (654% faster)

def test_large_mixed_string():
    # Large string with a single Korean number word in the middle
    s = "a" * 499 + "둘" + "b" * 499
    codeflash_output = like_num(s) # 93.2μs -> 14.4μs (545% faster)

# -------------------------------
# Mutation-sensitive/Negative Test Cases
# -------------------------------

def test_false_for_similar_non_numeric():
    # Should not match partial digit
    codeflash_output = like_num("1a2") # 2.51μs -> 1.94μs (29.6% faster)
    # Should not match if only sign
    codeflash_output = like_num("+") # 1.22μs -> 1.30μs (6.60% slower)
    # Should not match if only comma or dot
    codeflash_output = like_num(",") # 608ns -> 664ns (8.43% slower)
    codeflash_output = like_num(".") # 496ns -> 573ns (13.4% slower)
    # Should not match if only slash
    codeflash_output = like_num("/") # 1.37μs -> 1.28μs (6.39% faster)

def test_korean_word_not_in_list():
    # Korean word that's not a number
    codeflash_output = like_num("사랑") # 2.76μs -> 2.68μs (3.21% faster)

def test_fraction_with_letters():
    # Fraction with letters
    codeflash_output = like_num("a/b") # 2.85μs -> 2.10μs (36.0% faster)
    codeflash_output = like_num("1/b") # 1.59μs -> 1.25μs (27.9% faster)
    codeflash_output = like_num("a/2") # 1.29μs -> 874ns (47.7% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest
from spacy.lang.ko.lex_attrs import like_num

# function to test
_num_words = [
    "영",
    "공",
    # Native Korean number system
    "하나",
    "둘",
    "셋",
    "넷",
    "다섯",
    "여섯",
    "일곱",
    "여덟",
    "아홉",
    "열",
    "스물",
    "서른",
    "마흔",
    "쉰",
    "예순",
    "일흔",
    "여든",
    "아흔",
    # Sino-Korean number system
    "일",
    "이",
    "삼",
    "사",
    "오",
    "육",
    "칠",
    "팔",
    "구",
    "십",
    "백",
    "천",
    "만",
    "십만",
    "백만",
    "천만",
    "일억",
    "십억",
    "백억",
]
from spacy.lang.ko.lex_attrs import like_num

# unit tests

# ------------------ Basic Test Cases ------------------

def test_basic_integer():
    # Simple integer
    codeflash_output = like_num("123") # 883ns -> 960ns (8.02% slower)
    # Integer with plus sign
    codeflash_output = like_num("+456") # 684ns -> 725ns (5.66% slower)
    # Integer with minus sign
    codeflash_output = like_num("-789") # 344ns -> 375ns (8.27% slower)
    # Integer with tilde
    codeflash_output = like_num("~101") # 329ns -> 335ns (1.79% slower)
    # Integer with ± sign
    codeflash_output = like_num("±202") # 484ns -> 506ns (4.35% slower)

def test_basic_with_commas_and_dots():
    # Integer with comma (should be accepted)
    codeflash_output = like_num("1,000") # 1.02μs -> 1.11μs (7.91% slower)
    # Integer with dot (should be accepted)
    codeflash_output = like_num("2.000") # 496ns -> 529ns (6.24% slower)
    # Integer with both comma and dot
    codeflash_output = like_num("1,234.567") # 546ns -> 553ns (1.27% slower)

def test_basic_fraction():
    # Simple fraction
    codeflash_output = like_num("3/4") # 1.50μs -> 1.64μs (8.47% slower)
    # Fraction with plus sign
    codeflash_output = like_num("+5/6") # 891ns -> 1.04μs (14.2% slower)
    # Fraction with minus sign
    codeflash_output = like_num("-7/8") # 530ns -> 572ns (7.34% slower)
    # Fraction with tilde
    codeflash_output = like_num("~9/10") # 776ns -> 807ns (3.84% slower)
    # Fraction with ± sign
    codeflash_output = like_num("±11/12") # 967ns -> 849ns (13.9% faster)

def test_basic_korean_numbers():
    # Native Korean number
    codeflash_output = like_num("하나") # 3.98μs -> 3.06μs (29.8% faster)
    # Sino-Korean number
    codeflash_output = like_num("삼") # 1.67μs -> 1.79μs (6.91% slower)
    # Multi-character Korean number
    codeflash_output = like_num("십만") # 957ns -> 834ns (14.7% faster)

# ------------------ Edge Test Cases ------------------

def test_empty_and_non_numeric():
    # Empty string
    codeflash_output = like_num("") # 1.48μs -> 1.63μs (9.02% slower)
    # Only sign, no number
    codeflash_output = like_num("+") # 964ns -> 1.03μs (6.41% slower)
    # Only slash
    codeflash_output = like_num("/") # 1.98μs -> 1.52μs (30.1% faster)
    # Only dot
    codeflash_output = like_num(".") # 626ns -> 676ns (7.40% slower)
    # Only comma
    codeflash_output = like_num(",") # 451ns -> 521ns (13.4% slower)
    # Only non-numeric, non-Korean text
    codeflash_output = like_num("abc") # 1.53μs -> 998ns (53.8% faster)
    # Mixed letters and digits (not pure number)
    codeflash_output = like_num("12a34") # 1.59μs -> 904ns (75.8% faster)
    # Fraction with non-digit numerator
    codeflash_output = like_num("a/5") # 1.38μs -> 924ns (49.2% faster)
    # Fraction with non-digit denominator
    codeflash_output = like_num("5/b") # 1.25μs -> 780ns (59.6% faster)
    # Fraction with both non-digits
    codeflash_output = like_num("a/b") # 1.16μs -> 670ns (72.5% faster)
    # Fraction with more than one slash
    codeflash_output = like_num("1/2/3") # 1.56μs -> 769ns (102% faster)

def test_edge_leading_trailing_spaces():
    # Leading and trailing spaces (should not match)
    codeflash_output = like_num(" 123 ") # 2.73μs -> 1.95μs (39.6% faster)
    # Spaces in fraction
    codeflash_output = like_num("1 /2") # 2.06μs -> 1.54μs (34.3% faster)
    codeflash_output = like_num("1/ 2") # 1.73μs -> 1.09μs (58.4% faster)
    codeflash_output = like_num(" 1/2 ") # 1.74μs -> 963ns (80.8% faster)

def test_edge_zero_and_variants():
    # Zero in digit
    codeflash_output = like_num("0") # 685ns -> 759ns (9.75% slower)
    # Korean zero
    codeflash_output = like_num("영") # 2.12μs -> 2.32μs (8.99% slower)
    # Korean zero variant
    codeflash_output = like_num("공") # 870ns -> 894ns (2.68% slower)

def test_edge_case_sensitive_korean():
    # Korean number with different case (should match as is, since Korean letters don't have case)
    codeflash_output = like_num("하나") # 2.76μs -> 2.11μs (30.9% faster)

def test_edge_multiple_signs():
    # Multiple leading signs (should only strip one, so should fail)
    codeflash_output = like_num("++123") # 3.02μs -> 2.30μs (31.3% faster)
    codeflash_output = like_num("--123") # 1.70μs -> 1.09μs (55.1% faster)

def test_edge_dot_and_comma_only():
    # Only dots and commas
    codeflash_output = like_num(".,.,.") # 1.74μs -> 2.00μs (12.8% slower)

def test_edge_long_korean_word():
    # Korean word containing a number word as substring
    codeflash_output = like_num("하나님") # 3.35μs -> 2.54μs (32.0% faster)
    codeflash_output = like_num("삼성") # 1.47μs -> 1.58μs (6.95% slower)

def test_edge_fraction_with_commas_and_dots():
    # Fraction with comma and dot in numerator/denominator
    codeflash_output = like_num("1,000/2.000") # 1.89μs -> 1.97μs (4.01% slower)

def test_edge_fraction_with_leading_signs_and_commas():
    codeflash_output = like_num("+1,000/2,000") # 1.91μs -> 1.99μs (3.92% slower)
    codeflash_output = like_num("-1,000/2,000") # 828ns -> 957ns (13.5% slower)

# ------------------ Large Scale Test Cases ------------------

def test_large_scale_long_digit_string():
    # Very long digit string (999 digits)
    long_num = "1" * 999
    codeflash_output = like_num(long_num) # 2.59μs -> 2.67μs (2.88% slower)
    # Very long digit string with commas every 3 digits
    long_num_commas = ",".join(["1"*3]*333)
    codeflash_output = like_num(long_num_commas) # 5.59μs -> 5.59μs (0.090% faster)

def test_large_scale_long_fraction():
    # Large numerator and denominator (each 500 digits)
    num = "9" * 500
    denom = "8" * 500
    codeflash_output = like_num(f"{num}/{denom}") # 4.98μs -> 4.99μs (0.160% slower)

def test_large_scale_many_korean_numbers():
    # String with many Korean number words concatenated
    korean_nums = "".join(_num_words)
    codeflash_output = like_num(korean_nums) # 2.96μs -> 3.14μs (5.76% slower)

def test_large_scale_non_numeric_string():
    # Long string of non-numeric, non-Korean characters
    long_str = "a" * 1000
    codeflash_output = like_num(long_str) # 181μs -> 24.2μs (652% faster)

def test_large_scale_mixed_content():
    # Large string with digits, letters, and Korean numbers
    mixed = "abc" * 300 + "하나" + "xyz" * 300
    codeflash_output = like_num(mixed) # 331μs -> 43.2μs (669% faster)

def test_large_scale_fraction_with_large_numbers_and_sign():
    # Large numbers in fraction, with leading sign
    num = "1" * 500
    denom = "2" * 500
    codeflash_output = like_num(f"+{num}/{denom}") # 5.43μs -> 5.46μs (0.550% slower)
    codeflash_output = like_num(f"-{num}/{denom}") # 4.24μs -> 4.11μs (3.02% faster)

def test_large_scale_fraction_with_commas_and_dots():
    # Large numbers with commas and dots in fraction
    num = ",".join(["1"*3]*100)
    denom = ".".join(["2"*3]*100)
    codeflash_output = like_num(f"{num}/{denom}") # 5.94μs -> 5.94μs (0.084% slower)

# ------------------ Mutation Testing Guards ------------------

def test_mutation_guard_wrong_digit_detection():
    # Should not match if not all digits after stripping
    codeflash_output = like_num("12 34") # 3.13μs -> 2.35μs (33.2% faster)
    # Should not match if only partial Korean number word
    codeflash_output = like_num("하") # 2.17μs -> 1.69μs (28.5% faster)
    # Should not match if only partial Sino-Korean number word
    codeflash_output = like_num("십만"[:-1]) # 1.34μs -> 1.27μs (5.26% faster)

def test_mutation_guard_fraction_extra_slash():
    # Should not match if more than one slash
    codeflash_output = like_num("1/2/3") # 2.85μs -> 2.05μs (39.3% faster)

def test_mutation_guard_korean_word_case():
    # Should not match for latin letters that look similar to Korean
    codeflash_output = like_num("il") # 2.31μs -> 1.88μs (22.5% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-like_num-mhmik25w and push.

Codeflash Static Badge

The optimization achieves a **275% speedup** by addressing the most expensive operation in the original code: the Korean number word lookup.

**Key Performance Bottleneck Eliminated:**
The original code's line `if any(char.lower() in _num_words for char in text)` consumed 87.9% of total runtime (2.40ms out of 2.73ms). This was inefficient because:
- It performs O(n) linear searches through the `_num_words` list for each character
- It unnecessarily calls `.lower()` on Korean characters (which don't have case variants)

**Primary Optimization - Set-Based Lookup:**
The optimized version converts `_num_words` to a set once and caches it as a function attribute, enabling O(1) character lookups instead of O(n). This reduces the Korean word check from 2.40ms to 1.21ms (50% reduction), while the caching overhead is minimal (45μs total for getattr + set creation on first call).

**Secondary Optimization - Split Limit:**
Changed `text.split("/")` to `text.split("/", 1)` to avoid unnecessary splitting when validating fractions, though this has minimal impact.

**Performance Characteristics:**
- **Small inputs**: Slight overhead (2-14% slower) due to caching setup
- **Korean text**: 15-30% faster due to efficient set lookups  
- **Large non-numeric strings**: Dramatic improvements (500-650% faster) - the O(1) vs O(n) difference scales significantly with input size
- **Mixed content**: 300-600% faster for strings containing Korean characters

This optimization is particularly valuable for NLP workloads processing Korean text at scale, where `like_num` would be called frequently during tokenization and linguistic analysis.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 5, 2025 21:32
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant