Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 13, 2025

📄 44% (0.44x) speedup for like_num in spacy/lang/kmr/lex_attrs.py

⏱️ Runtime : 25.0 milliseconds 17.3 milliseconds (best of 54 runs)

📝 Explanation and details

The optimized version achieves a 44% speedup through three key performance optimizations:

1. Eliminated function call overhead by inlining is_digit
The original code called is_digit() for every text input, adding function call overhead. The optimized version inlines this logic directly into like_num, removing the function call entirely. This is particularly impactful since is_digit was called 14,443 times in profiling and represented 58.2% of the original runtime.

2. Converted list lookups to O(1) set lookups
The original code performed linear searches through _num_words and _ordinal_words lists for every lookup. The optimized version converts these to sets once and caches them as function attributes, changing O(n) list searches to O(1) set lookups. From profiling, the _num_words lookup took 5.1% of runtime and _ordinal_words lookup took 5.7% - both are now significantly faster.

3. Optimized string operations

  • Replaced text.startswith(("+", "-", "±", "~")) with text and text[0] in "+-±~" to avoid tuple creation and use faster character-in-string lookup
  • Added text.split("/", 1) to limit splits to just the first occurrence
  • Cached the endings tuple as a function attribute to avoid recreating it on every call

Performance characteristics by test case:

  • Kurdish word lookups: 38-58% faster due to set-based lookups
  • Digit+ending forms: 35-78% faster from inlining is_digit and eliminating function calls
  • Invalid strings: 45-62% faster as the optimized logic exits earlier for non-matching cases
  • Large-scale processing: Maintains consistent speedups across bulk operations

The optimization maintains identical functionality while significantly reducing computational overhead, making it especially beneficial for text processing pipelines that frequently validate numeric-like tokens.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 11 Passed
🌀 Generated Regression Tests 26755 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
lang/kmr/test_text.py::test_kmr_lex_attrs_capitals 2.90μs 2.94μs -1.43%⚠️
lang/kmr/test_text.py::test_kmr_lex_attrs_like_number_for_ordinal 30.3μs 24.3μs 25.0%✅
🌀 Generated Regression Tests and Runtime

import pytest # used for our unit tests
from spacy.lang.kmr.lex_attrs import like_num

function to test

_num_words = [
"sifir",
"yek",
"du",
"sê",
"çar",
"pênc",
"şeş",
"heft",
"heşt",
"neh",
"deh",
"yazde",
"dazde",
"sêzde",
"çarde",
"pazde",
"şazde",
"hevde",
"hejde",
"nozde",
"bîst",
"sî",
"çil",
"pêncî",
"şêst",
"heftê",
"heştê",
"nod",
"sed",
"hezar",
"milyon",
"milyar",
]

_ordinal_words = [
"yekem",
"yekemîn",
"duyem",
"duyemîn",
"sêyem",
"sêyemîn",
"çarem",
"çaremîn",
"pêncem",
"pêncemîn",
"şeşem",
"şeşemîn",
"heftem",
"heftemîn",
"heştem",
"heştemîn",
"nehem",
"nehemîn",
"dehem",
"dehemîn",
"yazdehem",
"yazdehemîn",
"dazdehem",
"dazdehemîn",
"sêzdehem",
"sêzdehemîn",
"çardehem",
"çardehemîn",
"pazdehem",
"pazdehemîn",
"şanzdehem",
"şanzdehemîn",
"hevdehem",
"hevdehemîn",
"hejdehem",
"hejdehemîn",
"nozdehem",
"nozdehemîn",
"bîstem",
"bîstemîn",
"sîyem",
"sîyemîn",
"çilem",
"çilemîn",
"pêncîyem",
"pênciyemîn",
"şêstem",
"şêstemîn",
"heftêyem",
"heftêyemîn",
"heştêyem",
"heştêyemîn",
"notem",
"notemîn",
"sedem",
"sedemîn",
"hezarem",
"hezaremîn",
"milyonem",
"milyonemîn",
"milyarem",
"milyaremîn",
]
from spacy.lang.kmr.lex_attrs import like_num

unit tests

1. Basic Test Cases

def test_basic_integer():
# Simple integer string
codeflash_output = like_num("123") # 1.22μs -> 1.13μs (8.44% faster)
# Integer with leading plus
codeflash_output = like_num("+456") # 796ns -> 853ns (6.68% slower)
# Integer with leading minus
codeflash_output = like_num("-789") # 440ns -> 417ns (5.52% faster)
# Integer with leading tilde
codeflash_output = like_num("~321") # 383ns -> 421ns (9.03% slower)
# Integer with leading ±
codeflash_output = like_num("±654") # 481ns -> 648ns (25.8% slower)

def test_basic_with_commas_and_dots():
# Integer with comma
codeflash_output = like_num("1,234") # 1.26μs -> 1.30μs (3.23% slower)
# Integer with dot
codeflash_output = like_num("5.678") # 645ns -> 637ns (1.26% faster)
# Integer with both comma and dot
codeflash_output = like_num("9,876.543") # 670ns -> 723ns (7.33% slower)

def test_basic_fraction():
# Simple fraction
codeflash_output = like_num("3/4") # 1.71μs -> 1.95μs (12.2% slower)
# Fraction with leading plus
codeflash_output = like_num("+10/11") # 1.42μs -> 1.47μs (3.33% slower)
# Fraction with leading minus
codeflash_output = like_num("-100/101") # 993ns -> 1.02μs (2.55% slower)
# Fraction with comma and dot
codeflash_output = like_num("1,234/5.678") # 1.16μs -> 1.10μs (4.81% faster)

def test_basic_kurdish_num_words():
# All Kurdish number words
for word in _num_words:
codeflash_output = like_num(word) # 25.1μs -> 21.8μs (15.4% faster)
# Test with uppercase
codeflash_output = like_num(word.upper())

def test_basic_kurdish_ordinals():
# All Kurdish ordinal words
for word in _ordinal_words:
codeflash_output = like_num(word) # 62.1μs -> 40.7μs (52.8% faster)
# Test with uppercase
codeflash_output = like_num(word.upper())

def test_basic_is_digit_endings():
# Numbers ending with 'em', 'yem', 'emîn', 'yemîn'
codeflash_output = like_num("123em") # 3.50μs -> 2.48μs (41.5% faster)
codeflash_output = like_num("456yem") # 2.22μs -> 1.64μs (35.3% faster)
codeflash_output = like_num("789emîn") # 2.33μs -> 1.95μs (19.9% faster)
codeflash_output = like_num("321yemîn") # 2.12μs -> 1.44μs (47.2% faster)

def test_basic_false_cases():
# Not a number, not a Kurdish word
codeflash_output = like_num("hello") # 3.26μs -> 2.25μs (45.1% faster)
codeflash_output = like_num("world123") # 2.02μs -> 1.43μs (42.1% faster)
codeflash_output = like_num("123abc") # 1.79μs -> 1.10μs (63.3% faster)
codeflash_output = like_num("abc123") # 1.39μs -> 906ns (53.1% faster)
codeflash_output = like_num("yekemx") # 1.26μs -> 815ns (55.2% faster)

2. Edge Test Cases

def test_edge_empty_string():
# Empty string should not be considered a number
codeflash_output = like_num("") # 2.92μs -> 1.98μs (48.1% faster)

def test_edge_only_sign():
# Only sign, no digits
codeflash_output = like_num("+") # 3.24μs -> 2.38μs (36.3% faster)
codeflash_output = like_num("-") # 1.75μs -> 1.26μs (39.0% faster)
codeflash_output = like_num("±") # 1.47μs -> 965ns (52.0% faster)
codeflash_output = like_num("~") # 1.40μs -> 877ns (59.6% faster)

def test_edge_fraction_invalid():
# Fraction with non-digit numerator or denominator
codeflash_output = like_num("a/4") # 3.54μs -> 2.84μs (24.7% faster)
codeflash_output = like_num("4/b") # 2.23μs -> 1.61μs (38.6% faster)
codeflash_output = like_num("a/b") # 1.61μs -> 1.06μs (52.4% faster)
# Multiple slashes
codeflash_output = like_num("1/2/3") # 2.25μs -> 1.20μs (87.4% faster)

def test_edge_leading_zeros():
# Leading zeros should be valid
codeflash_output = like_num("0000123") # 894ns -> 994ns (10.1% slower)
codeflash_output = like_num("0000/0001") # 1.31μs -> 1.45μs (9.40% slower)

def test_edge_dot_and_comma_only():
# Only dots and commas
codeflash_output = like_num(".,") # 3.43μs -> 2.44μs (40.6% faster)
codeflash_output = like_num(",.,") # 1.75μs -> 1.26μs (38.8% faster)

def test_edge_non_ascii_digits():
# Non-ASCII digits should not be considered valid
codeflash_output = like_num("١٢٣") # 1.37μs -> 1.41μs (2.91% slower)
codeflash_output = like_num("123") # 629ns -> 673ns (6.54% slower)

def test_edge_mixed_case_kurdish():
# Mixed case Kurdish number words
codeflash_output = like_num("YeK") # 1.55μs -> 1.93μs (19.2% slower)
codeflash_output = like_num("DuYeM") # 1.57μs -> 1.08μs (45.5% faster)

def test_edge_is_digit_with_non_digit_prefix():
# Non-digit prefix with valid ending
codeflash_output = like_num("abcem") # 4.27μs -> 2.79μs (53.2% faster)
codeflash_output = like_num("abcyem") # 2.25μs -> 1.78μs (26.8% faster)

def test_edge_is_digit_with_sign_and_ending():
# Sign with digit and ending
codeflash_output = like_num("+123em") # 3.73μs -> 2.53μs (47.4% faster)
codeflash_output = like_num("-456yem") # 2.36μs -> 1.69μs (39.9% faster)

def test_edge_is_digit_with_comma_dot_and_ending():
# Comma/dot with digit and ending
codeflash_output = like_num("1,234em") # 3.47μs -> 2.58μs (34.3% faster)
codeflash_output = like_num("5.678yem") # 2.40μs -> 1.74μs (38.4% faster)

def test_edge_is_digit_with_uppercase_ending():
# Uppercase endings should not be valid
codeflash_output = like_num("123EM") # 3.45μs -> 2.31μs (49.2% faster)
codeflash_output = like_num("456YEM") # 2.17μs -> 1.59μs (36.3% faster)

def test_edge_fraction_with_sign_and_commas():
# Fraction with sign and commas
codeflash_output = like_num("+1,234/5,678") # 2.13μs -> 2.38μs (10.7% slower)
codeflash_output = like_num("-1.234/5.678") # 1.07μs -> 1.18μs (8.84% slower)

def test_edge_fraction_with_dot_in_numerator_or_denominator():
# Dot inside numerator or denominator should be stripped
codeflash_output = like_num("1.234/5678") # 1.78μs -> 1.89μs (5.88% slower)
codeflash_output = like_num("1234/5.678") # 964ns -> 989ns (2.53% slower)

def test_edge_fraction_with_non_digit_after_strip():
# Fraction where stripping still leaves non-digit
codeflash_output = like_num("1,2a/345") # 4.32μs -> 3.15μs (37.4% faster)
codeflash_output = like_num("123/4b5") # 2.35μs -> 1.74μs (35.4% faster)

3. Large Scale Test Cases

def test_large_many_numbers():
# Test a large list of valid numbers
for i in range(1, 1000):
codeflash_output = like_num(str(i)) # 290μs -> 301μs (3.62% slower)
codeflash_output = like_num(f"+{i}")
codeflash_output = like_num(f"-{i}") # 337μs -> 354μs (4.60% slower)
codeflash_output = like_num(f"{i}em")
codeflash_output = like_num(f"{i}yem") # 338μs -> 351μs (3.76% slower)
# Fractions
codeflash_output = like_num(f"{i}/{i+1}")

def test_large_many_invalid():
# Test a large list of invalid strings
for i in range(1, 1000):
codeflash_output = like_num(f"abc{i}") # 1.28ms -> 835μs (53.1% faster)
codeflash_output = like_num(f"{i}xyz")
codeflash_output = like_num(f"{i}/abc") # 1.27ms -> 823μs (54.2% faster)
codeflash_output = like_num(f"abc/{i}")

def test_large_kurdish_words():
# Test all Kurdish number and ordinal words with random casing
for word in _num_words + _ordinal_words:
codeflash_output = like_num(word.lower()) # 82.2μs -> 59.3μs (38.6% faster)
codeflash_output = like_num(word.upper())
codeflash_output = like_num(word.capitalize()) # 69.7μs -> 48.6μs (43.3% faster)

def test_large_fraction_with_large_numbers():
# Large numerator and denominator
codeflash_output = like_num("123456/789012") # 1.95μs -> 2.10μs (7.32% slower)
codeflash_output = like_num("+123456/789012") # 1.30μs -> 1.27μs (2.04% faster)
codeflash_output = like_num("123,456/789,012") # 1.12μs -> 985ns (13.5% faster)

def test_large_is_digit_endings():
# Large numbers with endings
for i in range(100, 1000):
codeflash_output = like_num(f"{i}em") # 1.08ms -> 607μs (78.0% faster)
codeflash_output = like_num(f"{i}yem")
codeflash_output = like_num(f"{i}emîn") # 1.15ms -> 735μs (57.0% faster)
codeflash_output = like_num(f"{i}yemîn")

def test_large_false_cases_with_endings():
# Large numbers with invalid endings
for i in range(100, 1000):
codeflash_output = like_num(f"{i}EM") # 1.07ms -> 601μs (77.7% faster)
codeflash_output = like_num(f"{i}YEM")
codeflash_output = like_num(f"{i}EMIN") # 1.14ms -> 732μs (56.0% faster)
codeflash_output = like_num(f"{i}YEMIN")

#------------------------------------------------
import pytest # used for our unit tests
from spacy.lang.kmr.lex_attrs import like_num

function to test

_num_words = [
"sifir",
"yek",
"du",
"sê",
"çar",
"pênc",
"şeş",
"heft",
"heşt",
"neh",
"deh",
"yazde",
"dazde",
"sêzde",
"çarde",
"pazde",
"şazde",
"hevde",
"hejde",
"nozde",
"bîst",
"sî",
"çil",
"pêncî",
"şêst",
"heftê",
"heştê",
"nod",
"sed",
"hezar",
"milyon",
"milyar",
]

_ordinal_words = [
"yekem",
"yekemîn",
"duyem",
"duyemîn",
"sêyem",
"sêyemîn",
"çarem",
"çaremîn",
"pêncem",
"pêncemîn",
"şeşem",
"şeşemîn",
"heftem",
"heftemîn",
"heştem",
"heştemîn",
"nehem",
"nehemîn",
"dehem",
"dehemîn",
"yazdehem",
"yazdehemîn",
"dazdehem",
"dazdehemîn",
"sêzdehem",
"sêzdehemîn",
"çardehem",
"çardehemîn",
"pazdehem",
"pazdehemîn",
"şanzdehem",
"şanzdehemîn",
"hevdehem",
"hevdehemîn",
"hejdehem",
"hejdehemîn",
"nozdehem",
"nozdehemîn",
"bîstem",
"bîstemîn",
"sîyem",
"sîyemîn",
"çilem",
"çilemîn",
"pêncîyem",
"pênciyemîn",
"şêstem",
"şêstemîn",
"heftêyem",
"heftêyemîn",
"heştêyem",
"heştêyemîn",
"notem",
"notemîn",
"sedem",
"sedemîn",
"hezarem",
"hezaremîn",
"milyonem",
"milyonemîn",
"milyarem",
"milyaremîn",
]
from spacy.lang.kmr.lex_attrs import like_num

unit tests

-------------------

Basic Test Cases

-------------------

def test_basic_integer():
# Should recognize basic integer strings
codeflash_output = like_num("123") # 1.41μs -> 1.34μs (5.00% faster)
codeflash_output = like_num("0") # 559ns -> 544ns (2.76% faster)
codeflash_output = like_num("456789") # 478ns -> 491ns (2.65% slower)

def test_basic_signed_integer():
# Should recognize signed numbers
codeflash_output = like_num("+123") # 1.25μs -> 1.26μs (1.19% slower)
codeflash_output = like_num("-456") # 531ns -> 559ns (5.01% slower)
codeflash_output = like_num("~789") # 437ns -> 417ns (4.80% faster)
codeflash_output = like_num("±0") # 655ns -> 761ns (13.9% slower)

def test_basic_with_commas_and_dots():
# Should ignore commas and dots in numbers
codeflash_output = like_num("1,234") # 1.26μs -> 1.22μs (3.37% faster)
codeflash_output = like_num("12.345") # 636ns -> 621ns (2.42% faster)
codeflash_output = like_num("1,234,567") # 550ns -> 548ns (0.365% faster)
codeflash_output = like_num("1.234.567") # 461ns -> 465ns (0.860% slower)
codeflash_output = like_num("1,234.567") # 591ns -> 623ns (5.14% slower)

def test_basic_fraction():
# Should recognize fractions with digits
codeflash_output = like_num("1/2") # 1.87μs -> 2.05μs (8.96% slower)
codeflash_output = like_num("123/456") # 1.14μs -> 1.13μs (0.442% faster)
codeflash_output = like_num("0001/0002") # 905ns -> 907ns (0.221% slower)

def test_basic_num_words():
# Should recognize all _num_words
for word in _num_words:
codeflash_output = like_num(word) # 25.0μs -> 22.1μs (13.4% faster)
# Should be case-insensitive
codeflash_output = like_num(word.upper())
codeflash_output = like_num(word.capitalize()) # 18.6μs -> 16.1μs (15.0% faster)

def test_basic_ordinals():
# Should recognize all _ordinal_words
for word in _ordinal_words:
codeflash_output = like_num(word) # 61.8μs -> 40.6μs (52.3% faster)
# Should be case-insensitive
codeflash_output = like_num(word.upper())
codeflash_output = like_num(word.capitalize()) # 53.1μs -> 33.5μs (58.6% faster)

def test_basic_is_digit_endings():
# Should recognize digit+ending forms
codeflash_output = like_num("123em") # 3.70μs -> 2.58μs (43.1% faster)
codeflash_output = like_num("456yem") # 2.17μs -> 1.57μs (38.1% faster)
codeflash_output = like_num("789emîn") # 2.47μs -> 1.91μs (29.7% faster)
codeflash_output = like_num("101yemîn") # 2.06μs -> 1.53μs (34.5% faster)
# Should be case-insensitive
codeflash_output = like_num("123EM") # 1.47μs -> 860ns (71.4% faster)
codeflash_output = like_num("456YEM") # 1.47μs -> 863ns (69.8% faster)
codeflash_output = like_num("789EMÎN") # 1.59μs -> 992ns (60.3% faster)
codeflash_output = like_num("101YEMÎN") # 1.64μs -> 1.11μs (47.2% faster)

-------------------

Edge Test Cases

-------------------

def test_edge_empty_string():
# Empty string should not be recognized as a number
codeflash_output = not like_num("") # 2.96μs -> 2.09μs (41.5% faster)

def test_edge_only_sign():
# Only sign, no digits
codeflash_output = not like_num("+") # 3.28μs -> 2.34μs (40.1% faster)
codeflash_output = not like_num("-") # 1.74μs -> 1.27μs (37.2% faster)
codeflash_output = not like_num("~") # 1.45μs -> 925ns (56.3% faster)
codeflash_output = not like_num("±") # 1.36μs -> 890ns (53.0% faster)

def test_edge_non_digit_fraction():
# Fractions with non-digit parts
codeflash_output = not like_num("a/2") # 3.77μs -> 3.03μs (24.6% faster)
codeflash_output = not like_num("2/b") # 2.18μs -> 1.56μs (39.7% faster)
codeflash_output = not like_num("abc/def") # 2.44μs -> 1.69μs (45.1% faster)
codeflash_output = not like_num("/2") # 1.93μs -> 1.24μs (55.9% faster)
codeflash_output = not like_num("2/") # 1.68μs -> 1.18μs (42.7% faster)
codeflash_output = not like_num("/") # 1.60μs -> 1.21μs (32.2% faster)

def test_edge_multiple_slashes():
# Should not recognize multiple slashes
codeflash_output = not like_num("1/2/3") # 3.50μs -> 2.33μs (50.1% faster)
codeflash_output = not like_num("123/456/789") # 2.05μs -> 1.56μs (31.2% faster)

def test_edge_non_num_word():
# Should not recognize unrelated words
codeflash_output = not like_num("hello") # 3.31μs -> 2.15μs (53.9% faster)
codeflash_output = not like_num("number") # 2.18μs -> 1.48μs (48.1% faster)
codeflash_output = not like_num("sifirr") # 1.46μs -> 975ns (49.3% faster)
codeflash_output = not like_num("yekemem") # 2.31μs -> 1.41μs (63.6% faster)

def test_edge_dot_and_comma_only():
# Only punctuation, no digits
codeflash_output = not like_num(".") # 2.87μs -> 2.15μs (33.5% faster)
codeflash_output = not like_num(",") # 1.63μs -> 1.22μs (33.5% faster)
codeflash_output = not like_num("..") # 1.50μs -> 952ns (57.2% faster)
codeflash_output = not like_num(",,") # 1.33μs -> 793ns (68.1% faster)

def test_edge_leading_zeroes():
# Should recognize numbers with leading zeroes
codeflash_output = like_num("00001") # 902ns -> 1.08μs (16.3% slower)
codeflash_output = like_num("+00002") # 853ns -> 764ns (11.6% faster)
codeflash_output = like_num("00003em") # 2.84μs -> 1.88μs (50.8% faster)
codeflash_output = like_num("00004/00005") # 1.12μs -> 1.19μs (6.22% slower)

def test_edge_trailing_and_leading_spaces():
# Should not recognize numbers with spaces (since no strip)
codeflash_output = not like_num(" 123") # 3.00μs -> 2.29μs (30.8% faster)
codeflash_output = not like_num("123 ") # 1.71μs -> 1.34μs (26.8% faster)
codeflash_output = not like_num(" 456em ") # 1.76μs -> 1.20μs (46.4% faster)
codeflash_output = not like_num(" 1/2 ") # 2.44μs -> 1.62μs (50.1% faster)

def test_edge_mixed_case_endings():
# Should recognize mixed case endings
codeflash_output = like_num("123Em") # 3.45μs -> 2.41μs (43.1% faster)
codeflash_output = like_num("456YeM") # 2.29μs -> 1.64μs (39.8% faster)
codeflash_output = like_num("789EmîN") # 2.94μs -> 2.18μs (35.2% faster)
codeflash_output = like_num("101YeMîN") # 2.38μs -> 1.55μs (53.4% faster)

def test_edge_non_ascii_digits():
# Should not recognize non-ASCII digits
codeflash_output = not like_num("١٢٣") # 1.26μs -> 1.26μs (0.159% slower)
codeflash_output = not like_num("123") # 605ns -> 685ns (11.7% slower)

def test_edge_negative_fraction():
# Should recognize signed fractions
codeflash_output = like_num("-1/2") # 2.02μs -> 2.13μs (4.98% slower)
codeflash_output = like_num("+123/456") # 1.27μs -> 1.32μs (3.64% slower)
codeflash_output = like_num("~789/101") # 754ns -> 779ns (3.21% slower)
codeflash_output = like_num("±2/3") # 838ns -> 949ns (11.7% slower)

def test_edge_fraction_with_commas_and_dots():
# Should recognize fractions with commas and dots
codeflash_output = like_num("1,234/5,678") # 1.92μs -> 2.07μs (7.19% slower)
codeflash_output = like_num("12.34/56.78") # 978ns -> 1.07μs (8.43% slower)
codeflash_output = like_num("1,234.56/7,890.12") # 1.19μs -> 1.22μs (2.45% slower)

def test_edge_num_with_punctuation():
# Should not recognize numbers with other punctuation
codeflash_output = not like_num("123!") # 3.66μs -> 2.49μs (46.8% faster)
codeflash_output = not like_num("456?") # 1.72μs -> 1.28μs (33.7% faster)
codeflash_output = not like_num("789#") # 1.37μs -> 1.01μs (35.6% faster)
codeflash_output = not like_num("101$") # 1.27μs -> 938ns (35.1% faster)

def test_edge_num_word_with_digits():
# Should not recognize num words with digits attached
codeflash_output = not like_num("sifir1") # 3.15μs -> 2.34μs (34.2% faster)
codeflash_output = not like_num("yek123") # 1.68μs -> 1.36μs (23.0% faster)
codeflash_output = not like_num("du456") # 2.10μs -> 1.07μs (95.0% faster)

def test_edge_ord_word_with_digits():
# Should not recognize ordinal words with digits attached
codeflash_output = not like_num("yekem1") # 3.08μs -> 2.20μs (40.0% faster)
codeflash_output = not like_num("duyem123") # 2.19μs -> 1.49μs (47.3% faster)
codeflash_output = not like_num("sêyem456") # 2.62μs -> 1.65μs (58.6% faster)

def test_edge_digit_with_invalid_ending():
# Should not recognize digit with invalid ending
codeflash_output = not like_num("123xyz") # 2.98μs -> 2.23μs (33.6% faster)
codeflash_output = not like_num("456abc") # 1.68μs -> 1.19μs (40.9% faster)
codeflash_output = not like_num("789emx") # 1.29μs -> 1.00μs (28.0% faster)
codeflash_output = not like_num("101yemx") # 1.67μs -> 1.07μs (56.2% faster)

def test_edge_fraction_with_invalid_ending():
# Should not recognize fraction with ending
codeflash_output = not like_num("123em/456em") # 3.80μs -> 3.15μs (20.8% faster)
codeflash_output = not like_num("123yem/456yem") # 2.39μs -> 2.00μs (19.7% faster)

-------------------

Large Scale Test Cases

-------------------

def test_large_scale_many_integers():
# Should recognize many integers
for i in range(1, 1000):
codeflash_output = like_num(str(i)) # 292μs -> 292μs (0.106% faster)
codeflash_output = like_num("+" + str(i))
codeflash_output = like_num("-" + str(i)) # 335μs -> 355μs (5.73% slower)

def test_large_scale_many_fractions():
# Should recognize many fractions
for i in range(1, 1000, 50):
for j in range(1, 1000, 50):
codeflash_output = like_num(f"{i}/{j}")
codeflash_output = like_num(f"+{i}/{j}")
codeflash_output = like_num(f"-{i}/{j}")

def test_large_scale_many_em_endings():
# Should recognize many digit+ending forms
for i in range(1, 1000, 50):
codeflash_output = like_num(f"{i}em") # 29.6μs -> 16.9μs (75.3% faster)
codeflash_output = like_num(f"{i}yem")
codeflash_output = like_num(f"{i}emîn") # 28.6μs -> 18.8μs (52.6% faster)
codeflash_output = like_num(f"{i}yemîn")

def test_large_scale_many_non_numbers():
# Should not recognize random strings of similar length
for i in range(1, 1000, 50):
s = "a" * i
codeflash_output = not like_num(s) # 44.7μs -> 39.9μs (12.0% faster)
codeflash_output = not like_num(s + "em")
codeflash_output = not like_num(s + "123") # 47.2μs -> 42.0μs (12.6% faster)
codeflash_output = not like_num("em" + s)

def test_large_scale_ordinals_and_num_words():
# Should recognize all num and ordinal words in a large list
all_words = _num_words + _ordinal_words
for word in all_words:
for i in range(10): # repeat to simulate large scale
codeflash_output = like_num(word)
codeflash_output = like_num(word.upper())
codeflash_output = like_num(word.capitalize())

def test_large_scale_mixed_valid_and_invalid():
# Mix valid and invalid entries in a large list
valid = [str(i) for i in range(1, 501)] + [f"{i}em" for i in range(1, 501)]
invalid = ["foo", "bar", "baz", "qux", "quux"] * 100
for item in valid:
codeflash_output = like_num(item) # 718μs -> 482μs (48.9% faster)
for item in invalid:
codeflash_output = not like_num(item) # 624μs -> 384μs (62.5% faster)

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-like_num-mhwsvjwv and push.

Codeflash Static Badge

The optimized version achieves a **44% speedup** through three key performance optimizations:

**1. Eliminated function call overhead by inlining `is_digit`**
The original code called `is_digit()` for every text input, adding function call overhead. The optimized version inlines this logic directly into `like_num`, removing the function call entirely. This is particularly impactful since `is_digit` was called 14,443 times in profiling and represented 58.2% of the original runtime.

**2. Converted list lookups to O(1) set lookups**
The original code performed linear searches through `_num_words` and `_ordinal_words` lists for every lookup. The optimized version converts these to sets once and caches them as function attributes, changing O(n) list searches to O(1) set lookups. From profiling, the `_num_words` lookup took 5.1% of runtime and `_ordinal_words` lookup took 5.7% - both are now significantly faster.

**3. Optimized string operations**
- Replaced `text.startswith(("+", "-", "±", "~"))` with `text and text[0] in "+-±~"` to avoid tuple creation and use faster character-in-string lookup
- Added `text.split("/", 1)` to limit splits to just the first occurrence
- Cached the endings tuple as a function attribute to avoid recreating it on every call

**Performance characteristics by test case:**
- **Kurdish word lookups**: 38-58% faster due to set-based lookups
- **Digit+ending forms**: 35-78% faster from inlining `is_digit` and eliminating function calls  
- **Invalid strings**: 45-62% faster as the optimized logic exits earlier for non-matching cases
- **Large-scale processing**: Maintains consistent speedups across bulk operations

The optimization maintains identical functionality while significantly reducing computational overhead, making it especially beneficial for text processing pipelines that frequently validate numeric-like tokens.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 13, 2025 02:18
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant