Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 5, 2025

📄 1,072% (10.72x) speedup for like_num in spacy/lang/yo/lex_attrs.py

⏱️ Runtime : 10.2 milliseconds 872 microseconds (best of 169 runs)

📝 Explanation and details

The optimized code achieves a 10.7x speedup by eliminating the most expensive operation in the original implementation. The key optimization is lazy initialization with caching of the stripped number words.

What was optimized:

  1. Eliminated repeated expensive computation: The original code called [strip_accents_text(num) for num in _num_words] on every like_num() invocation, which was the performance bottleneck (95% of execution time per line profiler).

  2. Added lazy caching mechanism: _get_num_words_stripped() computes the stripped words only once and stores them in a function attribute cache, converting the list to a set for O(1) lookups instead of O(n) list searches.

  3. Simplified lookup logic: Reduced from two separate membership checks (text in _num_words_stripped or text.lower() in _num_words_stripped) to a single check after lowercasing once (stripped_text.lower() in _get_num_words_stripped()).

  4. Minor optimization: Changed num_markers from list to tuple for faster membership testing.

Why this is faster:

  • The original code performed ~74 expensive strip_accents_text() calls per function invocation (once per word in _num_words)
  • The optimized version performs at most 1 strip_accents_text() call (on the input text only)
  • Set membership testing is O(1) vs O(n) for list membership
  • Unicode normalization and accent stripping are computationally expensive operations that should be minimized

Performance impact:
The tests show dramatic improvements especially for cases involving actual number word lookups (1000-4000% faster for many test cases). The optimization is most effective when the input text requires checking against the Yoruba number words, as evidenced by the large speedups in cases like test_basic_digits() and test_large_many_numwords().

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 10 Passed
🌀 Generated Regression Tests 263 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
lang/yo/test_text.py::test_yo_lex_attrs_capitals 717μs 28.2μs 2448%✅
🌀 Generated Regression Tests and Runtime
import unicodedata

# imports
import pytest  # used for our unit tests
from spacy.lang.yo.lex_attrs import like_num

_num_words = [
    "ení",
    "oókàn",
    "ọ̀kanlá",
    "ẹ́ẹdọ́gbọ̀n",
    "àádọ́fà",
    "ẹ̀walélúɡba",
    "egbèje",
    "ẹgbàárin",
    "èjì",
    "eéjì",
    "èjìlá",
    "ọgbọ̀n,",
    "ọgọ́fà",
    "ọ̀ọ́dúrún",
    "ẹgbẹ̀jọ",
    "ẹ̀ẹ́dẹ́ɡbàárùn",
    "ẹ̀ta",
    "ẹẹ́ta",
    "ẹ̀talá",
    "aárùndílogójì",
    "àádóje",
    "irinwó",
    "ẹgbẹ̀sàn",
    "ẹgbàárùn",
    "ẹ̀rin",
    "ẹẹ́rin",
    "ẹ̀rinlá",
    "ogójì",
    "ogóje",
    "ẹ̀ẹ́dẹ́gbẹ̀ta",
    "ẹgbàá",
    "ẹgbàájọ",
    "àrún",
    "aárùn",
    "ẹ́ẹdógún",
    "àádọ́ta",
    "àádọ́jọ",
    "ẹgbẹ̀ta",
    "ẹgboókànlá",
    "ẹgbàawǎ",
    "ẹ̀fà",
    "ẹẹ́fà",
    "ẹẹ́rìndílógún",
    "ọgọ́ta",
    "ọgọ́jọ",
    "ọ̀ọ́dẹ́gbẹ̀rin",
    "ẹgbẹ́ẹdógún",
    "ọkẹ́marun",
    "èje",
    "etàdílógún",
    "àádọ́rin",
    "àádọ́sán",
    "ẹgbẹ̀rin",
    "ẹgbàajì",
    "ẹgbẹ̀ẹgbẹ̀rún",
    "ẹ̀jọ",
    "ẹẹ́jọ",
    "eéjìdílógún",
    "ọgọ́rin",
    "ọgọsàn",
    "ẹ̀ẹ́dẹ́gbẹ̀rún",
    "ẹgbẹ́ẹdọ́gbọ̀n",
    "ọgọ́rùn ọkẹ́",
    "ẹ̀sán",
    "ẹẹ́sàn",
    "oókàndílógún",
    "àádọ́rùn",
    "ẹ̀wadilúɡba",
    "ẹgbẹ̀rún",
    "ẹgbàáta",
    "ẹ̀wá",
    "ẹẹ́wàá",
    "ogún",
    "ọgọ́rùn",
    "igba",
    "ẹgbẹ̀fà",
    "ẹ̀ẹ́dẹ́ɡbarin",
]
from spacy.lang.yo.lex_attrs import like_num

# unit tests

# 1. Basic Test Cases

def test_basic_digits():
    # Test single digit
    codeflash_output = like_num("1") # 89.1μs -> 2.83μs (3052% faster)
    # Test multi-digit number
    codeflash_output = like_num("123") # 85.6μs -> 2.03μs (4109% faster)
    # Test number with leading zeros
    codeflash_output = like_num("00123") # 83.6μs -> 1.69μs (4838% faster)


def test_basic_num_markers():
    # Test presence of num_markers in string
    codeflash_output = like_num("dínlogún") # 2.32μs -> 2.10μs (10.5% faster)
    codeflash_output = like_num("dinlogun") # 1.19μs -> 1.20μs (1.17% slower)
    codeflash_output = like_num("dílogun") # 587ns -> 565ns (3.89% faster)
    codeflash_output = like_num("dologun") # 1.02μs -> 1.03μs (0.872% slower)
    codeflash_output = like_num("lelogun") # 664ns -> 644ns (3.11% faster)
    codeflash_output = like_num("lélogun") # 590ns -> 548ns (7.66% faster)

def test_basic_punctuation_removal():
    # Test number with comma
    codeflash_output = like_num("1,000") # 103μs -> 4.46μs (2215% faster)
    # Test number with dot
    codeflash_output = like_num("1.000") # 88.1μs -> 1.86μs (4631% faster)
    # Test word with comma
    codeflash_output = like_num("ọgbọ̀n,") # 87.4μs -> 5.39μs (1520% faster)
    # Test word with dot
    codeflash_output = like_num("ẹ́ẹdọ́gbọ̀n.") # 1.42μs -> 1.17μs (21.2% faster)

# 2. Edge Test Cases

def test_edge_empty_string():
    # Empty string should not be a number
    codeflash_output = like_num("") # 91.7μs -> 3.14μs (2818% faster)

def test_edge_spaces_only():
    # String with only spaces should not be a number
    codeflash_output = like_num("   ") # 91.7μs -> 3.71μs (2370% faster)

def test_edge_non_numeric_non_numword():
    # Completely unrelated word
    codeflash_output = like_num("apple") # 1.93μs -> 1.74μs (10.9% faster)
    # Word with accents but not in _num_words
    codeflash_output = like_num("àbùlè") # 92.4μs -> 4.97μs (1761% faster)


def test_edge_mixed_content():
    # String with digits and letters
    codeflash_output = like_num("abc123") # 105μs -> 6.14μs (1616% faster)
    # String with numword and unrelated suffix
    codeflash_output = like_num("ẹ́ẹdọ́gbọ̀nxyz") # 1.78μs -> 1.57μs (13.0% faster)

def test_edge_case_sensitivity():
    # Lowercase, uppercase, and mixed case
    codeflash_output = like_num("ẹ́ẹdọ́gbọ̀n") # 1.95μs -> 1.81μs (7.92% faster)
    codeflash_output = like_num("Ẹ́ẸDỌ́GBỌ̀N") # 96.8μs -> 6.44μs (1402% faster)
    codeflash_output = like_num("Ẹ́ẹDọ́Gbọ̀N") # 88.5μs -> 3.25μs (2623% faster)

def test_edge_marker_in_non_numword():
    # Marker present but not in a number context
    codeflash_output = like_num("dinosaur") # 1.84μs -> 1.67μs (9.94% faster)
    codeflash_output = like_num("dining") # 836ns -> 863ns (3.13% slower)
    codeflash_output = like_num("lemon") # 835ns -> 822ns (1.58% faster)

def test_edge_marker_at_edges():
    # Marker at start, end, or middle
    codeflash_output = like_num("di123") # 1.62μs -> 1.40μs (15.5% faster)
    codeflash_output = like_num("123di") # 817ns -> 785ns (4.08% faster)
    codeflash_output = like_num("12di3") # 679ns -> 638ns (6.43% faster)

def test_edge_marker_with_punctuation():
    # Marker with punctuation
    codeflash_output = like_num("dí,") # 1.87μs -> 1.74μs (7.00% faster)
    codeflash_output = like_num("dín.") # 780ns -> 711ns (9.70% faster)

def test_edge_digit_with_accents():
    # Digits with combining marks (should be stripped)
    accented_digit = unicodedata.normalize("NFD", "1́")  # '1' + combining acute accent
    codeflash_output = like_num(accented_digit) # 96.9μs -> 3.82μs (2440% faster)

def test_edge_numword_with_extra_spaces():
    # Numword with leading/trailing spaces
    codeflash_output = like_num("  ẹ́ẹdọ́gbọ̀n  ") # 1.85μs -> 1.63μs (13.1% faster)

def test_edge_numword_with_internal_spaces():
    # Numword with internal spaces
    codeflash_output = like_num("ẹ́ẹ dọ́gbọ̀n") # 1.72μs -> 1.57μs (9.41% faster)

def test_edge_marker_with_case_variations():
    # Marker with different cases
    codeflash_output = like_num("DiLogun") # 96.8μs -> 5.29μs (1730% faster)
    codeflash_output = like_num("LELOGUN") # 89.2μs -> 2.55μs (3393% faster)

def test_edge_marker_with_accents():
    # Marker with accents
    codeflash_output = like_num("dílogún") # 1.67μs -> 1.40μs (19.1% faster)
    codeflash_output = like_num("lélogún") # 835ns -> 783ns (6.64% faster)

# 3. Large Scale Test Cases

def test_large_many_digits():
    # Large number (up to 1000 digits)
    large_number = "9" * 1000
    codeflash_output = like_num(large_number) # 165μs -> 73.8μs (125% faster)

def test_large_many_numwords():
    # Test all numwords (with accents)
    for word in _num_words:
        codeflash_output = like_num(word)
    # Test all numwords (without accents)
    for word in _num_words:
        codeflash_output = like_num(strip_accents_text(word))

def test_large_marker_in_long_string():
    # Marker in a long string
    long_string = "a" * 500 + "din" + "b" * 499
    codeflash_output = like_num(long_string) # 3.85μs -> 3.63μs (5.97% faster)

def test_large_non_numword_long_string():
    # Long string without any marker or numword
    long_string = "a" * 1000
    codeflash_output = like_num(long_string) # 172μs -> 73.8μs (134% faster)

def test_large_mixed_content():
    # Large string with digits embedded
    mixed_string = "a" * 500 + "12345" + "b" * 495
    codeflash_output = like_num(mixed_string) # 166μs -> 74.2μs (124% faster)

def test_large_marker_at_edges_long_string():
    # Marker at start
    codeflash_output = like_num("din" + "a" * 997) # 3.08μs -> 2.91μs (5.91% faster)
    # Marker at end
    codeflash_output = like_num("a" * 997 + "din") # 2.31μs -> 2.23μs (3.32% faster)
    # Marker in middle
    codeflash_output = like_num("a" * 499 + "din" + "b" * 498) # 1.97μs -> 1.93μs (2.12% faster)

def test_large_numword_with_punctuation():
    # Numword with punctuation at the end
    for word in _num_words[:10]:  # Test first 10 for brevity
        codeflash_output = like_num(word + ",") # 600μs -> 24.7μs (2328% faster)
        codeflash_output = like_num(word + ".")

def test_large_numword_with_case_variations():
    # Numword with different cases
    for word in _num_words[:10]:  # Test first 10 for brevity
        codeflash_output = like_num(word.upper()) # 841μs -> 27.2μs (2991% faster)
        codeflash_output = like_num(word.lower())
        codeflash_output = like_num(word.capitalize()) # 579μs -> 15.3μs (3697% faster)

def test_large_marker_with_case_and_punctuation():
    # Marker with different cases and punctuation in long string
    for marker in ["di", "din", "le", "do"]:
        codeflash_output = like_num(marker.upper() + "logun") # 340μs -> 9.63μs (3432% faster)
        codeflash_output = like_num(marker.capitalize() + "logun,")
        codeflash_output = like_num(marker + "logun.") # 334μs -> 7.81μs (4179% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import unicodedata

# imports
import pytest  # used for our unit tests
from spacy.lang.yo.lex_attrs import like_num

_num_words = [
    "ení",
    "oókàn",
    "ọ̀kanlá",
    "ẹ́ẹdọ́gbọ̀n",
    "àádọ́fà",
    "ẹ̀walélúɡba",
    "egbèje",
    "ẹgbàárin",
    "èjì",
    "eéjì",
    "èjìlá",
    "ọgbọ̀n,",
    "ọgọ́fà",
    "ọ̀ọ́dúrún",
    "ẹgbẹ̀jọ",
    "ẹ̀ẹ́dẹ́ɡbàárùn",
    "ẹ̀ta",
    "ẹẹ́ta",
    "ẹ̀talá",
    "aárùndílogójì",
    "àádóje",
    "irinwó",
    "ẹgbẹ̀sàn",
    "ẹgbàárùn",
    "ẹ̀rin",
    "ẹẹ́rin",
    "ẹ̀rinlá",
    "ogójì",
    "ogóje",
    "ẹ̀ẹ́dẹ́gbẹ̀ta",
    "ẹgbàá",
    "ẹgbàájọ",
    "àrún",
    "aárùn",
    "ẹ́ẹdógún",
    "àádọ́ta",
    "àádọ́jọ",
    "ẹgbẹ̀ta",
    "ẹgboókànlá",
    "ẹgbàawǎ",
    "ẹ̀fà",
    "ẹẹ́fà",
    "ẹẹ́rìndílógún",
    "ọgọ́ta",
    "ọgọ́jọ",
    "ọ̀ọ́dẹ́gbẹ̀rin",
    "ẹgbẹ́ẹdógún",
    "ọkẹ́marun",
    "èje",
    "etàdílógún",
    "àádọ́rin",
    "àádọ́sán",
    "ẹgbẹ̀rin",
    "ẹgbàajì",
    "ẹgbẹ̀ẹgbẹ̀rún",
    "ẹ̀jọ",
    "ẹẹ́jọ",
    "eéjìdílógún",
    "ọgọ́rin",
    "ọgọsàn",
    "ẹ̀ẹ́dẹ́gbẹ̀rún",
    "ẹgbẹ́ẹdọ́gbọ̀n",
    "ọgọ́rùn ọkẹ́",
    "ẹ̀sán",
    "ẹẹ́sàn",
    "oókàndílógún",
    "àádọ́rùn",
    "ẹ̀wadilúɡba",
    "ẹgbẹ̀rún",
    "ẹgbàáta",
    "ẹ̀wá",
    "ẹẹ́wàá",
    "ogún",
    "ọgọ́rùn",
    "igba",
    "ẹgbẹ̀fà",
    "ẹ̀ẹ́dẹ́ɡbarin",
]
from spacy.lang.yo.lex_attrs import like_num

# unit tests

# 1. Basic Test Cases
def test_basic_digit():
    # Simple digit string
    codeflash_output = like_num("123") # 90.7μs -> 3.90μs (2228% faster)
    # Single digit
    codeflash_output = like_num("7") # 83.9μs -> 1.60μs (5146% faster)
    # Zero
    codeflash_output = like_num("0") # 82.7μs -> 1.04μs (7883% faster)


def test_basic_num_marker():
    # Contains a number marker
    codeflash_output = like_num("dinlogun") # 2.50μs -> 2.31μs (7.95% faster)
    # Contains a number marker with accent
    codeflash_output = like_num("dínlogún") # 979ns -> 915ns (6.99% faster)
    # Marker at the end
    codeflash_output = like_num("ogunle") # 1.06μs -> 1.07μs (0.655% slower)
    # Marker in the middle
    codeflash_output = like_num("ogundile") # 734ns -> 720ns (1.94% faster)

def test_basic_punctuation():
    # Digits with comma
    codeflash_output = like_num("1,234") # 102μs -> 4.60μs (2121% faster)
    # Digits with period
    codeflash_output = like_num("12.34") # 88.2μs -> 1.92μs (4486% faster)
    # Digits with both
    codeflash_output = like_num("1,234.56") # 84.6μs -> 1.99μs (4159% faster)

# 2. Edge Test Cases
def test_edge_empty_string():
    # Empty string should not be considered a number
    codeflash_output = like_num("") # 90.8μs -> 3.42μs (2551% faster)

def test_edge_non_number_word():
    # Completely unrelated word
    codeflash_output = like_num("banana") # 90.6μs -> 4.28μs (2015% faster)
    # Word similar to number but not in list
    codeflash_output = like_num("kanla") # 86.2μs -> 2.24μs (3746% faster)

def test_edge_mixed_alphanumeric():
    # Alphanumeric string
    codeflash_output = like_num("123abc") # 90.5μs -> 3.92μs (2213% faster)
    # Number word with extra characters
    codeflash_output = like_num("ọ̀kanlá!") # 87.4μs -> 4.79μs (1724% faster)

def test_edge_whitespace():
    # Number with leading/trailing whitespace
    codeflash_output = like_num(" 123 ") # 91.0μs -> 3.68μs (2374% faster)
    # Yoruba word with whitespace
    codeflash_output = like_num(" ẹ̀ta ") # 86.8μs -> 3.69μs (2249% faster)

def test_edge_case_sensitive():
    # Yoruba word in uppercase
    codeflash_output = like_num("Ẹ̀TA") # 89.5μs -> 4.68μs (1811% faster)
    # Yoruba word in mixed case
    codeflash_output = like_num("Ẹ̀tA") # 85.2μs -> 2.40μs (3455% faster)

def test_edge_marker_substring():
    # Marker substring but not a number
    codeflash_output = like_num("dinosaur") # 2.01μs -> 1.72μs (17.2% faster)
    # Marker inside a non-number word
    codeflash_output = like_num("ledog") # 1.05μs -> 1.03μs (2.73% faster)

def test_edge_marker_overlap():
    # Marker at start
    codeflash_output = like_num("dinogun") # 1.72μs -> 1.51μs (13.9% faster)
    # Marker at end
    codeflash_output = like_num("ogundin") # 963ns -> 931ns (3.44% faster)

def test_edge_num_word_with_punctuation():
    # Yoruba word with punctuation
    codeflash_output = like_num("ẹ̀ta,") # 94.1μs -> 5.71μs (1547% faster)
    codeflash_output = like_num("ẹ̀ta.") # 87.7μs -> 2.56μs (3329% faster)

def test_edge_num_word_with_spaces():
    # Yoruba word with internal spaces
    codeflash_output = like_num("ẹ̀ ta") # 93.1μs -> 4.70μs (1880% faster)

def test_edge_marker_with_accents():
    # Marker with accent
    codeflash_output = like_num("dínogún") # 1.74μs -> 1.39μs (24.7% faster)

def test_edge_marker_case():
    # Marker in different case
    codeflash_output = like_num("DINlogun") # 93.3μs -> 5.04μs (1752% faster)
    codeflash_output = like_num("Leogun") # 88.5μs -> 2.54μs (3381% faster)

def test_edge_digit_with_letters():
    # Digits with letters
    codeflash_output = like_num("123abc") # 91.3μs -> 4.07μs (2143% faster)

def test_edge_digit_with_spaces():
    # Digits with spaces
    codeflash_output = like_num(" 123 456 ") # 93.2μs -> 4.40μs (2017% faster)

def test_edge_marker_in_middle_of_word():
    # Marker in middle of non-number word
    codeflash_output = like_num("abcdinxyz") # 1.82μs -> 1.62μs (12.4% faster)

def test_edge_marker_in_word_with_punctuation():
    # Marker with punctuation
    codeflash_output = like_num("din,logun") # 2.10μs -> 1.92μs (9.64% faster)

def test_edge_marker_only():
    # Only marker as input
    codeflash_output = like_num("din") # 1.73μs -> 1.59μs (8.71% faster)
    codeflash_output = like_num("le") # 1.02μs -> 1.02μs (0.294% faster)

def test_edge_marker_with_numbers():
    # Marker with numbers
    codeflash_output = like_num("din123") # 1.69μs -> 1.46μs (15.3% faster)
    codeflash_output = like_num("123din") # 897ns -> 850ns (5.53% faster)

def test_edge_marker_with_spaces():
    # Marker with spaces
    codeflash_output = like_num(" din ") # 1.64μs -> 1.47μs (11.9% faster)

def test_edge_marker_with_uppercase():
    # Marker in uppercase
    codeflash_output = like_num("DIN") # 102μs -> 4.33μs (2260% faster)

def test_edge_marker_with_mixed_case():
    # Marker in mixed case
    codeflash_output = like_num("DiN") # 96.1μs -> 3.84μs (2400% faster)

def test_edge_marker_embedded():
    # Marker embedded in a longer word
    codeflash_output = like_num("abcdefdinxyz") # 1.89μs -> 1.57μs (20.3% faster)

# 3. Large Scale Test Cases
def test_large_many_digits():
    # Very long digit string
    big_num = "1" * 1000
    codeflash_output = like_num(big_num) # 165μs -> 73.9μs (125% faster)


def test_large_marker_in_many_words():
    # Marker in many different positions in large strings
    for marker in ["dí", "dọ", "lé", "dín", "di", "din", "le", "do"]:
        codeflash_output = like_num(marker * 100) # 10.6μs -> 10.1μs (4.49% faster)
        codeflash_output = like_num(f"abc{marker}xyz")

def test_large_non_number_words():
    # Large string of non-number words
    long_non_num = "banana" * 200
    codeflash_output = like_num(long_non_num) # 186μs -> 88.4μs (111% faster)

def test_large_mixed_list():
    # Mix of number words, markers, digits, and non-number words
    inputs = [
        "1234567890",
        "ọgọ́rùn",
        "dinlogun",
        "banana",
        "abcdef",
        "ẹ̀ta",
        "dinosaur",
        "leogun",
        "ogunle",
        "ogundin",
        "1,234,567",
        "12.34.56",
        "oókàn",
        "egbèje",
        "abcdefdinxyz",
        "din" * 50,
        "banana" * 50,
    ]
    expected = [
        True, True, True, False, False, True, False, True, True, True, True, True, True, True, True, True, False
    ]
    for inp, exp in zip(inputs, expected):
        codeflash_output = like_num(inp) # 878μs -> 53.3μs (1550% faster)


def test_large_marker_and_digits():
    # Large string with marker and digits
    s = "din" + "9" * 995
    codeflash_output = like_num(s) # 3.63μs -> 3.46μs (4.82% faster)

def test_large_spaces_and_punctuation():
    # Large digit string with spaces and punctuation
    s = " 1,234.567 " * 50
    codeflash_output = like_num(s) # 137μs -> 38.9μs (254% faster)

def test_large_marker_with_spaces():
    # Marker with spaces in large input
    s = " din " * 50
    codeflash_output = like_num(s) # 2.53μs -> 2.25μs (12.1% faster)

def test_large_marker_uppercase():
    # Uppercase marker repeated
    s = "DIN" * 100
    codeflash_output = like_num(s) # 118μs -> 25.9μs (358% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-like_num-mhmju9dh and push.

Codeflash Static Badge

The optimized code achieves a **10.7x speedup** by eliminating the most expensive operation in the original implementation. The key optimization is **lazy initialization with caching** of the stripped number words.

**What was optimized:**
1. **Eliminated repeated expensive computation**: The original code called `[strip_accents_text(num) for num in _num_words]` on every `like_num()` invocation, which was the performance bottleneck (95% of execution time per line profiler).

2. **Added lazy caching mechanism**: `_get_num_words_stripped()` computes the stripped words only once and stores them in a function attribute cache, converting the list to a **set** for O(1) lookups instead of O(n) list searches.

3. **Simplified lookup logic**: Reduced from two separate membership checks (`text in _num_words_stripped or text.lower() in _num_words_stripped`) to a single check after lowercasing once (`stripped_text.lower() in _get_num_words_stripped()`).

4. **Minor optimization**: Changed `num_markers` from list to tuple for faster membership testing.

**Why this is faster:**
- The original code performed ~74 expensive `strip_accents_text()` calls per function invocation (once per word in `_num_words`)
- The optimized version performs at most 1 `strip_accents_text()` call (on the input text only)
- Set membership testing is O(1) vs O(n) for list membership
- Unicode normalization and accent stripping are computationally expensive operations that should be minimized

**Performance impact:**
The tests show dramatic improvements especially for cases involving actual number word lookups (1000-4000% faster for many test cases). The optimization is most effective when the input text requires checking against the Yoruba number words, as evidenced by the large speedups in cases like `test_basic_digits()` and `test_large_many_numwords()`.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 5, 2025 22:08
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant