Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 5, 2025

📄 15% (0.15x) speedup for like_num in spacy/lang/bo/lex_attrs.py

⏱️ Runtime : 1.45 milliseconds 1.26 milliseconds (best of 139 runs)

📝 Explanation and details

The optimized code achieves a 15% speedup through two key optimizations:

1. Efficient Fraction Detection
The original code used text.count("/") == 1 which scans the entire string to count occurrences. The optimized version first checks if "/" in text: (which stops at the first occurrence), then uses split("/") and checks len(splits) == 2. This avoids the full string scan in the common case where there's no slash at all, and is more efficient when there is exactly one slash.

2. Set-Based Tibetan Word Lookup
The original code performed membership testing against _num_words (a list), which is O(n) linear search. The optimized version converts this to a set on first use and caches it as a function attribute, making subsequent lookups O(1). This is particularly effective since the profiler shows 1,169 hits on the Tibetan word check.

Performance Impact by Test Case:

  • Invalid strings without "/": 34-72% faster (e.g., "hello", "abc123") because they skip the expensive count operation entirely
  • Valid Tibetan numerals: 17-28% faster due to O(1) set lookup vs O(n) list search
  • Large invalid strings: Up to 66% faster when "/" is absent
  • Basic digit strings: Minimal impact (slight variations due to measurement noise)

The optimizations are most beneficial for workloads with many invalid strings or frequent Tibetan numeral checks, while maintaining identical correctness for all test cases.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 3259 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest
from spacy.lang.bo.lex_attrs import like_num

# function to test
# reference 1: https://en.wikipedia.org/wiki/Tibetan_numerals

_num_words = [
    "ཀླད་ཀོར་",
    "གཅིག་",
    "གཉིས་",
    "གསུམ་",
    "བཞི་",
    "ལྔ་",
    "དྲུག་",
    "བདུན་",
    "བརྒྱད་",
    "དགུ་",
    "བཅུ་",
    "བཅུ་གཅིག་",
    "བཅུ་གཉིས་",
    "བཅུ་གསུམ་",
    "བཅུ་བཞི་",
    "བཅུ་ལྔ་",
    "བཅུ་དྲུག་",
    "བཅུ་བདུན་",
    "བཅུ་པརྒྱད",
    "བཅུ་དགུ་",
    "ཉི་ཤུ་",
    "སུམ་ཅུ",
    "བཞི་བཅུ",
    "ལྔ་བཅུ",
    "དྲུག་ཅུ",
    "བདུན་ཅུ",
    "བརྒྱད་ཅུ",
    "དགུ་བཅུ",
    "བརྒྱ་",
    "སྟོང་",
    "ཁྲི་",
    "ས་ཡ་",
    "	བྱེ་བ་",
    "དུང་ཕྱུར་",
    "ཐེར་འབུམ་",
    "ཐེར་འབུམ་ཆེན་པོ་",
    "ཁྲག་ཁྲིག་",
    "ཁྲག་ཁྲིག་ཆེན་པོ་",
]
from spacy.lang.bo.lex_attrs import like_num

# unit tests

# BASIC TEST CASES

@pytest.mark.parametrize(
    "input_text, expected",
    [
        # Basic digit strings
        ("123", True),    # simple integer
        ("0", True),      # zero
        ("00123", True),  # leading zeros
        # Strings with allowed prefixes
        ("+123", True),
        ("-456", True),
        ("±789", True),
        ("~101", True),
        # Strings with commas and dots (should be stripped)
        ("1,234", True),
        ("12.345", True),
        ("1,234,567", True),
        ("1.2.3.4", True),  # all dots are simply removed
        ("1,2,3,4", True),  # all commas are simply removed
        # Fractional numbers (only one '/')
        ("1/2", True),
        ("123/456", True),
        ("0001/0002", True),
        # Tibetan numerals from _num_words
        ("གཅིག་", True),
        ("བརྒྱ་", True),
        ("བཅུ་གསུམ་", True),
        # Not a number
        ("hello", False),
        ("abc123", False),
        ("12a34", False),
        ("1/2/3", False),  # multiple slashes
        ("1/", False),     # missing denominator
        ("/2", False),     # missing numerator
        ("", False),       # empty string
    ]
)
def test_like_num_basic(input_text, expected):
    """Basic functionality tests for like_num."""
    codeflash_output = like_num(input_text) # 34.0μs -> 32.0μs (5.97% faster)

# EDGE TEST CASES

@pytest.mark.parametrize(
    "input_text, expected",
    [
        # Edge: Only prefix
        ("+", False),
        ("-", False),
        ("±", False),
        ("~", False),
        # Edge: Only slash
        ("/", False),
        # Edge: Non-digit with prefix
        ("+abc", False),
        ("-གཅིག་", True),  # valid Tibetan numeral with prefix
        # Edge: Spaces
        (" 123", False),  # leading space, should not be considered a number
        ("123 ", False),  # trailing space
        (" 1/2", False),  # leading space with fraction
        ("གཅིག་ ", False),  # trailing space with Tibetan numeral
        # Edge: Mixed valid and invalid
        ("1/abc", False),
        ("abc/1", False),
        ("1/2/3", False),
        # Edge: Only dot/comma
        (".", False),
        (",", False),
        ("..", False),
        (",,", False),
        # Edge: Negative zero
        ("-0", True),
        # Edge: Large numbers with many commas/dots
        ("1,2,3,4,5,6,7,8,9,0", True),
        ("9.8.7.6.5.4.3.2.1.0", True),
        # Edge: Large number with prefix and punctuation
        ("+1,234,567.89", True),
        # Edge: Long Tibetan numeral with prefix
        ("~ཐེར་འབུམ་ཆེན་པོ་", True),
        # Edge: Tibetan numeral with typo
        ("གཅིག", False),  # missing trailing '་'
        # Edge: Fraction with leading zeros
        ("00001/00002", True),
        # Edge: Fraction with prefix
        ("-1/2", True),
        ("+123/456", True),
        ("±0001/0002", True),
        # Edge: Fraction with dots/commas
        ("1,000/2,000", True),
        ("1.000/2.000", True),
        # Edge: Fraction with mixed valid/invalid
        ("1/abc", False),
        ("abc/1", False),
        ("1/2a", False),
        ("a1/2", False),
        # Edge: Tibetan numeral with comma/dot (should not match)
        ("གཅིག་,", False),
        ("གཅིག་.", False),
        # Edge: Numeral with double prefix
        ("++123", False),
        ("--123", False),
        ("~~123", False),
        ("±±123", False),
        # Edge: Numeral with prefix and space
        ("+ 123", False),
        ("- 456", False),
        # Edge: Slash at start or end
        ("/123", False),
        ("123/", False),
        # Edge: Fraction with empty numerator or denominator
        ("/2", False),
        ("3/", False),
        # Edge: Only Tibetan whitespace character (tab in one _num_words entry)
        ("\t", False),
        # Edge: Tibetan numeral with extra space
        ("གཅིག་ ", False),
        (" གཅིག་", False),
    ]
)
def test_like_num_edge(input_text, expected):
    """Edge case tests for like_num."""
    codeflash_output = like_num(input_text) # 87.7μs -> 79.5μs (10.3% faster)

# LARGE SCALE TEST CASES

def test_like_num_large_scale_digits():
    """Test with a large digit string (999 digits)."""
    big_num = "1" * 999
    codeflash_output = like_num(big_num) # 2.97μs -> 2.92μs (1.75% faster)

def test_like_num_large_scale_fraction():
    """Test with a large numerator and denominator (up to 999 digits each)."""
    num = "9" * 999
    denom = "8" * 999
    big_frac = f"{num}/{denom}"
    codeflash_output = like_num(big_frac) # 8.63μs -> 7.79μs (10.7% faster)

def test_like_num_large_scale_commas_dots():
    """Test with a large number containing many commas and dots."""
    # 1000 digits, commas every 3 digits
    num = ",".join(["123"] * 333) + ",1"
    codeflash_output = like_num(num) # 6.25μs -> 6.25μs (0.032% faster)
    # 1000 digits, dots every 2 digits
    num2 = ".".join(["12"] * 499) + ".1"
    codeflash_output = like_num(num2) # 6.91μs -> 6.88μs (0.320% faster)

def test_like_num_large_scale_invalid():
    """Test large invalid strings."""
    # 1000 'a's
    codeflash_output = like_num("a" * 1000) # 2.81μs -> 1.69μs (66.6% faster)
    # 500 digits, 500 letters
    codeflash_output = like_num("1" * 500 + "a" * 500) # 2.46μs -> 1.96μs (25.7% faster)
    # Large fraction but with letters
    codeflash_output = like_num("1" * 500 + "/" + "a" * 499) # 4.21μs -> 4.00μs (5.22% faster)

def test_like_num_all_tibetan_words():
    """Test that all _num_words entries are recognized as numbers."""
    for word in _num_words:
        codeflash_output = like_num(word) # 21.1μs -> 16.4μs (28.8% faster)

def test_like_num_large_scale_prefix():
    """Test large number with prefix and punctuation."""
    big_num = "+" + ",".join(["123"] * 333) + ",1"
    codeflash_output = like_num(big_num) # 6.66μs -> 6.60μs (0.925% faster)
    big_num = "-" + ".".join(["12"] * 499) + ".1"
    codeflash_output = like_num(big_num) # 7.09μs -> 7.08μs (0.141% faster)

def test_like_num_large_scale_malformed_fraction():
    """Test large malformed fractions."""
    # Too many slashes
    num = "1" * 333 + "/" + "2" * 333 + "/" + "3" * 333
    codeflash_output = like_num(num) # 3.30μs -> 3.30μs (0.243% faster)
    # Slash at start
    num2 = "/" + "1" * 999
    codeflash_output = like_num(num2) # 2.96μs -> 2.41μs (23.2% faster)
    # Slash at end
    num3 = "1" * 999 + "/"
    codeflash_output = like_num(num3) # 5.23μs -> 4.52μs (15.6% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest  # used for our unit tests
from spacy.lang.bo.lex_attrs import like_num

# function to test
_num_words = [
    "ཀླད་ཀོར་",
    "གཅིག་",
    "གཉིས་",
    "གསུམ་",
    "བཞི་",
    "ལྔ་",
    "དྲུག་",
    "བདུན་",
    "བརྒྱད་",
    "དགུ་",
    "བཅུ་",
    "བཅུ་གཅིག་",
    "བཅུ་གཉིས་",
    "བཅུ་གསུམ་",
    "བཅུ་བཞི་",
    "བཅུ་ལྔ་",
    "བཅུ་དྲུག་",
    "བཅུ་བདུན་",
    "བཅུ་པརྒྱད",
    "བཅུ་དགུ་",
    "ཉི་ཤུ་",
    "སུམ་ཅུ",
    "བཞི་བཅུ",
    "ལྔ་བཅུ",
    "དྲུག་ཅུ",
    "བདུན་ཅུ",
    "བརྒྱད་ཅུ",
    "དགུ་བཅུ",
    "བརྒྱ་",
    "སྟོང་",
    "ཁྲི་",
    "ས་ཡ་",
    "	བྱེ་བ་",
    "དུང་ཕྱུར་",
    "ཐེར་འབུམ་",
    "ཐེར་འབུམ་ཆེན་པོ་",
    "ཁྲག་ཁྲིག་",
    "ཁྲག་ཁྲིག་ཆེན་པོ་",
]
from spacy.lang.bo.lex_attrs import like_num

# unit tests

# -----------------------
# Basic Test Cases
# -----------------------

def test_basic_integer():
    # Test simple integer
    codeflash_output = like_num("123") # 913ns -> 860ns (6.16% faster)
    # Test zero
    codeflash_output = like_num("0") # 492ns -> 518ns (5.02% slower)
    # Test single digit
    codeflash_output = like_num("7") # 274ns -> 298ns (8.05% slower)

def test_basic_fraction():
    # Test proper fraction
    codeflash_output = like_num("1/2") # 1.57μs -> 1.66μs (5.47% slower)
    # Test improper fraction
    codeflash_output = like_num("10/3") # 1.01μs -> 993ns (2.11% faster)

def test_basic_signs():
    # Test positive sign
    codeflash_output = like_num("+123") # 1.14μs -> 1.06μs (7.27% faster)
    # Test negative sign
    codeflash_output = like_num("-456") # 516ns -> 533ns (3.19% slower)
    # Test tilde sign
    codeflash_output = like_num("~789") # 392ns -> 386ns (1.55% faster)
    # Test ± sign
    codeflash_output = like_num("±321") # 528ns -> 536ns (1.49% slower)

def test_basic_decimal_and_comma():
    # Should ignore commas and dots
    codeflash_output = like_num("1,234") # 1.07μs -> 1.04μs (2.88% faster)
    codeflash_output = like_num("1.234") # 503ns -> 499ns (0.802% faster)
    codeflash_output = like_num("1,234,567") # 521ns -> 505ns (3.17% faster)
    codeflash_output = like_num("1.234.567") # 394ns -> 400ns (1.50% slower)

def test_basic_tibetan_num_words():
    # Test a few Tibetan number words
    codeflash_output = like_num("གཅིག་") # 1.81μs -> 1.78μs (1.80% faster)
    codeflash_output = like_num("བཅུ་") # 830ns -> 747ns (11.1% faster)
    codeflash_output = like_num("སུམ་ཅུ") # 678ns -> 542ns (25.1% faster)

# -----------------------
# Edge Test Cases
# -----------------------

def test_edge_empty_string():
    # Empty string should not be a number
    codeflash_output = like_num("") # 1.33μs -> 1.10μs (20.9% faster)

def test_edge_only_sign():
    # Only sign, no digits
    codeflash_output = like_num("+") # 1.53μs -> 1.34μs (14.4% faster)
    codeflash_output = like_num("-") # 760ns -> 561ns (35.5% faster)
    codeflash_output = like_num("~") # 621ns -> 444ns (39.9% faster)
    codeflash_output = like_num("±") # 603ns -> 410ns (47.1% faster)

def test_edge_multiple_signs():
    # Multiple signs should only remove the first
    codeflash_output = like_num("++123") # 1.79μs -> 1.25μs (43.9% faster)
    codeflash_output = like_num("--123") # 812ns -> 663ns (22.5% faster)
    codeflash_output = like_num("±±123") # 1.19μs -> 1.14μs (3.94% faster)

def test_edge_non_digit_fraction():
    # Fraction with non-digit numerator or denominator
    codeflash_output = like_num("a/2") # 1.81μs -> 1.72μs (5.11% faster)
    codeflash_output = like_num("2/b") # 1.01μs -> 1.00μs (0.899% faster)
    codeflash_output = like_num("a/b") # 677ns -> 485ns (39.6% faster)
    # Fraction with more than one slash
    codeflash_output = like_num("1/2/3") # 809ns -> 644ns (25.6% faster)

def test_edge_non_number_strings():
    # Random strings should not be numbers
    codeflash_output = like_num("hello") # 1.35μs -> 1.01μs (34.4% faster)
    codeflash_output = like_num("བཅུ") # 1.27μs -> 1.05μs (21.2% faster)
    codeflash_output = like_num("གཅིག") # 813ns -> 473ns (71.9% faster)
    codeflash_output = like_num("123abc") # 854ns -> 594ns (43.8% faster)
    codeflash_output = like_num("abc123") # 536ns -> 472ns (13.6% faster)

def test_edge_only_comma_dot():
    # Only comma or dot should not be a number
    codeflash_output = like_num(",") # 1.26μs -> 1.04μs (20.7% faster)
    codeflash_output = like_num(".") # 754ns -> 553ns (36.3% faster)
    codeflash_output = like_num(".,") # 764ns -> 598ns (27.8% faster)

def test_edge_leading_trailing_spaces():
    # Spaces should prevent recognition
    codeflash_output = like_num(" 123") # 1.34μs -> 989ns (35.5% faster)
    codeflash_output = like_num("123 ") # 713ns -> 602ns (18.4% faster)
    codeflash_output = like_num(" 1/2") # 1.06μs -> 988ns (6.98% faster)
    codeflash_output = like_num("གཅིག་ ") # 1.10μs -> 839ns (30.6% faster)

def test_edge_tibetan_word_exact_match():
    # Similar but not exact Tibetan words
    codeflash_output = like_num("གཅིག") # 1.53μs -> 1.18μs (30.0% faster)
    codeflash_output = like_num("གཅིག་་") # 918ns -> 701ns (31.0% faster)
    codeflash_output = like_num("བཅུ་པརྒྱད་") # 682ns -> 459ns (48.6% faster)

def test_edge_fraction_with_sign():
    # Fraction with sign
    codeflash_output = like_num("+1/2") # 1.81μs -> 1.79μs (1.45% faster)
    codeflash_output = like_num("-10/3") # 1.04μs -> 1.11μs (6.29% slower)
    codeflash_output = like_num("~3/4") # 646ns -> 634ns (1.89% faster)
    codeflash_output = like_num("±7/8") # 840ns -> 764ns (9.95% faster)

def test_edge_fraction_with_comma_dot():
    # Fraction with commas and dots
    codeflash_output = like_num("1,000/2,000") # 1.82μs -> 1.83μs (0.764% slower)
    codeflash_output = like_num("1.000/2.000") # 902ns -> 855ns (5.50% faster)

def test_edge_fraction_with_leading_zero():
    # Fraction with leading zeros
    codeflash_output = like_num("01/02") # 1.41μs -> 1.46μs (3.16% slower)

def test_edge_fraction_with_trailing_slash():
    # Trailing slash is not valid
    codeflash_output = like_num("123/") # 1.89μs -> 1.72μs (9.71% faster)
    codeflash_output = like_num("/123") # 1.09μs -> 987ns (10.0% faster)

def test_edge_fraction_with_non_ascii_digits():
    # Non-ASCII digits
    codeflash_output = like_num("١٢٣") # 1.13μs -> 1.18μs (4.06% slower)

# -----------------------
# Large Scale Test Cases
# -----------------------

def test_large_scale_many_integers():
    # Test a large number of valid integers
    for i in range(1, 1000):  # 1 to 999
        codeflash_output = like_num(str(i)) # 240μs -> 247μs (2.59% slower)

def test_large_scale_many_fractions():
    # Test a large number of valid fractions
    for i in range(1, 1000, 10):  # step to keep it under 1000
        for j in range(1, 1000, 100):
            codeflash_output = like_num(f"{i}/{j}")

def test_large_scale_many_invalid_strings():
    # Test a large number of invalid strings
    for i in range(1, 1000):
        codeflash_output = like_num(f"notnumber{i}") # 494μs -> 339μs (45.4% faster)

def test_large_scale_tibetan_num_words():
    # All Tibetan number words should be recognized
    for word in _num_words:
        codeflash_output = like_num(word) # 21.7μs -> 17.0μs (27.2% faster)

def test_large_scale_mixed_valid_invalid():
    # Mix of valid and invalid inputs
    valid = ["123", "1/2", "+345", "གསུམ་"]
    invalid = ["abc", "1//2", "12a", "གསུམ", "1/2/3", " "]
    for v in valid:
        codeflash_output = like_num(v) # 3.76μs -> 3.71μs (1.13% faster)
    for iv in invalid:
        codeflash_output = like_num(iv) # 4.21μs -> 3.18μs (32.3% faster)

def test_large_scale_comma_dot_removal():
    # Numbers with many commas and dots
    for i in range(1, 1000, 111):
        s = f"{i:,}"
        codeflash_output = like_num(s) # 2.83μs -> 2.84μs (0.317% slower)
        s_dot = s.replace(",", ".")
        codeflash_output = like_num(s_dot)

def test_large_scale_fraction_with_sign_and_punctuation():
    # Fractions with sign and punctuation
    for i in range(1, 1000, 111):
        s = f"+{i:,}/1,000"
        codeflash_output = like_num(s) # 6.73μs -> 6.44μs (4.58% faster)
        s2 = f"-{i:,}/1.000"
        codeflash_output = like_num(s2)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-like_num-mhmj0jb5 and push.

Codeflash Static Badge

The optimized code achieves a 15% speedup through two key optimizations:

**1. Efficient Fraction Detection**
The original code used `text.count("/") == 1` which scans the entire string to count occurrences. The optimized version first checks `if "/" in text:` (which stops at the first occurrence), then uses `split("/")` and checks `len(splits) == 2`. This avoids the full string scan in the common case where there's no slash at all, and is more efficient when there is exactly one slash.

**2. Set-Based Tibetan Word Lookup**
The original code performed membership testing against `_num_words` (a list), which is O(n) linear search. The optimized version converts this to a set on first use and caches it as a function attribute, making subsequent lookups O(1). This is particularly effective since the profiler shows 1,169 hits on the Tibetan word check.

**Performance Impact by Test Case:**
- **Invalid strings without "/"**: 34-72% faster (e.g., "hello", "abc123") because they skip the expensive count operation entirely
- **Valid Tibetan numerals**: 17-28% faster due to O(1) set lookup vs O(n) list search
- **Large invalid strings**: Up to 66% faster when "/" is absent
- **Basic digit strings**: Minimal impact (slight variations due to measurement noise)

The optimizations are most beneficial for workloads with many invalid strings or frequent Tibetan numeral checks, while maintaining identical correctness for all test cases.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 5, 2025 21:45
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant