Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 5, 2025

📄 57% (0.57x) speedup for _consume_ent in spacy/training/iob_utils.py

⏱️ Runtime : 1.42 milliseconds 905 microseconds (best of 245 runs)

📝 Explanation and details

The optimization achieves a 56% speedup by eliminating inefficient list operations and replacing them with more performant alternatives:

Key Optimizations

1. Index-based scanning instead of repeated pop(0) operations

  • Original code uses while tags and tags[0] in {target_in, target_last}: tags.pop(0) which performs O(n) operations for each pop
  • Optimized code uses for i in range(n): t = tags[i] to scan without modification, then removes matched elements in one operation with del tags[:length-1]

2. List multiplication instead of list comprehension

  • Original: middle = [f"I-{label}" for _ in range(1, length - 1)] creates strings in a loop
  • Optimized: middle = ["I-" + label] * (length - 2) uses faster list multiplication for repeated identical strings

3. Early label validation

  • Moves the if not label: check earlier to avoid unnecessary work when tags are invalid

4. Conditional logic optimization

  • Separates the length > 2 case to avoid unnecessary list operations for simple B-L pairs

Performance Impact by Workload

The optimization shows dramatic improvements for large entities (200-207% faster for 999-token entities) because the original O(n²) pop(0) operations become O(n) index scanning. Small entities see mixed results - some are 20-40% slower due to additional overhead, while multi-token entities are 16-40% faster.

The function appears well-suited for NLP entity processing pipelines where large named entities are common, making the substantial gains on large sequences very valuable despite minor overhead on single tokens.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 1040 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from typing import List

# imports
import pytest  # used for our unit tests
from spacy.training.iob_utils import _consume_ent


class Errors:
    # Mock error message for E177
    E177 = "Invalid tag: {tag}"
from spacy.training.iob_utils import _consume_ent

# unit tests

# -----------------------------
# 1. Basic Test Cases
# -----------------------------

def test_single_entity_unit_tag():
    # Basic: Single entity with a valid "U-" tag
    tags = ["U-PER"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 1.44μs -> 2.04μs (29.2% slower)

def test_single_entity_begin_tag():
    # Basic: Single entity with a valid "B-" tag (should convert to "U-")
    tags = ["B-LOC"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 1.34μs -> 1.71μs (21.4% slower)

def test_multi_token_entity():
    # Basic: Multi-token entity with "B-", "I-", "L-" tags
    tags = ["B-ORG", "I-ORG", "L-ORG"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 4.41μs -> 3.63μs (21.6% faster)

def test_multi_token_entity_with_only_b_and_l():
    # Basic: Multi-token entity with "B-" and "L-" only (no "I-")
    tags = ["B-MISC", "L-MISC"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 3.26μs -> 2.48μs (31.2% faster)

def test_multi_token_entity_with_multiple_i_tags():
    # Basic: Multi-token entity with multiple "I-" tags
    tags = ["B-DATE", "I-DATE", "I-DATE", "L-DATE"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 3.82μs -> 3.28μs (16.3% faster)

def test_entity_with_different_label():
    # Basic: Entity with a different label
    tags = ["B-EVENT", "L-EVENT"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 2.93μs -> 2.35μs (24.9% faster)

# -----------------------------
# 2. Edge Test Cases
# -----------------------------

def test_empty_tags():
    # Edge: Empty input should return empty output
    tags = []
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 438ns -> 367ns (19.3% faster)

def test_invalid_tag_format():
    # Edge: Tag with missing label should raise ValueError
    tags = ["B-"]
    with pytest.raises(ValueError) as excinfo:
        _consume_ent(tags.copy()) # 5.61μs -> 5.69μs (1.41% slower)



def test_single_l_tag():
    # Edge: Single "L-" tag should be treated as "U-"
    tags = ["L-PRODUCT"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 1.70μs -> 2.21μs (22.9% slower)

def test_single_i_tag():
    # Edge: Single "I-" tag should be treated as "U-"
    tags = ["I-ORG"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 1.42μs -> 1.77μs (19.7% slower)

def test_entity_with_interleaved_tags():
    # Edge: Entity with interleaved tags (should only consume consecutive valid ones)
    tags = ["B-ORG", "I-ORG", "B-PER", "L-ORG"]
    tags_copy = tags.copy()
    codeflash_output = _consume_ent(tags_copy); result = codeflash_output # 3.75μs -> 2.77μs (35.3% faster)
    # The remaining tags should be ["B-PER", "L-ORG"] or ["B-PER"] depending on consumption

def test_entity_with_multiple_l_tags():
    # Edge: Multiple "L-" tags in a row (should consume only the first entity)
    tags = ["B-ORG", "L-ORG", "L-ORG"]
    tags_copy = tags.copy()
    codeflash_output = _consume_ent(tags_copy); result = codeflash_output # 3.76μs -> 3.35μs (12.1% faster)
    # The next "L-ORG" should remain for next consumption

def test_entity_with_non_matching_i_and_l():
    # Edge: "I-" and "L-" tags with different labels (should not be consumed)
    tags = ["B-ORG", "I-PER", "L-ORG"]
    tags_copy = tags.copy()
    codeflash_output = _consume_ent(tags_copy); result = codeflash_output # 1.65μs -> 1.96μs (15.7% slower)

def test_entity_with_extra_dash():
    # Edge: Tag with extra dash (e.g., "B--ORG")
    tags = ["B--ORG"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 1.26μs -> 1.67μs (24.6% slower)


def test_large_single_entity():
    # Large: A long entity (999 tokens)
    tags = ["B-LARGE"] + ["I-LARGE"] * 997 + ["L-LARGE"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 138μs -> 46.1μs (200% faster)

def test_large_multiple_entities():
    # Large: Multiple entities in one tag list (should only consume the first)
    tags = ["B-ORG"] + ["I-ORG"] * 498 + ["L-ORG"] + ["B-PER"] + ["L-PER"]
    tags_copy = tags.copy()
    codeflash_output = _consume_ent(tags_copy); result = codeflash_output # 63.8μs -> 24.0μs (166% faster)

def test_large_entity_with_only_b_and_l():
    # Large: Large entity with only "B-" and "L-" tags
    tags = ["B-ORG"] + ["L-ORG"] * 999
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 135μs -> 55.3μs (146% faster)

def test_large_entity_with_mixed_tags():
    # Large: Large entity with mixed "I-" and "L-" tags
    tags = ["B-ORG"] + ["I-ORG", "L-ORG"] * 499
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 134μs -> 50.8μs (164% faster)

def test_large_list_of_single_unit_entities():
    # Large: Many single-unit entities
    tags = ["U-ORG"] * 999
    tags_copy = tags.copy()
    codeflash_output = _consume_ent(tags_copy); result = codeflash_output # 1.77μs -> 2.20μs (19.6% slower)

def test_large_entity_with_non_matching_i():
    # Large: Entity with many "I-" tags, but a non-matching one breaks the sequence
    tags = ["B-ORG"] + ["I-ORG"] * 498 + ["I-PER"] + ["L-ORG"]
    tags_copy = tags.copy()
    codeflash_output = _consume_ent(tags_copy); result = codeflash_output # 64.8μs -> 23.9μs (171% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from typing import List

# imports
import pytest  # used for our unit tests
from spacy.training.iob_utils import _consume_ent


class Errors:
    E177 = "Invalid tag: {tag}"
from spacy.training.iob_utils import _consume_ent

# unit tests

# ----------------------
# BASIC TEST CASES
# ----------------------

def test_single_entity_unit_tag():
    # Test a single-unit entity
    tags = ["U-PER"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 1.32μs -> 1.70μs (22.0% slower)

def test_single_entity_begin_last():
    # Test a single entity with B- and L- tags (length 2)
    tags = ["B-ORG", "L-ORG"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 3.31μs -> 2.39μs (38.6% faster)

def test_multi_token_entity():
    # Test a multi-token entity with B-, I-, L- tags
    tags = ["B-LOC", "I-LOC", "L-LOC"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 3.63μs -> 3.12μs (16.1% faster)

def test_multi_token_entity_with_multiple_I():
    # Test a multi-token entity with multiple I- tags
    tags = ["B-MISC", "I-MISC", "I-MISC", "L-MISC"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 3.63μs -> 3.00μs (21.0% faster)

def test_consume_only_first_entity():
    # Test that only the first entity is consumed
    tags = ["B-PER", "L-PER", "U-LOC"]
    tags_copy = tags.copy()
    codeflash_output = _consume_ent(tags_copy); result = codeflash_output # 2.92μs -> 2.40μs (22.0% faster)

# ----------------------
# EDGE TEST CASES
# ----------------------

def test_empty_tags():
    # Test empty input list
    tags = []
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 355ns -> 343ns (3.50% faster)

def test_invalid_single_tag():
    # Test a tag with missing label (should raise ValueError)
    tags = ["U-"]
    with pytest.raises(ValueError) as excinfo:
        _consume_ent(tags.copy()) # 5.40μs -> 5.32μs (1.60% faster)

def test_entity_with_only_B_tag():
    # Test a tag with only B- (should be treated as length 1)
    tags = ["B-ORG"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 1.39μs -> 1.90μs (26.7% slower)

def test_entity_with_only_L_tag():
    # Test a tag with only L- (should be treated as length 1)
    tags = ["L-LOC"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 1.22μs -> 1.56μs (21.6% slower)

def test_entity_with_no_label():
    # Test a tag with no label after prefix (should raise ValueError)
    tags = ["B-"]
    with pytest.raises(ValueError) as excinfo:
        _consume_ent(tags.copy()) # 4.33μs -> 4.34μs (0.391% slower)

def test_entity_with_unexpected_tag_sequence():
    # Test a sequence with unexpected tag (should only consume matching tags)
    tags = ["B-PER", "I-LOC", "L-PER"]
    tags_copy = tags.copy()
    codeflash_output = _consume_ent(tags_copy); result = codeflash_output # 1.75μs -> 2.06μs (15.1% slower)

def test_entity_with_interleaved_tags():
    # Test a sequence with interleaved tags
    tags = ["B-ORG", "I-ORG", "B-PER", "L-ORG"]
    tags_copy = tags.copy()
    codeflash_output = _consume_ent(tags_copy); result = codeflash_output # 3.47μs -> 2.48μs (40.2% faster)

def test_entity_with_nonstandard_prefix():
    # Test a tag with nonstandard prefix (should be treated as unit entity)
    tags = ["X-FOO"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 1.26μs -> 1.57μs (19.8% slower)

def test_entity_with_lowercase_label():
    # Test tag with lowercase label
    tags = ["B-per", "L-per"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 3.16μs -> 2.36μs (33.8% faster)

# ----------------------
# LARGE SCALE TEST CASES
# ----------------------

def test_large_multi_token_entity():
    # Test a large multi-token entity
    n = 999
    tags = ["B-ORG"] + ["I-ORG"] * (n-2) + ["L-ORG"]
    codeflash_output = _consume_ent(tags.copy()); result = codeflash_output # 140μs -> 46.0μs (205% faster)
    expected = ["B-ORG"] + ["I-ORG"] * (n-2) + ["L-ORG"]

def test_large_list_multiple_entities():
    # Test a list with multiple large entities
    n = 500
    tags = (
        ["B-PER"] + ["I-PER"] * (n-2) + ["L-PER"] +
        ["B-LOC"] + ["I-LOC"] * (n-2) + ["L-LOC"]
    )
    tags_copy = tags.copy()
    codeflash_output = _consume_ent(tags_copy); result1 = codeflash_output # 72.3μs -> 23.5μs (207% faster)
    codeflash_output = _consume_ent(tags_copy); result2 = codeflash_output # 62.4μs -> 22.1μs (182% faster)
    expected1 = ["B-PER"] + ["I-PER"] * (n-2) + ["L-PER"]
    expected2 = ["B-LOC"] + ["I-LOC"] * (n-2) + ["L-LOC"]

def test_large_list_with_mixed_entities_and_unit_tags():
    # Test a large list with mixed entities and unit tags
    n = 300
    tags = (
        ["U-MISC"] +
        ["B-ORG"] + ["I-ORG"] * (n-2) + ["L-ORG"] +
        ["U-PER"] +
        ["B-LOC", "L-LOC"]
    )
    tags_copy = tags.copy()
    codeflash_output = _consume_ent(tags_copy); result1 = codeflash_output # 1.66μs -> 1.84μs (9.67% slower)
    codeflash_output = _consume_ent(tags_copy); result2 = codeflash_output # 37.5μs -> 14.6μs (157% faster)
    codeflash_output = _consume_ent(tags_copy); result3 = codeflash_output # 836ns -> 978ns (14.5% slower)
    codeflash_output = _consume_ent(tags_copy); result4 = codeflash_output # 1.68μs -> 1.26μs (32.8% faster)
    expected1 = ["U-MISC"]
    expected2 = ["B-ORG"] + ["I-ORG"] * (n-2) + ["L-ORG"]
    expected3 = ["U-PER"]
    expected4 = ["B-LOC", "L-LOC"]


def test_large_list_with_many_unit_entities():
    # Test a large list of unit entities
    tags = ["U-LOC"] * 999
    tags_copy = tags.copy()
    for _ in range(999):
        codeflash_output = _consume_ent(tags_copy); result = codeflash_output # 486μs -> 522μs (6.78% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_consume_ent-mhljaefj and push.

Codeflash Static Badge

The optimization achieves a **56% speedup** by eliminating inefficient list operations and replacing them with more performant alternatives:

## Key Optimizations

**1. Index-based scanning instead of repeated pop(0) operations**
- Original code uses `while tags and tags[0] in {target_in, target_last}: tags.pop(0)` which performs O(n) operations for each pop
- Optimized code uses `for i in range(n): t = tags[i]` to scan without modification, then removes matched elements in one operation with `del tags[:length-1]`

**2. List multiplication instead of list comprehension**  
- Original: `middle = [f"I-{label}" for _ in range(1, length - 1)]` creates strings in a loop
- Optimized: `middle = ["I-" + label] * (length - 2)` uses faster list multiplication for repeated identical strings

**3. Early label validation**
- Moves the `if not label:` check earlier to avoid unnecessary work when tags are invalid

**4. Conditional logic optimization**
- Separates the `length > 2` case to avoid unnecessary list operations for simple B-L pairs

## Performance Impact by Workload

The optimization shows **dramatic improvements for large entities** (200-207% faster for 999-token entities) because the original O(n²) pop(0) operations become O(n) index scanning. **Small entities see mixed results** - some are 20-40% slower due to additional overhead, while multi-token entities are 16-40% faster.

The function appears well-suited for **NLP entity processing pipelines** where large named entities are common, making the substantial gains on large sequences very valuable despite minor overhead on single tokens.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 5, 2025 05:05
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant