Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 5, 2025

📄 10% (0.10x) speedup for _char_indices in spacy/pipeline/span_finder.py

⏱️ Runtime : 47.2 microseconds 42.8 microseconds (best of 250 runs)

📝 Explanation and details

The optimization eliminates redundant indexing operations by caching the last token of the span. In the original code, span[-1] is accessed twice - once to get idx and again for len(). The optimized version stores span[-1] in a local variable last and reuses it, reducing the number of span indexing operations from 3 to 2.

Key changes:

  • Introduced last = span[-1] to cache the final token
  • Combined variable assignment into the return statement
  • Eliminated the intermediate start and end variables

Why this speeds up the code:
In Python, sequence indexing (especially negative indexing like span[-1]) involves method calls and bounds checking. By caching the result, we avoid repeating this overhead. The line profiler shows the most expensive operation was span[-1].idx + len(span[-1]) (54.2% of total time), which required two indexing operations. The optimization reduces this to a single indexing operation plus reuse of the cached token.

Performance impact:
The 10% overall speedup is consistent across test cases, with improvements ranging from 0% (unicode edge case) to 23% (large unicode tokens). The optimization is particularly effective for spans with larger tokens or more complex token objects where the indexing overhead is more significant. Given that this function calculates character boundaries for spans in NLP pipelines, even small improvements can compound when processing large documents or datasets.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 39 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 71.4%
🌀 Generated Regression Tests and Runtime
from typing import Tuple

# imports
import pytest  # used for our unit tests
from spacy.pipeline.span_finder import _char_indices


# Mocks for spaCy's Span and Token objects
class MockToken:
    def __init__(self, text, idx):
        self.text = text
        self.idx = idx

    def __len__(self):
        return len(self.text)

class MockSpan:
    def __init__(self, tokens):
        self.tokens = tokens

    def __getitem__(self, i):
        return self.tokens[i]

    def __len__(self):
        return len(self.tokens)

    def __iter__(self):
        return iter(self.tokens)
from spacy.pipeline.span_finder import _char_indices

# unit tests

# 1. Basic Test Cases

def test_single_token_span():
    # Single token span: start and end should be token boundaries
    token = MockToken("hello", 0)
    span = MockSpan([token])
    codeflash_output = _char_indices(span) # 1.59μs -> 1.47μs (7.74% faster)

def test_multi_token_span():
    # Multi-token span: start at first token, end after last token
    tokens = [MockToken("hello", 0), MockToken("world", 6)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.24μs -> 1.08μs (14.3% faster)

def test_span_with_spaces():
    # Tokens with spaces between them
    tokens = [MockToken("a", 0), MockToken("b", 2), MockToken("c", 4)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.22μs -> 1.03μs (17.9% faster)

def test_span_middle_of_text():
    # Span not starting at index 0
    tokens = [MockToken("foo", 10), MockToken("bar", 14)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.22μs -> 1.08μs (12.6% faster)

# 2. Edge Test Cases


def test_span_with_one_char_tokens():
    # All tokens are single characters
    tokens = [MockToken("a", 0), MockToken("b", 1), MockToken("c", 2)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.82μs -> 1.70μs (7.61% faster)

def test_span_with_non_ascii_tokens():
    # Tokens with unicode characters
    tokens = [MockToken("你好", 0), MockToken("世界", 2)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.13μs -> 1.13μs (0.000% faster)

def test_span_with_token_at_end_of_text():
    # Last token ends at last char of text
    tokens = [MockToken("foo", 0), MockToken("bar", 4), MockToken("baz", 8)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.19μs -> 1.10μs (7.83% faster)

def test_span_with_overlapping_indices():
    # Overlapping indices (should not happen in spaCy, but test anyway)
    tokens = [MockToken("foo", 0), MockToken("oo", 1)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.17μs -> 1.07μs (9.45% faster)

def test_span_with_nonzero_start_and_nonzero_end():
    # Tokens not at start or end of text
    tokens = [MockToken("abc", 5), MockToken("def", 9)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.21μs -> 1.00μs (20.6% faster)

def test_span_with_token_length_zero():
    # Token with zero length (should not happen, but test for robustness)
    tokens = [MockToken("", 0), MockToken("a", 1)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.15μs -> 1.08μs (6.02% faster)

def test_span_with_gap_between_tokens():
    # Tokens with gaps between indices
    tokens = [MockToken("foo", 0), MockToken("bar", 10)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.17μs -> 1.04μs (11.9% faster)

# 3. Large Scale Test Cases

def test_long_span_1000_tokens():
    # Span of 1000 tokens, each 1 char, indices 0..999
    tokens = [MockToken("a", i) for i in range(1000)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.38μs -> 1.23μs (12.0% faster)

def test_long_span_varied_token_lengths():
    # Span of 500 tokens, each token length = i % 5 + 1
    tokens = []
    idx = 0
    for i in range(500):
        length = i % 5 + 1
        tokens.append(MockToken("x" * length, idx))
        idx += length
    span = MockSpan(tokens)
    expected_start = 0
    expected_end = tokens[-1].idx + len(tokens[-1])
    codeflash_output = _char_indices(span) # 1.17μs -> 1.00μs (16.9% faster)

def test_span_with_large_gaps():
    # Tokens spaced apart by 10 chars each
    tokens = [MockToken("foo", i*10) for i in range(100)]
    span = MockSpan(tokens)
    expected_start = 0
    expected_end = tokens[-1].idx + len(tokens[-1])
    codeflash_output = _char_indices(span) # 1.13μs -> 992ns (14.2% faster)

def test_span_with_large_unicode_tokens():
    # 100 tokens, each token is 10 unicode chars, indices spaced by 10
    tokens = [MockToken("界"*10, i*10) for i in range(100)]
    span = MockSpan(tokens)
    expected_start = 0
    expected_end = tokens[-1].idx + len(tokens[-1])
    codeflash_output = _char_indices(span) # 1.09μs -> 883ns (23.0% faster)

def test_span_with_max_length_token():
    # One token, length 1000
    token = MockToken("x"*1000, 0)
    span = MockSpan([token])
    codeflash_output = _char_indices(span) # 1.30μs -> 1.20μs (8.42% faster)

# Additional mutation-resistance tests

def test_span_start_and_end_are_correct_for_mutation():
    # If the function used span[-1].idx without len(span[-1]), this would fail
    tokens = [MockToken("foo", 0), MockToken("bar", 4)]
    span = MockSpan(tokens)
    # "foo"(0-3), "bar"(4-7)
    codeflash_output = _char_indices(span) # 1.16μs -> 1.04μs (11.7% faster)

def test_span_with_non_consecutive_indices():
    # If the function used len(span) or similar, this would fail
    tokens = [MockToken("a", 0), MockToken("b", 100)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.15μs -> 1.02μs (11.9% faster)

def test_span_with_token_length_one_and_nonzero_idx():
    # Token at idx=5, length=1
    token = MockToken("x", 5)
    span = MockSpan([token])
    codeflash_output = _char_indices(span) # 1.11μs -> 1.06μs (5.49% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from typing import Tuple

# imports
import pytest
from spacy.pipeline.span_finder import _char_indices


# Mocks for spaCy's Span and Token objects for testing purposes
class MockToken:
    def __init__(self, text, idx):
        self.text = text
        self.idx = idx

    def __len__(self):
        return len(self.text)

class MockSpan:
    def __init__(self, tokens):
        self.tokens = tokens

    def __getitem__(self, item):
        return self.tokens[item]

    def __len__(self):
        return len(self.tokens)
from spacy.pipeline.span_finder import _char_indices

# ==========================
# Basic Test Cases
# ==========================

def test_single_token_span():
    # Test a span consisting of a single token
    token = MockToken("hello", 0)
    span = MockSpan([token])
    codeflash_output = _char_indices(span) # 1.61μs -> 1.32μs (22.1% faster)

def test_two_token_span():
    # Test a span consisting of two tokens
    tokens = [MockToken("hello", 0), MockToken("world", 6)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.13μs -> 990ns (14.1% faster)

def test_middle_of_text_span():
    # Span in the middle of the text
    tokens = [MockToken("quick", 4), MockToken("brown", 10)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.09μs -> 1.04μs (4.51% faster)

def test_span_with_punctuation():
    # Span including punctuation
    tokens = [MockToken("hello", 0), MockToken(",", 5), MockToken("world", 7)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.17μs -> 1.01μs (16.4% faster)

# ==========================
# Edge Test Cases
# ==========================


def test_span_with_nonzero_start():
    # Span not starting at 0
    tokens = [MockToken("foo", 10), MockToken("bar", 14)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.25μs -> 1.17μs (6.85% faster)

def test_span_with_multibyte_characters():
    # Span with multibyte (unicode) characters
    tokens = [MockToken("你好", 0), MockToken("世界", 2)]
    span = MockSpan(tokens)
    # Each Chinese character is 1 in Python string length, so '你好' is 2, '世界' is 2
    codeflash_output = _char_indices(span) # 1.16μs -> 1.07μs (8.68% faster)

def test_span_with_zero_length_token():
    # Span containing a zero-length token
    tokens = [MockToken("", 0), MockToken("abc", 0)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.19μs -> 1.07μs (10.3% faster)

def test_span_with_overlapping_indices():
    # Overlapping indices (should not happen in real spaCy, but test for robustness)
    tokens = [MockToken("a", 0), MockToken("b", 0)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.17μs -> 1.03μs (13.9% faster)

def test_span_with_non_ascii_token():
    # Span with non-ASCII (emoji)
    tokens = [MockToken("😊", 0), MockToken("🚀", 1)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.11μs -> 1.09μs (1.46% faster)

def test_span_with_whitespace_tokens():
    # Span with whitespace tokens
    tokens = [MockToken(" ", 0), MockToken("foo", 1), MockToken(" ", 4), MockToken("bar", 5)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.13μs -> 1.05μs (6.92% faster)

def test_span_with_token_at_end_of_text():
    # Span ending at the last character of text
    tokens = [MockToken("end", 97)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.15μs -> 1.03μs (11.5% faster)

# ==========================
# Large Scale Test Cases
# ==========================

def test_long_span_of_1000_tokens():
    # Span of 1000 tokens, each 1 character, sequential indices
    tokens = [MockToken("a", i) for i in range(1000)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.27μs -> 1.14μs (10.8% faster)

def test_large_tokens_with_large_offsets():
    # Large tokens with large offsets
    tokens = [MockToken("x"*10, i*10) for i in range(1000)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.30μs -> 1.14μs (14.1% faster)

def test_span_on_middle_of_large_document():
    # Span in the middle of a large document
    tokens = [MockToken("a", i) for i in range(1000)]
    span = MockSpan(tokens[400:600])
    codeflash_output = _char_indices(span) # 1.21μs -> 1.11μs (8.70% faster)

def test_span_with_varied_token_lengths():
    # Span with tokens of varying lengths
    tokens = []
    idx = 0
    for i in range(100):
        tok = "x" * (i % 10 + 1)
        tokens.append(MockToken(tok, idx))
        idx += len(tok)
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.17μs -> 1.12μs (4.18% faster)

def test_span_with_gaps_in_indices():
    # Tokens with gaps in indices (simulate missing text)
    tokens = [MockToken("a", 0), MockToken("b", 10), MockToken("c", 20)]
    span = MockSpan(tokens)
    # Start at 0, end at 20 + 1 = 21
    codeflash_output = _char_indices(span) # 1.18μs -> 1.08μs (9.33% faster)

# ==========================
# Mutation Testing Guards
# ==========================

def test_mutation_wrong_start():
    # If function returns span[-1].idx as start, this will fail
    tokens = [MockToken("start", 3), MockToken("end", 10)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.13μs -> 1.08μs (4.15% faster)

def test_mutation_wrong_end():
    # If function returns span[-1].idx (not adding len), this will fail
    tokens = [MockToken("foo", 5), MockToken("bar", 9)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.15μs -> 1.02μs (12.4% faster)

def test_mutation_wrong_order():
    # If function swaps start and end, this will fail
    tokens = [MockToken("alpha", 2), MockToken("beta", 8)]
    span = MockSpan(tokens)
    codeflash_output = _char_indices(span) # 1.05μs -> 1.03μs (2.73% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_char_indices-mhmic9uo and push.

Codeflash Static Badge

The optimization eliminates redundant indexing operations by caching the last token of the span. In the original code, `span[-1]` is accessed twice - once to get `idx` and again for `len()`. The optimized version stores `span[-1]` in a local variable `last` and reuses it, reducing the number of span indexing operations from 3 to 2.

**Key changes:**
- Introduced `last = span[-1]` to cache the final token
- Combined variable assignment into the return statement
- Eliminated the intermediate `start` and `end` variables

**Why this speeds up the code:**
In Python, sequence indexing (especially negative indexing like `span[-1]`) involves method calls and bounds checking. By caching the result, we avoid repeating this overhead. The line profiler shows the most expensive operation was `span[-1].idx + len(span[-1])` (54.2% of total time), which required two indexing operations. The optimization reduces this to a single indexing operation plus reuse of the cached token.

**Performance impact:**
The 10% overall speedup is consistent across test cases, with improvements ranging from 0% (unicode edge case) to 23% (large unicode tokens). The optimization is particularly effective for spans with larger tokens or more complex token objects where the indexing overhead is more significant. Given that this function calculates character boundaries for spans in NLP pipelines, even small improvements can compound when processing large documents or datasets.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 5, 2025 21:26
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant