Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 13, 2025

📄 25% (0.25x) speedup for validate_attrs in spacy/pipe_analysis.py

⏱️ Runtime : 2.59 milliseconds 2.07 milliseconds (best of 181 runs)

📝 Explanation and details

The optimized code achieves a 25% speedup through two key optimizations:

1. Streamlined dot_to_dict Loop Logic

The original code used enumerate() with a conditional check on every iteration:

for i, item in enumerate(parts):
    is_last = i == len(parts) - 1
    path = path.setdefault(item, value if is_last else {})

The optimized version separates the logic into two cleaner operations:

for item in parts[:-1]:
    path = path.setdefault(item, {})
path.setdefault(parts[-1], value)

This eliminates the overhead of enumerate(), the repeated len(parts) - 1 calculation, and the conditional check on each iteration. The line profiler shows this reduced the dot_to_dict runtime from 13.6ms to 8.98ms (34% faster).

2. Single-Pass Span Attribute Filtering

In validate_attrs, the original code made two separate passes over the values iterable for span validation:

span_attrs = [attr for attr in values if attr.startswith("span.")]
span_attrs = [attr for attr in span_attrs if not attr.startswith("span._.")]

The optimized version combines this into a single comprehension and converts the iterable to a list once:

values_list = list(values)  # Convert once
span_attrs = [attr for attr in values_list if attr.startswith("span.") and not attr.startswith("span._.")]

This optimization is particularly effective for test cases with large numbers of custom extension attributes, showing 30-35% improvements in those scenarios. The validate_attrs function benefits from reduced iteration overhead and better memory access patterns.

These optimizations are especially beneficial for spaCy's attribute validation system, which likely processes many attribute lists during pipeline configuration and component validation.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 18 Passed
🌀 Generated Regression Tests 46 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
pipeline/test_analysis.py::test_analysis_validate_attrs_invalid 76.1μs 70.8μs 7.43%✅
pipeline/test_analysis.py::test_analysis_validate_attrs_valid 34.6μs 31.7μs 9.42%✅
🌀 Generated Regression Tests and Runtime

from typing import Any, Dict, Iterable

imports

import pytest # used for our unit tests
from spacy.pipe_analysis import validate_attrs

Simulate spacy.errors.Errors with minimal error messages for testing

class Errors:
E180 = "Span attributes must be custom extension attributes: {attrs}"
E181 = "Invalid object '{obj}' for attributes: {attrs}"
E182 = "Invalid attribute format: {attr}"
E183 = "Invalid attribute format: {attr}. Did you mean: {solution}?"
E184 = "Attributes ending with '_' are not allowed: {attr}. Did you mean: {solution}?"
E185 = "Object '{obj}' does not have attribute '{attr}'"

Dummy classes with some attributes for testing

class Doc:
text = True
is_parsed = True
cats = True
# No 'foo', 'bar', etc.

class Token:
pos = True
lemma = True
text = True
# No 'foo', 'bar', etc.

class Span:
# Only custom extension attributes allowed
pass
from spacy.pipe_analysis import validate_attrs

-------------------------------

Unit tests for validate_attrs

-------------------------------

1. Basic Test Cases

def test_valid_doc_attrs():
# Single valid doc attribute
codeflash_output = validate_attrs(["doc.text"]) # 4.93μs -> 4.51μs (9.37% faster)
# Multiple valid doc attributes
codeflash_output = validate_attrs(["doc.text", "doc.is_parsed", "doc.cats"]) # 5.35μs -> 5.21μs (2.78% faster)

def test_valid_token_attrs():
# Single valid token attribute
codeflash_output = validate_attrs(["token.pos"]) # 4.64μs -> 4.34μs (6.84% faster)
# Multiple valid token attributes
codeflash_output = validate_attrs(["token.text", "token.lemma", "token.pos"]) # 5.45μs -> 5.23μs (4.18% faster)

def test_valid_custom_extension_attrs():
# Custom extension attributes for doc
codeflash_output = validate_attrs(["doc..my_ext"]) # 4.35μs -> 3.98μs (9.19% faster)
# Custom extension attributes for token
codeflash_output = validate_attrs(["token.
.my_token_ext"]) # 2.37μs -> 2.38μs (0.379% slower)
# Multiple custom extension attributes
codeflash_output = validate_attrs(["doc..foo", "token..bar"]) # 3.72μs -> 3.52μs (5.77% faster)

def test_valid_mixed_attrs():
# Mixed valid attributes
attrs = ["doc.text", "token.lemma", "doc..foo", "token..bar"]
codeflash_output = validate_attrs(attrs) # 8.61μs -> 8.13μs (5.90% faster)

2. Edge Test Cases

def test_invalid_object_name():
# Object name not in doc/token/span
with pytest.raises(ValueError) as excinfo:
validate_attrs(["sentence.text"]) # 9.37μs -> 8.76μs (6.97% faster)

def test_invalid_attr_format_missing_attr():
# Missing attribute after object
with pytest.raises(ValueError) as excinfo:
validate_attrs(["doc"]) # 7.75μs -> 7.59μs (2.05% faster)

def test_invalid_attr_format_custom_ext_missing_attr():
# Missing attribute after custom extension prefix
with pytest.raises(ValueError) as excinfo:
validate_attrs(["doc._"]) # 8.09μs -> 7.89μs (2.50% faster)

def test_invalid_attr_format_nested_extension():
# Nested extension attribute (too deep)
with pytest.raises(ValueError) as excinfo:
validate_attrs(["doc._.foo.bar"]) # 9.07μs -> 9.12μs (0.570% slower)

def test_invalid_attr_format_non_extension_nested():
# Nested attribute (not extension)
with pytest.raises(ValueError) as excinfo:
validate_attrs(["doc.text.foo"]) # 8.87μs -> 8.45μs (4.97% faster)

def test_invalid_attr_format_trailing_underscore():
# Attribute ending with underscore
with pytest.raises(ValueError) as excinfo:
validate_attrs(["token.pos_"]) # 8.33μs -> 7.85μs (6.14% faster)

def test_invalid_attr_not_in_class():
# Attribute does not exist in class
with pytest.raises(ValueError) as excinfo:
validate_attrs(["doc.foo"]) # 9.40μs -> 8.84μs (6.34% faster)

def test_span_non_extension_attribute():
# Span attribute not allowed unless custom extension
with pytest.raises(ValueError) as excinfo:
validate_attrs(["span.text"]) # 9.13μs -> 8.21μs (11.2% faster)

def test_span_extension_attribute_valid():
# Span custom extension attribute is allowed
codeflash_output = validate_attrs(["span._.my_span_ext"]) # 5.82μs -> 5.08μs (14.7% faster)

def test_case_insensitivity():
# Attribute names are case-insensitive
codeflash_output = validate_attrs(["DOC.TEXT", "Token.Lemma"]) # 7.21μs -> 6.96μs (3.58% faster)

def test_multiple_errors():
# Multiple errors: only the first error is raised
with pytest.raises(ValueError) as excinfo:
validate_attrs(["doc.foo", "token.bar"]) # 10.7μs -> 9.98μs (7.69% faster)

3. Large Scale Test Cases

def test_large_number_of_valid_attrs():
# Large number of valid attributes
attrs = [f"doc.text"] * 500 + [f"token.pos"] * 500
codeflash_output = validate_attrs(attrs) # 21.0μs -> 23.0μs (8.47% slower)

def test_large_number_of_custom_extension_attrs():
# Large number of valid custom extension attributes
attrs = [f"doc._.ext{i}" for i in range(1000)]
codeflash_output = validate_attrs(attrs) # 549μs -> 407μs (35.0% faster)

def test_large_number_of_invalid_attrs():
# Large number of invalid attributes, should fail on first invalid
attrs = [f"doc.text"] * 999 + ["doc.foo"]
with pytest.raises(ValueError) as excinfo:
validate_attrs(attrs) # 26.7μs -> 27.1μs (1.51% slower)

def test_large_number_of_span_non_extension_attrs():
# Large number of span non-extension attributes, should fail on first
attrs = [f"span.text"] * 1000
with pytest.raises(ValueError) as excinfo:
validate_attrs(attrs) # 128μs -> 117μs (9.87% faster)

def test_large_number_of_mixed_valid_attrs():
# Large number of mixed valid attributes
attrs = (
[f"doc.text"] * 250 +
[f"token.lemma"] * 250 +
[f"doc..ext{i}" for i in range(250)] +
[f"token.
.ext{i}" for i in range(250)]
)
codeflash_output = validate_attrs(attrs) # 288μs -> 219μs (31.6% faster)

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

#------------------------------------------------
from typing import Any, Dict, Iterable

imports

import pytest
from spacy.pipe_analysis import validate_attrs

Errors class to simulate spacy.errors.Errors

class Errors:
E180 = "Invalid Span attributes: {attrs}. Only custom extension attributes (span._.xyz) are allowed."
E181 = "Invalid object '{obj}' in attribute(s): {attrs}. Only 'doc', 'token', and 'span' are allowed."
E182 = "Invalid attribute format: {attr}. Attribute must specify a property, not just the object."
E183 = "Invalid attribute format: {attr}. Did you mean: {solution}?"
E184 = "Attribute '{attr}' should not end with an underscore. Did you mean: {solution}?"
E185 = "Object '{obj}' does not have attribute '{attr}'."

Dummy classes to simulate spacy.tokens.Doc, Span, Token

class Doc:
text = None
length = None
# Simulate some valid attributes

class Token:
pos = None
lemma = None
pos_ = None # Used to test invalid attributes ending with '_'
# Simulate some valid attributes

class Span:
# No valid attributes except extensions
pass
from spacy.pipe_analysis import validate_attrs

------------------ UNIT TESTS ------------------

1. Basic Test Cases

def test_valid_custom_extension_attributes():
# Valid custom extension attributes for doc, token, span
attrs = ["doc..xyz", "token..abc", "span._.myext"]
codeflash_output = validate_attrs(attrs); result = codeflash_output # 10.5μs -> 9.69μs (8.51% faster)

def test_valid_mixed_attributes():
# Mix of valid normal and extension attributes
attrs = ["doc.text", "token.lemma", "span._.myext"]
codeflash_output = validate_attrs(attrs); result = codeflash_output # 10.0μs -> 9.45μs (5.83% faster)

2. Edge Test Cases

def test_invalid_object_name():
# Attribute with invalid object name
attrs = ["sentence.text"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 10.4μs -> 9.45μs (9.64% faster)

def test_missing_attribute_after_object():
# Attribute is just "doc" or "token" or "span"
attrs = ["doc"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 8.09μs -> 7.83μs (3.37% faster)

def test_missing_attribute_after_extension():
# Attribute is just "doc."
attrs = ["doc.
"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 8.57μs -> 8.14μs (5.36% faster)

def test_span_non_extension_attribute():
# Span attribute not using custom extension
attrs = ["span.text"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 8.85μs -> 8.00μs (10.6% faster)

def test_attribute_with_trailing_underscore():
# Attribute ends with an underscore
attrs = ["token.pos_"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 8.60μs -> 8.03μs (7.08% faster)

def test_attribute_with_too_many_dots():
# Attribute is something like doc.text.extra
attrs = ["doc.text.extra"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 9.50μs -> 9.16μs (3.63% faster)

def test_extension_attribute_with_too_many_dots():
# Attribute is something like doc..x.y
attrs = ["doc.
.x.y"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 9.18μs -> 8.56μs (7.17% faster)

def test_attribute_not_in_class():
# Attribute is not present in Doc or Token class
attrs = ["doc.invalidattr"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 9.86μs -> 8.93μs (10.4% faster)

def test_multiple_errors_in_list():
# Multiple invalid attributes, should raise on first error found
attrs = ["doc.text", "doc.invalidattr", "token.pos_"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 12.7μs -> 11.8μs (7.99% faster)

def test_span_extension_attribute_with_too_many_dots():
# Extension attribute with too many dots
attrs = ["span._.myext.sub"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 10.9μs -> 10.1μs (8.45% faster)

3. Large Scale Test Cases

def test_large_number_of_custom_extensions():
# Test with many valid custom extension attributes
attrs = [f"doc._.ext{i}" for i in range(1000)]
codeflash_output = validate_attrs(attrs); result = codeflash_output # 544μs -> 405μs (34.4% faster)

def test_large_number_of_invalid_attributes():
# Test with many invalid attributes (all should fail)
attrs = [f"doc.invalid{i}" for i in range(1000)]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 458μs -> 350μs (30.8% faster)

def test_large_mixed_valid_and_invalid():
# Mix of valid and invalid attributes
attrs = ["doc.text"] * 500 + ["token.pos"] * 499 + ["doc.invalidattr"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 27.1μs -> 27.9μs (3.05% slower)

def test_large_number_of_span_non_extension():
# Large number of invalid span attributes (non-extension)
attrs = [f"span.text{i}" for i in range(100)]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 68.3μs -> 56.2μs (21.6% faster)
for i in range(100):
pass

def test_large_number_of_extension_attribute_with_too_many_dots():
# Large number of extension attributes with too many dots
attrs = [f"doc._.ext{i}.sub" for i in range(100)]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 79.0μs -> 62.1μs (27.3% faster)

4. Miscellaneous/Additional Edge Cases

def test_empty_input():
# Empty input should be valid and return empty list
attrs = []
codeflash_output = validate_attrs(attrs); result = codeflash_output # 2.03μs -> 2.04μs (0.489% slower)

def test_case_insensitivity():
# Attribute keys should be treated case-insensitively
attrs = ["DOC.TEXT", "Token.Lemma", "SPAN._.MyExt"]
codeflash_output = validate_attrs(attrs); result = codeflash_output # 10.4μs -> 9.30μs (12.2% faster)

def test_attribute_with_leading_dot():
# Attribute with leading dot is invalid
attrs = [".doc.text"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 9.74μs -> 8.62μs (12.9% faster)

def test_attribute_with_multiple_consecutive_dots():
# Attribute with multiple consecutive dots is invalid
attrs = ["doc..text"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 9.40μs -> 9.04μs (4.10% faster)

def test_attribute_with_trailing_dot():
# Attribute with trailing dot is invalid
attrs = ["doc.text."]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 8.95μs -> 8.85μs (1.05% faster)

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-validate_attrs-mhwqezei and push.

Codeflash Static Badge

The optimized code achieves a **25% speedup** through two key optimizations:

## 1. Streamlined `dot_to_dict` Loop Logic
The original code used `enumerate()` with a conditional check on every iteration:
```python
for i, item in enumerate(parts):
    is_last = i == len(parts) - 1
    path = path.setdefault(item, value if is_last else {})
```

The optimized version separates the logic into two cleaner operations:
```python
for item in parts[:-1]:
    path = path.setdefault(item, {})
path.setdefault(parts[-1], value)
```

This eliminates the overhead of `enumerate()`, the repeated `len(parts) - 1` calculation, and the conditional check on each iteration. The line profiler shows this reduced the `dot_to_dict` runtime from 13.6ms to 8.98ms (**34% faster**).

## 2. Single-Pass Span Attribute Filtering
In `validate_attrs`, the original code made two separate passes over the `values` iterable for span validation:
```python
span_attrs = [attr for attr in values if attr.startswith("span.")]
span_attrs = [attr for attr in span_attrs if not attr.startswith("span._.")]
```

The optimized version combines this into a single comprehension and converts the iterable to a list once:
```python
values_list = list(values)  # Convert once
span_attrs = [attr for attr in values_list if attr.startswith("span.") and not attr.startswith("span._.")]
```

This optimization is particularly effective for test cases with large numbers of custom extension attributes, showing **30-35% improvements** in those scenarios. The `validate_attrs` function benefits from reduced iteration overhead and better memory access patterns.

These optimizations are especially beneficial for spaCy's attribute validation system, which likely processes many attribute lists during pipeline configuration and component validation.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 13, 2025 01:10
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant