⚡️ Speed up function validate_attrs by 25%
#14
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 25% (0.25x) speedup for
validate_attrsinspacy/pipe_analysis.py⏱️ Runtime :
2.59 milliseconds→2.07 milliseconds(best of181runs)📝 Explanation and details
The optimized code achieves a 25% speedup through two key optimizations:
1. Streamlined
dot_to_dictLoop LogicThe original code used
enumerate()with a conditional check on every iteration:The optimized version separates the logic into two cleaner operations:
This eliminates the overhead of
enumerate(), the repeatedlen(parts) - 1calculation, and the conditional check on each iteration. The line profiler shows this reduced thedot_to_dictruntime from 13.6ms to 8.98ms (34% faster).2. Single-Pass Span Attribute Filtering
In
validate_attrs, the original code made two separate passes over thevaluesiterable for span validation:The optimized version combines this into a single comprehension and converts the iterable to a list once:
This optimization is particularly effective for test cases with large numbers of custom extension attributes, showing 30-35% improvements in those scenarios. The
validate_attrsfunction benefits from reduced iteration overhead and better memory access patterns.These optimizations are especially beneficial for spaCy's attribute validation system, which likely processes many attribute lists during pipeline configuration and component validation.
✅ Correctness verification report:
⚙️ Existing Unit Tests and Runtime
pipeline/test_analysis.py::test_analysis_validate_attrs_invalidpipeline/test_analysis.py::test_analysis_validate_attrs_valid🌀 Generated Regression Tests and Runtime
from typing import Any, Dict, Iterable
imports
import pytest # used for our unit tests
from spacy.pipe_analysis import validate_attrs
Simulate spacy.errors.Errors with minimal error messages for testing
class Errors:
E180 = "Span attributes must be custom extension attributes: {attrs}"
E181 = "Invalid object '{obj}' for attributes: {attrs}"
E182 = "Invalid attribute format: {attr}"
E183 = "Invalid attribute format: {attr}. Did you mean: {solution}?"
E184 = "Attributes ending with '_' are not allowed: {attr}. Did you mean: {solution}?"
E185 = "Object '{obj}' does not have attribute '{attr}'"
Dummy classes with some attributes for testing
class Doc:
text = True
is_parsed = True
cats = True
# No 'foo', 'bar', etc.
class Token:
pos = True
lemma = True
text = True
# No 'foo', 'bar', etc.
class Span:
# Only custom extension attributes allowed
pass
from spacy.pipe_analysis import validate_attrs
-------------------------------
Unit tests for validate_attrs
-------------------------------
1. Basic Test Cases
def test_valid_doc_attrs():
# Single valid doc attribute
codeflash_output = validate_attrs(["doc.text"]) # 4.93μs -> 4.51μs (9.37% faster)
# Multiple valid doc attributes
codeflash_output = validate_attrs(["doc.text", "doc.is_parsed", "doc.cats"]) # 5.35μs -> 5.21μs (2.78% faster)
def test_valid_token_attrs():
# Single valid token attribute
codeflash_output = validate_attrs(["token.pos"]) # 4.64μs -> 4.34μs (6.84% faster)
# Multiple valid token attributes
codeflash_output = validate_attrs(["token.text", "token.lemma", "token.pos"]) # 5.45μs -> 5.23μs (4.18% faster)
def test_valid_custom_extension_attrs():
# Custom extension attributes for doc
codeflash_output = validate_attrs(["doc..my_ext"]) # 4.35μs -> 3.98μs (9.19% faster)
# Custom extension attributes for token
codeflash_output = validate_attrs(["token..my_token_ext"]) # 2.37μs -> 2.38μs (0.379% slower)
# Multiple custom extension attributes
codeflash_output = validate_attrs(["doc..foo", "token..bar"]) # 3.72μs -> 3.52μs (5.77% faster)
def test_valid_mixed_attrs():
# Mixed valid attributes
attrs = ["doc.text", "token.lemma", "doc..foo", "token..bar"]
codeflash_output = validate_attrs(attrs) # 8.61μs -> 8.13μs (5.90% faster)
2. Edge Test Cases
def test_invalid_object_name():
# Object name not in doc/token/span
with pytest.raises(ValueError) as excinfo:
validate_attrs(["sentence.text"]) # 9.37μs -> 8.76μs (6.97% faster)
def test_invalid_attr_format_missing_attr():
# Missing attribute after object
with pytest.raises(ValueError) as excinfo:
validate_attrs(["doc"]) # 7.75μs -> 7.59μs (2.05% faster)
def test_invalid_attr_format_custom_ext_missing_attr():
# Missing attribute after custom extension prefix
with pytest.raises(ValueError) as excinfo:
validate_attrs(["doc._"]) # 8.09μs -> 7.89μs (2.50% faster)
def test_invalid_attr_format_nested_extension():
# Nested extension attribute (too deep)
with pytest.raises(ValueError) as excinfo:
validate_attrs(["doc._.foo.bar"]) # 9.07μs -> 9.12μs (0.570% slower)
def test_invalid_attr_format_non_extension_nested():
# Nested attribute (not extension)
with pytest.raises(ValueError) as excinfo:
validate_attrs(["doc.text.foo"]) # 8.87μs -> 8.45μs (4.97% faster)
def test_invalid_attr_format_trailing_underscore():
# Attribute ending with underscore
with pytest.raises(ValueError) as excinfo:
validate_attrs(["token.pos_"]) # 8.33μs -> 7.85μs (6.14% faster)
def test_invalid_attr_not_in_class():
# Attribute does not exist in class
with pytest.raises(ValueError) as excinfo:
validate_attrs(["doc.foo"]) # 9.40μs -> 8.84μs (6.34% faster)
def test_span_non_extension_attribute():
# Span attribute not allowed unless custom extension
with pytest.raises(ValueError) as excinfo:
validate_attrs(["span.text"]) # 9.13μs -> 8.21μs (11.2% faster)
def test_span_extension_attribute_valid():
# Span custom extension attribute is allowed
codeflash_output = validate_attrs(["span._.my_span_ext"]) # 5.82μs -> 5.08μs (14.7% faster)
def test_case_insensitivity():
# Attribute names are case-insensitive
codeflash_output = validate_attrs(["DOC.TEXT", "Token.Lemma"]) # 7.21μs -> 6.96μs (3.58% faster)
def test_multiple_errors():
# Multiple errors: only the first error is raised
with pytest.raises(ValueError) as excinfo:
validate_attrs(["doc.foo", "token.bar"]) # 10.7μs -> 9.98μs (7.69% faster)
3. Large Scale Test Cases
def test_large_number_of_valid_attrs():
# Large number of valid attributes
attrs = [f"doc.text"] * 500 + [f"token.pos"] * 500
codeflash_output = validate_attrs(attrs) # 21.0μs -> 23.0μs (8.47% slower)
def test_large_number_of_custom_extension_attrs():
# Large number of valid custom extension attributes
attrs = [f"doc._.ext{i}" for i in range(1000)]
codeflash_output = validate_attrs(attrs) # 549μs -> 407μs (35.0% faster)
def test_large_number_of_invalid_attrs():
# Large number of invalid attributes, should fail on first invalid
attrs = [f"doc.text"] * 999 + ["doc.foo"]
with pytest.raises(ValueError) as excinfo:
validate_attrs(attrs) # 26.7μs -> 27.1μs (1.51% slower)
def test_large_number_of_span_non_extension_attrs():
# Large number of span non-extension attributes, should fail on first
attrs = [f"span.text"] * 1000
with pytest.raises(ValueError) as excinfo:
validate_attrs(attrs) # 128μs -> 117μs (9.87% faster)
def test_large_number_of_mixed_valid_attrs():
# Large number of mixed valid attributes
attrs = (
[f"doc.text"] * 250 +
[f"token.lemma"] * 250 +
[f"doc..ext{i}" for i in range(250)] +
[f"token..ext{i}" for i in range(250)]
)
codeflash_output = validate_attrs(attrs) # 288μs -> 219μs (31.6% faster)
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from typing import Any, Dict, Iterable
imports
import pytest
from spacy.pipe_analysis import validate_attrs
Errors class to simulate spacy.errors.Errors
class Errors:
E180 = "Invalid Span attributes: {attrs}. Only custom extension attributes (span._.xyz) are allowed."
E181 = "Invalid object '{obj}' in attribute(s): {attrs}. Only 'doc', 'token', and 'span' are allowed."
E182 = "Invalid attribute format: {attr}. Attribute must specify a property, not just the object."
E183 = "Invalid attribute format: {attr}. Did you mean: {solution}?"
E184 = "Attribute '{attr}' should not end with an underscore. Did you mean: {solution}?"
E185 = "Object '{obj}' does not have attribute '{attr}'."
Dummy classes to simulate spacy.tokens.Doc, Span, Token
class Doc:
text = None
length = None
# Simulate some valid attributes
class Token:
pos = None
lemma = None
pos_ = None # Used to test invalid attributes ending with '_'
# Simulate some valid attributes
class Span:
# No valid attributes except extensions
pass
from spacy.pipe_analysis import validate_attrs
------------------ UNIT TESTS ------------------
1. Basic Test Cases
def test_valid_custom_extension_attributes():
# Valid custom extension attributes for doc, token, span
attrs = ["doc..xyz", "token..abc", "span._.myext"]
codeflash_output = validate_attrs(attrs); result = codeflash_output # 10.5μs -> 9.69μs (8.51% faster)
def test_valid_mixed_attributes():
# Mix of valid normal and extension attributes
attrs = ["doc.text", "token.lemma", "span._.myext"]
codeflash_output = validate_attrs(attrs); result = codeflash_output # 10.0μs -> 9.45μs (5.83% faster)
2. Edge Test Cases
def test_invalid_object_name():
# Attribute with invalid object name
attrs = ["sentence.text"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 10.4μs -> 9.45μs (9.64% faster)
def test_missing_attribute_after_object():
# Attribute is just "doc" or "token" or "span"
attrs = ["doc"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 8.09μs -> 7.83μs (3.37% faster)
def test_missing_attribute_after_extension():
# Attribute is just "doc."
attrs = ["doc."]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 8.57μs -> 8.14μs (5.36% faster)
def test_span_non_extension_attribute():
# Span attribute not using custom extension
attrs = ["span.text"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 8.85μs -> 8.00μs (10.6% faster)
def test_attribute_with_trailing_underscore():
# Attribute ends with an underscore
attrs = ["token.pos_"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 8.60μs -> 8.03μs (7.08% faster)
def test_attribute_with_too_many_dots():
# Attribute is something like doc.text.extra
attrs = ["doc.text.extra"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 9.50μs -> 9.16μs (3.63% faster)
def test_extension_attribute_with_too_many_dots():
# Attribute is something like doc..x.y
attrs = ["doc..x.y"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 9.18μs -> 8.56μs (7.17% faster)
def test_attribute_not_in_class():
# Attribute is not present in Doc or Token class
attrs = ["doc.invalidattr"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 9.86μs -> 8.93μs (10.4% faster)
def test_multiple_errors_in_list():
# Multiple invalid attributes, should raise on first error found
attrs = ["doc.text", "doc.invalidattr", "token.pos_"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 12.7μs -> 11.8μs (7.99% faster)
def test_span_extension_attribute_with_too_many_dots():
# Extension attribute with too many dots
attrs = ["span._.myext.sub"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 10.9μs -> 10.1μs (8.45% faster)
3. Large Scale Test Cases
def test_large_number_of_custom_extensions():
# Test with many valid custom extension attributes
attrs = [f"doc._.ext{i}" for i in range(1000)]
codeflash_output = validate_attrs(attrs); result = codeflash_output # 544μs -> 405μs (34.4% faster)
def test_large_number_of_invalid_attributes():
# Test with many invalid attributes (all should fail)
attrs = [f"doc.invalid{i}" for i in range(1000)]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 458μs -> 350μs (30.8% faster)
def test_large_mixed_valid_and_invalid():
# Mix of valid and invalid attributes
attrs = ["doc.text"] * 500 + ["token.pos"] * 499 + ["doc.invalidattr"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 27.1μs -> 27.9μs (3.05% slower)
def test_large_number_of_span_non_extension():
# Large number of invalid span attributes (non-extension)
attrs = [f"span.text{i}" for i in range(100)]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 68.3μs -> 56.2μs (21.6% faster)
for i in range(100):
pass
def test_large_number_of_extension_attribute_with_too_many_dots():
# Large number of extension attributes with too many dots
attrs = [f"doc._.ext{i}.sub" for i in range(100)]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 79.0μs -> 62.1μs (27.3% faster)
4. Miscellaneous/Additional Edge Cases
def test_empty_input():
# Empty input should be valid and return empty list
attrs = []
codeflash_output = validate_attrs(attrs); result = codeflash_output # 2.03μs -> 2.04μs (0.489% slower)
def test_case_insensitivity():
# Attribute keys should be treated case-insensitively
attrs = ["DOC.TEXT", "Token.Lemma", "SPAN._.MyExt"]
codeflash_output = validate_attrs(attrs); result = codeflash_output # 10.4μs -> 9.30μs (12.2% faster)
def test_attribute_with_leading_dot():
# Attribute with leading dot is invalid
attrs = [".doc.text"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 9.74μs -> 8.62μs (12.9% faster)
def test_attribute_with_multiple_consecutive_dots():
# Attribute with multiple consecutive dots is invalid
attrs = ["doc..text"]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 9.40μs -> 9.04μs (4.10% faster)
def test_attribute_with_trailing_dot():
# Attribute with trailing dot is invalid
attrs = ["doc.text."]
with pytest.raises(ValueError) as e:
validate_attrs(attrs) # 8.95μs -> 8.85μs (1.05% faster)
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
To edit these changes
git checkout codeflash/optimize-validate_attrs-mhwqezeiand push.