⚡️ Speed up function calculate_accuracy by 71%
#268
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 71% (0.71x) speedup for
calculate_accuracyinunstructured/metrics/text_extraction.py⏱️ Runtime :
6.44 milliseconds→3.76 milliseconds(best of132runs)📝 Explanation and details
The optimized code achieves a 71% speedup through three key optimizations that reduce redundant work in the common case:
What Changed
Module-level constant for validation (
_RETURN_TYPES): Moved the allowed return types to a module-level tuple instead of creating a new list on every function call.Conditional Unicode quote standardization: Added
str.isascii()checks before callingstandardize_quotes(). This expensive Unicode replacement operation (which iterates through ~40 quote mappings) is now skipped when strings contain only ASCII characters.Early equality check: After string preparation, added a fast-path check
if output == sourceto immediately return the result without calling the expensiveLevenshtein.distance()calculation.Why It's Faster
ASCII check optimization: The line profiler shows
standardize_quotes()consumed ~76% of runtime in the original (12.6ms out of 16.6ms total). Withstr.isascii()being a fast C-level operation, the optimization successfully skips this expensive Unicode processing in most test cases - only 3 out of 71 function calls (4%) actually needed quote standardization in the test suite.Early equality shortcut: When strings are identical after preprocessing (33 out of 71 calls = 46% of test cases), the optimized version immediately returns without computing Levenshtein distance (originally ~21.5% of runtime). The profiler confirms these 33 cases now exit early, avoiding the distance calculation entirely.
Validation overhead elimination: While small (0.4% of runtime), removing the list allocation on every call adds up, especially given the function_references show this is called from
_process_document()which processes multiple documents in evaluation workloads.Impact on Workloads
Based on the function_references,
calculate_accuracy()is called from document evaluation pipelines (evaluate.py) where it processes extracted text against source documents. The optimizations are particularly effective for:The test results confirm this: identical string tests show 10-20x speedup (e.g.,
test_identical_strings_returns_perfect_score: 46.3μs → 3.96μs), while tests requiring actual Levenshtein computation show smaller but still meaningful gains (6-15%). The document evaluation context in_process_document()indicates this function may be called repeatedly in loops, amplifying the per-call savings.✅ Correctness verification report:
🌀 Click to see Generated Regression Tests
🔎 Click to see Concolic Coverage Tests
codeflash_concolic_xdo_puqm/tmpl285jtn6/test_concolic_coverage.py::test_calculate_accuracyTo edit these changes
git checkout codeflash/optimize-calculate_accuracy-mks1pgs1and push.