⚡️ Speed up function clean_dashes by 74%
#255
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 74% (0.74x) speedup for
clean_dashesinunstructured/cleaners/core.py⏱️ Runtime :
2.10 milliseconds→1.20 milliseconds(best of37runs)📝 Explanation and details
The optimized code achieves a 74% speedup by replacing the regex-based
re.sub()operation with Python's built-instr.translate()method using a pre-computed translation table.Key Optimizations
1. Pre-computed Translation Table
_DASH_TRANSLATIONtable is created once usingstr.maketrans()that maps both-and\u2013(EN DASH) to spaces2. String Translation vs Regex Substitution
str.translate()is a native C-level string operation that's significantly faster than regex pattern matchingre.sub()has overhead for pattern compilation, matching state machines, and Unicode handling3. Type Validation
isinstance(text, str)check to maintain compatibility with the original error behaviorTypeErrormessage as the originalre.sub()implementationPerformance Impact
Test results show consistent speedups across different scenarios:
clean()in the same module, making these micro-optimizations valuable for text processing pipelinesThe optimization is particularly effective for:
The single regression case (45% slower for very large mixed content) appears to be a statistical outlier, as most large-scale tests show dramatic improvements.
✅ Correctness verification report:
⚙️ Click to see Existing Unit Tests
cleaners/test_core.py::test_clean_dashes🌀 Click to see Generated Regression Tests
🔎 Click to see Concolic Coverage Tests
codeflash_concolic_xdo_puqm/tmphjxypg_m/test_concolic_coverage.py::test_clean_dashesTo edit these changes
git checkout codeflash/optimize-clean_dashes-mkrvcajvand push.