⚡️ Speed up function clean by 14%
#258
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 14% (0.14x) speedup for
cleaninunstructured/cleaners/core.py⏱️ Runtime :
1.98 milliseconds→1.74 milliseconds(best of43runs)📝 Explanation and details
The optimized code achieves a 14% speedup by pre-compiling regex patterns at module level instead of compiling them on every function call. This is a classic performance optimization in Python that exploits regex compilation overhead.
Key Optimization:
Three regex patterns are now compiled once at module initialization:
_WHITESPACE_CHARS_RE = re.compile(r"[\xa0\n]")_MULTIPLE_SPACES_RE = re.compile(r"[ ]{2,}")_DASHES_RE = re.compile(r"[-\u2013]")Why This Works:
When you call
re.sub(pattern, repl, string)with a string pattern, Python must:By pre-compiling, steps 1-2 happen once at import time instead of on every function call. The line profiler data confirms this:
clean_extra_whitespace: Dropped from 2.01ms to 1.25ms (38% faster) - the two regex operations now use pre-compiled patternsclean_dashes: Dropped from 0.94ms to 0.57ms (40% faster) - single regex now pre-compiledcleanfunction: Dropped from 4.26ms to 3.20ms (25% faster in line profiler)Test Results Show:
The optimization particularly benefits scenarios with:
test_extra_whitespace_collapsed_and_nbsp_handled,test_clean_extra_whitespace_double_spacestest_dashes_and_endash_replaced_by_space,test_clean_dashes_hyphentest_combined_flags_order_and_behavior,test_empty_string_and_only_punctuation_edge_casestest_large_scale_performance_and_correctness_under_limits,test_clean_maximum_consecutive_same_charImpact on Production:
Text cleaning functions are typically called in tight loops during document processing pipelines. Even though individual calls save only microseconds, when processing thousands of documents with millions of text fragments, these savings compound significantly. The optimization is purely internal - no API changes, no behavioral differences - making it a safe performance win.
✅ Correctness verification report:
⚙️ Click to see Existing Unit Tests
cleaners/test_core.py::test_cleancleaners/test_core.py::test_clean_bulletscleaners/test_core.py::test_clean_dashescleaners/test_core.py::test_clean_extra_whitespacecleaners/test_core.py::test_clean_trailing_punctuation🌀 Click to see Generated Regression Tests
🔎 Click to see Concolic Coverage Tests
codeflash_concolic_xdo_puqm/tmppw0yo7eh/test_concolic_coverage.py::test_cleancodeflash_concolic_xdo_puqm/tmppw0yo7eh/test_concolic_coverage.py::test_clean_2To edit these changes
git checkout codeflash/optimize-clean-mkrwcfq7and push.