⚡️ Speed up function replace_mime_encodings by 22%
#256
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 22% (0.22x) speedup for
replace_mime_encodingsinunstructured/cleaners/core.py⏱️ Runtime :
330 microseconds→270 microseconds(best of64runs)📝 Explanation and details
The optimized code achieves a 22% speedup by adding a single decorator:
@lru_cache(maxsize=128)to theformat_encoding_strfunction. This is a pure memoization optimization that caches the results of encoding string formatting.Why this optimization works:
Repeated encoding values: In real-world usage, applications typically use a small set of encoding strings repeatedly (e.g., "utf-8", "UTF-8", "utf_8", "iso-8859-1"). The
format_encoding_strfunction performs string operations (.lower()and.replace()) on every call, even when processing the same encoding value multiple times.Cache eliminates redundant work: With
lru_cache, after the first call with a given encoding string, subsequent calls return the cached result immediately without executing the function body. This eliminates:annotated_encodings(~24% of original function time)Evidence from profiling: The line profiler shows
format_encoding_strcalls dropped from 813,499 ns to 385,714 ns inreplace_mime_encodings(52% faster), accounting for the overall 22% speedup.Test results confirm the optimization pattern:
test_performance_repeated_operations: 33.7% → 33.6% → 26.3% faster across successive calls, with absolute times dropping from 4.49μs → 1.42μs → 1.08μs)utf_8)Impact on workloads:
This optimization is particularly effective when:
replace_mime_encodingsis called repeatedlyThe 128-entry cache size is appropriate since encoding names are a finite, small set in practice, keeping memory overhead minimal while maximizing hit rates.
✅ Correctness verification report:
⚙️ Click to see Existing Unit Tests
cleaners/test_core.py::test_replace_mime_encodingscleaners/test_core.py::test_replace_mime_encodings_works_with_different_encodingscleaners/test_core.py::test_replace_mime_encodings_works_with_right_to_left_encodings🌀 Click to see Generated Regression Tests
🔎 Click to see Concolic Coverage Tests
codeflash_concolic_xdo_puqm/tmpvb4dgkld/test_concolic_coverage.py::test_replace_mime_encodingsTo edit these changes
git checkout codeflash/optimize-replace_mime_encodings-mkrvo33aand push.