⚡️ Speed up function _format_grouping_output by 23%
#273
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 23% (0.23x) speedup for
_format_grouping_outputinunstructured/metrics/utils.py⏱️ Runtime :
26.1 milliseconds→21.2 milliseconds(best of113runs)📝 Explanation and details
The optimized code achieves a 22% speedup by adding a fast-path for single DataFrame/Series inputs and avoiding unnecessary data copies during concatenation.
Key Optimizations
Fast-path for single inputs: When only one DataFrame or Series is passed, the function now directly calls
reset_index()instead of invokingpd.concat(). This avoids the overhead of pandas' concatenation machinery, which includes index alignment, metadata merging, and internal data structure creation - all unnecessary when there's only one object.Zero-copy concatenation: For multiple DataFrames, the optimization adds
copy=Falsetopd.concat(), which tells pandas to avoid creating unnecessary copies of the underlying data arrays when possible. This reduces both memory allocation overhead and CPU time spent copying data.Performance Impact by Test Case
The optimization shows dramatic improvements for single DataFrame cases (28-85% faster), which represents a common usage pattern:
test_single_dataframe_input: 73.7% faster)Why This Matters
Looking at
function_references, this function is called fromget_mean_grouping()in a metrics evaluation pipeline. In that context:for field in agg_fields)The optimizations are particularly beneficial when processing evaluation metrics repeatedly across different document types or connectors, as the cumulative time savings add up across multiple invocations.
✅ Correctness verification report:
🌀 Click to see Generated Regression Tests
🔎 Click to see Concolic Coverage Tests
codeflash_concolic_xdo_puqm/tmpf_a63ivm/test_concolic_coverage.py::test__format_grouping_outputTo edit these changes
git checkout codeflash/optimize-_format_grouping_output-mks486jxand push.