Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jan 24, 2026

📄 23% (0.23x) speedup for _format_grouping_output in unstructured/metrics/utils.py

⏱️ Runtime : 26.1 milliseconds 21.2 milliseconds (best of 113 runs)

📝 Explanation and details

The optimized code achieves a 22% speedup by adding a fast-path for single DataFrame/Series inputs and avoiding unnecessary data copies during concatenation.

Key Optimizations

  1. Fast-path for single inputs: When only one DataFrame or Series is passed, the function now directly calls reset_index() instead of invoking pd.concat(). This avoids the overhead of pandas' concatenation machinery, which includes index alignment, metadata merging, and internal data structure creation - all unnecessary when there's only one object.

  2. Zero-copy concatenation: For multiple DataFrames, the optimization adds copy=False to pd.concat(), which tells pandas to avoid creating unnecessary copies of the underlying data arrays when possible. This reduces both memory allocation overhead and CPU time spent copying data.

Performance Impact by Test Case

The optimization shows dramatic improvements for single DataFrame cases (28-85% faster), which represents a common usage pattern:

  • Single DataFrame tests: 58-85% faster (e.g., test_single_dataframe_input: 73.7% faster)
  • Multiple DataFrame tests: 8-18% faster (more modest but still meaningful)

Why This Matters

Looking at function_references, this function is called from get_mean_grouping() in a metrics evaluation pipeline. In that context:

  • The function is called once per aggregation field (see the loop for field in agg_fields)
  • For the common case of a single aggregation field, the fast-path optimization directly applies
  • Even when multiple fields are aggregated, avoiding data copies reduces memory pressure in data-heavy evaluation workflows

The optimizations are particularly beneficial when processing evaluation metrics repeatedly across different document types or connectors, as the cumulative time savings add up across multiple invocations.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 38 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
import pandas as pd

# imports
from unstructured.metrics.utils import _format_grouping_output


def test_basic_two_dataframes_simple():
    # Create two simple DataFrames with the same integer index
    df1 = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=[0, 1])
    df2 = pd.DataFrame({"c": [5, 6]}, index=[0, 1])

    # Call the function under test
    codeflash_output = _format_grouping_output(df1, df2)
    out = codeflash_output  # 660μs -> 596μs (10.6% faster)


def test_single_dataframe_returns_reset_index():
    # Create a DataFrame with a string-based index
    df = pd.DataFrame({"x": [10, 20]}, index=["r1", "r2"])

    # Call with a single DataFrame: behavior should match df.reset_index()
    codeflash_output = _format_grouping_output(df)
    out = codeflash_output  # 462μs -> 286μs (61.5% faster)


def test_empty_dataframes_no_rows():
    # Two DataFrames with columns but no rows
    df1 = pd.DataFrame(columns=["a"])
    df2 = pd.DataFrame(columns=["b"])

    # Concatenate empty DataFrames should yield an empty DataFrame with combined columns and reset index
    codeflash_output = _format_grouping_output(df1, df2)
    out = codeflash_output  # 585μs -> 537μs (8.93% faster)


def test_non_overlapping_indices_aligns_and_fills_nans():
    # DataFrames with non-overlapping indices
    df1 = pd.DataFrame({"a": [1, 2]}, index=[0, 1])
    df2 = pd.DataFrame({"b": [3, 4]}, index=[10, 11])

    # Concatenate side-by-side; alignment should produce union of indices with NaNs for missing data
    codeflash_output = _format_grouping_output(df1, df2)
    out = codeflash_output  # 979μs -> 943μs (3.87% faster)


def test_duplicate_column_names_preserved_and_accessible():
    # Two DataFrames with the same column name 'a'
    df1 = pd.DataFrame({"a": [1]})
    df2 = pd.DataFrame({"a": [2]})

    # After concatenation duplicate column labels should be preserved (no automatic suffixing here)
    codeflash_output = _format_grouping_output(df1, df2)
    out = codeflash_output  # 628μs -> 586μs (7.11% faster)


def test_multiindex_columns_preserved():
    # DataFrame with MultiIndex columns
    cols1 = pd.MultiIndex.from_tuples([("g", "x"), ("g", "y")])
    df1 = pd.DataFrame([[1, 2]], columns=cols1)

    # Another DataFrame with a different MultiIndex column
    cols2 = pd.MultiIndex.from_tuples([("h", "z")])
    df2 = pd.DataFrame([[3]], columns=cols2)

    # Concatenate and reset index
    codeflash_output = _format_grouping_output(df1, df2)
    out = codeflash_output  # 1.40ms -> 1.31ms (7.22% faster)


def test_mixed_dtypes_preserved_after_concat_and_reset():
    # DataFrame with integers and strings
    df1 = pd.DataFrame({"int": [1, 2], "str": ["a", "b"]}, index=[0, 1])

    # DataFrame with datetimes
    df2 = pd.DataFrame(
        {"dt": [pd.Timestamp("2020-01-01"), pd.Timestamp("2020-01-02")]}, index=[0, 1]
    )

    codeflash_output = _format_grouping_output(df1, df2)
    out = codeflash_output  # 604μs -> 544μs (11.0% faster)


def test_large_scale_three_dataframes_250_rows_each():
    # Create three DataFrames each with 250 rows (total cell count 250 * 3 = 750 < 1000)
    rows = 250
    df1 = pd.DataFrame({"a": list(range(rows))}, index=list(range(rows)))
    df2 = pd.DataFrame({"b": list(range(rows))}, index=list(range(rows)))
    df3 = pd.DataFrame({"c": list(range(rows))}, index=list(range(rows)))

    # Run the function; this verifies scalability for moderate sizes
    codeflash_output = _format_grouping_output(df1, df2, df3)
    out = codeflash_output  # 733μs -> 651μs (12.6% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pandas as pd

from unstructured.metrics.utils import _format_grouping_output


class TestFormatGroupingOutputBasic:
    """Basic test cases for _format_grouping_output function"""

    def test_single_dataframe_input(self):
        """Test with a single DataFrame - should reset index and return it"""
        # Create a simple DataFrame with a non-default index
        df = pd.DataFrame({"col1": [1, 2, 3]}, index=[10, 20, 30])

        # Call the function with single DataFrame
        codeflash_output = _format_grouping_output(df)
        result = codeflash_output  # 475μs -> 273μs (73.7% faster)

    def test_two_dataframes_concatenation(self):
        """Test concatenating two DataFrames side-by-side"""
        # Create two DataFrames with matching lengths
        df1 = pd.DataFrame({"A": [1, 2, 3]}, index=[0, 1, 2])
        df2 = pd.DataFrame({"B": [4, 5, 6]}, index=[0, 1, 2])

        # Concatenate both DataFrames
        codeflash_output = _format_grouping_output(df1, df2)
        result = codeflash_output  # 657μs -> 594μs (10.6% faster)

    def test_three_dataframes_concatenation(self):
        """Test concatenating three DataFrames side-by-side"""
        # Create three DataFrames with identical indices
        df1 = pd.DataFrame({"X": [10, 20]})
        df2 = pd.DataFrame({"Y": [30, 40]})
        df3 = pd.DataFrame({"Z": [50, 60]})

        # Concatenate all three
        codeflash_output = _format_grouping_output(df1, df2, df3)
        result = codeflash_output  # 638μs -> 585μs (8.91% faster)

    def test_dataframes_with_string_columns(self):
        """Test concatenation with string data"""
        # Create DataFrames with string values
        df1 = pd.DataFrame({"name": ["Alice", "Bob"]})
        df2 = pd.DataFrame({"city": ["NYC", "LA"]})

        # Concatenate
        codeflash_output = _format_grouping_output(df1, df2)
        result = codeflash_output  # 592μs -> 542μs (9.29% faster)

    def test_dataframes_with_mixed_dtypes(self):
        """Test concatenation with mixed data types"""
        # Create DataFrames with different data types
        df1 = pd.DataFrame({"int_col": [1, 2]})
        df2 = pd.DataFrame({"float_col": [1.5, 2.5]})
        df3 = pd.DataFrame({"str_col": ["a", "b"]})

        # Concatenate
        codeflash_output = _format_grouping_output(df1, df2, df3)
        result = codeflash_output  # 527μs -> 480μs (9.96% faster)

    def test_dataframes_with_nan_values(self):
        """Test concatenation with NaN values"""
        # Create DataFrames containing NaN values
        df1 = pd.DataFrame({"col1": [1.0, float("nan"), 3.0]})
        df2 = pd.DataFrame({"col2": [4.0, 5.0, float("nan")]})

        # Concatenate
        codeflash_output = _format_grouping_output(df1, df2)
        result = codeflash_output  # 599μs -> 552μs (8.55% faster)


class TestFormatGroupingOutputEdgeCases:
    """Edge case tests for _format_grouping_output function"""

    def test_empty_dataframe(self):
        """Test with an empty DataFrame"""
        # Create an empty DataFrame
        df = pd.DataFrame()

        # Concatenate empty DataFrame
        codeflash_output = _format_grouping_output(df)
        result = codeflash_output  # 395μs -> 307μs (28.5% faster)

    def test_dataframe_with_single_row(self):
        """Test with a DataFrame containing only one row"""
        # Create a DataFrame with one row
        df = pd.DataFrame({"col": [42]}, index=[999])

        # Process it
        codeflash_output = _format_grouping_output(df)
        result = codeflash_output  # 462μs -> 272μs (69.8% faster)

    def test_dataframe_with_single_column(self):
        """Test with a DataFrame containing only one column"""
        # Create a single-column DataFrame
        df = pd.DataFrame({"only": [1, 2, 3]})

        # Process it
        codeflash_output = _format_grouping_output(df)
        result = codeflash_output  # 441μs -> 279μs (58.0% faster)

    def test_dataframe_with_negative_index(self):
        """Test with DataFrame having negative index values"""
        # Create DataFrame with negative index
        df = pd.DataFrame({"val": [10, 20]}, index=[-5, -3])

        # Process it
        codeflash_output = _format_grouping_output(df)
        result = codeflash_output  # 471μs -> 273μs (72.7% faster)

    def test_dataframe_with_string_index(self):
        """Test with DataFrame having string index"""
        # Create DataFrame with string index
        df = pd.DataFrame({"data": [100, 200]}, index=["row_a", "row_b"])

        # Process it
        codeflash_output = _format_grouping_output(df)
        result = codeflash_output  # 462μs -> 290μs (59.2% faster)

    def test_dataframe_with_duplicate_index(self):
        """Test with DataFrame having duplicate index values"""
        # Create DataFrame with duplicate index
        df1 = pd.DataFrame({"A": [1, 2, 3]}, index=[0, 1, 0])
        df2 = pd.DataFrame({"B": [4, 5, 6]}, index=[0, 1, 0])

        # Concatenate
        codeflash_output = _format_grouping_output(df1, df2)
        result = codeflash_output  # 646μs -> 598μs (8.02% faster)

    def test_dataframes_with_multiindex_columns(self):
        """Test with DataFrames having MultiIndex columns"""
        # Create DataFrame with MultiIndex columns
        df1 = pd.DataFrame(
            [[1, 2], [3, 4]], columns=pd.MultiIndex.from_tuples([("A", "x"), ("A", "y")])
        )
        df2 = pd.DataFrame(
            [[5, 6], [7, 8]], columns=pd.MultiIndex.from_tuples([("B", "x"), ("B", "y")])
        )

        # Concatenate
        codeflash_output = _format_grouping_output(df1, df2)
        result = codeflash_output  # 1.30ms -> 1.19ms (9.46% faster)

    def test_dataframe_with_large_index_values(self):
        """Test with DataFrame having very large index values"""
        # Create DataFrame with large index values
        df = pd.DataFrame({"val": [1, 2]}, index=[10**15, 10**15 + 1])

        # Process it
        codeflash_output = _format_grouping_output(df)
        result = codeflash_output  # 466μs -> 273μs (70.6% faster)

    def test_dataframe_with_float_index(self):
        """Test with DataFrame having float index"""
        # Create DataFrame with float index
        df = pd.DataFrame({"val": [1, 2]}, index=[1.5, 2.7])

        # Process it
        codeflash_output = _format_grouping_output(df)
        result = codeflash_output  # 504μs -> 272μs (85.3% faster)

    def test_concatenate_many_dataframes(self):
        """Test concatenating many DataFrames (10 DataFrames)"""
        # Create 10 DataFrames
        dfs = [pd.DataFrame({f"col{i}": [i, i + 1]}) for i in range(10)]

        # Concatenate all
        codeflash_output = _format_grouping_output(*dfs)
        result = codeflash_output  # 885μs -> 747μs (18.5% faster)

    def test_dataframe_with_special_float_values(self):
        """Test with special float values (inf, -inf)"""
        # Create DataFrame with special float values
        df1 = pd.DataFrame({"pos_inf": [float("inf"), 1.0]})
        df2 = pd.DataFrame({"neg_inf": [float("-inf"), 2.0]})

        # Concatenate
        codeflash_output = _format_grouping_output(df1, df2)
        result = codeflash_output  # 598μs -> 558μs (7.13% faster)

    def test_dataframe_with_boolean_values(self):
        """Test with boolean values in DataFrame"""
        # Create DataFrames with boolean values
        df1 = pd.DataFrame({"bool_col": [True, False, True]})
        df2 = pd.DataFrame({"bool_col2": [False, True, False]})

        # Concatenate
        codeflash_output = _format_grouping_output(df1, df2)
        result = codeflash_output  # 588μs -> 543μs (8.21% faster)

    def test_dataframe_with_datetime_index(self):
        """Test with DatetimeIndex"""
        # Create DataFrame with datetime index
        dates = pd.date_range("2023-01-01", periods=3)
        df = pd.DataFrame({"value": [1, 2, 3]}, index=dates)

        # Process it
        codeflash_output = _format_grouping_output(df)
        result = codeflash_output  # 492μs -> 291μs (69.4% faster)


class TestFormatGroupingOutputLargeScale:
    """Large scale test cases for _format_grouping_output function"""

    def test_large_single_dataframe(self):
        """Test with a large DataFrame (500 rows, 50 columns)"""
        # Create a large DataFrame
        df = pd.DataFrame({f"col{i}": range(i, i + 500) for i in range(50)})

        # Process it
        codeflash_output = _format_grouping_output(df)
        result = codeflash_output  # 510μs -> 312μs (63.5% faster)

    def test_multiple_large_dataframes(self):
        """Test concatenating 5 large DataFrames (300 rows each, 20 columns each)"""
        # Create 5 large DataFrames
        dfs = [pd.DataFrame({f"df{i}_col{j}": range(300) for j in range(20)}) for i in range(5)]

        # Concatenate all
        codeflash_output = _format_grouping_output(*dfs)
        result = codeflash_output  # 872μs -> 751μs (16.1% faster)

    def test_large_dataframe_with_many_rows(self):
        """Test with a DataFrame containing 800 rows"""
        # Create a DataFrame with many rows
        df = pd.DataFrame(
            {
                "col1": range(800),
                "col2": range(800, 1600),
                "col3": [float(i) * 1.5 for i in range(800)],
            }
        )

        # Process it
        codeflash_output = _format_grouping_output(df)
        result = codeflash_output  # 483μs -> 297μs (62.5% faster)

    def test_wide_dataframe(self):
        """Test with a very wide DataFrame (1 row, 100 columns)"""
        # Create a very wide DataFrame
        df = pd.DataFrame({f"col{i}": [i] for i in range(100)})

        # Process it
        codeflash_output = _format_grouping_output(df)
        result = codeflash_output  # 476μs -> 302μs (57.4% faster)

    def test_many_small_dataframes(self):
        """Test concatenating 50 small DataFrames (10 rows each, 1 column each)"""
        # Create 50 small DataFrames
        dfs = [pd.DataFrame({f"col{i}": range(10)}) for i in range(50)]

        # Concatenate all
        codeflash_output = _format_grouping_output(*dfs)
        result = codeflash_output  # 2.21ms -> 1.60ms (37.9% faster)

    def test_large_dataframe_all_unique_values(self):
        """Test with a large DataFrame containing all unique values"""
        # Create a large DataFrame with unique values
        import numpy as np

        np.random.seed(42)
        df1 = pd.DataFrame({"col1": np.random.rand(500)})
        df2 = pd.DataFrame({"col2": np.random.rand(500)})

        # Concatenate
        codeflash_output = _format_grouping_output(df1, df2)
        result = codeflash_output  # 604μs -> 559μs (7.99% faster)

    def test_large_dataframe_with_repeated_values(self):
        """Test with a large DataFrame containing repeated values"""
        # Create a large DataFrame with repeated values
        df1 = pd.DataFrame({"col1": [1, 2, 3] * 267})  # 801 rows
        df2 = pd.DataFrame({"col2": ["a", "b", "c"] * 267})

        # Concatenate
        codeflash_output = _format_grouping_output(df1, df2)
        result = codeflash_output  # 512μs -> 472μs (8.54% faster)

    def test_large_numeric_range_dataframe(self):
        """Test with a large DataFrame containing a wide numeric range"""
        # Create a DataFrame with a wide numeric range
        df1 = pd.DataFrame({"small": [0.00001] * 400})
        df2 = pd.DataFrame({"large": [1000000.0] * 400})

        # Concatenate
        codeflash_output = _format_grouping_output(df1, df2)
        result = codeflash_output  # 609μs -> 561μs (8.50% faster)

    def test_large_string_dataframe(self):
        """Test with a large DataFrame containing long strings"""
        # Create a DataFrame with long strings
        long_string = "a" * 1000
        df = pd.DataFrame({"short": ["x"] * 300, "long": [long_string] * 300})

        # Process it
        codeflash_output = _format_grouping_output(df)
        result = codeflash_output  # 455μs -> 285μs (59.2% faster)

    def test_large_categorical_dataframe(self):
        """Test with a large DataFrame containing categorical data"""
        # Create a DataFrame with categorical values
        categories = ["cat_a", "cat_b", "cat_c", "cat_d"]
        df1 = pd.DataFrame({"cat": pd.Categorical([categories[i % 4] for i in range(500)])})
        df2 = pd.DataFrame({"values": range(500)})

        # Concatenate
        codeflash_output = _format_grouping_output(df1, df2)
        result = codeflash_output  # 553μs -> 512μs (8.03% faster)

    def test_performance_with_accumulated_dataframes(self):
        """Test performance with progressively concatenating many DataFrames"""
        # Create multiple DataFrames and concatenate them progressively
        dfs = [pd.DataFrame({"col": range(10)}) for _ in range(30)]

        # Concatenate all at once
        codeflash_output = _format_grouping_output(*dfs)
        result = codeflash_output  # 1.54ms -> 1.19ms (30.2% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest

from unstructured.metrics.utils import _format_grouping_output


def test__format_grouping_output():
    with pytest.raises(ValueError, match="No\\ objects\\ to\\ concatenate"):
        _format_grouping_output()
🔎 Click to see Concolic Coverage Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_xdo_puqm/tmpf_a63ivm/test_concolic_coverage.py::test__format_grouping_output 12.6μs 12.4μs 1.85%✅

To edit these changes git checkout codeflash/optimize-_format_grouping_output-mks486jx and push.

Codeflash Static Badge

The optimized code achieves a **22% speedup** by adding a fast-path for single DataFrame/Series inputs and avoiding unnecessary data copies during concatenation.

## Key Optimizations

1. **Fast-path for single inputs**: When only one DataFrame or Series is passed, the function now directly calls `reset_index()` instead of invoking `pd.concat()`. This avoids the overhead of pandas' concatenation machinery, which includes index alignment, metadata merging, and internal data structure creation - all unnecessary when there's only one object.

2. **Zero-copy concatenation**: For multiple DataFrames, the optimization adds `copy=False` to `pd.concat()`, which tells pandas to avoid creating unnecessary copies of the underlying data arrays when possible. This reduces both memory allocation overhead and CPU time spent copying data.

## Performance Impact by Test Case

The optimization shows **dramatic improvements for single DataFrame cases** (28-85% faster), which represents a common usage pattern:
- Single DataFrame tests: 58-85% faster (e.g., `test_single_dataframe_input`: 73.7% faster)
- Multiple DataFrame tests: 8-18% faster (more modest but still meaningful)

## Why This Matters

Looking at `function_references`, this function is called from `get_mean_grouping()` in a metrics evaluation pipeline. In that context:
- The function is called **once per aggregation field** (see the loop `for field in agg_fields`)
- For the common case of a single aggregation field, the fast-path optimization directly applies
- Even when multiple fields are aggregated, avoiding data copies reduces memory pressure in data-heavy evaluation workflows

The optimizations are particularly beneficial when processing evaluation metrics repeatedly across different document types or connectors, as the cumulative time savings add up across multiple invocations.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 January 24, 2026 09:36
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant