Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jan 24, 2026

📄 138% (1.38x) speedup for _display in unstructured/metrics/utils.py

⏱️ Runtime : 63.7 milliseconds 26.7 milliseconds (best of 71 runs)

📝 Explanation and details

The optimized code achieves a 138% speedup (from 63.7ms to 26.7ms) by eliminating the primary performance bottleneck: pandas' df.iterrows() method, which creates expensive Series objects for each row.

Key Optimizations

1. Eliminated df.iterrows() overhead (53.3% → 0.6% of runtime)

  • Original: df.iterrows() consumed 136ms creating temporary Series objects
  • Optimized: Direct list indexing (col_values[j][row_idx]) reduced this to 0.6ms
  • This single change accounts for most of the speedup

2. Pre-computed string representations

  • Collects all column data once: col_values = [df[header].tolist() for header in headers]
  • Pre-converts to strings: col_strs = [[str(item) for item in col] for col in col_values]
  • For non-float values, reuses cached strings instead of calling str() repeatedly
  • Floats are still formatted on-demand with f"{item:.3f}" to maintain precision

3. Reduced column width calculation overhead

  • Uses pre-computed col_strs instead of calling str() for every item during width calculation
  • Time reduced from 20.8ms to 2.9ms

Performance Impact by Workload

Based on function references, _display() is called from calculate() to show aggregated metrics after document processing. The optimization benefits are most significant when:

  • Many rows (200-500+): Test results show 181-354% speedup for large DataFrames, making metric reporting substantially faster
  • Moderate columns (10-20): Overhead reduction scales well with column count
  • Float-heavy data: The pre-computed strings help non-floats, while floats are formatted efficiently on-demand

The optimization preserves exact output formatting (3-decimal float precision, column alignment) while dramatically reducing runtime, particularly valuable when displaying evaluation results for batch document processing operations.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 48 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
import pandas as pd  # used to construct real DataFrame objects required by _display

# imports
from unstructured.metrics.utils import _display


def test_empty_dataframe_no_output_and_returns_none(capsys):
    """
    Edge case: DataFrame with zero rows should produce no output and return immediately.
    We construct a DataFrame with columns but no rows. len(df) == 0 triggers the early return.
    """
    df = pd.DataFrame(columns=["A", "B"])  # has columns but zero rows
    codeflash_output = _display(df)
    result = codeflash_output  # should return None and not raise
    captured = capsys.readouterr()  # capture stdout/stderr produced by click.echo


def test_basic_single_row_output_structure_and_values(capsys):
    """
    Basic functionality: small DataFrame with one row of mixed types.
    Validate number of output lines (header, separator, one data row) and that
    float values are rounded to 3 decimal places.
    """
    df = pd.DataFrame([{"Name": "Alice", "Score": 0.5}])  # single-row DF
    _display(df)
    out = capsys.readouterr().out.splitlines()  # split into individual printed lines

    header_line = out[0]
    separator_line = out[1]
    data_line = out[2]


def test_integer_and_float_formatting_and_rounding(capsys):
    """
    Edge: Ensure integers are not formatted as floats (no trailing decimals)
    and floats are rounded to 3 decimal places (including carry).
    """
    df = pd.DataFrame([{"Count": 42, "Fraction": 0.9999}])
    _display(df)
    out = capsys.readouterr().out.splitlines()
    row = out[2]


def test_none_and_nan_values_and_column_widths_exactness(capsys):
    """
    Edge: Test handling of None (prints 'None') and NaN (prints 'nan') and ensure
    column widths are computed based on the stringified values (including None).
    We compute expected header string the same way as the function and compare exact output.
    """
    df = pd.DataFrame(
        [
            {"Col1": "x", "Col2": None},
            {"Col1": "longer", "Col2": float("nan")},
        ]
    )
    # Pre-compute expected header and column widths exactly as the function does
    headers = df.columns.tolist()
    col_widths = [
        max(len(header), max(len(str(item)) for item in df[header])) for header in headers
    ]
    expected_header = " ".join(header.ljust(col_widths[i]) for i, header in enumerate(headers))
    expected_separator = "-" * sum(col_widths) + "-" * (len(headers) - 1)

    _display(df)
    out_lines = capsys.readouterr().out.splitlines()

    # Data rows: ensure 'None' and 'nan' string representations appear somewhere in the output
    flattened_output = "\n".join(out_lines[2:])


def test_alignment_and_separator_length_consistency(capsys):
    """
    Basic/Edge: Test alignment logic when headers are longer than values and vice versa.
    Confirm separator length equals header line length which ensures consistent table formatting.
    """
    df = pd.DataFrame([{"LongHeader": "v", "S": "value"}])
    # Compute expected widths same as implementation
    headers = df.columns.tolist()
    col_widths = [
        max(len(header), max(len(str(item)) for item in df[header])) for header in headers
    ]
    expected_header = " ".join(header.ljust(col_widths[i]) for i, header in enumerate(headers))

    _display(df)
    out = capsys.readouterr().out.splitlines()
    header_line = out[0]
    separator_line = out[1]


def test_large_scale_many_rows_output_line_count_and_content(capsys):
    """
    Large-scale: test with a relatively large DataFrame (300 rows, 3 columns).
    Ensures the function scales to many rows within the requested limits and
    that the number of output lines matches expectation.
    This stays under the 1000-element per-dimension constraint: 300 rows * 3 cols = 900 elements.
    """
    n = 300  # keep under instructed upper bounds for loop/size
    df = pd.DataFrame([{"c1": i, "c2": float(i) / 7.0, "c3": f"s{i}"} for i in range(n)])

    _display(df)
    out_lines = capsys.readouterr().out.splitlines()

    # Spot-check formatting of a float in the middle row (should be rounded to 3 decimals)
    mid_index = 150
    # compute expected formatted float with 3 decimals
    expected_mid_float = f"{(mid_index / 7.0):.3f}"


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unittest.mock import patch

import pandas as pd

from unstructured.metrics.utils import _display


class TestDisplayBasic:
    """Basic test cases for _display function with normal conditions."""

    def test_empty_dataframe_returns_none(self):
        """Test that an empty DataFrame produces no output."""
        # Create an empty DataFrame with columns
        df = pd.DataFrame(columns=["metric", "value"])

        # Capture output and verify nothing is printed
        with patch("click.echo") as mock_echo:
            codeflash_output = _display(df)
            result = codeflash_output  # 3.81μs -> 3.22μs (18.3% faster)
            mock_echo.assert_not_called()

    def test_single_row_single_column(self):
        """Test display of a DataFrame with one row and one column."""
        # Create a simple DataFrame with one metric
        df = pd.DataFrame({"metric": ["accuracy"]})

        # Capture all output calls
        with patch("click.echo") as mock_echo:
            _display(df)  # 457μs -> 347μs (31.7% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_single_row_multiple_columns(self):
        """Test display of a DataFrame with one row and multiple columns."""
        # Create DataFrame with multiple metrics
        df = pd.DataFrame({"metric": ["accuracy"], "value": [0.95]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 534μs -> 390μs (37.0% faster)

    def test_multiple_rows_single_column(self):
        """Test display of a DataFrame with multiple rows and one column."""
        # Create DataFrame with multiple rows
        df = pd.DataFrame({"name": ["alice", "bob", "charlie"]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 585μs -> 361μs (62.1% faster)

    def test_multiple_rows_multiple_columns(self):
        """Test display of a DataFrame with multiple rows and columns."""
        # Create a standard metrics table
        df = pd.DataFrame(
            {"model": ["model_a", "model_b"], "accuracy": [0.95, 0.92], "loss": [0.1, 0.15]}
        )

        with patch("click.echo") as mock_echo:
            _display(df)  # 641μs -> 423μs (51.6% faster)

    def test_float_formatting_to_three_decimals(self):
        """Test that float values are formatted to exactly 3 decimal places."""
        # Create DataFrame with float values
        df = pd.DataFrame({"metric": ["f1"], "score": [0.123456789]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 525μs -> 385μs (36.3% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_string_values_preserved(self):
        """Test that string values are preserved exactly as they are."""
        # Create DataFrame with string values
        df = pd.DataFrame({"name": ["test_value"], "type": ["string_type"]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 491μs -> 380μs (29.4% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_column_alignment_with_headers(self):
        """Test that columns are properly aligned based on header widths."""
        # Create DataFrame where header is longer than data
        df = pd.DataFrame({"very_long_header": ["short"]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 448μs -> 347μs (29.2% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_column_alignment_with_data(self):
        """Test that columns are properly aligned when data is longer than headers."""
        # Create DataFrame where data is longer than header
        df = pd.DataFrame({"x": ["very_long_value"]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 455μs -> 347μs (31.1% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]


class TestDisplayEdgeCases:
    """Edge case tests for _display function."""

    def test_zero_values(self):
        """Test handling of zero float values."""
        # Create DataFrame with zero values
        df = pd.DataFrame({"metric": ["zero"], "value": [0.0]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 526μs -> 382μs (37.5% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_negative_float_values(self):
        """Test handling of negative float values."""
        # Create DataFrame with negative floats
        df = pd.DataFrame({"metric": ["negative"], "value": [-0.5]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 540μs -> 387μs (39.3% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_very_small_float_values(self):
        """Test handling of very small float values near zero."""
        # Create DataFrame with very small float
        df = pd.DataFrame({"metric": ["tiny"], "value": [0.00001]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 540μs -> 390μs (38.5% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_very_large_float_values(self):
        """Test handling of very large float values."""
        # Create DataFrame with large float
        df = pd.DataFrame({"metric": ["large"], "value": [1234567.89]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 541μs -> 386μs (40.3% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_scientific_notation_float(self):
        """Test handling of floats in scientific notation."""
        # Create DataFrame with scientific notation value
        df = pd.DataFrame({"metric": ["sci"], "value": [1e-5]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 533μs -> 384μs (38.8% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_nan_values(self):
        """Test handling of NaN values."""
        # Create DataFrame with NaN
        df = pd.DataFrame({"metric": ["nan_test"], "value": [float("nan")]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 530μs -> 383μs (38.3% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_inf_values(self):
        """Test handling of infinity values."""
        # Create DataFrame with positive infinity
        df = pd.DataFrame({"metric": ["inf"], "value": [float("inf")]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 529μs -> 381μs (38.9% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_negative_inf_values(self):
        """Test handling of negative infinity values."""
        # Create DataFrame with negative infinity
        df = pd.DataFrame({"metric": ["neginf"], "value": [float("-inf")]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 532μs -> 381μs (39.8% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_integer_values(self):
        """Test handling of integer values."""
        # Create DataFrame with integers
        df = pd.DataFrame({"metric": ["int"], "count": [42]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 523μs -> 377μs (38.5% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_empty_string_values(self):
        """Test handling of empty string values."""
        # Create DataFrame with empty strings
        df = pd.DataFrame({"col1": [""], "col2": ["data"]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 485μs -> 380μs (27.7% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_whitespace_only_strings(self):
        """Test handling of strings with only whitespace."""
        # Create DataFrame with whitespace-only strings
        df = pd.DataFrame({"metric": ["  "], "value": ["test"]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 484μs -> 381μs (26.9% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_unicode_characters_in_strings(self):
        """Test handling of unicode characters."""
        # Create DataFrame with unicode characters
        df = pd.DataFrame({"metric": ["\u03b1\u03b2\u03b3"], "value": ["test"]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 487μs -> 379μs (28.5% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_special_characters_in_strings(self):
        """Test handling of special characters like tabs and newlines."""
        # Create DataFrame with special characters
        df = pd.DataFrame({"metric": ["test\tvalue"], "col": ["data"]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 483μs -> 375μs (28.7% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_single_column_single_value(self):
        """Test minimal DataFrame with one column and one value."""
        # Create minimal DataFrame
        df = pd.DataFrame({"x": [1]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 425μs -> 345μs (23.1% faster)

    def test_many_columns_few_rows(self):
        """Test DataFrame with many columns but few rows."""
        # Create DataFrame with many columns
        df = pd.DataFrame({f"col{i}": [i] for i in range(20)})

        with patch("click.echo") as mock_echo:
            _display(df)  # 907μs -> 766μs (18.4% faster)

    def test_boolean_values(self):
        """Test handling of boolean values."""
        # Create DataFrame with booleans
        df = pd.DataFrame({"flag": [True, False]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 482μs -> 352μs (36.7% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_none_values(self):
        """Test handling of None values."""
        # Create DataFrame with None
        df = pd.DataFrame({"metric": ["none_test"], "value": [None]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 486μs -> 377μs (28.9% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_mixed_types_in_column(self):
        """Test DataFrame with mixed types in the same column."""
        # Create DataFrame with mixed types
        df = pd.DataFrame({"mixed": [1, "string", 0.5]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 597μs -> 367μs (62.7% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]


class TestDisplaySeparatorLine:
    """Tests for the separator line formatting."""

    def test_separator_line_length_single_column(self):
        """Test that separator line has correct length for single column."""
        # Create simple DataFrame
        df = pd.DataFrame({"abc": ["def"]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 452μs -> 345μs (31.0% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]
            # Separator should have correct width
            header_line = calls[0]
            separator_line = calls[1]

    def test_separator_line_length_multiple_columns(self):
        """Test that separator line has correct length for multiple columns."""
        # Create DataFrame with multiple columns
        df = pd.DataFrame({"col1": ["a"], "col2": ["bb"], "col3": ["ccc"]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 505μs -> 398μs (26.6% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]
            # Separator length should match header length
            header_line = calls[0]
            separator_line = calls[1]

    def test_separator_contains_only_dashes(self):
        """Test that separator line contains only dashes."""
        # Create DataFrame
        df = pd.DataFrame({"x": [1]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 428μs -> 342μs (25.3% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]
            separator = calls[1]


class TestDisplayLargeScale:
    """Large scale tests for _display function."""

    def test_many_rows(self):
        """Test DataFrame with many rows."""
        # Create DataFrame with 500 rows
        df = pd.DataFrame({"id": range(500), "value": [float(i) / 100 for i in range(500)]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 22.8ms -> 5.03ms (354% faster)

    def test_many_columns(self):
        """Test DataFrame with many columns."""
        # Create DataFrame with 50 columns
        df = pd.DataFrame({f"col{i}": [float(i)] for i in range(50)})

        with patch("click.echo") as mock_echo:
            _display(df)  # 1.62ms -> 1.42ms (14.0% faster)

    def test_large_string_values(self):
        """Test DataFrame with large string values."""
        # Create DataFrame with long strings
        long_string = "x" * 500
        df = pd.DataFrame({"text": [long_string]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 456μs -> 342μs (33.3% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_mixed_row_and_column_scale(self):
        """Test DataFrame with moderate rows and columns (balanced)."""
        # Create DataFrame with 200 rows and 10 columns
        df = pd.DataFrame({f"metric_{i}": [float(j) for j in range(200)] for i in range(10)})

        with patch("click.echo") as mock_echo:
            _display(df)  # 11.1ms -> 3.95ms (181% faster)

    def test_many_float_values_formatting(self):
        """Test that many float values are all formatted consistently."""
        # Create DataFrame with many float values
        df = pd.DataFrame({"values": [0.123456789, 1.987654321, 0.111111111, 99.999999999] * 50})

        with patch("click.echo") as mock_echo:
            _display(df)  # 9.16ms -> 2.13ms (331% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]
            # Count formatted floats with 3 decimal places
            float_count = sum(1 for call in calls[2:] if "." in call)


class TestDisplayColumnWidthCalculation:
    """Tests for column width calculation logic."""

    def test_width_based_on_header(self):
        """Test that column width respects header length."""
        # Header is longer than data
        df = pd.DataFrame({"very_long_header": ["x"]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 452μs -> 347μs (30.1% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_width_based_on_data(self):
        """Test that column width respects data length when longer than header."""
        # Data is longer than header
        df = pd.DataFrame({"x": ["very_long_value"]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 456μs -> 348μs (30.9% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_multiple_column_widths(self):
        """Test that each column has independent width calculation."""
        # Columns with different widths
        df = pd.DataFrame({"a": ["short"], "very_long_header": ["x"], "medium": ["medium_val"]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 513μs -> 405μs (26.6% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]

    def test_width_with_floats(self):
        """Test column width calculation with float values."""
        # Float values have consistent length after formatting
        df = pd.DataFrame({"val": [1.1, 999.999, 0.001]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 533μs -> 374μs (42.5% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]


class TestDisplayOutputFormatting:
    """Tests for output formatting details."""

    def test_header_row_format(self):
        """Test that header row is properly formatted."""
        # Create simple DataFrame
        df = pd.DataFrame({"col1": ["a"], "col2": ["b"]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 484μs -> 379μs (27.9% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]
            header = calls[0]

    def test_data_row_format(self):
        """Test that data rows are properly formatted."""
        # Create DataFrame with multiple columns
        df = pd.DataFrame({"col1": ["value1"], "col2": ["value2"]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 482μs -> 378μs (27.5% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]
            data_row = calls[2]

    def test_spacing_between_columns(self):
        """Test that columns are properly spaced."""
        # Create DataFrame with distinct values
        df = pd.DataFrame({"x": ["a"], "y": ["b"]})

        with patch("click.echo") as mock_echo:
            _display(df)  # 485μs -> 373μs (30.0% faster)
            calls = [call[0][0] for call in mock_echo.call_args_list]
            header = calls[0]


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_display-mks4gad7 and push.

Codeflash Static Badge

The optimized code achieves a **138% speedup** (from 63.7ms to 26.7ms) by eliminating the primary performance bottleneck: pandas' `df.iterrows()` method, which creates expensive Series objects for each row.

## Key Optimizations

**1. Eliminated `df.iterrows()` overhead (53.3% → 0.6% of runtime)**
- Original: `df.iterrows()` consumed 136ms creating temporary Series objects
- Optimized: Direct list indexing (`col_values[j][row_idx]`) reduced this to 0.6ms
- This single change accounts for most of the speedup

**2. Pre-computed string representations**
- Collects all column data once: `col_values = [df[header].tolist() for header in headers]`
- Pre-converts to strings: `col_strs = [[str(item) for item in col] for col in col_values]`
- For non-float values, reuses cached strings instead of calling `str()` repeatedly
- Floats are still formatted on-demand with `f"{item:.3f}"` to maintain precision

**3. Reduced column width calculation overhead**
- Uses pre-computed `col_strs` instead of calling `str()` for every item during width calculation
- Time reduced from 20.8ms to 2.9ms

## Performance Impact by Workload

Based on function references, `_display()` is called from `calculate()` to show aggregated metrics after document processing. The optimization benefits are most significant when:

- **Many rows** (200-500+): Test results show 181-354% speedup for large DataFrames, making metric reporting substantially faster
- **Moderate columns** (10-20): Overhead reduction scales well with column count
- **Float-heavy data**: The pre-computed strings help non-floats, while floats are formatted efficiently on-demand

The optimization preserves exact output formatting (3-decimal float precision, column alignment) while dramatically reducing runtime, particularly valuable when displaying evaluation results for batch document processing operations.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 January 24, 2026 09:43
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant