Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jan 24, 2026

📄 5,005% (50.05x) speedup for get_default_pandas_dtypes in unstructured/staging/base.py

⏱️ Runtime : 24.9 milliseconds 488 microseconds (best of 50 runs)

📝 Explanation and details

The optimization achieves a ~50x speedup by eliminating the repeated instantiation of pd.StringDtype() objects on every function call.

What changed:

  1. Caching the template dictionary: After the first call, the dictionary template is stored as a function attribute (_cached_template)
  2. Reusing a single pd.StringDtype() instance: Instead of creating 23 separate pd.StringDtype() objects per call, the optimized version creates just one and reuses it across all string-typed fields
  3. Returning a shallow copy: dict(cached) creates a new dictionary instance from the cached template, preserving the original behavior where each call returns an independent dict

Why this is faster:

  • Object creation overhead: Creating pd.StringDtype() instances is expensive. The original code called pd.StringDtype() 23 times per invocation, while the optimized version calls it once ever (on first invocation only)
  • Dictionary construction cost: Building the 42-entry dictionary from scratch each time has non-trivial overhead. Caching eliminates this repeated work
  • Line profiler evidence: The function's internal execution time dropped from 144.4ms to 956μs (99.3% → 49.2% of total time in wrapper), a ~151x improvement

Performance characteristics from tests:

  • Single calls show 19-21x speedup (86μs → 4μs)
  • Repeated calls benefit more: second+ calls see up to 54x speedup (73μs → 1.3μs) since cache is warm
  • Large-scale test (100 iterations) shows 66x speedup (7ms → 103μs), confirming the optimization scales well with repeated usage

Impact on workloads:
Based on function_references, this function is called from convert_to_dataframe() with the set_dtypes=True parameter. Since convert_to_dataframe likely processes multiple elements/documents in data pipeline scenarios, this optimization significantly reduces overhead when converting many element batches to DataFrames. The shallow copy ensures each caller still gets an independent dictionary, preventing any shared mutable state issues while delivering substantial performance gains for repeated conversions.

The optimization is particularly effective for workloads that call get_default_pandas_dtypes() multiple times (common in batch processing pipelines), while maintaining identical behavior for single-use cases.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 37 Passed
🌀 Generated Regression Tests 346 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
⚙️ Click to see Existing Unit Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
staging/test_base.py::test_default_pandas_dtypes 85.1μs 3.72μs 2188%✅
🌀 Click to see Generated Regression Tests
from __future__ import annotations

# imports
import pandas as pd

from unstructured.staging.base import get_default_pandas_dtypes

# function to test

# unit tests

# The complete set of keys the function is expected to return. Tests will assert exact key equality
# to catch accidental additions/removals during mutation testing.
EXPECTED_KEYS = {
    "text",
    "type",
    "element_id",
    "filename",
    "filetype",
    "file_directory",
    "last_modified",
    "attached_to_filename",
    "parent_id",
    "category_depth",
    "image_path",
    "languages",
    "page_number",
    "page_name",
    "url",
    "link_urls",
    "link_texts",
    "links",
    "sent_from",
    "sent_to",
    "subject",
    "section",
    "header_footer_type",
    "emphasized_text_contents",
    "emphasized_text_tags",
    "text_as_html",
    "max_characters",
    "is_continuation",
    "detection_class_prob",
    "sender",
    "coordinates_points",
    "coordinates_system",
    "coordinates_layout_width",
    "coordinates_layout_height",
    "data_source_url",
    "data_source_version",
    "data_source_record_locator",
    "data_source_date_created",
    "data_source_date_modified",
    "data_source_date_processed",
    "data_source_permissions_data",
    "embeddings",
}


def test_basic_structure_and_key_presence():
    """
    Basic test:
    - Ensure function returns a dictionary
    - Ensure the returned dict has exactly the expected keys (no more, no fewer)
    - Ensure the dictionary is not empty
    """
    codeflash_output = get_default_pandas_dtypes()
    result = codeflash_output  # 85.9μs -> 3.85μs (2128% faster)


def test_value_types_and_specific_mappings():
    """
    Edge/semantic checks:
    - Confirm that string-backed pandas dtypes are instances of pandas' StringDtype
    - Confirm that string constants like "Int64" and "boolean" are exact
    - Confirm that numeric dtypes are the float Python type and object mappings are `object`
    """
    codeflash_output = get_default_pandas_dtypes()
    dtypes = codeflash_output  # 86.0μs -> 3.70μs (2223% faster)

    # Keys that are expected to be pandas' nullable string dtype instances
    expected_string_dtype_keys = {
        "text",
        "type",
        "element_id",
        "filename",
        "filetype",
        "file_directory",
        "last_modified",
        "attached_to_filename",
        "parent_id",
        "image_path",
        "page_name",
        "url",
        "link_urls",
        "subject",
        "section",
        "header_footer_type",
        "text_as_html",
        "sender",
        "coordinates_system",
        "data_source_url",
        "data_source_version",
        "data_source_date_created",
        "data_source_date_modified",
        "data_source_date_processed",
    }

    # Verify each of these keys maps to an instance of pandas.StringDtype
    for key in expected_string_dtype_keys:
        val = dtypes.get(key)

    # Entries declared as list/object types must map to built-in object type
    object_keys = {
        "languages",
        "link_texts",
        "links",
        "sent_from",
        "sent_to",
        "emphasized_text_contents",
        "emphasized_text_tags",
        "coordinates_points",
        "data_source_record_locator",
        "data_source_permissions_data",
        "embeddings",
    }
    for key in object_keys:
        pass


def test_mutability_and_independence_between_calls():
    """
    Edge case:
    - Ensure that modifying the returned dict from one call does not affect the dict returned by another call.
    - This checks that the function produces independent dict instances and avoids returning a shared mutable object.
    """
    codeflash_output = get_default_pandas_dtypes()
    first = codeflash_output  # 85.2μs -> 3.88μs (2094% faster)
    codeflash_output = get_default_pandas_dtypes()
    second = codeflash_output  # 72.6μs -> 1.32μs (5399% faster)

    # Mutate the first mapping for a string-dtype key
    first["text"] = "THIS_IS_A_MUTATION"


def test_instances_are_new_each_call():
    """
    Edge case:
    - For keys that are constructed as pd.StringDtype() instances each call,
      ensure the instances are not the same object across multiple calls.
    - This avoids accidental reuse of mutable dtypes if that would be problematic.
    """
    codeflash_output = get_default_pandas_dtypes()
    a = codeflash_output  # 85.5μs -> 3.79μs (2155% faster)
    codeflash_output = get_default_pandas_dtypes()
    b = codeflash_output  # 72.6μs -> 1.40μs (5072% faster)

    # For each StringDtype key, the instances should not be the identical object (i.e., new instances)
    string_keys = [k for k, v in a.items() if isinstance(v, pd.StringDtype)]

    for key in string_keys:
        pass


def test_all_values_are_expected_types():
    """
    Exhaustive check:
    - Iterate all entries in the returned dict and assert that each value is of an allowed
      category; this detects accidental unexpected types introduced by code mutations.
    """
    codeflash_output = get_default_pandas_dtypes()
    dtypes = codeflash_output  # 85.7μs -> 3.91μs (2088% faster)

    for key, val in dtypes.items():
        # Allowed categories for values:
        # - pandas.StringDtype() instances
        # - exact string tokens "Int64" or "boolean"
        # - built-in type objects like float
        # - built-in object type
        allowed = (
            isinstance(val, pd.StringDtype)
            or (isinstance(val, str) and val in {"Int64", "boolean"})
            or val is float
            or val is object
        )


def test_repeated_calls_for_stability_large_scale():
    """
    Large-scale/Stress test:
    - Call the function many times to ensure stability and consistent output across repeated usage.
    - We avoid loops > 1000 as requested; we use 200 iterations which is sufficient to catch
      issues with caching, shared mutable state, or memory leaks in simple functions.
    """
    iterations = 200  # kept under 1000 per instructions

    # We'll perform repeated calls and check that each result is valid and independent.
    previous = None
    for i in range(iterations):
        codeflash_output = get_default_pandas_dtypes()
        res = codeflash_output  # 13.9ms -> 199μs (6862% faster)

        # Ensure dict instances are not accidentally the same object across calls
        if previous is not None:
            pass
        previous = res


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pandas as pd

from unstructured.staging.base import get_default_pandas_dtypes


class TestGetDefaultPandasDtypesBasic:
    """Basic test cases for get_default_pandas_dtypes function."""

    def test_returns_dict(self):
        """Test that the function returns a dictionary."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 87.2μs -> 4.20μs (1978% faster)

    def test_returns_non_empty_dict(self):
        """Test that the function returns a non-empty dictionary."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.2μs -> 4.01μs (2048% faster)

    def test_dict_has_text_field(self):
        """Test that the returned dict contains the 'text' field."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.3μs -> 4.24μs (1934% faster)

    def test_dict_has_type_field(self):
        """Test that the returned dict contains the 'type' field."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.7μs -> 4.19μs (1969% faster)

    def test_text_field_is_string_dtype(self):
        """Test that the 'text' field is a StringDtype."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.8μs -> 4.28μs (1927% faster)

    def test_type_field_is_string_dtype(self):
        """Test that the 'type' field is a StringDtype."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.9μs -> 4.20μs (1946% faster)

    def test_element_id_field_is_string_dtype(self):
        """Test that the 'element_id' field is a StringDtype."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.7μs -> 4.20μs (1942% faster)

    def test_category_depth_field_is_int64_string(self):
        """Test that 'category_depth' field is the string 'Int64'."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.3μs -> 4.24μs (1935% faster)

    def test_page_number_field_is_int64_string(self):
        """Test that 'page_number' field is the string 'Int64'."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.2μs -> 4.28μs (1916% faster)

    def test_is_continuation_field_is_boolean_string(self):
        """Test that 'is_continuation' field is the string 'boolean'."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.7μs -> 4.19μs (1943% faster)

    def test_detection_class_prob_field_is_float(self):
        """Test that 'detection_class_prob' field is float type."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.1μs -> 4.09μs (2003% faster)

    def test_languages_field_is_object(self):
        """Test that 'languages' field is object type."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.5μs -> 4.05μs (2010% faster)

    def test_link_texts_field_is_object(self):
        """Test that 'link_texts' field is object type."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.8μs -> 4.06μs (2015% faster)

    def test_coordinates_layout_width_is_float(self):
        """Test that 'coordinates_layout_width' field is float type."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.4μs -> 4.05μs (2006% faster)

    def test_coordinates_layout_height_is_float(self):
        """Test that 'coordinates_layout_height' field is float type."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.4μs -> 4.09μs (1989% faster)

    def test_all_values_are_valid_types(self):
        """Test that all values in the dictionary are valid type specifications."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.9μs -> 4.19μs (1972% faster)
        valid_types = (pd.StringDtype, str, type, object)
        for key, value in result.items():
            # Check if value is a StringDtype instance, a string, a type, or object
            is_valid = (
                isinstance(value, pd.StringDtype)
                or isinstance(value, str)
                or isinstance(value, type)
                or value is object
            )

    def test_function_is_callable(self):
        """Test that get_default_pandas_dtypes is callable."""

    def test_function_takes_no_arguments(self):
        """Test that the function takes no required arguments."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.2μs -> 4.02μs (2043% faster)


class TestGetDefaultPandasDtypesEdgeCases:
    """Edge case tests for get_default_pandas_dtypes function."""

    def test_multiple_calls_return_identical_dicts(self):
        """Test that multiple calls return dictionaries with identical structure and types."""
        codeflash_output = get_default_pandas_dtypes()
        result1 = codeflash_output  # 86.7μs -> 4.18μs (1972% faster)
        codeflash_output = get_default_pandas_dtypes()
        result2 = codeflash_output  # 73.2μs -> 1.50μs (4787% faster)

        # Check that both have the same structure for each key
        for key in result1.keys():
            pass

    def test_dict_keys_are_strings(self):
        """Test that all dictionary keys are strings."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.0μs -> 4.18μs (1955% faster)
        for key in result.keys():
            pass

    def test_no_duplicate_keys(self):
        """Test that there are no duplicate keys in the dictionary."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.6μs -> 4.12μs (1976% faster)
        keys_list = list(result.keys())
        unique_keys = set(result.keys())

    def test_all_string_dtype_fields_are_same_instance_type(self):
        """Test that all StringDtype values are instances of pd.StringDtype."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.0μs -> 4.15μs (1975% faster)
        string_fields = [k for k, v in result.items() if isinstance(v, pd.StringDtype)]

        for field in string_fields:
            pass

    def test_string_type_fields_are_strings(self):
        """Test that fields with string type specifications are actual strings."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.9μs -> 4.11μs (1989% faster)
        string_type_fields = [k for k, v in result.items() if isinstance(v, str)]

        for field in string_type_fields:
            pass

    def test_object_type_fields_are_object_type(self):
        """Test that fields specified as object are actually the object type."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.1μs -> 4.22μs (1943% faster)
        object_fields = [k for k, v in result.items() if v is object]

        for field in object_fields:
            pass

    def test_float_type_fields_are_float_type(self):
        """Test that fields specified as float are actually the float type."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.3μs -> 4.15μs (1978% faster)
        float_fields = [k for k, v in result.items() if v is float]

        for field in float_fields:
            pass

    def test_expected_string_dtype_fields_count(self):
        """Test that there is a reasonable number of StringDtype fields."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.0μs -> 4.04μs (2028% faster)
        string_dtype_fields = [v for v in result.values() if isinstance(v, pd.StringDtype)]

    def test_expected_total_fields_count(self):
        """Test that there is a reasonable total number of fields."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.4μs -> 4.07μs (2021% faster)

    def test_specific_fields_exist(self):
        """Test that all expected specific fields are present."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.7μs -> 4.23μs (1927% faster)
        expected_fields = [
            "text",
            "type",
            "element_id",
            "filename",
            "filetype",
            "file_directory",
            "last_modified",
            "category_depth",
            "page_number",
            "is_continuation",
            "detection_class_prob",
            "sender",
            "coordinates_layout_width",
            "coordinates_layout_height",
            "embeddings",
        ]
        for field in expected_fields:
            pass

    def test_sender_field_is_string_dtype(self):
        """Test that the 'sender' field is a StringDtype."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.3μs -> 4.08μs (1993% faster)

    def test_embeddings_field_is_object(self):
        """Test that the 'embeddings' field is object type."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.9μs -> 4.18μs (1956% faster)


class TestGetDefaultPandasDtypesLargeScale:
    """Large-scale test cases for get_default_pandas_dtypes function."""

    def test_all_fields_accessible_efficiently(self):
        """Test that all fields in the returned dictionary are accessible without performance issues."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.1μs -> 4.08μs (2013% faster)
        # Iterate through all fields and verify they are accessible
        count = 0
        for key, value in result.items():
            count += 1

    def test_dict_iteration_consistency(self):
        """Test that the dictionary can be iterated multiple times consistently."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.9μs -> 4.25μs (1919% faster)

        # First iteration
        keys_first = list(result.keys())
        values_first = list(result.values())

        # Second iteration
        keys_second = list(result.keys())
        values_second = list(result.values())

    def test_memory_efficiency_with_repeated_calls(self):
        """Test that repeated calls to the function don't cause unexpected behavior."""
        results = []
        for i in range(100):
            codeflash_output = get_default_pandas_dtypes()
            result = codeflash_output  # 7.02ms -> 103μs (6669% faster)
            results.append(result)

        # Verify all results have the same structure
        first_result = results[0]
        for i, result in enumerate(results[1:], 1):
            pass

    def test_dict_comprehension_compatibility(self):
        """Test that the returned dictionary works well with dict comprehensions."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 86.2μs -> 4.21μs (1949% faster)

        # Create a new dict using comprehension
        string_dtypes = {k: v for k, v in result.items() if isinstance(v, pd.StringDtype)}

        # Verify that all items in the comprehension result are StringDtype
        for k, v in string_dtypes.items():
            pass

    def test_grouping_by_dtype_category(self):
        """Test that fields can be categorized by dtype efficiently."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.8μs -> 4.08μs (2005% faster)

        categories = {
            "string_dtype": [],
            "string_spec": [],
            "object_type": [],
            "float_type": [],
            "other": [],
        }

        for key, value in result.items():
            if isinstance(value, pd.StringDtype):
                categories["string_dtype"].append(key)
            elif isinstance(value, str):
                categories["string_spec"].append(key)
            elif value is object:
                categories["object_type"].append(key)
            elif value is float:
                categories["float_type"].append(key)
            else:
                categories["other"].append(key)

    def test_dataframe_column_dtype_assignment(self):
        """Test that the returned dtypes can be used to create a DataFrame."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.7μs -> 4.29μs (1895% faster)

        # Create a sample DataFrame with the dtypes
        # This tests that the dtype specifications are valid
        df = pd.DataFrame({k: pd.Series(dtype=v) for k, v in result.items()})

    def test_field_name_consistency(self):
        """Test that field names follow a consistent naming convention."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 87.1μs -> 4.19μs (1981% faster)

        # All keys should be lowercase or snake_case
        for key in result.keys():
            pass

    def test_no_none_values_in_dict(self):
        """Test that the dictionary contains no None values."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.2μs -> 3.97μs (2046% faster)

        for key, value in result.items():
            pass

    def test_dict_can_be_converted_to_list_of_tuples(self):
        """Test that the dictionary can be converted to various formats."""
        codeflash_output = get_default_pandas_dtypes()
        result = codeflash_output  # 85.5μs -> 3.91μs (2085% faster)

        # Convert to list of tuples
        items_list = list(result.items())

        # Verify each item is a tuple of (str, dtype_spec)
        for key, value in items_list:
            pass


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unstructured.staging.base import get_default_pandas_dtypes


def test_get_default_pandas_dtypes():
    get_default_pandas_dtypes()
🔎 Click to see Concolic Coverage Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_xdo_puqm/tmpxh75cuor/test_concolic_coverage.py::test_get_default_pandas_dtypes 88.5μs 4.46μs 1884%✅

To edit these changes git checkout codeflash/optimize-get_default_pandas_dtypes-mks0u2mf and push.

Codeflash Static Badge

The optimization achieves a **~50x speedup** by eliminating the repeated instantiation of `pd.StringDtype()` objects on every function call.

**What changed:**
1. **Caching the template dictionary**: After the first call, the dictionary template is stored as a function attribute (`_cached_template`)
2. **Reusing a single `pd.StringDtype()` instance**: Instead of creating 23 separate `pd.StringDtype()` objects per call, the optimized version creates just one and reuses it across all string-typed fields
3. **Returning a shallow copy**: `dict(cached)` creates a new dictionary instance from the cached template, preserving the original behavior where each call returns an independent dict

**Why this is faster:**
- **Object creation overhead**: Creating `pd.StringDtype()` instances is expensive. The original code called `pd.StringDtype()` 23 times per invocation, while the optimized version calls it once ever (on first invocation only)
- **Dictionary construction cost**: Building the 42-entry dictionary from scratch each time has non-trivial overhead. Caching eliminates this repeated work
- **Line profiler evidence**: The function's internal execution time dropped from 144.4ms to 956μs (99.3% → 49.2% of total time in wrapper), a ~151x improvement

**Performance characteristics from tests:**
- Single calls show 19-21x speedup (86μs → 4μs)
- Repeated calls benefit more: second+ calls see up to 54x speedup (73μs → 1.3μs) since cache is warm
- Large-scale test (100 iterations) shows 66x speedup (7ms → 103μs), confirming the optimization scales well with repeated usage

**Impact on workloads:**
Based on `function_references`, this function is called from `convert_to_dataframe()` with the `set_dtypes=True` parameter. Since `convert_to_dataframe` likely processes multiple elements/documents in data pipeline scenarios, this optimization significantly reduces overhead when converting many element batches to DataFrames. The shallow copy ensures each caller still gets an independent dictionary, preventing any shared mutable state issues while delivering substantial performance gains for repeated conversions.

The optimization is particularly effective for workloads that call `get_default_pandas_dtypes()` multiple times (common in batch processing pipelines), while maintaining identical behavior for single-use cases.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 January 24, 2026 08:02
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant