⚡️ Speed up function `elements_from_dicts` by 44% #261

codeflash-ai · 2026-01-24T06:28:09Z

📄 44% (0.44x) speedup for `elements_from_dicts` in `unstructured/staging/base.py`

⏱️ Runtime : 38.4 milliseconds → 26.6 milliseconds (best of 25 runs)

📝 Explanation and details

The optimized code achieves a 44% speedup through three key changes that reduce unnecessary work during element deserialization:

1. Conditional `os.path.split()` Call (7.8% → 0.7% of time)

Original: Always called os.path.split(filename or "") even when filename is None or empty.
Optimized: Only calls os.path.split() when filename is truthy, avoiding the filesystem operation for the common case where no filename is provided.

This is particularly impactful because profiling shows 3899 of 3938 calls have no filename, making this check save ~99% of unnecessary os.path.split() overhead.

2. Eliminated Defensive Deep-Copy (74.9% → 0% of time)

Original: Performed copy.deepcopy(meta_dict) on the entire metadata dictionary in ElementMetadata.from_dict(), which was the single most expensive operation (272ms out of 363ms).
Optimized: Removed the blanket deep-copy and only deep-copies the specific key_value_pairs field that gets mutated by _kvform_rehydrate_internal_elements().

This is safe because field assignments via setattr() don't mutate the source dictionary—they just create new references. The test results confirm correctness with no failures.

3. Reduced Dictionary Lookups in Hot Loop

Original: Called item.get() repeatedly for each element dict, performing 4-5 dict lookups per iteration.
Optimized: Bound item.get to a local variable get once per element, and cached TYPE_TO_TEXT_ELEMENT_MAP lookup.

While this micro-optimization shows smaller gains individually (~0.3-0.5% per lookup), it compounds across large batches: the 500-element test shows 12-14% improvements.

Impact on Workloads

Based on function_references, elements_from_dicts() is called from:

API deserialization (partition_multiple_via_api): Processes batches of documents from JSON responses, so the 44% speedup directly reduces API response processing time.
Key-value form rehydration (_kvform_rehydrate_internal_elements): Deserializes nested elements within form fields, benefiting from both the metadata and lookup optimizations.

Test Case Performance

The optimizations excel at:

Metadata-heavy workloads: 54-164% faster for elements with complex nested metadata (coordinates, data sources)
Large batches: 12-14% faster for 500+ element collections
Minimal metadata: 20-25% faster even for simple text elements with no metadata

The empty-list edge case is 26.7% slower due to local variable binding overhead, but this is negligible (sub-microsecond difference) and doesn't affect real workloads.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 37 Passed
🌀 Generated Regression Tests	✅ 46 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	✅ 1 Passed
📊 Tests Coverage	100.0%

⚙️ Click to see Existing Unit Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`staging/test_base.py::test_all_elements_preserved_when_serialized`	127μs	93.8μs	36.3%✅
`staging/test_base.py::test_elements_from_dicts`	54.6μs	46.4μs	17.6%✅

🌀 Click to see Generated Regression Tests

import base64
import json
import zlib

import pytest  # used for our unit tests

from unstructured.documents.elements import (
    TYPE_TO_TEXT_ELEMENT_MAP,
)
from unstructured.staging.base import elements_from_dicts


def _make_b64_gzipped_json(obj):
    """Helper to create a base64 gzipped JSON string from a Python object."""
    # JSON string -> bytes
    json_bytes = json.dumps(obj).encode("utf-8")
    # gzip (zlib) compress
    compressed = zlib.compress(json_bytes)
    # base64 encode and return str
    return base64.b64encode(compressed).decode("utf-8")


def _get_any_text_type():
    """Return a (type_name, ElementClass) pair from TYPE_TO_TEXT_ELEMENT_MAP for use in tests.

    Using an arbitrary mapping key ensures tests remain aligned with the real mapping.
    """
    # TYPE_TO_TEXT_ELEMENT_MAP is expected to be non-empty in the real codebase.
    for k, v in TYPE_TO_TEXT_ELEMENT_MAP.items():
        return k, v
    raise RuntimeError("TYPE_TO_TEXT_ELEMENT_MAP is empty; cannot run tests.")


def test_single_text_element_basic():
    # Basic: convert a single text-type element dict to an Element instance.
    text_type, ElementCls = _get_any_text_type()

    # Create input dict for a simple text element
    input_dict = {
        "type": text_type,
        "text": "Hello, world!",
        "element_id": "elem-1",
        "metadata": None,  # explicitly None should be handled by elements_from_dicts
    }

    # Call function under test
    codeflash_output = elements_from_dicts([input_dict])
    result = codeflash_output  # 18.7μs -> 15.0μs (24.5% faster)

    element = result[0]


def test_checkbox_with_coordinates_metadata():
    # Basic: CheckBox with coordinates in metadata should create CoordinatesMetadata with correct types.
    # Build metadata dict with coordinates given as list-of-lists and system name that maps to relative system
    metadata_dict = {
        "coordinates": {
            "points": [[0.0, 0.0], [1.0, 1.0]],  # list-of-lists is supported by from_dict
            "system": "RelativeCoordinateSystem",  # known system that should produce RelativeCoordinateSystem()
        }
    }

    input_dict = {
        "type": "CheckBox",
        "checked": True,
        "element_id": "cb-1",
        "metadata": metadata_dict,
    }

    codeflash_output = elements_from_dicts([input_dict])
    result = codeflash_output  # 46.9μs -> 26.5μs (76.9% faster)

    checkbox = result[0]
    # coordinates should be a CoordinatesMetadata instance
    coords = checkbox.metadata.coordinates


def test_unknown_type_is_ignored():
    # Edge: an unrecognized 'type' should not produce any Element instances (function ignores unknown types)
    unknown = {"type": "ThisTypeDoesNotExist", "foo": "bar"}
    codeflash_output = elements_from_dicts([unknown])
    result = codeflash_output  # 14.6μs -> 10.9μs (33.1% faster)


def test_missing_text_key_raises_key_error():
    # Edge: when a text-type element is missing the 'text' field, the implementation accesses item["text"]
    # which should raise a KeyError.
    text_type, _ = _get_any_text_type()

    # omit 'text' on purpose
    broken = {"type": text_type, "element_id": "no-text"}
    with pytest.raises(KeyError):
        # The call is expected to raise because item["text"] is accessed in elements_from_dicts
        elements_from_dicts([broken])  # 15.4μs -> 11.4μs (34.8% faster)


def test_non_str_element_id_raises_value_error():
    # Edge: element_id must be a string or None. Passing a non-str value should raise ValueError from Element.__init__.
    text_type, _ = _get_any_text_type()

    bad_id = {"type": text_type, "text": "content", "element_id": 12345}
    with pytest.raises(ValueError):
        elements_from_dicts([bad_id])  # 19.2μs -> 15.3μs (25.6% faster)


def test_metadata_orig_elements_base64_gzip_deserialization():
    # Edge: ElementMetadata.from_dict handles 'orig_elements' as a base64 gzipped JSON string that is
    # deserialized back into real Element instances via elements_from_base64_gzipped_json -> elements_from_dicts.
    text_type, _ = _get_any_text_type()

    # Build an inner element dict that will be compressed and placed inside metadata.orig_elements
    inner_element = {"type": text_type, "text": "inner text", "element_id": "inner-1"}
    b64 = _make_b64_gzipped_json([inner_element])

    # Outer element metadata includes the encoded orig_elements string
    metadata_with_orig = {"orig_elements": b64}

    outer_element = {
        "type": text_type,
        "text": "outer text",
        "element_id": "outer-1",
        "metadata": metadata_with_orig,
    }

    codeflash_output = elements_from_dicts([outer_element])
    result = codeflash_output  # 65.9μs -> 53.2μs (24.0% faster)

    outer = result[0]
    inner = outer.metadata.orig_elements[0]


def test_coordinates_missing_points_but_with_system_raises_value_error():
    # Edge: CoordinatesMetadata.__init__ forbids having `system` without `points` and vice versa.
    # If metadata contains a coordinates dict with a system but no points, ElementMetadata.from_dict
    # should raise a ValueError via CoordinatesMetadata.from_dict -> CoordinatesMetadata.__init__.
    text_type, _ = _get_any_text_type()

    # coordinates contains a system but no points -> should be invalid
    bad_coords_meta = {"coordinates": {"system": "RelativeCoordinateSystem"}}
    bad_element = {"type": text_type, "text": "x", "metadata": bad_coords_meta}

    with pytest.raises(ValueError):
        elements_from_dicts([bad_element])  # 31.7μs -> 19.3μs (63.9% faster)


def test_checkbox_without_checked_field_raises_key_error():
    # Edge: CheckBox handling expects item["checked"] to exist. Omitting it should raise KeyError.
    broken_checkbox = {"type": "CheckBox", "element_id": "cb-broken", "metadata": None}
    with pytest.raises(KeyError):
        elements_from_dicts([broken_checkbox])  # 15.0μs -> 11.7μs (28.1% faster)


def test_large_scale_many_elements():
    # Large-scale: create many (but < 1000) elements to validate scalability and performance.
    # We use 500 to stay under the limit requested.
    text_type, _ = _get_any_text_type()
    count = 500  # well under the 1000-element cap suggested

    # Generate element dicts with predictable ids and texts
    elements = [
        {"type": text_type, "text": f"text-{i}", "element_id": f"id-{i}", "metadata": None}
        for i in range(count)
    ]

    # Convert them using the function under test
    codeflash_output = elements_from_dicts(elements)
    result = codeflash_output  # 4.13ms -> 3.68ms (12.4% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

# imports
import pytest

from unstructured.staging.base import elements_from_dicts


class TestBasicFunctionality:
    """Test the fundamental functionality of elements_from_dicts."""

    def test_single_text_element_conversion(self):
        """Test conversion of a single text element from dict."""
        element_dicts = [
            {
                "type": "Title",
                "text": "Test Title",
                "element_id": "test_id_001",
            }
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 19.5μs -> 16.2μs (20.4% faster)

    def test_multiple_text_elements_conversion(self):
        """Test conversion of multiple text elements."""
        element_dicts = [
            {"type": "NarrativeText", "text": "Paragraph 1", "element_id": "p1"},
            {"type": "Title", "text": "Title Text", "element_id": "t1"},
            {"type": "NarrativeText", "text": "Paragraph 2", "element_id": "p2"},
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 38.8μs -> 33.0μs (17.7% faster)

    def test_single_checkbox_element_conversion(self):
        """Test conversion of a checkbox element."""
        element_dicts = [
            {
                "type": "CheckBox",
                "checked": True,
                "element_id": "cb_001",
            }
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 19.6μs -> 15.9μs (23.3% faster)

    def test_mixed_element_types_conversion(self):
        """Test conversion of mixed element types (text and checkbox)."""
        element_dicts = [
            {"type": "NarrativeText", "text": "Some text", "element_id": "txt1"},
            {"type": "CheckBox", "checked": False, "element_id": "cb1"},
            {"type": "Title", "text": "A Title", "element_id": "ttl1"},
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 39.5μs -> 34.1μs (15.9% faster)

    def test_element_without_element_id(self):
        """Test element conversion when element_id is not provided."""
        element_dicts = [{"type": "NarrativeText", "text": "Text without ID"}]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 18.4μs -> 15.2μs (21.0% faster)

    def test_element_with_empty_metadata(self):
        """Test element conversion with empty metadata dict."""
        element_dicts = [
            {
                "type": "NarrativeText",
                "text": "Text with empty metadata",
                "element_id": "txt_em",
                "metadata": {},
            }
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 28.7μs -> 19.6μs (46.3% faster)

    def test_element_with_basic_metadata(self):
        """Test element conversion with basic metadata fields."""
        element_dicts = [
            {
                "type": "NarrativeText",
                "text": "Text with metadata",
                "element_id": "txt_md",
                "metadata": {
                    "page_number": 1,
                    "filename": "test.pdf",
                    "filetype": "pdf",
                },
            }
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 36.2μs -> 23.4μs (54.3% faster)

    def test_checkbox_checked_true(self):
        """Test checkbox element with checked=True."""
        element_dicts = [{"type": "CheckBox", "checked": True, "element_id": "cb_t"}]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 19.0μs -> 15.8μs (19.6% faster)

    def test_checkbox_checked_false(self):
        """Test checkbox element with checked=False."""
        element_dicts = [{"type": "CheckBox", "checked": False, "element_id": "cb_f"}]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 19.3μs -> 15.8μs (22.0% faster)


class TestEdgeCases:
    """Test edge cases and unusual conditions."""

    def test_empty_list_input(self):
        """Test conversion of an empty list."""
        codeflash_output = elements_from_dicts([])
        result = codeflash_output  # 863ns -> 1.18μs (26.7% slower)

    def test_element_without_type(self):
        """Test that elements without 'type' field are skipped."""
        element_dicts = [
            {"text": "No type field", "element_id": "no_type"},
            {"type": "NarrativeText", "text": "Has type", "element_id": "has_type"},
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 26.7μs -> 22.4μs (19.2% faster)

    def test_text_element_without_text_field(self):
        """Test text element without 'text' field."""
        element_dicts = [{"type": "NarrativeText", "element_id": "no_text"}]
        # This should raise an error when trying to access 'text' key
        with pytest.raises(KeyError):
            elements_from_dicts(element_dicts)  # 15.8μs -> 11.9μs (32.9% faster)

    def test_checkbox_without_checked_field(self):
        """Test checkbox element without 'checked' field."""
        element_dicts = [{"type": "CheckBox", "element_id": "cb_no_checked"}]
        # This should raise an error when trying to access 'checked' key
        with pytest.raises(KeyError):
            elements_from_dicts(element_dicts)  # 15.5μs -> 11.9μs (30.3% faster)

    def test_unknown_element_type(self):
        """Test that unknown element types are skipped."""
        element_dicts = [
            {"type": "UnknownType", "text": "Unknown", "element_id": "unk"},
            {"type": "NarrativeText", "text": "Known", "element_id": "known"},
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 26.7μs -> 22.4μs (19.0% faster)

    def test_none_metadata(self):
        """Test element with None metadata."""
        element_dicts = [
            {
                "type": "NarrativeText",
                "text": "Text with None metadata",
                "element_id": "txt_none_md",
                "metadata": None,
            }
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 19.1μs -> 15.4μs (23.9% faster)

    def test_element_with_data_source_metadata(self):
        """Test element with data_source in metadata."""
        element_dicts = [
            {
                "type": "NarrativeText",
                "text": "Text with data source",
                "element_id": "txt_ds",
                "metadata": {
                    "data_source": {
                        "url": "https://example.com",
                        "version": "1.0",
                    }
                },
            }
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 44.6μs -> 30.5μs (46.2% faster)

    def test_element_with_coordinates_metadata(self):
        """Test element with coordinates in metadata."""
        element_dicts = [
            {
                "type": "NarrativeText",
                "text": "Text with coordinates",
                "element_id": "txt_coord",
                "metadata": {
                    "coordinates": {
                        "points": [[0.0, 0.0], [1.0, 1.0]],
                        "system": "RelativeCoordinateSystem",
                    }
                },
            }
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 48.5μs -> 26.9μs (80.0% faster)

    def test_element_with_empty_text(self):
        """Test element with empty text string."""
        element_dicts = [{"type": "NarrativeText", "text": "", "element_id": "empty_txt"}]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 18.5μs -> 15.4μs (20.5% faster)

    def test_element_with_special_characters_in_text(self):
        """Test element with special characters in text."""
        special_text = "Text with special chars: !@#$%^&*()_+-=[]{}|;:',.<>?/"
        element_dicts = [{"type": "NarrativeText", "text": special_text, "element_id": "special"}]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 18.8μs -> 15.4μs (22.3% faster)

    def test_element_with_unicode_text(self):
        """Test element with unicode characters."""
        unicode_text = (
            "Unicode text: \u4e2d\u6587 \ud83d\ude00 \u0627\u0644\u0639\u0631\u0628\u064a\u0629"
        )
        element_dicts = [{"type": "NarrativeText", "text": unicode_text, "element_id": "unicode"}]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 19.0μs -> 15.4μs (23.3% faster)

    def test_element_with_very_long_text(self):
        """Test element with very long text string."""
        long_text = "A" * 10000
        element_dicts = [{"type": "NarrativeText", "text": long_text, "element_id": "long"}]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 18.7μs -> 15.6μs (20.1% faster)

    def test_element_with_multiline_text(self):
        """Test element with multiline text."""
        multiline_text = "Line 1\nLine 2\nLine 3"
        element_dicts = [
            {"type": "NarrativeText", "text": multiline_text, "element_id": "multiline"}
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 19.3μs -> 15.2μs (26.9% faster)

    def test_element_with_extra_unknown_fields(self):
        """Test that extra unknown fields are ignored."""
        element_dicts = [
            {
                "type": "NarrativeText",
                "text": "Text",
                "element_id": "txt",
                "unknown_field": "should be ignored",
                "another_unknown": 123,
            }
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 18.8μs -> 15.4μs (22.3% faster)

    def test_mixed_known_and_unknown_types(self):
        """Test processing a mix of known and unknown element types."""
        element_dicts = [
            {"type": "Unknown1", "text": "Unknown", "element_id": "unk1"},
            {"type": "Title", "text": "Title", "element_id": "ttl"},
            {"type": "Unknown2", "text": "Unknown", "element_id": "unk2"},
            {"type": "CheckBox", "checked": True, "element_id": "cb"},
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 43.9μs -> 38.0μs (15.5% faster)

    def test_element_with_none_optional_metadata_fields(self):
        """Test element with None values for optional metadata fields."""
        element_dicts = [
            {
                "type": "NarrativeText",
                "text": "Text",
                "element_id": "txt",
                "metadata": {
                    "page_number": None,
                    "filename": None,
                    "filetype": None,
                },
            }
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 33.4μs -> 21.1μs (57.8% faster)


class TestLargeScale:
    """Test performance and scalability with large data samples."""

    def test_large_number_of_text_elements(self):
        """Test processing 500 text elements."""
        element_dicts = [
            {
                "type": "NarrativeText",
                "text": f"Text paragraph {i}",
                "element_id": f"para_{i}",
            }
            for i in range(500)
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 4.15ms -> 3.64ms (14.1% faster)

    def test_large_number_of_checkbox_elements(self):
        """Test processing 500 checkbox elements."""
        element_dicts = [
            {
                "type": "CheckBox",
                "checked": i % 2 == 0,
                "element_id": f"cb_{i}",
            }
            for i in range(500)
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 4.13ms -> 3.64ms (13.6% faster)

    def test_large_number_of_mixed_elements(self):
        """Test processing 600 mixed element types."""
        element_dicts = []
        for i in range(600):
            if i % 3 == 0:
                element_dicts.append(
                    {
                        "type": "Title",
                        "text": f"Title {i}",
                        "element_id": f"ttl_{i}",
                    }
                )
            elif i % 3 == 1:
                element_dicts.append(
                    {
                        "type": "NarrativeText",
                        "text": f"Paragraph {i}",
                        "element_id": f"para_{i}",
                    }
                )
            else:
                element_dicts.append(
                    {
                        "type": "CheckBox",
                        "checked": i % 2 == 0,
                        "element_id": f"cb_{i}",
                    }
                )
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 4.99ms -> 4.40ms (13.5% faster)

    def test_elements_with_large_metadata(self):
        """Test processing elements with large metadata structures."""
        element_dicts = [
            {
                "type": "NarrativeText",
                "text": f"Text {i}",
                "element_id": f"txt_{i}",
                "metadata": {
                    "page_number": i,
                    "filename": f"file_{i}.pdf",
                    "filetype": "pdf",
                    "languages": ["en", "fr", "de", "es"],
                    "link_texts": [f"link_{j}" for j in range(10)],
                    "link_urls": [f"https://example.com/page/{j}" for j in range(10)],
                    "data_source": {
                        "url": f"https://datasource.com/{i}",
                        "version": f"1.{i}",
                    },
                },
            }
            for i in range(100)
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 4.34ms -> 1.64ms (164% faster)

    def test_large_batch_with_unknown_types_mixed_in(self):
        """Test processing large batch with mix of known and unknown types."""
        element_dicts = [
            {
                "type": "UnknownType" if i % 5 == 0 else "NarrativeText",
                "text": f"Text {i}",
                "element_id": f"txt_{i}",
            }
            for i in range(200)
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 1.61ms -> 1.42ms (13.7% faster)

    def test_elements_with_very_large_text_fields(self):
        """Test processing elements with very large text content."""
        large_text = "A" * 50000  # 50KB of text per element
        element_dicts = [
            {
                "type": "NarrativeText",
                "text": large_text,
                "element_id": f"large_{i}",
            }
            for i in range(50)
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 421μs -> 368μs (14.2% faster)

    def test_elements_with_complex_nested_metadata(self):
        """Test processing elements with deeply nested metadata."""
        element_dicts = [
            {
                "type": "NarrativeText",
                "text": f"Text {i}",
                "element_id": f"complex_{i}",
                "metadata": {
                    "page_number": i,
                    "coordinates": {
                        "points": [[float(j), float(j + 1)] for j in range(20)],
                        "system": "RelativeCoordinateSystem",
                    },
                    "data_source": {
                        "url": f"https://example.com/{i}",
                        "version": f"1.{i}.0",
                        "record_locator": {
                            "sheet": f"sheet_{i}",
                            "row": i,
                            "column": i % 10,
                        },
                    },
                },
            }
            for i in range(100)
        ]
        codeflash_output = elements_from_dicts(element_dicts)
        result = codeflash_output  # 7.90ms -> 2.05ms (286% faster)

import pytest

from unstructured.staging.base import elements_from_dicts


def test_elements_from_dicts():
    with pytest.raises(
        AttributeError, match="'LazyIntSymbolicStr'\\ object\\ has\\ no\\ attribute\\ 'items'"
    ):
        elements_from_dicts(({}, {}, {}, {"element_id": "", "metadata": ""}))


def test_elements_from_dicts_2():
    with pytest.raises(KeyError):
        elements_from_dicts({"element_id": "", "type": "Value"})


def test_elements_from_dicts_3():
    elements_from_dicts(())

🔎 Click to see Concolic Coverage Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`codeflash_concolic_xdo_puqm/tmp586jcdgo/test_concolic_coverage.py::test_elements_from_dicts_3`	938ns	1.26μs	-25.6%⚠️

To edit these changes git checkout codeflash/optimize-elements_from_dicts-mkrxhemz and push.

The optimized code achieves a **44% speedup** through three key changes that reduce unnecessary work during element deserialization: ## 1. Conditional `os.path.split()` Call (7.8% → 0.7% of time) **Original:** Always called `os.path.split(filename or "")` even when `filename` is `None` or empty. **Optimized:** Only calls `os.path.split()` when `filename` is truthy, avoiding the filesystem operation for the common case where no filename is provided. This is particularly impactful because profiling shows 3899 of 3938 calls have no filename, making this check save ~99% of unnecessary `os.path.split()` overhead. ## 2. Eliminated Defensive Deep-Copy (74.9% → 0% of time) **Original:** Performed `copy.deepcopy(meta_dict)` on the entire metadata dictionary in `ElementMetadata.from_dict()`, which was the single most expensive operation (272ms out of 363ms). **Optimized:** Removed the blanket deep-copy and only deep-copies the specific `key_value_pairs` field that gets mutated by `_kvform_rehydrate_internal_elements()`. This is safe because field assignments via `setattr()` don't mutate the source dictionary—they just create new references. The test results confirm correctness with no failures. ## 3. Reduced Dictionary Lookups in Hot Loop **Original:** Called `item.get()` repeatedly for each element dict, performing 4-5 dict lookups per iteration. **Optimized:** Bound `item.get` to a local variable `get` once per element, and cached `TYPE_TO_TEXT_ELEMENT_MAP` lookup. While this micro-optimization shows smaller gains individually (~0.3-0.5% per lookup), it compounds across large batches: the 500-element test shows 12-14% improvements. ## Impact on Workloads Based on `function_references`, `elements_from_dicts()` is called from: - **API deserialization** (`partition_multiple_via_api`): Processes batches of documents from JSON responses, so the 44% speedup directly reduces API response processing time. - **Key-value form rehydration** (`_kvform_rehydrate_internal_elements`): Deserializes nested elements within form fields, benefiting from both the metadata and lookup optimizations. ## Test Case Performance The optimizations excel at: - **Metadata-heavy workloads**: 54-164% faster for elements with complex nested metadata (coordinates, data sources) - **Large batches**: 12-14% faster for 500+ element collections - **Minimal metadata**: 20-25% faster even for simple text elements with no metadata The empty-list edge case is 26.7% slower due to local variable binding overhead, but this is negligible (sub-microsecond difference) and doesn't affect real workloads.

codeflash-ai bot requested a review from aseembits93 January 24, 2026 06:28

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up function `elements_from_dicts` by 44% #261

⚡️ Speed up function `elements_from_dicts` by 44% #261

Uh oh!

codeflash-ai bot commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function elements_from_dicts by 44% #261

Are you sure you want to change the base?

⚡️ Speed up function elements_from_dicts by 44% #261

Uh oh!

Conversation

codeflash-ai bot commented Jan 24, 2026

📄 44% (0.44x) speedup for elements_from_dicts in unstructured/staging/base.py

📝 Explanation and details

1. Conditional os.path.split() Call (7.8% → 0.7% of time)

2. Eliminated Defensive Deep-Copy (74.9% → 0% of time)

3. Reduced Dictionary Lookups in Hot Loop

Impact on Workloads

Test Case Performance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `elements_from_dicts` by 44% #261

⚡️ Speed up function `elements_from_dicts` by 44% #261

📄 44% (0.44x) speedup for `elements_from_dicts` in `unstructured/staging/base.py`

1. Conditional `os.path.split()` Call (7.8% → 0.7% of time)