Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jan 24, 2026

📄 42% (0.42x) speedup for elements_to_md in unstructured/staging/base.py

⏱️ Runtime : 6.68 milliseconds 4.72 milliseconds (best of 35 runs)

📝 Explanation and details

The optimization achieves a 41% speedup by replacing Python's structural pattern matching with direct isinstance() checks and explicit attribute access. Here's why this matters:

Key Performance Improvement

Pattern matching overhead elimination: The original code spent ~65% of its time in case statement evaluation (lines showing 15%, 12.2%, 11.2%, 12%, 14.2% in profiling). Each case statement with attribute unpacking like case Title(text=text): performs:

  1. Type checking via isinstance()
  2. Attribute extraction and binding
  3. Guard condition evaluation (for the if clauses)

The optimized version performs these operations explicitly and only once per element type, avoiding the pattern matching machinery's overhead.

Specific Optimizations

  1. Early returns reduce unnecessary checks: By restructuring as if-elif chains with early returns, once an element type matches, no further type checks occur. The pattern matching evaluates all cases sequentially.

  2. Cached attribute access for Images: The optimized code extracts metadata and text once for Image elements (metadata = element.metadata), then reuses these references across multiple conditions. The original code repeatedly accessed element.metadata through pattern unpacking in each case.

  3. Simplified conditional logic: For Image elements, the nested if-statements in the optimized version more efficiently evaluate conditions in sequence (checking image_base64 once, then mime_type, then exclude flag) versus pattern matching which re-evaluates the entire pattern for each case.

Test Case Performance

The optimization shows consistent gains across all scenarios:

  • Large-scale performance (500 elements): 44.9% faster - demonstrates the optimization scales well with volume
  • Title conversions: 28-45% faster - benefits from eliminating pattern matching overhead for simple type checks
  • Image conversions: 18-40% faster - particularly strong gains due to reduced repeated metadata access
  • Mixed element workloads: 21-37% faster - shows consistent improvement regardless of element type distribution

Impact on Production Workloads

Based on the function_references, this function is called from json_to_format() in a document conversion pipeline. Since it processes entire documents (potentially hundreds of elements), the 41% speedup translates directly to faster batch conversion jobs. The optimization is especially valuable when format_type == "markdown" as every element in the document flows through element_to_md().

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 38 Passed
🌀 Generated Regression Tests 55 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
⚙️ Click to see Existing Unit Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
staging/test_base.py::test_elements_to_md_conversion 794μs 570μs 39.3%✅
staging/test_base.py::test_elements_to_md_file_output 115μs 97.0μs 19.5%✅
🌀 Click to see Generated Regression Tests
# imports

from unstructured.documents.elements import Image, Table, Text, Title

# Import the real functions and classes from the codebase under test.
from unstructured.staging.base import elements_to_md


def test_title_and_text_basic_combination():
    # Title should be converted to markdown header "# text"
    title = Title(text="Chapter 1")
    # Text (a general paragraph element) should be returned verbatim
    paragraph = Text(text="This is a paragraph.")
    # Combined should join with a single newline in the same order as the iterable
    codeflash_output = elements_to_md([title, paragraph])
    result = codeflash_output  # 7.52μs -> 5.85μs (28.5% faster)


def test_table_prefers_text_as_html_over_raw_text():
    # Table should prefer metadata.text_as_html when present
    table = Table(text="raw table text")
    # Set metadata.text_as_html to simulate a table that has HTML representation
    table.metadata.text_as_html = "<table><tr><td>1</td></tr></table>"
    codeflash_output = elements_to_md([table])
    result = codeflash_output  # 5.51μs -> 4.33μs (27.2% faster)


def test_image_base64_without_mime_returns_data_image_star():
    # Image with image_base64 and no mime type should use image/* in data URI
    img = Image(text="an image")
    img.metadata.image_base64 = "BASE64DATA"
    img.metadata.image_mime_type = None  # explicit None to trigger first image case
    codeflash_output = elements_to_md([img])
    result = codeflash_output  # 12.7μs -> 10.8μs (18.3% faster)


def test_image_base64_with_mime_returns_specific_mime():
    # Image with image_base64 and a mime type should include that mime type in data URI
    img = Image(text="logo")
    img.metadata.image_base64 = "B64"
    img.metadata.image_mime_type = "image/png"
    codeflash_output = elements_to_md([img])
    result = codeflash_output  # 7.03μs -> 5.30μs (32.6% faster)


def test_image_url_used_when_present():
    # If image_url is present, it should be used to form the markdown image link
    img = Image(text="remote")
    img.metadata.image_url = "https://example.com/image.png"
    # Ensure no base64 data is set so the url-branch is matched
    img.metadata.image_base64 = None
    codeflash_output = elements_to_md([img])
    result = codeflash_output  # 14.6μs -> 10.8μs (34.9% faster)


def test_exclude_binary_image_data_true_causes_fallback_to_text_if_no_url():
    # When exclude_binary_image_data is True and only base64 is present (no url),
    # the image cases for base64 shouldn't match and the fallback should be element.text
    img = Image(text="fallback-text")
    img.metadata.image_base64 = "SOMEBASE64"
    img.metadata.image_mime_type = "image/jpeg"
    # No image_url set; exclude_binary_image_data should cause fallback
    codeflash_output = elements_to_md([img], exclude_binary_image_data=True)
    result = codeflash_output  # 13.4μs -> 10.8μs (24.1% faster)


def test_writes_file_and_respects_encoding(tmp_path):
    # Create elements with a non-ascii character to ensure encoding is used
    title = Title(text="Título")  # include accented char
    paragraph = Text(text="Café")
    codeflash_output = elements_to_md([title, paragraph])
    md_content = codeflash_output  # 8.26μs -> 6.47μs (27.5% faster)
    # Write to a temporarily-provided path using the function under test
    file_path = tmp_path / "out.md"
    codeflash_output = elements_to_md([title, paragraph], filename=str(file_path), encoding="utf-8")
    returned = codeflash_output  # 77.3μs -> 72.7μs (6.30% faster)
    # Read the file back using the same encoding and compare
    read_back = file_path.read_text(encoding="utf-8")


def test_empty_iterable_returns_empty_string():
    # Passing an empty list should return an empty string (no trailing newline)
    codeflash_output = elements_to_md([])
    result = codeflash_output  # 2.03μs -> 1.76μs (15.5% faster)


def test_large_scale_many_elements_performance_and_correctness():
    # Create a moderately large number of elements (500) to test scaling behavior
    count = 500
    elements = [Text(text=f"line {i}") for i in range(count)]
    codeflash_output = elements_to_md(elements)
    result = codeflash_output  # 643μs -> 444μs (44.9% faster)
    # The resulting markdown should be the join of each "line i" separated by newlines
    expected = "\n".join(f"line {i}" for i in range(count))


def test_element_to_md_various_single_elements():
    # Table with html
    t = Table(text="raw")
    t.metadata.text_as_html = "<table/>"
    # Image with url
    img = Image(text="alt")
    img.metadata.image_url = "http://img"


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import os
import tempfile
from typing import Optional

# imports
import pytest

# Import real classes from their actual modules
from unstructured.documents.elements import Element, Image, Table, Title
from unstructured.staging.base import elements_to_md


# Helper function to create mock-like instances of real Element subclasses
def create_title(text: str) -> Title:
    """Create a Title element with the given text."""
    title = Title(text=text)
    return title


def create_table(text: str, html: Optional[str] = None) -> Table:
    """Create a Table element with optional HTML metadata."""
    table = Table(text=text)
    if html is not None:
        table.metadata.text_as_html = html
    return table


def create_image(
    text: str,
    base64: Optional[str] = None,
    mime_type: Optional[str] = None,
    url: Optional[str] = None,
) -> Image:
    """Create an Image element with various metadata options."""
    image = Image(text=text)
    if base64 is not None:
        image.metadata.image_base64 = base64
    if mime_type is not None:
        image.metadata.image_mime_type = mime_type
    if url is not None:
        image.metadata.image_url = url
    return image


def create_text_element(text: str) -> Element:
    """Create a generic Text element."""
    from unstructured.documents.elements import Text

    elem = Text(text=text)
    return elem


class TestBasicFunctionality:
    """Test basic functionality of elements_to_md with normal inputs."""

    def test_empty_elements_list(self):
        """Test that an empty list of elements returns an empty string."""
        codeflash_output = elements_to_md([])
        result = codeflash_output  # 1.97μs -> 1.73μs (13.7% faster)

    def test_single_title_element(self):
        """Test conversion of a single Title element."""
        title = create_title("My Title")
        codeflash_output = elements_to_md([title])
        result = codeflash_output  # 4.00μs -> 2.75μs (45.2% faster)

    def test_single_text_element(self):
        """Test conversion of a single generic Text element."""
        text_elem = create_text_element("Hello World")
        codeflash_output = elements_to_md([text_elem])
        result = codeflash_output  # 5.75μs -> 4.91μs (17.2% faster)

    def test_multiple_titles(self):
        """Test conversion of multiple Title elements."""
        titles = [create_title("First"), create_title("Second")]
        codeflash_output = elements_to_md(titles)
        result = codeflash_output  # 4.85μs -> 3.54μs (36.8% faster)

    def test_multiple_text_elements(self):
        """Test conversion of multiple generic text elements."""
        texts = [create_text_element("Line 1"), create_text_element("Line 2")]
        codeflash_output = elements_to_md(texts)
        result = codeflash_output  # 7.52μs -> 6.18μs (21.7% faster)

    def test_mixed_elements(self):
        """Test conversion of mixed element types."""
        elements = [
            create_title("Title"),
            create_text_element("Body text"),
            create_title("Another Title"),
        ]
        codeflash_output = elements_to_md(elements)
        result = codeflash_output  # 8.04μs -> 6.14μs (30.9% faster)

    def test_table_with_html_metadata(self):
        """Test that Table with text_as_html returns the HTML."""
        html_content = "<table><tr><td>Cell</td></tr></table>"
        table = create_table("Table text", html=html_content)
        codeflash_output = elements_to_md([table])
        result = codeflash_output  # 5.50μs -> 4.15μs (32.4% faster)

    def test_table_without_html_metadata(self):
        """Test that Table without HTML metadata returns text content."""
        table = create_table("Plain table text")
        codeflash_output = elements_to_md([table])
        result = codeflash_output  # 12.3μs -> 9.44μs (30.8% faster)

    def test_image_with_url(self):
        """Test Image element with image_url."""
        image = create_image("alt text", url="https://example.com/image.jpg")
        codeflash_output = elements_to_md([image])
        result = codeflash_output  # 14.2μs -> 10.2μs (39.4% faster)

    def test_image_with_base64_and_mime_type(self):
        """Test Image element with base64 data and MIME type."""
        image = create_image("alt", base64="abc123", mime_type="image/png")
        codeflash_output = elements_to_md([image])
        result = codeflash_output  # 6.94μs -> 5.42μs (28.0% faster)

    def test_image_with_base64_no_mime_type(self):
        """Test Image element with base64 data but no MIME type."""
        image = create_image("alt", base64="xyz789")
        codeflash_output = elements_to_md([image])
        result = codeflash_output  # 12.0μs -> 10.4μs (15.9% faster)


class TestFileOutput:
    """Test file output functionality."""

    def test_write_to_file(self):
        """Test writing markdown content to a file."""
        with tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".md") as f:
            temp_file = f.name

        try:
            elements = [create_title("Test"), create_text_element("Content")]
            codeflash_output = elements_to_md(elements, filename=temp_file)
            result = codeflash_output

            # Verify file contents
            with open(temp_file, encoding="utf-8") as f:
                file_content = f.read()
        finally:
            if os.path.exists(temp_file):
                os.remove(temp_file)

    def test_write_to_file_with_custom_encoding(self):
        """Test writing to file with custom encoding."""
        with tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".md") as f:
            temp_file = f.name

        try:
            elements = [create_text_element("Content with special chars: é à ü")]
            codeflash_output = elements_to_md(elements, filename=temp_file, encoding="utf-8")
            result = codeflash_output

            # Verify file was written with correct encoding
            with open(temp_file, encoding="utf-8") as f:
                file_content = f.read()
        finally:
            if os.path.exists(temp_file):
                os.remove(temp_file)

    def test_file_creation_with_nonexistent_directory(self):
        """Test that file creation fails gracefully when directory doesn't exist."""
        nonexistent_dir = "/nonexistent/dir/that/does/not/exist/file.md"
        elements = [create_text_element("test")]

        # This should raise an error since the directory doesn't exist
        with pytest.raises((FileNotFoundError, OSError)):
            elements_to_md(elements, filename=nonexistent_dir)  # 23.4μs -> 22.2μs (5.14% faster)

    def test_no_file_output_when_filename_is_none(self):
        """Test that no file is created when filename is None."""
        elements = [create_title("Test")]
        codeflash_output = elements_to_md(elements, filename=None)
        result = codeflash_output  # 4.36μs -> 3.35μs (30.2% faster)


class TestExcludeBinaryImageData:
    """Test exclude_binary_image_data parameter."""

    def test_exclude_base64_with_mime_type(self):
        """Test that base64 images are excluded when exclude_binary_image_data=True."""
        image = create_image("alt", base64="abc123", mime_type="image/png")
        codeflash_output = elements_to_md([image], exclude_binary_image_data=True)
        result = codeflash_output  # 13.2μs -> 10.5μs (25.3% faster)

    def test_exclude_base64_without_mime_type(self):
        """Test excluding base64 image without MIME type."""
        image = create_image("alt", base64="xyz789")
        codeflash_output = elements_to_md([image], exclude_binary_image_data=True)
        result = codeflash_output  # 14.3μs -> 11.6μs (22.7% faster)

    def test_keep_url_when_excluding_binary(self):
        """Test that URL-based images are kept when excluding binary data."""
        image = create_image("alt", url="https://example.com/image.jpg")
        codeflash_output = elements_to_md([image], exclude_binary_image_data=True)
        result = codeflash_output  # 14.3μs -> 10.7μs (33.4% faster)

    def test_exclude_binary_false_includes_base64(self):
        """Test that base64 images are included when exclude_binary_image_data=False."""
        image = create_image("alt", base64="abc123", mime_type="image/png")
        codeflash_output = elements_to_md([image], exclude_binary_image_data=False)
        result = codeflash_output  # 7.31μs -> 5.76μs (27.0% faster)

    def test_exclude_binary_with_mixed_images(self):
        """Test mixed images with exclude_binary_image_data=True."""
        images = [
            create_image("base64_img", base64="abc123", mime_type="image/png"),
            create_image("url_img", url="https://example.com/image.jpg"),
        ]
        codeflash_output = elements_to_md(images, exclude_binary_image_data=True)
        result = codeflash_output  # 19.9μs -> 14.6μs (36.2% faster)


class TestEdgeCases:
    """Test edge cases and unusual inputs."""

    def test_empty_string_element(self):
        """Test element with empty string text."""
        elem = create_text_element("")
        codeflash_output = elements_to_md([elem])
        result = codeflash_output  # 5.95μs -> 4.91μs (21.1% faster)

    def test_whitespace_only_element(self):
        """Test element with only whitespace."""
        elem = create_text_element("   \n\t  ")
        codeflash_output = elements_to_md([elem])
        result = codeflash_output  # 5.81μs -> 4.82μs (20.7% faster)

    def test_very_long_text(self):
        """Test element with very long text."""
        long_text = "a" * 10000
        elem = create_text_element(long_text)
        codeflash_output = elements_to_md([elem])
        result = codeflash_output  # 5.74μs -> 4.85μs (18.3% faster)

    def test_special_characters_in_text(self):
        """Test text with special characters and symbols."""
        special_text = "!@#$%^&*()_+-=[]{}|;:',.<>?/~`"
        elem = create_text_element(special_text)
        codeflash_output = elements_to_md([elem])
        result = codeflash_output  # 5.85μs -> 4.84μs (20.8% faster)

    def test_unicode_characters(self):
        """Test text with various Unicode characters."""
        unicode_text = "Hello 世界 مرحبا мир 🌍"
        elem = create_text_element(unicode_text)
        codeflash_output = elements_to_md([elem])
        result = codeflash_output  # 5.91μs -> 4.78μs (23.6% faster)

    def test_newlines_in_element_text(self):
        """Test element with embedded newlines."""
        text_with_newlines = "Line 1\nLine 2\nLine 3"
        elem = create_text_element(text_with_newlines)
        codeflash_output = elements_to_md([elem])
        result = codeflash_output  # 5.74μs -> 5.13μs (12.0% faster)

    def test_title_with_special_characters(self):
        """Test title containing special markdown characters."""
        title = create_title("Title with #hashtag and **bold** syntax")
        codeflash_output = elements_to_md([title])
        result = codeflash_output  # 3.96μs -> 2.87μs (38.3% faster)

    def test_image_with_empty_alt_text(self):
        """Test image element with empty alt text."""
        image = create_image("", url="https://example.com/image.jpg")
        codeflash_output = elements_to_md([image])
        result = codeflash_output  # 14.4μs -> 10.2μs (40.2% faster)

    def test_image_with_empty_base64(self):
        """Test image with empty base64 string."""
        image = create_image("alt", base64="", mime_type="image/png")
        codeflash_output = elements_to_md([image])
        result = codeflash_output  # 6.95μs -> 5.27μs (31.9% faster)

    def test_table_html_with_special_characters(self):
        """Test table HTML content with special characters."""
        html = "<table><tr><td>&lt;tag&gt;</td></tr></table>"
        table = create_table("", html=html)
        codeflash_output = elements_to_md([table])
        result = codeflash_output  # 5.45μs -> 4.18μs (30.2% faster)
from unstructured.staging.base import elements_to_md


def test_elements_to_md():
    elements_to_md((), filename=None, exclude_binary_image_data=True, encoding="")
🔎 Click to see Concolic Coverage Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_xdo_puqm/tmp7u6ihkg6/test_concolic_coverage.py::test_elements_to_md 2.29μs 2.23μs 2.55%✅

To edit these changes git checkout codeflash/optimize-elements_to_md-mkrzl707 and push.

Codeflash Static Badge

The optimization achieves a **41% speedup** by replacing Python's structural pattern matching with direct `isinstance()` checks and explicit attribute access. Here's why this matters:

## Key Performance Improvement

**Pattern matching overhead elimination**: The original code spent ~65% of its time in `case` statement evaluation (lines showing 15%, 12.2%, 11.2%, 12%, 14.2% in profiling). Each `case` statement with attribute unpacking like `case Title(text=text):` performs:
1. Type checking via `isinstance()`
2. Attribute extraction and binding
3. Guard condition evaluation (for the `if` clauses)

The optimized version performs these operations explicitly and only once per element type, avoiding the pattern matching machinery's overhead.

## Specific Optimizations

1. **Early returns reduce unnecessary checks**: By restructuring as if-elif chains with early returns, once an element type matches, no further type checks occur. The pattern matching evaluates all cases sequentially.

2. **Cached attribute access for Images**: The optimized code extracts `metadata` and `text` once for Image elements (`metadata = element.metadata`), then reuses these references across multiple conditions. The original code repeatedly accessed `element.metadata` through pattern unpacking in each case.

3. **Simplified conditional logic**: For Image elements, the nested if-statements in the optimized version more efficiently evaluate conditions in sequence (checking `image_base64` once, then mime_type, then exclude flag) versus pattern matching which re-evaluates the entire pattern for each case.

## Test Case Performance

The optimization shows consistent gains across all scenarios:
- **Large-scale performance** (500 elements): 44.9% faster - demonstrates the optimization scales well with volume
- **Title conversions**: 28-45% faster - benefits from eliminating pattern matching overhead for simple type checks
- **Image conversions**: 18-40% faster - particularly strong gains due to reduced repeated metadata access
- **Mixed element workloads**: 21-37% faster - shows consistent improvement regardless of element type distribution

## Impact on Production Workloads

Based on the `function_references`, this function is called from `json_to_format()` in a document conversion pipeline. Since it processes entire documents (potentially hundreds of elements), the 41% speedup translates directly to faster batch conversion jobs. The optimization is especially valuable when `format_type == "markdown"` as every element in the document flows through `element_to_md()`.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 January 24, 2026 07:27
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant