Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jan 24, 2026

📄 2,121% (21.21x) speedup for bytes_string_to_string in unstructured/cleaners/core.py

⏱️ Runtime : 5.04 milliseconds 227 microseconds (best of 65 runs)

📝 Explanation and details

The optimized code achieves a 21x speedup (2121%) by replacing an inefficient character-by-character byte construction with Python's native encode() method.

Key Optimization

Original approach:

text_bytes = bytes([ord(char) for char in text])
  • Creates a list comprehension iterating over every character
  • Calls ord() for each character individually
  • Constructs an intermediate list in memory
  • Converts the list to bytes
  • Line profiler shows: 28.7ms (92.4% of total time)

Optimized approach:

text_bytes = text.encode("latin-1")
  • Uses Python's built-in string encoding directly
  • Latin-1 encoding maps characters 0-255 to bytes 1:1 (identical to the original behavior)
  • No intermediate list creation
  • Line profiler shows: 106μs (4.1% of total time)
  • ~270x faster on the critical line

Why This Works

The original function's purpose is to interpret a string where each character represents a byte value (ord 0-255), then decode those bytes using a specified encoding. Latin-1 encoding has the unique property that it directly maps Unicode codepoints 0-255 to bytes 0-255, making text.encode("latin-1") functionally equivalent to bytes([ord(char) for char in text]) but implemented in optimized C code.

Error Handling

Added a try-except block to preserve original behavior:

except UnicodeEncodeError:
    raise ValueError("bytes must be in range(0, 256)") from None

The original would raise ValueError if any character had ord > 255; the optimized version catches UnicodeEncodeError from encode() and converts it to the same ValueError.

Performance Impact by Test Category

  • Small strings (< 20 chars): 30-100% faster (microseconds saved)
  • Large strings (> 1000 chars): 5000-13000% faster (hundreds of microseconds saved)
    • Example: 8000-char string goes from 428μs to 5.19μs (8141% faster)
    • The performance gap grows linearly with string length due to eliminating the Python-level loop

The optimization is particularly impactful for any workload processing moderate-to-large strings, as the speedup scales directly with input size.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 20 Passed
🌀 Generated Regression Tests 50 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
⚙️ Click to see Existing Unit Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
cleaners/test_core.py::test_bytes_string_to_string 6.51μs 3.73μs 74.6%✅
🌀 Click to see Generated Regression Tests
import pytest  # used for our unit tests

from unstructured.cleaners.core import bytes_string_to_string


def test_basic_ascii_default_utf8():
    # Basic ASCII string should be preserved with default utf-8 encoding
    input_text = "Hello, world!"  # ASCII characters only
    # The function builds bytes from ords and decodes with utf-8 by default
    codeflash_output = bytes_string_to_string(input_text)
    result = codeflash_output  # 5.68μs -> 2.90μs (95.7% faster)


def test_multibyte_utf8_sequence_decodes_to_unicode_char():
    # UTF-8 encoded 'é' is two bytes: 0xC3 0xA9. Build string with those byte values.
    input_text = chr(0xC3) + chr(0xA9)  # two chars with ord 195 and 169
    # Decoding these bytes as utf-8 should yield the single character 'é'
    codeflash_output = bytes_string_to_string(input_text, encoding="utf-8")
    result = codeflash_output  # 5.40μs -> 4.06μs (33.1% faster)


def test_encoding_normalization_underscore_and_case():
    # Ensure format_encoding_str normalizes underscores to hyphens and lowercases
    input_text = "AB"  # simple ASCII that decodes the same in many encodings
    # Use a mixed-case, underscore-containing encoding name
    codeflash_output = bytes_string_to_string(input_text, encoding="UTF_8")
    result = codeflash_output  # 4.80μs -> 3.65μs (31.5% faster)


def test_empty_string_returns_empty_string():
    # Empty input string should produce empty bytes and decode to empty string
    input_text = ""
    codeflash_output = bytes_string_to_string(input_text, encoding="utf-8")
    result = codeflash_output  # 3.83μs -> 3.19μs (20.1% faster)


def test_null_byte_handling_preserved():
    # Include a null byte in the middle; should be preserved after decode
    input_text = "A" + chr(0) + "B"
    # Decoding should yield a string containing the NUL character
    codeflash_output = bytes_string_to_string(input_text, encoding="utf-8")
    result = codeflash_output  # 4.66μs -> 3.33μs (39.9% faster)


def test_value_error_when_input_has_ord_over_255():
    # If any character has ord > 255, bytes([ord(char) ...]) must raise ValueError
    # Create a character with ord 256 (outside 0-255 byte range)
    input_text = chr(256)  # '\u0100'
    with pytest.raises(ValueError):
        codeflash_output = bytes_string_to_string(input_text, encoding="utf-8")
        _ = codeflash_output  # 3.63μs -> 4.81μs (24.5% slower)


def test_invalid_encoding_raises_lookup_error():
    # If the provided encoding (after formatting) is unknown, decode() should raise LookupError
    input_text = "A"  # simple byte that any codec could decode, but codec is invalid
    with pytest.raises(LookupError):
        codeflash_output = bytes_string_to_string(
            input_text, encoding="this-encoding-does-not-exist"
        )
        _ = codeflash_output  # 9.10μs -> 8.44μs (7.75% faster)


def test_annotation_removal_with_underscore_triggers_expected_codec():
    # The format_encoding_str removes directional annotations from certain encodings.
    # Use an annotated name with underscores which becomes one of annotated_encodings
    input_text = "AB"  # ASCII bytes that decode identically across many encodings
    # 'ISO_8859_6_I' -> 'iso-8859-6-i' -> trimmed to 'iso-8859-6'
    codeflash_output = bytes_string_to_string(input_text, encoding="ISO_8859_6_I")
    result = codeflash_output  # 9.36μs -> 8.03μs (16.5% faster)


def test_annotation_removal_with_hyphen_triggers_expected_codec():
    # Same as above but provide the hyphenated annotated form directly
    input_text = "AB"
    # 'iso-8859-6-i' should be transformed to 'iso-8859-6' by format_encoding_str logic
    codeflash_output = bytes_string_to_string(input_text, encoding="iso-8859-6-i")
    result = codeflash_output  # 8.87μs -> 7.55μs (17.5% faster)


def test_large_scale_latin1_identity_preserved():
    # Construct a large but bounded (<1000) sequence of byte-values as characters.
    # Using latin-1 (iso-8859-1) as the decoding ensures a one-to-one mapping of byte->codepoint.
    length = 500  # well under 1000 to keep test quick and deterministic
    # Build a string where each character's ord is i % 256
    large_input = "".join(chr(i % 256) for i in range(length))
    # Decoding via iso-8859-1 (latin-1) should produce a string whose codepoints equal the original ords
    codeflash_output = bytes_string_to_string(large_input, encoding="iso_8859_1")
    result = codeflash_output  # 32.7μs -> 3.83μs (753% faster)


def test_large_scale_utf8_roundtrip_for_ascii_subset():
    # Another large-scale test using ASCII-range bytes which are identical in UTF-8
    length = 700  # still under 1000
    # Build a repeating ASCII sequence (values 32..126) which are single-byte in UTF-8
    ascii_vals = [32 + (i % 95) for i in range(length)]  # printable ASCII range
    input_text = "".join(chr(v) for v in ascii_vals)
    # Decoding as utf-8 should reproduce the same characters
    codeflash_output = bytes_string_to_string(input_text, encoding="UTF_8")
    result = codeflash_output  # 43.3μs -> 3.84μs (1027% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unstructured.cleaners.core import bytes_string_to_string


def test_basic_utf8_decoding():
    """Test basic UTF-8 decoding with ASCII characters."""
    text = "hello"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 5.02μs -> 3.19μs (57.5% faster)


def test_basic_utf8_with_default_encoding():
    """Test UTF-8 decoding using default encoding parameter."""
    text = "world"
    codeflash_output = bytes_string_to_string(text)
    result = codeflash_output  # 4.70μs -> 2.78μs (69.1% faster)


def test_utf8_with_numbers():
    """Test UTF-8 decoding with numeric characters."""
    text = "test123"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 5.19μs -> 3.19μs (62.7% faster)


def test_utf8_with_special_characters():
    """Test UTF-8 decoding with special characters and punctuation."""
    text = "hello-world_test.txt"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 6.24μs -> 3.08μs (103% faster)


def test_utf8_with_spaces():
    """Test UTF-8 decoding with spaces between words."""
    text = "hello world test"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 5.92μs -> 3.21μs (84.3% faster)


def test_encoding_normalization_uppercase():
    """Test that encoding parameter is normalized from uppercase."""
    text = "test"
    codeflash_output = bytes_string_to_string(text, encoding="UTF-8")
    result = codeflash_output  # 4.83μs -> 3.12μs (54.4% faster)


def test_encoding_normalization_underscore():
    """Test that encoding parameter is normalized from underscore to hyphen."""
    text = "test"
    codeflash_output = bytes_string_to_string(text, encoding="utf_8")
    result = codeflash_output  # 4.82μs -> 3.33μs (44.9% faster)


def test_encoding_normalization_mixed_case_underscore():
    """Test encoding normalization with mixed case and underscores."""
    text = "test"
    codeflash_output = bytes_string_to_string(text, encoding="UTF_8")
    result = codeflash_output  # 5.01μs -> 3.43μs (46.0% faster)


def test_iso_8859_1_encoding():
    """Test ISO-8859-1 (Latin-1) encoding."""
    text = "café"
    codeflash_output = bytes_string_to_string(text, encoding="iso-8859-1")
    result = codeflash_output  # 5.05μs -> 3.39μs (49.0% faster)


def test_iso_8859_1_normalization():
    """Test ISO-8859-1 encoding with underscore normalization."""
    text = "test"
    codeflash_output = bytes_string_to_string(text, encoding="ISO_8859_1")
    result = codeflash_output  # 5.23μs -> 3.63μs (44.1% faster)


def test_empty_string():
    """Test decoding an empty string."""
    text = ""
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 3.94μs -> 3.04μs (29.8% faster)


def test_single_character():
    """Test decoding a single character."""
    text = "a"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 4.45μs -> 3.15μs (41.2% faster)


def test_newline_character():
    """Test decoding a newline character."""
    text = "\n"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 4.29μs -> 3.16μs (35.7% faster)


def test_tab_character():
    """Test decoding a tab character."""
    text = "\t"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 4.42μs -> 3.17μs (39.3% faster)


def test_mixed_whitespace():
    """Test decoding mixed whitespace characters."""
    text = "hello\n\tworld"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 5.64μs -> 3.23μs (74.7% faster)


def test_iso_8859_6_e_normalization():
    """Test ISO-8859-6-E (Arabic with directionality) encoding normalization."""
    text = "test"
    # The function should strip the -e annotation and use iso-8859-6
    codeflash_output = bytes_string_to_string(text, encoding="iso-8859-6-e")
    result = codeflash_output  # 9.30μs -> 7.84μs (18.5% faster)


def test_iso_8859_6_i_normalization():
    """Test ISO-8859-6-I (Arabic with directionality) encoding normalization."""
    text = "test"
    # The function should strip the -i annotation and use iso-8859-6
    codeflash_output = bytes_string_to_string(text, encoding="iso-8859-6-i")
    result = codeflash_output  # 8.98μs -> 7.53μs (19.3% faster)


def test_iso_8859_8_e_normalization():
    """Test ISO-8859-8-E (Hebrew with directionality) encoding normalization."""
    text = "test"
    # The function should strip the -e annotation and use iso-8859-8
    codeflash_output = bytes_string_to_string(text, encoding="iso-8859-8-e")
    result = codeflash_output  # 9.33μs -> 7.59μs (22.9% faster)


def test_iso_8859_8_i_normalization():
    """Test ISO-8859-8-I (Hebrew with directionality) encoding normalization."""
    text = "test"
    # The function should strip the -i annotation and use iso-8859-8
    codeflash_output = bytes_string_to_string(text, encoding="iso-8859-8-i")
    result = codeflash_output  # 9.09μs -> 7.48μs (21.5% faster)


def test_encoding_with_mixed_normalization():
    """Test encoding normalization combining case and underscore."""
    text = "data"
    codeflash_output = bytes_string_to_string(text, encoding="ISO_8859_6_E")
    result = codeflash_output  # 9.29μs -> 7.66μs (21.3% faster)


def test_us_ascii_encoding():
    """Test US-ASCII encoding."""
    text = "ASCII"
    codeflash_output = bytes_string_to_string(text, encoding="us-ascii")
    result = codeflash_output  # 5.27μs -> 3.38μs (56.1% faster)


def test_us_ascii_normalization():
    """Test US-ASCII encoding with underscore normalization."""
    text = "test"
    codeflash_output = bytes_string_to_string(text, encoding="US_ASCII")
    result = codeflash_output  # 5.16μs -> 3.64μs (41.7% faster)


def test_latin1_alias():
    """Test latin-1 alias for iso-8859-1."""
    text = "latin"
    codeflash_output = bytes_string_to_string(text, encoding="latin-1")
    result = codeflash_output  # 5.20μs -> 3.35μs (55.3% faster)


def test_latin1_alias_normalization():
    """Test latin-1 alias with underscore normalization."""
    text = "test"
    codeflash_output = bytes_string_to_string(text, encoding="LATIN_1")
    result = codeflash_output  # 5.01μs -> 3.58μs (39.8% faster)


def test_ascii_only_characters():
    """Test string with only ASCII characters (0-127 range)."""
    text = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 10.1μs -> 3.27μs (209% faster)


def test_repeated_characters():
    """Test string with repeated characters."""
    text = "aaaa"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 4.63μs -> 3.27μs (41.4% faster)


def test_alphanumeric_with_symbols():
    """Test alphanumeric characters mixed with symbols."""
    text = "abc!@#$%^&*()"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 5.84μs -> 3.27μs (78.6% faster)


def test_large_string_ascii():
    """Test decoding a large string with ASCII characters."""
    # Create a large string with 5000 ASCII characters
    text = "a" * 5000
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 277μs -> 5.12μs (5324% faster)


def test_large_string_repeated_pattern():
    """Test decoding a large string with repeated pattern."""
    # Create a pattern and repeat it 500 times
    pattern = "hello world "
    text = pattern * 500
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 319μs -> 4.92μs (6395% faster)


def test_large_string_with_numbers():
    """Test decoding a large string with numbers."""
    # Create a large string with 10000 numeric characters
    text = "0123456789" * 1000
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 527μs -> 5.57μs (9362% faster)


def test_large_string_with_mixed_content():
    """Test decoding a large string with mixed content."""
    # Create a pattern with mixed characters
    pattern = "abc123XYZ!@#\n\t"
    text = pattern * 300
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 222μs -> 4.45μs (4902% faster)


def test_large_string_all_ascii_symbols():
    """Test decoding a large string with all printable ASCII symbols."""
    # Use all printable ASCII characters (32-126)
    ascii_chars = "".join(chr(i) for i in range(32, 127))
    text = ascii_chars * 100
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 492μs -> 5.43μs (8972% faster)


def test_large_string_default_encoding():
    """Test decoding a large string with default UTF-8 encoding."""
    text = "test_content_" * 500
    codeflash_output = bytes_string_to_string(text)
    result = codeflash_output  # 337μs -> 4.43μs (7525% faster)


def test_very_long_single_word():
    """Test decoding a very long single word."""
    text = "a" * 8000
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 428μs -> 5.19μs (8141% faster)


def test_large_string_with_sentence_structure():
    """Test decoding a large string with sentence structure."""
    sentence = "The quick brown fox jumps over the lazy dog. "
    text = sentence * 200
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 466μs -> 5.39μs (8546% faster)


def test_performance_consistent_across_encodings():
    """Test that performance is consistent for same text across different encoding formats."""
    text = "performance_test_" * 400

    # Test with different encoding formats
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result1 = codeflash_output  # 352μs -> 4.94μs (7034% faster)
    codeflash_output = bytes_string_to_string(text, encoding="UTF-8")
    result2 = codeflash_output  # 347μs -> 2.62μs (13133% faster)
    codeflash_output = bytes_string_to_string(text, encoding="utf_8")
    result3 = codeflash_output  # 349μs -> 2.55μs (13611% faster)


def test_large_multiline_string():
    """Test decoding a large multiline string."""
    line = "This is a line of text.\n"
    text = line * 500
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 620μs -> 6.12μs (10040% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unstructured.cleaners.core import bytes_string_to_string


def test_bytes_string_to_string():
    bytes_string_to_string("", encoding="")
🔎 Click to see Concolic Coverage Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_xdo_puqm/tmpl5i6ubkt/test_concolic_coverage.py::test_bytes_string_to_string 3.89μs 3.01μs 29.0%✅

To edit these changes git checkout codeflash/optimize-bytes_string_to_string-mkrwky7e and push.

Codeflash Static Badge

The optimized code achieves a **21x speedup** (2121%) by replacing an inefficient character-by-character byte construction with Python's native `encode()` method.

## Key Optimization

**Original approach:**
```python
text_bytes = bytes([ord(char) for char in text])
```
- Creates a list comprehension iterating over every character
- Calls `ord()` for each character individually
- Constructs an intermediate list in memory
- Converts the list to bytes
- Line profiler shows: **28.7ms** (92.4% of total time)

**Optimized approach:**
```python
text_bytes = text.encode("latin-1")
```
- Uses Python's built-in string encoding directly
- Latin-1 encoding maps characters 0-255 to bytes 1:1 (identical to the original behavior)
- No intermediate list creation
- Line profiler shows: **106μs** (4.1% of total time)
- **~270x faster** on the critical line

## Why This Works

The original function's purpose is to interpret a string where each character represents a byte value (ord 0-255), then decode those bytes using a specified encoding. Latin-1 encoding has the unique property that it directly maps Unicode codepoints 0-255 to bytes 0-255, making `text.encode("latin-1")` functionally equivalent to `bytes([ord(char) for char in text])` but implemented in optimized C code.

## Error Handling

Added a try-except block to preserve original behavior:
```python
except UnicodeEncodeError:
    raise ValueError("bytes must be in range(0, 256)") from None
```
The original would raise `ValueError` if any character had `ord > 255`; the optimized version catches `UnicodeEncodeError` from `encode()` and converts it to the same `ValueError`.

## Performance Impact by Test Category

- **Small strings (< 20 chars)**: 30-100% faster (microseconds saved)
- **Large strings (> 1000 chars)**: **5000-13000% faster** (hundreds of microseconds saved)
  - Example: 8000-char string goes from 428μs to 5.19μs (8141% faster)
  - The performance gap grows linearly with string length due to eliminating the Python-level loop

The optimization is particularly impactful for any workload processing moderate-to-large strings, as the speedup scales directly with input size.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 January 24, 2026 06:02
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant