⚡️ Speed up function `calculate_edit_distance` by 150% #269

codeflash-ai · 2026-01-24T08:36:37Z

📄 150% (1.50x) speedup for `calculate_edit_distance` in `unstructured/metrics/text_extraction.py`

⏱️ Runtime : 5.47 milliseconds → 2.19 milliseconds (best of 60 runs)

📝 Explanation and details

The optimized code achieves a 150% speedup (from 5.47ms to 2.19ms) by eliminating redundant dictionary construction and replacing inefficient character-by-character replacements with a pre-computed translation table.

Key Optimizations

1. Module-level Pre-computation

The original code reconstructed the double_quotes and single_quotes dictionaries on every call to standardize_quotes (217 calls in the profile). This consumed ~23% of runtime just building dictionaries. The optimized version moves these to module-level constants (_DOUBLE_QUOTES, _SINGLE_QUOTES), computed once at import time.

2. Translation Table (`str.translate()`)

The original code used a loop with unicode_to_char() conversions and individual str.replace() calls for each quote type (~40 iterations per call). The optimized version pre-computes all unicode characters and builds a single translation table (_QUOTE_TRANSLATION) using str.maketrans(). This allows str.translate() to replace all quote characters in a single pass through the string, which is implemented in C and far more efficient than Python loops with multiple replace() calls.

Line profiler shows standardize_quotes dropped from 25.9ms total time (with ~65% spent in loops and dictionary construction) to just 0.5ms (single translate call).

3. Faster Validation Check

Changed return_as not in return_types from a list lookup to a tuple literal check return_as not in ("score", "distance"). This avoids list construction on every call and uses Python's optimized tuple comparison. The list is now only created in the error path (3 out of 105 calls).

Impact on Workloads

The function_references show calculate_edit_distance is called by calculate_accuracy, which appears to be a high-level metric function. Given that the test results show 3-10x speedups on individual calls (e.g., 44μs → 9μs for typical inputs), any workflow processing multiple documents or computing accuracy metrics repeatedly will benefit significantly. The optimization is particularly effective when:

Text contains many unicode quotes: The translation table eliminates the need to check each quote type individually
Called in loops: Module-level constants amortize setup costs across all calls
Large documents: The single-pass translate() scales better than multiple replace() operations (e.g., 500-char strings show 272% speedup)

Test cases with standard ASCII text show ~380% speedup, while those with unicode quotes show ~330% speedup - demonstrating consistent gains across input types. The optimization maintains correctness while reducing overhead from 94% of runtime to negligible levels.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 42 Passed
🌀 Generated Regression Tests	✅ 66 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	✅ 3 Passed
📊 Tests Coverage	92.3%

⚙️ Click to see Existing Unit Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`metrics/test_text_extraction.py::test_calculate_edit_distance`	293μs	52.0μs	465%✅
`metrics/test_text_extraction.py::test_calculate_edit_distance_with_filename`	259μs	144μs	78.7%✅
`metrics/test_text_extraction.py::test_calculate_edit_distance_with_various_whitespace_1`	1.24ms	654μs	90.0%✅
`metrics/test_text_extraction.py::test_calculate_edit_distance_with_various_whitespace_2`	518μs	463μs	11.7%✅

🌀 Click to see Generated Regression Tests

import pytest

from unstructured.metrics.text_extraction import calculate_edit_distance


class TestBasicFunctionality:
    """Test the fundamental functionality of calculate_edit_distance."""

    def test_identical_strings_return_zero_distance(self):
        """Test that identical strings return zero edit distance."""
        codeflash_output = calculate_edit_distance("hello", "hello", return_as="distance")
        result = codeflash_output  # 45.0μs -> 9.28μs (384% faster)

    def test_identical_strings_return_perfect_score(self):
        """Test that identical strings return score of 1.0."""
        codeflash_output = calculate_edit_distance("hello", "hello", return_as="score")
        result = codeflash_output  # 45.2μs -> 9.49μs (376% faster)

    def test_completely_different_strings_distance(self):
        """Test edit distance for completely different strings."""
        # "abc" vs "def" requires 3 substitutions with default weights (2,1,1)
        codeflash_output = calculate_edit_distance("abc", "def", return_as="distance")
        result = codeflash_output  # 44.6μs -> 9.29μs (380% faster)

    def test_completely_different_strings_score(self):
        """Test similarity score for completely different strings."""
        codeflash_output = calculate_edit_distance("abc", "def", return_as="score")
        result = codeflash_output  # 45.1μs -> 9.46μs (376% faster)

    def test_single_character_strings(self):
        """Test with single character strings."""
        codeflash_output = calculate_edit_distance("a", "a", return_as="distance")
        result = codeflash_output  # 44.7μs -> 8.69μs (414% faster)

    def test_single_character_different(self):
        """Test with different single character strings."""
        codeflash_output = calculate_edit_distance("a", "b", return_as="distance")
        result = codeflash_output  # 44.7μs -> 8.74μs (412% faster)

    def test_one_insertion_required(self):
        """Test when one insertion is needed."""
        # "cat" -> "cart" requires one insertion
        codeflash_output = calculate_edit_distance("cart", "cat", return_as="distance")
        result = codeflash_output  # 44.8μs -> 9.28μs (383% faster)

    def test_one_deletion_required(self):
        """Test when one deletion is needed."""
        # "cart" -> "cat" requires one deletion
        codeflash_output = calculate_edit_distance("cat", "cart", return_as="distance")
        result = codeflash_output  # 44.6μs -> 9.21μs (384% faster)

    def test_one_substitution_required(self):
        """Test when one substitution is needed."""
        # "cat" -> "bat" requires one substitution
        codeflash_output = calculate_edit_distance("bat", "cat", return_as="distance")
        result = codeflash_output  # 44.9μs -> 9.07μs (395% faster)

    def test_custom_weights_insertion(self):
        """Test custom weights affect the distance calculation."""
        # With default weights (2, 1, 1)
        codeflash_output = calculate_edit_distance(
            "test", "est", return_as="distance", weights=(2, 1, 1)
        )
        result1 = codeflash_output  # 44.9μs -> 9.10μs (393% faster)
        # With different weights for insertion/deletion
        codeflash_output = calculate_edit_distance(
            "test", "est", return_as="distance", weights=(5, 1, 1)
        )
        result2 = codeflash_output  # 34.7μs -> 3.30μs (950% faster)

    def test_custom_weights_deletion(self):
        """Test custom weights for deletion operation."""
        codeflash_output = calculate_edit_distance(
            "est", "test", return_as="distance", weights=(1, 2, 1)
        )
        result1 = codeflash_output  # 44.6μs -> 9.13μs (389% faster)
        codeflash_output = calculate_edit_distance(
            "est", "test", return_as="distance", weights=(1, 5, 1)
        )
        result2 = codeflash_output  # 34.8μs -> 3.37μs (932% faster)

    def test_custom_weights_substitution(self):
        """Test custom weights for substitution operation."""
        codeflash_output = calculate_edit_distance(
            "cat", "bat", return_as="distance", weights=(1, 1, 2)
        )
        result1 = codeflash_output  # 45.1μs -> 9.04μs (399% faster)
        codeflash_output = calculate_edit_distance(
            "cat", "bat", return_as="distance", weights=(1, 1, 5)
        )
        result2 = codeflash_output  # 34.4μs -> 3.14μs (997% faster)

    def test_whitespace_standardization_enabled(self):
        """Test that whitespace standardization works correctly."""
        # Multiple spaces should be treated as single space
        codeflash_output = calculate_edit_distance(
            "hello  world", "hello world", standardize_whitespaces=True
        )
        result1 = codeflash_output  # 45.5μs -> 10.5μs (333% faster)

    def test_whitespace_standardization_disabled(self):
        """Test when whitespace standardization is disabled."""
        codeflash_output = calculate_edit_distance(
            "hello  world", "hello world", standardize_whitespaces=False
        )
        result = codeflash_output  # 44.6μs -> 9.54μs (367% faster)

    def test_leading_trailing_whitespace_removed(self):
        """Test that leading/trailing whitespace is removed with standardization."""
        codeflash_output = calculate_edit_distance(
            "  hello  ", "hello", standardize_whitespaces=True
        )
        result = codeflash_output  # 45.1μs -> 9.72μs (364% faster)

    def test_tab_and_newline_standardization(self):
        """Test that tabs and newlines are standardized."""
        codeflash_output = calculate_edit_distance(
            "hello\tworld", "hello world", standardize_whitespaces=True
        )
        result = codeflash_output  # 45.5μs -> 10.6μs (329% faster)


class TestEdgeCases:
    """Test edge cases and unusual conditions."""

    def test_both_strings_none(self):
        """Test when both strings are None."""
        codeflash_output = calculate_edit_distance(None, None, return_as="distance")
        result = codeflash_output  # 43.7μs -> 7.22μs (505% faster)

    def test_both_strings_empty(self):
        """Test when both strings are empty."""
        codeflash_output = calculate_edit_distance("", "", return_as="distance")
        result = codeflash_output  # 43.8μs -> 7.18μs (510% faster)

    def test_output_none_source_present(self):
        """Test when output is None but source is present."""
        codeflash_output = calculate_edit_distance(None, "hello", return_as="distance")
        result = codeflash_output  # 44.4μs -> 8.79μs (405% faster)

    def test_output_present_source_none(self):
        """Test when output is present but source is None."""
        codeflash_output = calculate_edit_distance("hello", None, return_as="distance")
        result = codeflash_output  # 44.6μs -> 8.94μs (399% faster)

    def test_output_empty_string_source_present(self):
        """Test when output is empty string."""
        codeflash_output = calculate_edit_distance("", "hello", return_as="distance")
        result = codeflash_output  # 44.2μs -> 8.86μs (399% faster)

    def test_source_empty_string_output_present(self):
        """Test when source is empty string."""
        codeflash_output = calculate_edit_distance("hello", "", return_as="distance")
        result = codeflash_output  # 44.5μs -> 9.07μs (391% faster)

    def test_invalid_return_type_raises_error(self):
        """Test that invalid return_as parameter raises ValueError."""
        with pytest.raises(ValueError):
            calculate_edit_distance(
                "hello", "world", return_as="invalid"
            )  # 5.05μs -> 5.05μs (0.040% slower)

    def test_return_type_case_sensitive(self):
        """Test that return_as parameter is case-sensitive."""
        with pytest.raises(ValueError):
            calculate_edit_distance(
                "hello", "world", return_as="Score"
            )  # 5.27μs -> 4.91μs (7.35% faster)

    def test_distance_never_negative(self):
        """Test that distance is never negative."""
        codeflash_output = calculate_edit_distance("test", "best", return_as="distance")
        result = codeflash_output  # 45.5μs -> 9.53μs (377% faster)

    def test_score_never_exceeds_one(self):
        """Test that score is never greater than 1.0."""
        codeflash_output = calculate_edit_distance("test", "best", return_as="score")
        result = codeflash_output  # 45.0μs -> 9.44μs (376% faster)

    def test_score_never_below_zero(self):
        """Test that score is never less than 0.0."""
        codeflash_output = calculate_edit_distance("test", "x" * 1000, return_as="score")
        result = codeflash_output  # 55.3μs -> 21.2μs (160% faster)

    def test_very_long_output_vs_short_source(self):
        """Test when output is much longer than source."""
        codeflash_output = calculate_edit_distance("x" * 100, "a", return_as="score")
        result = codeflash_output  # 45.7μs -> 10.0μs (357% faster)

    def test_unicode_characters(self):
        """Test with unicode characters."""
        codeflash_output = calculate_edit_distance("café", "cafe", return_as="distance")
        result = codeflash_output  # 45.4μs -> 10.3μs (342% faster)

    def test_special_characters(self):
        """Test with special characters."""
        codeflash_output = calculate_edit_distance("hello!", "hello", return_as="distance")
        result = codeflash_output  # 44.9μs -> 9.27μs (384% faster)

    def test_numeric_strings(self):
        """Test with numeric strings."""
        codeflash_output = calculate_edit_distance("12345", "12345", return_as="distance")
        result = codeflash_output  # 44.6μs -> 9.27μs (381% faster)

    def test_mixed_alphanumeric(self):
        """Test with mixed alphanumeric strings."""
        codeflash_output = calculate_edit_distance("test123", "test456", return_as="distance")
        result = codeflash_output  # 44.8μs -> 9.65μs (364% faster)

    def test_quote_standardization_double_quotes(self):
        """Test that unicode double quotes are standardized."""
        # Using actual unicode quote characters
        codeflash_output = calculate_edit_distance(
            "\u201chello\u201d", '"hello"', return_as="score"
        )
        result = codeflash_output  # 47.5μs -> 11.1μs (327% faster)

    def test_quote_standardization_single_quotes(self):
        """Test that unicode single quotes are standardized."""
        # Using actual unicode quote characters
        codeflash_output = calculate_edit_distance(
            "\u2018hello\u2019", "'hello'", return_as="score"
        )
        result = codeflash_output  # 47.5μs -> 11.1μs (329% faster)

    def test_zero_weights_raises_no_error(self):
        """Test with zero weights (edge case for Levenshtein)."""
        # Zero weights are valid but may produce unexpected results
        codeflash_output = calculate_edit_distance(
            "hello", "world", return_as="distance", weights=(0, 0, 0)
        )
        result = codeflash_output  # 43.9μs -> 8.51μs (416% faster)

    def test_very_large_weights(self):
        """Test with very large weight values."""
        codeflash_output = calculate_edit_distance(
            "a", "b", return_as="distance", weights=(1000, 1000, 1000)
        )
        result = codeflash_output  # 45.0μs -> 8.83μs (409% faster)


class TestReturnTypes:
    """Test different return type specifications."""

    def test_return_distance_type(self):
        """Test that return_as='distance' returns correct type."""
        codeflash_output = calculate_edit_distance("hello", "hallo", return_as="distance")
        result = codeflash_output  # 44.8μs -> 9.24μs (385% faster)

    def test_return_score_type(self):
        """Test that return_as='score' returns correct type."""
        codeflash_output = calculate_edit_distance("hello", "hallo", return_as="score")
        result = codeflash_output  # 45.1μs -> 9.51μs (374% faster)

    def test_default_return_type_is_distance(self):
        """Test that default return_as is 'distance'."""
        codeflash_output = calculate_edit_distance("hello", "hallo")
        result1 = codeflash_output  # 44.2μs -> 9.12μs (385% faster)
        codeflash_output = calculate_edit_distance("hello", "hallo", return_as="distance")
        result2 = codeflash_output  # 34.7μs -> 3.79μs (816% faster)

    def test_score_and_distance_relationship(self):
        """Test that score and distance are inversely related."""
        codeflash_output = calculate_edit_distance("test", "best", return_as="score")
        score = codeflash_output  # 45.0μs -> 9.45μs (377% faster)
        codeflash_output = calculate_edit_distance("test", "best", return_as="distance")
        distance = codeflash_output  # 34.7μs -> 3.44μs (907% faster)
        # Score = 1 - (distance / len(source))
        expected_score = 1 - (distance / 4.0)  # "best" has 4 characters


class TestConsistency:
    """Test consistency across multiple calls."""

    def test_consistent_results_multiple_calls(self):
        """Test that multiple calls with same input produce same output."""
        codeflash_output = calculate_edit_distance("hello", "world")
        result1 = codeflash_output  # 44.6μs -> 9.22μs (384% faster)
        codeflash_output = calculate_edit_distance("hello", "world")
        result2 = codeflash_output  # 34.4μs -> 3.24μs (961% faster)
        codeflash_output = calculate_edit_distance("hello", "world")
        result3 = codeflash_output  # 33.4μs -> 2.68μs (1145% faster)

    def test_order_matters_for_output_vs_source(self):
        """Test that swapping output and source changes the result."""
        codeflash_output = calculate_edit_distance("cat", "dog", return_as="distance")
        result1 = codeflash_output  # 44.6μs -> 9.07μs (392% faster)
        codeflash_output = calculate_edit_distance("dog", "cat", return_as="distance")
        result2 = codeflash_output  # 34.4μs -> 3.22μs (970% faster)

    def test_symmetric_property_of_distance(self):
        """Test the symmetric property of edit distance."""
        codeflash_output = calculate_edit_distance("abc", "def", return_as="score")
        result1 = codeflash_output  # 44.9μs -> 9.38μs (378% faster)
        codeflash_output = calculate_edit_distance("def", "abc", return_as="score")
        result2 = codeflash_output  # 34.5μs -> 3.16μs (992% faster)


class TestLargeScale:
    """Test the function's performance and scalability with large inputs."""

    def test_large_identical_strings(self):
        """Test with large identical strings."""
        large_string = "a" * 500
        codeflash_output = calculate_edit_distance(large_string, large_string, return_as="distance")
        result = codeflash_output  # 46.6μs -> 12.5μs (272% faster)

    def test_large_completely_different_strings(self):
        """Test with large completely different strings."""
        string1 = "a" * 100
        string2 = "b" * 100
        codeflash_output = calculate_edit_distance(string1, string2, return_as="distance")
        result = codeflash_output  # 79.4μs -> 44.5μs (78.3% faster)

    def test_large_string_with_single_difference(self):
        """Test large strings differing by single character."""
        string1 = "a" * 250 + "b" + "a" * 250
        string2 = "a" * 250 + "c" + "a" * 250
        codeflash_output = calculate_edit_distance(string1, string2, return_as="distance")
        result = codeflash_output  # 46.8μs -> 12.3μs (280% faster)

    def test_large_string_single_insertion(self):
        """Test large string with single insertion needed."""
        string1 = "a" * 300
        string2 = "a" * 299
        codeflash_output = calculate_edit_distance(string1, string2, return_as="distance")
        result = codeflash_output  # 46.2μs -> 11.1μs (318% faster)

    def test_large_strings_whitespace_standardization(self):
        """Test large strings with lots of whitespace to standardize."""
        string1 = " ".join(["word"] * 100)
        string2 = "  ".join(["word"] * 100)
        codeflash_output = calculate_edit_distance(
            string1, string2, standardize_whitespaces=True, return_as="score"
        )
        result = codeflash_output  # 59.8μs -> 25.1μs (138% faster)

    def test_large_string_with_many_unicode_quotes(self):
        """Test large string with many unicode quote characters."""
        string1 = ("hello " * 50) + "\u201cworld\u201d"
        string2 = ("hello " * 50) + '"world"'
        codeflash_output = calculate_edit_distance(string1, string2, return_as="score")
        result = codeflash_output  # 56.9μs -> 36.9μs (54.2% faster)

    def test_performance_reasonable_for_medium_strings(self):
        """Test that function completes in reasonable time for medium strings."""
        # String of length ~500
        string1 = "a" * 250 + "b" * 250
        string2 = "a" * 248 + "b" * 252
        # Should complete without hanging
        codeflash_output = calculate_edit_distance(string1, string2, return_as="distance")
        result = codeflash_output  # 46.5μs -> 12.4μs (276% faster)

    def test_very_different_length_strings(self):
        """Test with strings of very different lengths."""
        short = "hi"
        long = "a" * 300
        codeflash_output = calculate_edit_distance(long, short, return_as="score")
        result = codeflash_output  # 48.5μs -> 13.3μs (265% faster)

    def test_repeated_patterns_large_string(self):
        """Test with large strings containing repeated patterns."""
        pattern = "abc" * 100
        codeflash_output = calculate_edit_distance(pattern, pattern, return_as="distance")
        result = codeflash_output  # 46.0μs -> 11.3μs (306% faster)

    def test_score_bounded_for_very_long_strings(self):
        """Test that score stays bounded even with very long strings."""
        long_string = "x" * 600
        source = "a" * 50
        codeflash_output = calculate_edit_distance(long_string, source, return_as="score")
        result = codeflash_output  # 167μs -> 132μs (26.1% faster)

    def test_distance_scales_with_string_size(self):
        """Test that distance scales appropriately with string size."""
        codeflash_output = calculate_edit_distance("a" * 10, "b" * 10, return_as="distance")
        result1 = codeflash_output  # 44.9μs -> 9.28μs (383% faster)
        codeflash_output = calculate_edit_distance("a" * 100, "b" * 100, return_as="distance")
        result2 = codeflash_output  # 70.0μs -> 38.4μs (82.2% faster)

    def test_large_string_with_whitespace_variations(self):
        """Test large strings with various whitespace patterns."""
        string1 = "word1 word2 word3" * 30
        string2 = "word1  word2  word3" * 30
        codeflash_output = calculate_edit_distance(
            string1, string2, standardize_whitespaces=True, return_as="score"
        )
        result = codeflash_output  # 55.2μs -> 21.4μs (157% faster)

    def test_many_small_edits_in_large_string(self):
        """Test large string with many scattered small edits."""
        base = list("a" * 200)
        modified = base.copy()
        # Make 10 edits
        for i in range(0, 200, 20):
            modified[i] = "b"
        codeflash_output = calculate_edit_distance(
            "".join(modified), "".join(base), return_as="distance"
        )
        result = codeflash_output  # 73.0μs -> 37.0μs (97.1% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import pytest

from unstructured.metrics.text_extraction import calculate_edit_distance


def test_calculate_edit_distance():
    calculate_edit_distance(
        "", "", weights=(0, 0, 0), return_as="distance", standardize_whitespaces=False
    )


def test_calculate_edit_distance_2():
    with pytest.raises(
        ValueError,
        match="Invalid\\ return\\ value\\ type\\.\\ Expected\\ one\\ of:\\ \\['score',\\ 'distance'\\]",
    ):
        calculate_edit_distance(
            "", "", weights=(0, 0, 0), return_as="", standardize_whitespaces=True
        )


def test_calculate_edit_distance_3():
    calculate_edit_distance(
        "", "", weights=(0, 0, 0), return_as="score", standardize_whitespaces=False
    )

🔎 Click to see Concolic Coverage Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`codeflash_concolic_xdo_puqm/tmpnhjwc_rx/test_concolic_coverage.py::test_calculate_edit_distance`	43.7μs	6.78μs	544%✅
`codeflash_concolic_xdo_puqm/tmpnhjwc_rx/test_concolic_coverage.py::test_calculate_edit_distance_2`	5.36μs	5.11μs	4.86%✅
`codeflash_concolic_xdo_puqm/tmpnhjwc_rx/test_concolic_coverage.py::test_calculate_edit_distance_3`	43.7μs	7.00μs	524%✅

To edit these changes git checkout codeflash/optimize-calculate_edit_distance-mks22n2z and push.

The optimized code achieves a **150% speedup** (from 5.47ms to 2.19ms) by eliminating redundant dictionary construction and replacing inefficient character-by-character replacements with a pre-computed translation table. ## Key Optimizations ### 1. **Module-level Pre-computation** The original code reconstructed the `double_quotes` and `single_quotes` dictionaries on *every* call to `standardize_quotes` (217 calls in the profile). This consumed **~23% of runtime** just building dictionaries. The optimized version moves these to module-level constants (`_DOUBLE_QUOTES`, `_SINGLE_QUOTES`), computed once at import time. ### 2. **Translation Table (`str.translate()`)** The original code used a loop with `unicode_to_char()` conversions and individual `str.replace()` calls for each quote type (~40 iterations per call). The optimized version pre-computes all unicode characters and builds a single translation table (`_QUOTE_TRANSLATION`) using `str.maketrans()`. This allows `str.translate()` to replace all quote characters in a **single pass** through the string, which is implemented in C and far more efficient than Python loops with multiple `replace()` calls. Line profiler shows `standardize_quotes` dropped from **25.9ms total time** (with ~65% spent in loops and dictionary construction) to just **0.5ms** (single translate call). ### 3. **Faster Validation Check** Changed `return_as not in return_types` from a list lookup to a tuple literal check `return_as not in ("score", "distance")`. This avoids list construction on every call and uses Python's optimized tuple comparison. The list is now only created in the error path (3 out of 105 calls). ## Impact on Workloads The `function_references` show `calculate_edit_distance` is called by `calculate_accuracy`, which appears to be a high-level metric function. Given that the test results show **3-10x speedups** on individual calls (e.g., 44μs → 9μs for typical inputs), any workflow processing multiple documents or computing accuracy metrics repeatedly will benefit significantly. The optimization is particularly effective when: - **Text contains many unicode quotes**: The translation table eliminates the need to check each quote type individually - **Called in loops**: Module-level constants amortize setup costs across all calls - **Large documents**: The single-pass `translate()` scales better than multiple `replace()` operations (e.g., 500-char strings show 272% speedup) Test cases with standard ASCII text show ~380% speedup, while those with unicode quotes show ~330% speedup - demonstrating consistent gains across input types. The optimization maintains correctness while reducing overhead from 94% of runtime to negligible levels.

codeflash-ai bot requested a review from aseembits93 January 24, 2026 08:36

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up function `calculate_edit_distance` by 150% #269

⚡️ Speed up function `calculate_edit_distance` by 150% #269

Uh oh!

codeflash-ai bot commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function calculate_edit_distance by 150% #269

Are you sure you want to change the base?

⚡️ Speed up function calculate_edit_distance by 150% #269

Uh oh!

Conversation

codeflash-ai bot commented Jan 24, 2026

📄 150% (1.50x) speedup for calculate_edit_distance in unstructured/metrics/text_extraction.py

📝 Explanation and details

Key Optimizations

1. Module-level Pre-computation

2. Translation Table (str.translate())

3. Faster Validation Check

Impact on Workloads

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `calculate_edit_distance` by 150% #269

⚡️ Speed up function `calculate_edit_distance` by 150% #269

📄 150% (1.50x) speedup for `calculate_edit_distance` in `unstructured/metrics/text_extraction.py`

2. Translation Table (`str.translate()`)