Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jan 24, 2026

📄 71% (0.71x) speedup for calculate_accuracy in unstructured/metrics/text_extraction.py

⏱️ Runtime : 6.44 milliseconds 3.76 milliseconds (best of 132 runs)

📝 Explanation and details

The optimized code achieves a 71% speedup through three key optimizations that reduce redundant work in the common case:

What Changed

  1. Module-level constant for validation (_RETURN_TYPES): Moved the allowed return types to a module-level tuple instead of creating a new list on every function call.

  2. Conditional Unicode quote standardization: Added str.isascii() checks before calling standardize_quotes(). This expensive Unicode replacement operation (which iterates through ~40 quote mappings) is now skipped when strings contain only ASCII characters.

  3. Early equality check: After string preparation, added a fast-path check if output == source to immediately return the result without calling the expensive Levenshtein.distance() calculation.

Why It's Faster

ASCII check optimization: The line profiler shows standardize_quotes() consumed ~76% of runtime in the original (12.6ms out of 16.6ms total). With str.isascii() being a fast C-level operation, the optimization successfully skips this expensive Unicode processing in most test cases - only 3 out of 71 function calls (4%) actually needed quote standardization in the test suite.

Early equality shortcut: When strings are identical after preprocessing (33 out of 71 calls = 46% of test cases), the optimized version immediately returns without computing Levenshtein distance (originally ~21.5% of runtime). The profiler confirms these 33 cases now exit early, avoiding the distance calculation entirely.

Validation overhead elimination: While small (0.4% of runtime), removing the list allocation on every call adds up, especially given the function_references show this is called from _process_document() which processes multiple documents in evaluation workloads.

Impact on Workloads

Based on the function_references, calculate_accuracy() is called from document evaluation pipelines (evaluate.py) where it processes extracted text against source documents. The optimizations are particularly effective for:

  • ASCII-only documents (most English text): Skip all Unicode quote processing
  • Identical text cases (perfect extraction): Return immediately without distance calculation
  • Validation-heavy paths: The module-level constant avoids repeated allocations in batch processing

The test results confirm this: identical string tests show 10-20x speedup (e.g., test_identical_strings_returns_perfect_score: 46.3μs → 3.96μs), while tests requiring actual Levenshtein computation show smaller but still meaningful gains (6-15%). The document evaluation context in _process_document() indicates this function may be called repeatedly in loops, amplifying the per-call savings.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 73 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
from typing import Optional, Tuple

# imports
import pytest  # used for our unit tests

from unstructured.metrics.text_extraction import calculate_accuracy

# function to test
# NOTE: The original function references rapidfuzz.distance.Levenshtein.
# To make this test self-contained and deterministic, we provide a local
# Levenshtein implementation with the required .distance(...) API.
# The function implementations below otherwise follow the original logic
# and signatures as closely as possible.


class Levenshtein:
    """
    Minimal, deterministic implementation of a weighted Levenshtein distance
    compatible with the original function's expectations:
        Levenshtein.distance(output, source, weights=weights)
    weights is a tuple (insertion_cost, deletion_cost, substitution_cost).
    """

    @staticmethod
    def distance(a: str, b: str, weights: Tuple[int, int, int] = (2, 1, 1)) -> float:
        # Use classic DP to compute weighted edit distance.
        ins_w, del_w, sub_w = weights
        # Convert None to empty string to be safe
        if a is None:
            a = ""
        if b is None:
            b = ""
        n = len(a)
        m = len(b)
        # If one of the strings is empty, cost is sum of insertions/deletions
        # Note: We're computing cost to convert 'a' -> 'b' so:
        # deleting chars from 'a' costs del_w, inserting into 'a' costs ins_w
        if n == 0:
            return float(m * ins_w)
        if m == 0:
            return float(n * del_w)

        # Initialize DP arrays (only keep two rows for memory efficiency)
        prev = [0.0] * (m + 1)
        cur = [0.0] * (m + 1)

        # cost to convert empty a to prefixes of b: need to insert j chars
        for j in range(1, m + 1):
            prev[j] = prev[j - 1] + ins_w

        for i in range(1, n + 1):
            # cost to convert prefix a[:i] to empty b: delete i chars
            cur[0] = prev[0] + del_w
            ai = a[i - 1]
            for j in range(1, m + 1):
                bj = b[j - 1]
                if ai == bj:
                    sub_cost = 0.0
                else:
                    sub_cost = sub_w
                # substitution (or match)
                cost_sub = prev[j - 1] + sub_cost
                # deletion from a (remove ai)
                cost_del = prev[j] + del_w
                # insertion into a (insert bj)
                cost_ins = cur[j - 1] + ins_w
                cur[j] = min(cost_sub, cost_del, cost_ins)
            # swap rows
            prev, cur = cur, prev

        return float(prev[m])


# Helper used by standardize_quotes in original code: convert "U+XXXX" -> char
def unicode_to_char(unicode_desc: str) -> str:
    # Expect format "U+XXXX" or "U+XXXXX", etc.
    if not unicode_desc.startswith("U+"):
        return unicode_desc
    hex_part = unicode_desc[2:]
    try:
        code_point = int(hex_part, 16)
        return chr(code_point)
    except Exception:
        # If parsing fails, return the original descriptor for safety
        return unicode_desc


def calculate_edit_distance(
    output: Optional[str],
    source: Optional[str],
    weights: Tuple[int, int, int] = (2, 1, 1),
    return_as: str = "distance",
    standardize_whitespaces: bool = True,
) -> float:
    """
    Calculates edit distance using Levenshtein distance between two strings.

    NOTE: This implementation mirrors the original function logic that the tests
    are verifying. It uses the local Levenshtein distance implemented above.
    """
    return_types = ["score", "distance"]
    if return_as not in return_types:
        raise ValueError("Invalid return value type. Expected one of: %s" % return_types)
    output = standardize_quotes(prepare_str(output, standardize_whitespaces))
    source = standardize_quotes(prepare_str(source, standardize_whitespaces))
    distance = Levenshtein.distance(output, source, weights=weights)  # type: ignore
    # lower bounded the char length for source string at 1.0 because to avoid division by zero
    # in the case where source string is empty, the distance should be at 100%
    source_char_len = max(len(source), 1.0)  # type: ignore
    bounded_percentage_distance = min(max(distance / source_char_len, 0.0), 1.0)
    if return_as == "score":
        return 1 - bounded_percentage_distance
    elif return_as == "distance":
        return distance
    return 0.0


def prepare_str(string: Optional[str], standardize_whitespaces: bool = False) -> str:
    if not string:
        return ""
    if standardize_whitespaces:
        return " ".join(string.split())
    return str(string)  # type: ignore


def standardize_quotes(text: str) -> str:
    """
    Converts all unicode quotes to standard ASCII quotes with comprehensive coverage.

    This function uses unicode_to_char to translate "U+XXXX" descriptors into actual
    unicode characters and replaces them with ASCII quotes where found.
    """
    # Double Quotes Dictionary
    double_quotes = {
        '"': "U+0022",  # noqa 601 # Standard typewriter/programmer's quote
        '"': "U+201C",  # noqa 601 # Left double quotation mark
        '"': "U+201D",  # noqa 601 # Right double quotation mark
        "„": "U+201E",  # Double low-9 quotation mark
        "‟": "U+201F",  # Double high-reversed-9 quotation mark
        "«": "U+00AB",  # Left-pointing double angle quotation mark
        "»": "U+00BB",  # Right-pointing double angle quotation mark
        "❝": "U+275D",  # Heavy double turned comma quotation mark ornament
        "❞": "U+275E",  # Heavy double comma quotation mark ornament
        "⹂": "U+2E42",  # Double low-reversed-9 quotation mark
        "🙶": "U+1F676",  # SANS-SERIF HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
        "🙷": "U+1F677",  # SANS-SERIF HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
        "🙸": "U+1F678",  # SANS-SERIF HEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT
        "⠦": "U+2826",  # Braille double closing quotation mark
        "⠴": "U+2834",  # Braille double opening quotation mark
        "〝": "U+301D",  # REVERSED DOUBLE PRIME QUOTATION MARK
        "〞": "U+301E",  # DOUBLE PRIME QUOTATION MARK
        "〟": "U+301F",  # LOW DOUBLE PRIME QUOTATION MARK
        """: "U+FF02",  # FULLWIDTH QUOTATION MARK
        ",,": "U+275E",  # LOW HEAVY DOUBLE COMMA ORNAMENT
    }

    # Single Quotes Dictionary
    single_quotes = {
        "'": "U+0027",  # noqa 601 # Standard typewriter/programmer's quote
        "'": "U+2018",  # noqa 601 # Left single quotation mark
        "'": "U+2019",  # noqa 601 # Right single quotation mark # noqa: W605
        "‚": "U+201A",  # Single low-9 quotation mark
        "‛": "U+201B",  # Single high-reversed-9 quotation mark
        "‹": "U+2039",  # Single left-pointing angle quotation mark
        "›": "U+203A",  # Single right-pointing angle quotation mark
        "❛": "U+275B",  # Heavy single turned comma quotation mark ornament
        "❜": "U+275C",  # Heavy single comma quotation mark ornament
        "「": "U+300C",  # Left corner bracket
        "」": "U+300D",  # Right corner bracket
        "『": "U+300E",  # Left white corner bracket
        "』": "U+300F",  # Right white corner bracket
        "﹁": "U+FE41",  # PRESENTATION FORM FOR VERTICAL LEFT CORNER BRACKET
        "﹂": "U+FE42",  # PRESENTATION FORM FOR VERTICAL RIGHT CORNER BRACKET
        "﹃": "U+FE43",  # PRESENTATION FORM FOR VERTICAL LEFT WHITE CORNER BRACKET
        "﹄": "U+FE44",  # PRESENTATION FORM FOR VERTICAL RIGHT WHITE CORNER BRACKET
        "'": "U+FF07",  # FULLWIDTH APOSTROPHE
        "「": "U+FF62",  # HALFWIDTH LEFT CORNER BRACKET
        "」": "U+FF63",  # HALFWIDTH RIGHT CORNER BRACKET
    }

    double_quote_standard = '"'
    single_quote_standard = "'"

    # Apply double quote replacements
    for unicode_val in double_quotes.values():
        unicode_char = unicode_to_char(unicode_val)
        if unicode_char in text:
            text = text.replace(unicode_char, double_quote_standard)

    # Apply single quote replacements
    for unicode_val in single_quotes.values():
        unicode_char = unicode_to_char(unicode_val)
        if unicode_char in text:
            text = text.replace(unicode_char, single_quote_standard)

    return text


def test_identical_strings_accuracy_is_one():
    # Basic scenario: identical strings should yield perfect accuracy 1.0
    out = "hello world"
    src = "hello world"
    # calculate_accuracy returns a float score; identical strings => score 1.0
    codeflash_output = calculate_accuracy(out, src)
    result = codeflash_output  # 46.2μs -> 4.01μs (1052% faster)


def test_completely_different_strings_accuracy_is_zero():
    # Basic scenario: completely different strings of same length should yield 0.0
    out = "aaaa"
    src = "bbbb"
    # Each character will require a substitution; normalized distance = 1.0 -> score 0.0
    codeflash_output = calculate_accuracy(out, src)
    result = codeflash_output  # 45.1μs -> 8.02μs (462% faster)


def test_none_inputs_treated_as_empty_strings():
    # Edge case: None values should be treated as empty strings
    codeflash_output = calculate_accuracy(None, None)
    result_both_none = codeflash_output  # 44.3μs -> 2.22μs (1894% faster)

    # Output None, source empty string -> both become empty -> 1.0
    codeflash_output = calculate_accuracy(None, "")
    result_none_and_empty = codeflash_output  # 34.4μs -> 967ns (3453% faster)

    # Output non-empty, source None -> source becomes empty, non-empty output -> score 0.0
    codeflash_output = calculate_accuracy("x", None)
    result_nonempty_and_none = codeflash_output  # 34.4μs -> 6.53μs (427% faster)


def test_source_empty_and_output_nonempty_caps_to_zero_score():
    # Edge: source is empty string but output is non-empty. Distance normalized over max(len(source),1.0)
    # ensures result is capped and yields score 0.0 when distance is >= 1.0 * 1.0
    src = ""
    out = "nonempty"
    codeflash_output = calculate_accuracy(out, src)
    score = codeflash_output  # 45.1μs -> 7.68μs (487% faster)


def test_whitespace_standardization_defaults_to_true_in_calculate_accuracy():
    # Basic: calculate_accuracy calls calculate_edit_distance with standardize_whitespaces=True by default
    # so strings with multiple internal spaces should be normalized to a single space and match if text is same.
    out = "a   b  c"  # many spaces
    src = "a b c"  # single spaces
    # After standardization both become "a b c" -> perfect match
    codeflash_output = calculate_accuracy(out, src)
    score = codeflash_output  # 46.4μs -> 4.12μs (1024% faster)


def test_quotes_standardization_replaces_various_unicode_quotes():
    # Edge: ensure that various unicode quotation marks are normalized to ASCII quotes
    # Use some common smart quotes
    left_double = "\u201c"  # “
    right_double = "\u201d"  # ”
    left_single = "\u2018"  # ‘
    right_single = "\u2019"  # ’

    out = left_double + "hello" + right_double
    src = '"' + "hello" + '"'  # ASCII double quotes

    # Also test single quotes
    out_single = left_single + "hey" + right_single
    src_single = "'" + "hey" + "'"


def test_calculate_edit_distance_raises_on_invalid_return_as():
    # Edge: calling calculate_edit_distance with invalid return_as should raise ValueError
    with pytest.raises(ValueError):
        calculate_edit_distance("a", "b", return_as="invalid")


def test_weights_affect_distance_and_thus_accuracy():
    # Basic/edge: Changing weights should affect computed distance and thus accuracy.
    # Here we compare single-character strings so the effect is easy to reason about.
    out = "a"
    src = ""
    # With insertion cost high, converting empty source to "a" costs ins_w
    codeflash_output = calculate_accuracy(out, src, weights=(2, 1, 1))
    score_default = codeflash_output  # 45.6μs -> 8.51μs (436% faster)
    # If insertion cost is reduced to 1, distance reduces and score increases
    codeflash_output = calculate_accuracy(out, src, weights=(1, 1, 1))
    score_lower_insertion = codeflash_output  # 35.2μs -> 2.73μs (1186% faster)


def test_large_scale_single_difference_high_score():
    # Large scale: strings of length 500 with only one differing character should yield a very high score.
    # Use repeated characters to keep DP deterministic and within acceptable runtime.
    base = "a" * 500
    # Only one substitution at the end
    modified = "a" * 499 + "b"
    codeflash_output = calculate_accuracy(modified, base)
    score = codeflash_output  # 47.3μs -> 10.0μs (372% faster)


def test_large_scale_many_insertions_but_bounded_normalization():
    # Large scale: many insertions relative to a small source should cap the normalized distance at 1.0
    src = "short"
    # Create output much longer but keep test input sizes under 1000 chars
    out = "x" * 400  # 400 insertions
    codeflash_output = calculate_accuracy(out, src)
    score = codeflash_output  # 54.2μs -> 16.5μs (229% faster)


def test_calculate_edit_distance_distance_return_type_matches_distance():
    # Basic: verify calculate_edit_distance return_as="distance" returns raw distance
    out = "kitten"
    src = "sitting"
    dist = calculate_edit_distance(out, src, return_as="distance")


def test_prepare_str_and_standardize_quotes_integration():
    # Integration: ensure prepare_str with whitespace standardization and quote standardization
    # work together via calculate_edit_distance internal calls.
    out = "  “ spaced ”  "  # includes smart quotes and extra spaces
    src = '"spaced"'
    # Because calculate_accuracy standardizes whitespace and quotes, these should match
    codeflash_output = calculate_accuracy(out, src)
    score = codeflash_output  # 48.9μs -> 32.8μs (49.0% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unstructured.metrics.text_extraction import calculate_accuracy


class TestCalculateAccuracyBasic:
    """Basic test cases for calculate_accuracy function under normal conditions."""

    def test_identical_strings_returns_perfect_score(self):
        """Test that identical strings return a perfect accuracy score of 1.0."""
        codeflash_output = calculate_accuracy("hello world", "hello world")
        result = codeflash_output  # 46.3μs -> 3.96μs (1066% faster)

    def test_empty_strings_returns_perfect_score(self):
        """Test that two empty strings return a perfect accuracy score."""
        codeflash_output = calculate_accuracy("", "")
        result = codeflash_output  # 44.1μs -> 2.21μs (1896% faster)

    def test_completely_different_strings_returns_low_score(self):
        """Test that completely different strings return a low accuracy score."""
        codeflash_output = calculate_accuracy("abc", "xyz")
        result = codeflash_output  # 45.4μs -> 8.13μs (458% faster)

    def test_single_character_difference_returns_high_score(self):
        """Test that strings with only one character difference return high accuracy."""
        codeflash_output = calculate_accuracy("hello", "hallo")
        result = codeflash_output  # 45.5μs -> 7.99μs (469% faster)

    def test_none_output_treated_as_empty_string(self):
        """Test that None output is treated as an empty string."""
        codeflash_output = calculate_accuracy(None, "test")
        result = codeflash_output  # 45.4μs -> 7.86μs (477% faster)

    def test_none_source_treated_as_empty_string(self):
        """Test that None source is treated as an empty string."""
        codeflash_output = calculate_accuracy("test", None)
        result = codeflash_output  # 45.0μs -> 7.67μs (487% faster)

    def test_both_none_returns_perfect_score(self):
        """Test that both None values return perfect accuracy."""
        codeflash_output = calculate_accuracy(None, None)
        result = codeflash_output  # 44.4μs -> 2.22μs (1898% faster)

    def test_default_weights_used(self):
        """Test that default weights (2, 1, 1) are used when not specified."""
        codeflash_output = calculate_accuracy("cat", "hat")
        result1 = codeflash_output  # 45.3μs -> 8.06μs (462% faster)
        codeflash_output = calculate_accuracy("cat", "hat", weights=(2, 1, 1))
        result2 = codeflash_output  # 34.9μs -> 3.07μs (1038% faster)

    def test_return_value_is_float(self):
        """Test that the return value is always a float."""
        codeflash_output = calculate_accuracy("test", "test")
        result = codeflash_output  # 45.1μs -> 3.10μs (1355% faster)

    def test_return_value_in_valid_range(self):
        """Test that the return value is always between 0.0 and 1.0."""
        test_cases = [
            ("abc", "xyz"),
            ("hello", "world"),
            ("test", "test"),
            ("", "nonempty"),
            (None, "string"),
        ]
        for output, source in test_cases:
            codeflash_output = calculate_accuracy(output, source)
            result = codeflash_output  # 180μs -> 16.4μs (1001% faster)


class TestCalculateAccuracyEdgeCases:
    """Edge case tests for calculate_accuracy function."""

    def test_output_none_source_empty(self):
        """Test when output is None and source is empty string."""
        codeflash_output = calculate_accuracy(None, "")
        result = codeflash_output  # 43.7μs -> 2.32μs (1782% faster)

    def test_output_empty_source_none(self):
        """Test when output is empty string and source is None."""
        codeflash_output = calculate_accuracy("", None)
        result = codeflash_output  # 44.1μs -> 2.21μs (1902% faster)

    def test_output_none_source_nonempty(self):
        """Test when output is None and source is a non-empty string."""
        codeflash_output = calculate_accuracy(None, "hello")
        result = codeflash_output  # 45.0μs -> 7.80μs (476% faster)

    def test_output_nonempty_source_none(self):
        """Test when output is non-empty and source is None."""
        codeflash_output = calculate_accuracy("hello", None)
        result = codeflash_output  # 45.1μs -> 7.54μs (497% faster)

    def test_output_empty_source_nonempty(self):
        """Test when output is empty and source is non-empty."""
        codeflash_output = calculate_accuracy("", "test")
        result = codeflash_output  # 45.3μs -> 7.80μs (481% faster)

    def test_single_character_strings_identical(self):
        """Test single character strings that are identical."""
        codeflash_output = calculate_accuracy("a", "a")
        result = codeflash_output  # 45.3μs -> 3.11μs (1357% faster)

    def test_single_character_strings_different(self):
        """Test single character strings that are different."""
        codeflash_output = calculate_accuracy("a", "b")
        result = codeflash_output  # 45.3μs -> 8.04μs (464% faster)

    def test_whitespace_normalization(self):
        """Test that multiple whitespaces are normalized to single spaces."""
        codeflash_output = calculate_accuracy("hello   world", "hello world")
        result1 = codeflash_output  # 46.0μs -> 4.15μs (1007% faster)
        codeflash_output = calculate_accuracy("hello world", "hello world")
        result2 = codeflash_output  # 35.1μs -> 1.57μs (2133% faster)

    def test_leading_trailing_whitespace_normalized(self):
        """Test that leading and trailing whitespaces are removed."""
        codeflash_output = calculate_accuracy("  hello world  ", "hello world")
        result1 = codeflash_output  # 46.0μs -> 4.08μs (1027% faster)
        codeflash_output = calculate_accuracy("hello world", "hello world")
        result2 = codeflash_output  # 34.9μs -> 1.64μs (2031% faster)

    def test_unicode_quotes_standardized_single(self):
        """Test that unicode single quotes are standardized to ASCII quotes."""
        codeflash_output = calculate_accuracy("'hello'", "'hello'")
        result1 = codeflash_output  # 45.2μs -> 3.26μs (1285% faster)
        codeflash_output = calculate_accuracy("'hello'", "'hello'")
        result2 = codeflash_output  # 34.5μs -> 1.23μs (2710% faster)

    def test_unicode_quotes_standardized_double(self):
        """Test that unicode double quotes are standardized to ASCII quotes."""
        codeflash_output = calculate_accuracy('"hello"', '"hello"')
        result1 = codeflash_output  # 45.2μs -> 3.17μs (1327% faster)
        codeflash_output = calculate_accuracy('"hello"', '"hello"')
        result2 = codeflash_output  # 34.5μs -> 1.20μs (2777% faster)

    def test_very_long_identical_strings(self):
        """Test very long identical strings."""
        long_string = "a" * 500
        codeflash_output = calculate_accuracy(long_string, long_string)
        result = codeflash_output  # 47.3μs -> 4.27μs (1008% faster)

    def test_very_long_different_strings(self):
        """Test very long completely different strings."""
        string1 = "a" * 250
        string2 = "b" * 250
        codeflash_output = calculate_accuracy(string1, string2)
        result = codeflash_output  # 288μs -> 251μs (15.0% faster)

    def test_special_characters_preserved(self):
        """Test that special characters are compared correctly."""
        codeflash_output = calculate_accuracy("hello@world!", "hello@world!")
        result = codeflash_output  # 45.5μs -> 3.23μs (1310% faster)

    def test_special_characters_difference(self):
        """Test that special character differences are detected."""
        codeflash_output = calculate_accuracy("hello@world", "hello@world")
        result1 = codeflash_output  # 45.6μs -> 3.19μs (1329% faster)
        codeflash_output = calculate_accuracy("hello#world", "hello@world")
        result2 = codeflash_output  # 34.8μs -> 6.27μs (456% faster)

    def test_case_sensitive_comparison(self):
        """Test that comparison is case-sensitive."""
        codeflash_output = calculate_accuracy("Hello", "hello")
        result1 = codeflash_output  # 45.5μs -> 8.03μs (467% faster)
        codeflash_output = calculate_accuracy("Hello", "Hello")
        result2 = codeflash_output  # 34.6μs -> 1.36μs (2438% faster)

    def test_numeric_strings(self):
        """Test that numeric strings are compared correctly."""
        codeflash_output = calculate_accuracy("12345", "12345")
        result = codeflash_output  # 45.4μs -> 3.12μs (1356% faster)

    def test_mixed_alphanumeric_identical(self):
        """Test mixed alphanumeric identical strings."""
        codeflash_output = calculate_accuracy("abc123xyz", "abc123xyz")
        result = codeflash_output  # 45.5μs -> 3.12μs (1355% faster)

    def test_weights_parameter_affects_score(self):
        """Test that different weights produce different scores."""
        codeflash_output = calculate_accuracy("cat", "hat", weights=(2, 1, 1))
        result1 = codeflash_output  # 45.8μs -> 8.20μs (458% faster)
        codeflash_output = calculate_accuracy("cat", "hat", weights=(1, 1, 1))
        result2 = codeflash_output  # 35.4μs -> 3.32μs (966% faster)

    def test_asymmetric_similarity(self):
        """Test that accuracy can be asymmetric (output vs source matters)."""
        # When source is empty, behavior should be consistent
        codeflash_output = calculate_accuracy("test", "")
        result1 = codeflash_output  # 44.8μs -> 7.70μs (482% faster)
        codeflash_output = calculate_accuracy("", "test")
        result2 = codeflash_output  # 34.8μs -> 2.76μs (1159% faster)


class TestCalculateAccuracyLargeScale:
    """Large scale test cases for assessing performance and scalability."""

    def test_large_identical_strings_performance(self):
        """Test performance with large identical strings (500 chars)."""
        large_string = "The quick brown fox jumps over the lazy dog. " * 11  # ~500 chars
        codeflash_output = calculate_accuracy(large_string, large_string)
        result = codeflash_output  # 62.9μs -> 16.8μs (273% faster)

    def test_large_strings_with_small_difference(self):
        """Test large strings with a single character difference."""
        base_string = "a" * 499
        string1 = base_string + "a"
        string2 = base_string + "b"
        codeflash_output = calculate_accuracy(string1, string2)
        result = codeflash_output  # 47.3μs -> 9.89μs (379% faster)

    def test_large_strings_high_similarity(self):
        """Test large strings that are very similar (99% match)."""
        base = "word " * 100  # 500 chars
        modified = base[:490] + "wxyz"
        codeflash_output = calculate_accuracy(modified, base)
        result = codeflash_output  # 58.1μs -> 20.3μs (186% faster)

    def test_large_strings_50_percent_different(self):
        """Test large strings that are 50% different."""
        string1 = "a" * 250 + "b" * 250
        string2 = "b" * 250 + "a" * 250
        codeflash_output = calculate_accuracy(string1, string2)
        result = codeflash_output  # 624μs -> 586μs (6.59% faster)

    def test_large_strings_completely_different(self):
        """Test large completely different strings."""
        string1 = "x" * 500
        string2 = "y" * 500
        codeflash_output = calculate_accuracy(string1, string2)
        result = codeflash_output  # 1.01ms -> 948μs (6.06% faster)

    def test_many_small_differences_in_large_string(self):
        """Test large string with many small differences scattered throughout."""
        original = "test word " * 50  # 500 chars
        modified_list = list(original)
        # Change every 10th character (50 changes)
        for i in range(0, len(modified_list), 10):
            if modified_list[i] != "x":
                modified_list[i] = "x"
        modified = "".join(modified_list)
        codeflash_output = calculate_accuracy(modified, original)
        result = codeflash_output  # 618μs -> 574μs (7.61% faster)

    def test_consecutive_batches_of_changes(self):
        """Test large string with consecutive chunks of changes."""
        base = "a" * 100 + "b" * 100 + "c" * 100 + "d" * 100 + "e" * 100
        modified = "a" * 100 + "x" * 100 + "c" * 100 + "x" * 100 + "e" * 100
        codeflash_output = calculate_accuracy(modified, base)
        result = codeflash_output  # 364μs -> 301μs (21.0% faster)

    def test_whitespace_heavy_large_string(self):
        """Test large strings with heavy whitespace that gets normalized."""
        words = ["word"] * 120
        string1 = "   ".join(words)  # Multiple spaces between words
        string2 = " ".join(words)  # Single space between words
        codeflash_output = calculate_accuracy(string1, string2)
        result = codeflash_output  # 59.9μs -> 17.0μs (253% faster)

    def test_unicode_characters_in_large_string(self):
        """Test large strings containing unicode characters."""
        base = "café naïve " * 45  # ~500 chars with unicode
        codeflash_output = calculate_accuracy(base, base)
        result = codeflash_output  # 62.6μs -> 58.5μs (6.98% faster)

    def test_mixed_quote_types_large_string(self):
        """Test large strings with various quote types that get standardized."""
        base = "\"hello\" and 'world' repeated " * 15
        codeflash_output = calculate_accuracy(base, base)
        result = codeflash_output  # 54.3μs -> 11.3μs (380% faster)

    def test_special_characters_density(self):
        """Test strings with high density of special characters."""
        special = "!@#$%^&*()_+-=[]{}|;:,.<>?" * 20  # ~500 chars
        codeflash_output = calculate_accuracy(special, special)
        result = codeflash_output  # 47.0μs -> 4.23μs (1013% faster)

    def test_repeated_pattern_identical(self):
        """Test large strings with repeated patterns that are identical."""
        pattern = "abc123xyz " * 50
        codeflash_output = calculate_accuracy(pattern, pattern)
        result = codeflash_output  # 52.9μs -> 9.91μs (434% faster)

    def test_repeated_pattern_with_variance(self):
        """Test large strings with repeated patterns where one differs."""
        pattern1 = "abc123xyz " * 50
        pattern2 = "abc124xyz " * 50  # Changed 123 to 124
        codeflash_output = calculate_accuracy(pattern1, pattern2)
        result = codeflash_output  # 665μs -> 634μs (4.84% faster)

    def test_source_longer_than_output(self):
        """Test when source is significantly longer than output."""
        output = "short"
        source = "this is a much longer source string with many more characters"
        codeflash_output = calculate_accuracy(output, source)
        result = codeflash_output  # 47.8μs -> 10.4μs (359% faster)

    def test_output_longer_than_source(self):
        """Test when output is significantly longer than source."""
        output = "this is a much longer output string with many more characters"
        source = "short"
        codeflash_output = calculate_accuracy(output, source)
        result = codeflash_output  # 47.8μs -> 10.3μs (366% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from unstructured.metrics.text_extraction import calculate_accuracy


def test_calculate_accuracy():
    calculate_accuracy("", "", weights=(0, 0, 0))
🔎 Click to see Concolic Coverage Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_xdo_puqm/tmpl285jtn6/test_concolic_coverage.py::test_calculate_accuracy 44.0μs 2.64μs 1564%✅

To edit these changes git checkout codeflash/optimize-calculate_accuracy-mks1pgs1 and push.

Codeflash Static Badge

The optimized code achieves a **71% speedup** through three key optimizations that reduce redundant work in the common case:

## What Changed

1. **Module-level constant for validation** (`_RETURN_TYPES`): Moved the allowed return types to a module-level tuple instead of creating a new list on every function call.

2. **Conditional Unicode quote standardization**: Added `str.isascii()` checks before calling `standardize_quotes()`. This expensive Unicode replacement operation (which iterates through ~40 quote mappings) is now skipped when strings contain only ASCII characters.

3. **Early equality check**: After string preparation, added a fast-path check `if output == source` to immediately return the result without calling the expensive `Levenshtein.distance()` calculation.

## Why It's Faster

**ASCII check optimization**: The line profiler shows `standardize_quotes()` consumed ~76% of runtime in the original (12.6ms out of 16.6ms total). With `str.isascii()` being a fast C-level operation, the optimization successfully skips this expensive Unicode processing in most test cases - only 3 out of 71 function calls (4%) actually needed quote standardization in the test suite.

**Early equality shortcut**: When strings are identical after preprocessing (33 out of 71 calls = 46% of test cases), the optimized version immediately returns without computing Levenshtein distance (originally ~21.5% of runtime). The profiler confirms these 33 cases now exit early, avoiding the distance calculation entirely.

**Validation overhead elimination**: While small (0.4% of runtime), removing the list allocation on every call adds up, especially given the function_references show this is called from `_process_document()` which processes multiple documents in evaluation workloads.

## Impact on Workloads

Based on the function_references, `calculate_accuracy()` is called from document evaluation pipelines (`evaluate.py`) where it processes extracted text against source documents. The optimizations are particularly effective for:

- **ASCII-only documents** (most English text): Skip all Unicode quote processing
- **Identical text cases** (perfect extraction): Return immediately without distance calculation  
- **Validation-heavy paths**: The module-level constant avoids repeated allocations in batch processing

The test results confirm this: identical string tests show **10-20x speedup** (e.g., `test_identical_strings_returns_perfect_score`: 46.3μs → 3.96μs), while tests requiring actual Levenshtein computation show smaller but still meaningful gains (6-15%). The document evaluation context in `_process_document()` indicates this function may be called repeatedly in loops, amplifying the per-call savings.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 January 24, 2026 08:26
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant