Avoid quadratic time on unclosed quotes in CSV parser#52
Merged
almeida-raphael merged 1 commit intomainfrom Feb 2, 2026
Merged
Conversation
[sc-81091] When a CSV line had an unclosed quote, the parser accumulated following lines and re-scanned the full buffer after each line, causing O(n²) time. Files with hundreds of thousands of lines could take hours. Use incremental quote state (QuoteState) across line boundaries: scan only the new line with the previous state, and carry prev_char/prev_prev_char so escaped "" and delimiter-only cells are correct at boundaries. Add Rust unit, equivalence and performance tests; Python fixture and integration test for unclosed quote. No customer data or product names in tests (OSS-safe).
534a0a6 to
ba52012
Compare
mikicz
approved these changes
Feb 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[sc-81091]
PR Content: Fix quadratic time on unclosed quotes
Fix quadratic time complexity (O(n²)) in CSV parser when processing files with unclosed quotes, reducing processing time from ~4 hours to <1 second for problematic files.
Problem: The
has_open_quotes()function was creating O(n) string allocations via.replace()on every call. When processing files with unclosed quotes, this caused O(n²) behavior that turned a 60MB file into a 4+ hour processing job.Solution: Replaced with O(n) incremental parsing using
QuoteStatestruct that tracks quote state across line boundaries without string allocations.Benchmark Results
Correctness Validation
Performance Comparison
Changes
csv_gp/src/parser.rs: Replaced O(n²)has_open_quoteswith O(n) incrementalquote_state_aftercsv_gp_python/tests/fixtures/unclosed_quote.csv: New test fixturecsv_gp_python/tests/test_integration.py: New integration test for unclosed quotes