Skip to content

Avoid quadratic time on unclosed quotes in CSV parser#52

Merged
almeida-raphael merged 1 commit intomainfrom
sc-81091/us-onboarding-portal-delay-in-processing
Feb 2, 2026
Merged

Avoid quadratic time on unclosed quotes in CSV parser#52
almeida-raphael merged 1 commit intomainfrom
sc-81091/us-onboarding-portal-delay-in-processing

Conversation

@almeida-raphael
Copy link
Copy Markdown
Collaborator

@almeida-raphael almeida-raphael commented Jan 30, 2026

[sc-81091]

PR Content: Fix quadratic time on unclosed quotes

Fix quadratic time complexity (O(n²)) in CSV parser when processing files with unclosed quotes, reducing processing time from ~4 hours to <1 second for problematic files.

Problem: The has_open_quotes() function was creating O(n) string allocations via .replace() on every call. When processing files with unclosed quotes, this caused O(n²) behavior that turned a 60MB file into a 4+ hour processing job.

Solution: Replaced with O(n) incremental parsing using QuoteState struct that tracks quote state across line boundaries without string allocations.

Benchmark Results

Correctness Validation

Dataset Files Tested Output Match
SFTP Production US 663 files (35+ orgs) 100% identical
Onboarding Portal (US+EU) 199 files 100% identical
Unit Tests (Rust) 66 tests All passing
Integration Tests (Python) 15 tests All passing

Performance Comparison

Metric Fix Branch Main Branch Improvement
663 SFTP files (total) 9.63s 12.81s 1.33x faster
Problematic file (60MB, unclosed quote) 0.63s ~4 hours ~22,000x faster

Changes

  • csv_gp/src/parser.rs: Replaced O(n²) has_open_quotes with O(n) incremental quote_state_after
  • csv_gp_python/tests/fixtures/unclosed_quote.csv: New test fixture
  • csv_gp_python/tests/test_integration.py: New integration test for unclosed quotes

[sc-81091]

When a CSV line had an unclosed quote, the parser accumulated
following lines and re-scanned the full buffer after each line,
causing O(n²) time. Files with hundreds of thousands of lines
could take hours.

Use incremental quote state (QuoteState) across line boundaries:
scan only the new line with the previous state, and carry
prev_char/prev_prev_char so escaped "" and delimiter-only cells
are correct at boundaries.

Add Rust unit, equivalence and performance tests; Python fixture
and integration test for unclosed quote. No customer data or
product names in tests (OSS-safe).
@almeida-raphael almeida-raphael force-pushed the sc-81091/us-onboarding-portal-delay-in-processing branch from 534a0a6 to ba52012 Compare January 30, 2026 16:47
@almeida-raphael almeida-raphael merged commit aded388 into main Feb 2, 2026
10 checks passed
@almeida-raphael almeida-raphael deleted the sc-81091/us-onboarding-portal-delay-in-processing branch February 2, 2026 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants