refactor: optimize Arrow memory and propagate CSV read errors#471
Open
shirly121 wants to merge 4 commits into
Open
refactor: optimize Arrow memory and propagate CSV read errors#471shirly121 wants to merge 4 commits into
shirly121 wants to merge 4 commits into
Conversation
…tages Extract kSniffBlockSize (1MB) to neug::reader namespace in options.h and apply it in CSV, JSON, JSONL, and Parquet sniff functions to avoid excessive Arrow memory pool allocation during schema inference. Also set dataset_schema in ScanOptions to skip redundant Inspect() call in create_scanner, preventing large block_size-proportional allocations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…turning nullptr GetNextBatch() in CSVStreamRecordBatchSupplier, CSVTableRecordBatchSupplier, and ArrowRecordBatchStreamSupplier previously swallowed Arrow read errors by logging and returning nullptr. After the dataset_schema optimization skipped Inspect(), data validation moved to lazy batch reading, where errors were silently ignored. Now throw IOException so errors propagate through the pipeline to the caller. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
longbinlai
reviewed
Jun 4, 2026
Collaborator
longbinlai
left a comment
There was a problem hiding this comment.
If this is a fix, please add test cases
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Inspect()calls ininfer_schemaandcreate_scannerstagesGetNextBatch()in allIRecordBatchSupplierimplementations to throwIOExceptionon read errors instead of silently returningnullptr, which caused invalid CSV data to be silently acceptedChanges
src/utils/reader/options.cc: When entry schema is available, pass it directly tofactory->Finish()to avoid a redundantInspect()scaninclude/neug/utils/reader/options.h: Adddataset_schemafield toArrowOptionsfor schema reuseinclude/neug/compiler/function/import/csv_read_function.h: Pass entry schema to options builderinclude/neug/compiler/function/import/json_read_function.h: Pass entry schema to options builderextension/parquet/include/parquet_read_function.h: Pass entry schema to options buildersrc/storages/loader/loader_utils.cc: ThrowTHROW_IO_EXCEPTIONon Arrow read errors inCSVStreamRecordBatchSupplier,CSVTableRecordBatchSupplier, andArrowRecordBatchStreamSupplierTest plan
test_import_bad_csvnow correctly raisesERR_IO_ERRORfor invalid CSV data🤖 Generated with Claude Code
Fixes #220