Skip to content

refactor: optimize Arrow memory and propagate CSV read errors#471

Open
shirly121 wants to merge 4 commits into
alibaba:mainfrom
shirly121:mem_optimize
Open

refactor: optimize Arrow memory and propagate CSV read errors#471
shirly121 wants to merge 4 commits into
alibaba:mainfrom
shirly121:mem_optimize

Conversation

@shirly121
Copy link
Copy Markdown
Collaborator

@shirly121 shirly121 commented Jun 3, 2026

Summary

  • Optimize Arrow memory usage by reusing entry schema to skip redundant Inspect() calls in infer_schema and create_scanner stages
  • Fix GetNextBatch() in all IRecordBatchSupplier implementations to throw IOException on read errors instead of silently returning nullptr, which caused invalid CSV data to be silently accepted

Changes

  • src/utils/reader/options.cc: When entry schema is available, pass it directly to factory->Finish() to avoid a redundant Inspect() scan
  • include/neug/utils/reader/options.h: Add dataset_schema field to ArrowOptions for schema reuse
  • include/neug/compiler/function/import/csv_read_function.h: Pass entry schema to options builder
  • include/neug/compiler/function/import/json_read_function.h: Pass entry schema to options builder
  • extension/parquet/include/parquet_read_function.h: Pass entry schema to options builder
  • src/storages/loader/loader_utils.cc: Throw THROW_IO_EXCEPTION on Arrow read errors in CSVStreamRecordBatchSupplier, CSVTableRecordBatchSupplier, and ArrowRecordBatchStreamSupplier

Test plan

  • test_import_bad_csv now correctly raises ERR_IO_ERROR for invalid CSV data
  • All 40 import/export tests pass
  • Existing query and storage tests unaffected

🤖 Generated with Claude Code

Fixes #220

shirly121 and others added 4 commits June 1, 2026 20:02
…tages

Extract kSniffBlockSize (1MB) to neug::reader namespace in options.h and
apply it in CSV, JSON, JSONL, and Parquet sniff functions to avoid
excessive Arrow memory pool allocation during schema inference. Also set
dataset_schema in ScanOptions to skip redundant Inspect() call in
create_scanner, preventing large block_size-proportional allocations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…turning nullptr

GetNextBatch() in CSVStreamRecordBatchSupplier, CSVTableRecordBatchSupplier,
and ArrowRecordBatchStreamSupplier previously swallowed Arrow read errors by
logging and returning nullptr. After the dataset_schema optimization skipped
Inspect(), data validation moved to lazy batch reading, where errors were
silently ignored. Now throw IOException so errors propagate through the
pipeline to the caller.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shirly121 shirly121 requested a review from longbinlai June 4, 2026 03:06
Copy link
Copy Markdown
Collaborator

@longbinlai longbinlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a fix, please add test cases

@shirly121 shirly121 changed the title fix: optimize Arrow memory and propagate CSV read errors refactor: optimize Arrow memory and propagate CSV read errors Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Wire Arrow ScanOptions batch_size and clarify CSV read batch_size semantics for external loads

2 participants