refactor: optimize Arrow memory and propagate CSV read errors by shirly121 · Pull Request #471 · alibaba/neug

shirly121 · 2026-06-03T08:54:06Z

Summary

Optimize Arrow memory usage by reusing entry schema to skip redundant Inspect() calls in infer_schema and create_scanner stages
Fix GetNextBatch() in all IRecordBatchSupplier implementations to throw IOException on read errors instead of silently returning nullptr, which caused invalid CSV data to be silently accepted

Changes

src/utils/reader/options.cc: When entry schema is available, pass it directly to factory->Finish() to avoid a redundant Inspect() scan
include/neug/utils/reader/options.h: Add dataset_schema field to ArrowOptions for schema reuse
include/neug/compiler/function/import/csv_read_function.h: Pass entry schema to options builder
include/neug/compiler/function/import/json_read_function.h: Pass entry schema to options builder
extension/parquet/include/parquet_read_function.h: Pass entry schema to options builder
src/storages/loader/loader_utils.cc: Throw THROW_IO_EXCEPTION on Arrow read errors in CSVStreamRecordBatchSupplier, CSVTableRecordBatchSupplier, and ArrowRecordBatchStreamSupplier

Test plan

test_import_bad_csv now correctly raises ERR_IO_ERROR for invalid CSV data
All 40 import/export tests pass
Existing query and storage tests unaffected

🤖 Generated with Claude Code

Fixes #220

…tages Extract kSniffBlockSize (1MB) to neug::reader namespace in options.h and apply it in CSV, JSON, JSONL, and Parquet sniff functions to avoid excessive Arrow memory pool allocation during schema inference. Also set dataset_schema in ScanOptions to skip redundant Inspect() call in create_scanner, preventing large block_size-proportional allocations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…turning nullptr GetNextBatch() in CSVStreamRecordBatchSupplier, CSVTableRecordBatchSupplier, and ArrowRecordBatchStreamSupplier previously swallowed Arrow read errors by logging and returning nullptr. After the dataset_schema optimization skipped Inspect(), data validation moved to lazy batch reading, where errors were silently ignored. Now throw IOException so errors propagate through the pipeline to the caller. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

longbinlai

If this is a fix, please add test cases

shirly121 and others added 4 commits June 1, 2026 20:02

style: fix clang-format spacing in constructor initializer lists

4dc284c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'main' into mem_optimize

8eb482f

shirly121 requested a review from longbinlai June 4, 2026 03:06

longbinlai reviewed Jun 4, 2026

View reviewed changes

shirly121 changed the title ~~fix: optimize Arrow memory and propagate CSV read errors~~ refactor: optimize Arrow memory and propagate CSV read errors Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: optimize Arrow memory and propagate CSV read errors#471

refactor: optimize Arrow memory and propagate CSV read errors#471
shirly121 wants to merge 4 commits into
alibaba:mainfrom
shirly121:mem_optimize

shirly121 commented Jun 3, 2026 •

edited

Loading

Uh oh!

longbinlai left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shirly121 commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

longbinlai left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shirly121 commented Jun 3, 2026 •

edited

Loading