Skip to content

feat: CSV array type support via string-to-list post-processing#472

Open
shirly121 wants to merge 5 commits into
alibaba:mainfrom
shirly121:csv_list
Open

feat: CSV array type support via string-to-list post-processing#472
shirly121 wants to merge 5 commits into
alibaba:mainfrom
shirly121:csv_list

Conversation

@shirly121
Copy link
Copy Markdown
Collaborator

@shirly121 shirly121 commented Jun 3, 2026

Summary

  • Implement CSV array type support by post-processing string columns into Arrow list arrays, enabling COPY FROM with array-typed columns
  • Add Arrow list type support in the execution layer for proper CAST handling of list columns
  • Refactor appendScalarValue to use neug::Date/DateTime and simplify get_elem by reusing get_arrow_scalar_value
  • Guard Database.close() against double-close errors

Changes

  • src/utils/reader/arrow_type_cast.cc + include/neug/utils/reader/arrow_type_cast.h: New string-to-list type cast logic for CSV array parsing
  • src/utils/reader/reader.cc: Integrate post-processing step after CSV read to convert string columns to list types
  • src/utils/reader/options.cc + include/neug/utils/reader/options.h: Add list type options for CSV format
  • src/execution/common/columns/arrow_context_column.cc: Add list type handling in Arrow context column
  • tools/python_bind/neug/database.py: Guard against double-close in Database.close()
  • tests/: Add comprehensive C++ and Python tests for array loading and sniffer

Test plan

  • New test_load_array.py with 250+ lines covering various array type scenarios
  • New test_reader.cc with C++ unit tests for arrow type cast
  • Sniffer tests for array type detection
  • Existing tests unaffected

🤖 Generated with Claude Code

Fixes #441

shirly121 and others added 5 commits June 2, 2026 23:25
…sing

- CSV ConvertOptions: override list columns to large_utf8 for Arrow CSV Reader
- ArrowTypeCaster framework: string->list type conversion for full_read and batch_read
- ArrowTypeCaster supports nested lists, dates, timestamps, intervals
- CSV-safe schema for projection (createCsvSafeSchema)
- Add C++ unit tests for CSV array reading (test_reader.cc)
- Add C++ sniffer test for list-like column inference (test_sniffer.cc)
- Add Python end-to-end tests for LOAD FROM CSV with CAST to array types
- Add Python sniffer test for CAST to list type (xfail)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ArrowArrayContextColumn::get_elem() did not support arrow::Type::LIST,
causing "Unsupported arrow type: list<item: float>" when executing
CAST(col, 'FLOAT[]') on CSV-loaded data. Extract scalar value logic
into get_arrow_scalar_value() helper and add recursive list/fixed-size
list handling for both get_elem() and nested list elements.

Also remove xfail marker from test_csv_cast_list since it now passes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…assertions

Add early return in Database.close() when _database is already None,
preventing "I/O operation on closed file" from __del__ during interpreter
shutdown. Also replace string-based assertions in test_load_array with
exact type comparisons (datetime.date, datetime.datetime).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ove unused castBatch

- Remove duplicated LIST/FIXED_SIZE_LIST handling in get_elem, delegate to
  get_arrow_scalar_value which already covers all types
- Remove unused ArrowTypeCaster::castBatch (replaced by LazyTypeCastRecordBatch)
- Fix clang-format 10.0.1 formatting in options.h

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tamp in appendScalarValue

Replace neug::common::Date::fromCString and neug::common::Timestamp::fromCString
with neug::Date and neug::DateTime from utils/property/types.h, aligning with
the project's standard type system. Adjust timestamp unit conversions accordingly
(milli_second base instead of microsecond base).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CSV: Array (LIST) type not supported in load_data

1 participant