feat: implement DataWriter for Iceberg data files by shangxinli · Pull Request #552 · apache/iceberg-cpp

shangxinli · 2026-01-31T17:43:13Z

Implements DataWriter class for writing Iceberg data files as part of issue #441 (task 2).

Implementation:

Factory method DataWriter::Make() for creating writer instances
Support for Parquet and Avro file formats via WriterFactoryRegistry
Complete DataFile metadata generation including partition info, column statistics, serialized bounds, and sort order ID
Proper lifecycle management with Initialize/Write/Close/Metadata
PIMPL idiom for ABI stability

Related to #441

evindj · 2026-02-03T06:17:48Z

src/iceberg/data/data_writer.cc

+
+    ICEBERG_ASSIGN_OR_RAISE(writer_,
+                            WriterFactoryRegistry::Open(options_.format, writer_options));
+    return {};


It is odd that an empty structure is always returned. Also, since this is initialization why not doing in the ctor?

Refactored the initialization logic

evindj · 2026-02-03T06:19:55Z

src/iceberg/data/data_writer.cc

+    if (closed_) {
+      return InvalidArgument("Writer already closed");
+    }


I could see a case for making close idempotent, is there any strong reason why we want to return this error instead of no op for example?

evindj · 2026-02-03T06:23:31Z

src/iceberg/data/data_writer.cc

+      return InvalidArgument("Writer already closed");
+    }
+    ICEBERG_RETURN_UNEXPECTED(writer_->Close());
+    closed_ = true;


Should this class address thread safety?

Good question! I've added explicit documentation that this class is not thread-safe:

I don't think a single writer (or reader) should support thread safety so it is fine not to add comment like this.

@wgtmac out of curiosity for my own knowledge, what guarantees that a single writer/reader will be using the class?

These file writers are supposed to be used by a single write task, which for example can be an unit of a table sink operator in a sql job plan. Usually the writer is responsible for partitioned (and sometimes sorted) data chunks.

Agreed. Removed the thread safety comment from the header.

evindj · 2026-02-03T06:29:41Z

src/iceberg/test/data_writer_test.cc

+TEST_F(DataWriterTest, CreateWithParquetFormat) {
+  DataWriterOptions options{
+      .path = "test_data.parquet",
+      .schema = schema_,
+      .spec = partition_spec_,
+      .partition = PartitionValues{},
+      .format = FileFormatType::kParquet,
+      .io = file_io_,
+      .properties = {{"write.parquet.compression-codec", "uncompressed"}},
+  };
+
+  auto writer_result = DataWriter::Make(options);
+  ASSERT_THAT(writer_result, IsOk());
+  auto writer = std::move(writer_result.value());
+  ASSERT_NE(writer, nullptr);
+}
+
+TEST_F(DataWriterTest, CreateWithAvroFormat) {
+  DataWriterOptions options{
+      .path = "test_data.avro",
+      .schema = schema_,
+      .spec = partition_spec_,
+      .partition = PartitionValues{},
+      .format = FileFormatType::kAvro,
+      .io = file_io_,
+  };
+
+  auto writer_result = DataWriter::Make(options);
+  ASSERT_THAT(writer_result, IsOk());
+  auto writer = std::move(writer_result.value());
+  ASSERT_NE(writer, nullptr);
+}


nit: The two tests are quite similar, it is probably possible to leverage a function to reduce duplication

Consolidated the two tests using parameterized testing.

evindj · 2026-02-03T06:31:44Z

src/iceberg/test/data_writer_test.cc

+  // Check length before close
+  auto length_result = writer->Length();
+  ASSERT_THAT(length_result, IsOk());
+  EXPECT_GT(length_result.value(), 0);


nit: check the size of the data passed to the write function?

zhjwpku · 2026-02-03T06:41:10Z

src/iceberg/data/data_writer.cc

+    if (!writer_) {
+      return InvalidArgument("Writer not initialized");
+    }


Suggested change

if (!writer_) {

return InvalidArgument("Writer not initialized");

}

ICEBERG_PRECHECK(writer_, "Writer not initialized");

nit, this should make the code shorter.

Replaced all manual null checks with ICEBERG_PRECHECK

zhjwpku · 2026-02-03T06:42:02Z

src/iceberg/data/data_writer.cc

+  }
+
+  Result<FileWriter::WriteResult> Metadata() {
+    if (!closed_) {


nit: use ICEBERG_CHECK here

zhjwpku · 2026-02-03T06:45:18Z

src/iceberg/test/data_writer_test.cc

+  EXPECT_GT(length.value(), 0);
+}
+
+}  // namespace


nit: move this closing namespace curly before the first TEST_F?

Implements DataWriter class for writing Iceberg data files as part of issue apache#441 (task 2). Implementation: - Static factory method DataWriter::Make() for creating writer instances - Support for Parquet and Avro file formats via WriterFactoryRegistry - Complete DataFile metadata generation including partition info, column statistics, serialized bounds, and sort order ID - Proper lifecycle management with Write/Close/Metadata methods - Idempotent Close() - multiple calls succeed (no-op after first) - PIMPL idiom for ABI stability - Not thread-safe (documented) Tests: - 13 comprehensive unit tests including parameterized format tests - Coverage: creation, write/close lifecycle, metadata generation, error handling, feature validation, and data size verification - All tests passing (13/13) Related to apache#441

wgtmac · 2026-02-09T07:47:37Z

src/iceberg/data/data_writer.cc

 class DataWriter::Impl {
 public:
+  static Result<std::unique_ptr<Impl>> Make(DataWriterOptions options) {
+    WriterOptions writer_options;


nit: use aggregate initialization for writer_options

Done. Changed to use aggregate initialization for WriterOptions.

Done. Changed to use aggregate initialization for WriterOptions.

wgtmac · 2026-02-09T07:57:25Z

src/iceberg/data/data_writer.cc

+  }
+
+  Status Write(ArrowArray* data) {
+    ICEBERG_PRECHECK(writer_, "Writer not initialized");


Will this check ever fail? If not, should we remove the check or use ICEBERG_DCHECK instead? Same question for below.

Good point. Since writer_ is always initialized in Make() (the constructor is private and only called after successful writer creation), the check can never fail. Changed to ICEBERG_DCHECK for all three usages.

Changed to ICEBERG_DCHECK

wgtmac · 2026-02-09T07:59:11Z

src/iceberg/data/data_writer.cc

+      return InvalidArgument("Writer already closed");
+    }
+    ICEBERG_RETURN_UNEXPECTED(writer_->Close());
+    closed_ = true;


I don't think a single writer (or reader) should support thread safety so it is fine not to add comment like this.

wgtmac · 2026-02-09T08:00:12Z

src/iceberg/data/data_writer.cc

+  }
+
+  Result<FileWriter::WriteResult> Metadata() {
+    ICEBERG_PRECHECK(closed_, "Cannot get metadata before closing the writer");


Suggested change

ICEBERG_PRECHECK(closed_, "Cannot get metadata before closing the writer");

ICEBERG_CHECK(closed_, "Cannot get metadata before closing the writer");

We should return invalid state instead of invalid argument in this case.

Done. Changed to ICEBERG_CHECK which returns ValidationFailed instead of InvalidArgument. Updated the test expectation accordingly.

wgtmac · 2026-02-09T08:03:11Z

src/iceberg/data/data_writer.cc

+    data_file->file_path = options_.path;
+    data_file->file_format = options_.format;
+    data_file->partition = options_.partition;
+    data_file->record_count = metrics.row_count.value_or(0);


Suggested change

data_file->record_count = metrics.row_count.value_or(0);

data_file->record_count = metrics.row_count.value_or(-1);

Java impl uses -1 when row count is unavailable.

Done. Changed to value_or(-1) to match the Java implementation.

wgtmac · 2026-02-09T08:03:33Z

src/iceberg/data/data_writer.cc

+    auto split_offsets = writer_->split_offsets();
+
+    auto data_file = std::make_shared<DataFile>();
+    data_file->content = DataFile::Content::kData;


nit: use aggregate initialization

Done. Changed to use aggregate initialization for DataFile. Also used range constructors for the metrics maps (e.g. {metrics.column_sizes.begin(), metrics.column_sizes.end()}) to simplify the conversion. The bounds maps still need explicit loops due to the serialization step.

Done. Changed to use aggregate initialization for DataFile. Also used range constructors for the metrics maps...

wgtmac · 2026-02-09T08:06:32Z

src/iceberg/data/data_writer.cc

+
+    // Convert metrics maps from unordered_map to map
+    for (const auto& [col_id, size] : metrics.column_sizes) {
+      data_file->column_sizes[col_id] = size;


Do you think it makes sense to change DataFile and Metrics classes to use std::map or std::unordered_map consistently so we don't need to use a for-loop here?

cc @zhjwpku

That would be a nice cleanup. For now I've simplified the conversion by using range constructors instead of explicit for-loops. Changing Metrics and DataFile to use the same map type would be a good follow-up.

That would be a nice cleanup. For now I've simplified... Changing Metrics and DataFile to use the same map type would be a good follow-up.

wgtmac · 2026-02-09T08:12:02Z

src/iceberg/test/data_writer_test.cc

+        SchemaField::MakeRequired(1, "id", std::make_shared<IntType>()),
+        SchemaField::MakeOptional(2, "name", std::make_shared<StringType>())});


Suggested change

SchemaField::MakeRequired(1, "id", std::make_shared<IntType>()),

SchemaField::MakeOptional(2, "name", std::make_shared<StringType>())});

SchemaField::MakeRequired(1, "id", int32()),

SchemaField::MakeOptional(2, "name", string())});

Done. Changed to use int32() and string() factory functions.

wgtmac · 2026-02-09T08:19:02Z

src/iceberg/test/data_writer_test.cc

+
+using ::testing::HasSubstr;
+
+class DataWriterTest : public ::testing::Test {


Can we try to consolidate the test cases since each of them only test a tiny api with repeated boilerplate of creating writer and writing data? This may lead to test cases explosion if more and more cases are like this.

Done. Added MakeDefaultOptions() and WriteTestDataToWriter() helpers to reduce boilerplate. Consolidated related tests: merged WriteAndClose+LengthIncreasesAfterWrite, GetMetadataAfterClose+MetadataContainsColumnMetrics, and SortOrderIdPreserved+SortOrderIdNullByDefault. Reduced from 10 test cases to 7.

Done. Added MakeDefaultOptions() and WriteTestDataToWriter() helpers. Reduced test cases.

- Use aggregate initialization for WriterOptions and DataFile - Change ICEBERG_PRECHECK(writer_) to ICEBERG_DCHECK (can never fail) - Use ICEBERG_CHECK for closed state check (returns ValidationFailed) - Use value_or(-1) for missing row count to match Java impl - Use range constructors for metrics map conversion - Remove unnecessary thread safety comment - Use int32()/string() factory functions in tests - Consolidate test cases and add helpers to reduce boilerplate

shangxinli force-pushed the implement-data-file-writer branch from 8944a75 to a201953 Compare January 31, 2026 17:59

evindj reviewed Feb 3, 2026

View reviewed changes

zhjwpku reviewed Feb 3, 2026

View reviewed changes

shangxinli force-pushed the implement-data-file-writer branch 2 times, most recently from 90d324e to 153d763 Compare February 7, 2026 01:31

shangxinli force-pushed the implement-data-file-writer branch from 153d763 to 147f25b Compare February 7, 2026 01:34

wgtmac reviewed Feb 9, 2026

View reviewed changes

shangxinli added 2 commits February 14, 2026 17:33

ci: retrigger

8ef6520

	ICEBERG_PRECHECK(closed_, "Cannot get metadata before closing the writer");
	ICEBERG_CHECK(closed_, "Cannot get metadata before closing the writer");

	data_file->record_count = metrics.row_count.value_or(0);
	data_file->record_count = metrics.row_count.value_or(-1);

		SchemaField::MakeRequired(1, "id", std::make_shared<IntType>()),
		SchemaField::MakeOptional(2, "name", std::make_shared<StringType>())});


		using ::testing::HasSubstr;

		class DataWriterTest : public ::testing::Test {

Conversation

shangxinli commented Jan 31, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shangxinli Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

shangxinli Feb 7, 2026 •

edited

Loading