Add fluentbit -compact compressions that preserve CRI time nanoseconds by solsson · Pull Request #2 · Yolean/kubernetes-logs-datalake

solsson · 2026-02-24T14:13:24Z

... without the storage space and parsing cost of iso 8601 varchars. Also use dictionaries for "stream" (stdout/stderr) and "logtag" (logtag) where possible (arrow, not parquet, AFAIK), again to reduce storage space. Use ZSTD for arrow output because that's what the nanoarrow extension (see https://duckdb.org/2025/05/23/arrow-ipc-support-in-duckdb) supports at the time of writing.

Write each log record to S3 twice via two fluent-bit s3 outputs: - .arrow files: CRI time preserved as VARCHAR (nanosecond ISO 8601) with json_date_key=false suppressing the internal timestamp - .parquet files: time_ms as BIGINT (epoch_ms) alongside CRI time y-logcli gains -f flag (arrow/parquet/both) to query either or both formats. UNION ALL combines them with matching TIMESTAMPTZ columns. test.sh polls each format independently and asserts: - Schema: arrow time is VARCHAR, parquet time_ms is BIGINT - Arrow time values match CRI timestamp format (nanosecond precision) - Both formats contain equal row counts - SIGTERM flush works with dual outputs Path layout adds NODE_NAME between date and pod/container components. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Arrow-compact stores Timestamp(ns) without timezone metadata. The previous `time::TIMESTAMPTZ` cast made DuckDB interpret the value as local time, causing timestamps to display 1 hour off (in CET) from the parquet-compact output. Using `AT TIME ZONE 'UTC'` correctly tells DuckDB the timezone-naive nanos are UTC. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The arrow-compact compression was incorrectly calling table_to_parquet_buffer(), producing Parquet files with a .arrow extension. Now uses a new table_to_arrow_ipc_buffer() that writes uncompressed Feather v2 (Arrow IPC) via garrow_table_write_as_feather with GARROW_COMPRESSION_TYPE_UNCOMPRESSED (nanoarrow/DuckDB cannot decode LZ4-compressed Arrow IPC bodies). Also: - Skip dictionary encoding for arrow-compact (breaks nanoarrow reader) - y-logcli: use read_arrow() via nanoarrow extension for .arrow files - test.sh: use read_arrow for schema checks, show distinct metadata for each format (DESCRIBE for Arrow IPC, parquet_schema/metadata for Parquet) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Write stream and logtag as Arrow dictionary type in both the Arrow IPC and Parquet output paths. pyarrow confirms dictionary<values=string, indices=int32> in the written files. DuckDB/nanoarrow cannot read dictionary-encoded Arrow IPC (paleolimbot/duckdb-nanoarrow#25), so test.sh wraps nanoarrow-based arrow assertions with ON_DUCKDB_ARROW_FAILURE (default: continue). The pyarrow probe via arrow-tools image is now the authoritative Arrow IPC format validator. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- inspect_arrow.py: format timestamps as ISO 8601 with ns precision, show first 5 rows (burst shows distinct ns per row), pick earliest file from log-generator container - test.sh: print progress before DuckDB arrow probe and pyarrow validation steps so the script doesn't appear stalled Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

stream (stdout/stderr) and logtag (F/P) have very low cardinality, so int8 indices save 3 bytes per row vs the default int32. After dictionary encoding, cast indices from int32 to int8 and rebuild the dictionary array with the compact index type. Unit tests verify int8 via assert_dict_int8() (32 passed, 0 failed). test.sh pyarrow assertion confirms dictionary<values=string, indices=int8>. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Gives us both stream values (stdout/stderr) in the test data, exercising the dictionary-encoded stream column. y-logcli: print column types above table output DuckDB's box mode truncates long type names like TIMESTAMP WITH TIME ZONE. Print a -- name: TYPE header before the table so full types are always visible. y-logcli: add -o columns mode with ISO 8601 nanosecond timestamps Compact output: space-separated time, pod, container, stream, message (truncated to 60 chars). Timestamps use AT TIME ZONE 'UTC' to extract UTC from TIMESTAMPTZ before formatting with epoch_ns nanoseconds. test.sh: print DuckDB default interpretation of persisted formats Shows DESCRIBE read_parquet and read_arrow output so we can see how DuckDB maps the on-disk types without any explicit type handling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ecision DuckDB maps Parquet isAdjustedToUTC=true to TIMESTAMP WITH TIME ZONE which is microsecond precision, silently truncating nanoseconds. Both formats now write Timestamp(ns) without timezone annotation — DuckDB reads as TIMESTAMP_NS preserving full nanosecond precision. Timestamps are UTC by convention. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

solsson and others added 10 commits February 23, 2026 22:17

shows sample file metadata during test.sh

bfe2b5f

prints arrow format metadata using apache arrow's lib

2789163

solsson merged commit a1a5972 into main Feb 24, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fluentbit -compact compressions that preserve CRI time nanoseconds#2

Add fluentbit -compact compressions that preserve CRI time nanoseconds#2
solsson merged 10 commits into
mainfrom
cri-columns-duckdb-parquet-optimization

solsson commented Feb 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

solsson commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

solsson commented Feb 24, 2026 •

edited

Loading