Add fluentbit -compact compressions that preserve CRI time nanoseconds#2
Merged
Conversation
Write each log record to S3 twice via two fluent-bit s3 outputs: - .arrow files: CRI time preserved as VARCHAR (nanosecond ISO 8601) with json_date_key=false suppressing the internal timestamp - .parquet files: time_ms as BIGINT (epoch_ms) alongside CRI time y-logcli gains -f flag (arrow/parquet/both) to query either or both formats. UNION ALL combines them with matching TIMESTAMPTZ columns. test.sh polls each format independently and asserts: - Schema: arrow time is VARCHAR, parquet time_ms is BIGINT - Arrow time values match CRI timestamp format (nanosecond precision) - Both formats contain equal row counts - SIGTERM flush works with dual outputs Path layout adds NODE_NAME between date and pod/container components. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Arrow-compact stores Timestamp(ns) without timezone metadata. The previous `time::TIMESTAMPTZ` cast made DuckDB interpret the value as local time, causing timestamps to display 1 hour off (in CET) from the parquet-compact output. Using `AT TIME ZONE 'UTC'` correctly tells DuckDB the timezone-naive nanos are UTC. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The arrow-compact compression was incorrectly calling table_to_parquet_buffer(), producing Parquet files with a .arrow extension. Now uses a new table_to_arrow_ipc_buffer() that writes uncompressed Feather v2 (Arrow IPC) via garrow_table_write_as_feather with GARROW_COMPRESSION_TYPE_UNCOMPRESSED (nanoarrow/DuckDB cannot decode LZ4-compressed Arrow IPC bodies). Also: - Skip dictionary encoding for arrow-compact (breaks nanoarrow reader) - y-logcli: use read_arrow() via nanoarrow extension for .arrow files - test.sh: use read_arrow for schema checks, show distinct metadata for each format (DESCRIBE for Arrow IPC, parquet_schema/metadata for Parquet) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Write stream and logtag as Arrow dictionary type in both the Arrow IPC and Parquet output paths. pyarrow confirms dictionary<values=string, indices=int32> in the written files. DuckDB/nanoarrow cannot read dictionary-encoded Arrow IPC (paleolimbot/duckdb-nanoarrow#25), so test.sh wraps nanoarrow-based arrow assertions with ON_DUCKDB_ARROW_FAILURE (default: continue). The pyarrow probe via arrow-tools image is now the authoritative Arrow IPC format validator. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- inspect_arrow.py: format timestamps as ISO 8601 with ns precision, show first 5 rows (burst shows distinct ns per row), pick earliest file from log-generator container - test.sh: print progress before DuckDB arrow probe and pyarrow validation steps so the script doesn't appear stalled Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
stream (stdout/stderr) and logtag (F/P) have very low cardinality, so int8 indices save 3 bytes per row vs the default int32. After dictionary encoding, cast indices from int32 to int8 and rebuild the dictionary array with the compact index type. Unit tests verify int8 via assert_dict_int8() (32 passed, 0 failed). test.sh pyarrow assertion confirms dictionary<values=string, indices=int8>. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Gives us both stream values (stdout/stderr) in the test data, exercising the dictionary-encoded stream column. y-logcli: print column types above table output DuckDB's box mode truncates long type names like TIMESTAMP WITH TIME ZONE. Print a -- name: TYPE header before the table so full types are always visible. y-logcli: add -o columns mode with ISO 8601 nanosecond timestamps Compact output: space-separated time, pod, container, stream, message (truncated to 60 chars). Timestamps use AT TIME ZONE 'UTC' to extract UTC from TIMESTAMPTZ before formatting with epoch_ns nanoseconds. test.sh: print DuckDB default interpretation of persisted formats Shows DESCRIBE read_parquet and read_arrow output so we can see how DuckDB maps the on-disk types without any explicit type handling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ecision DuckDB maps Parquet isAdjustedToUTC=true to TIMESTAMP WITH TIME ZONE which is microsecond precision, silently truncating nanoseconds. Both formats now write Timestamp(ns) without timezone annotation — DuckDB reads as TIMESTAMP_NS preserving full nanosecond precision. Timestamps are UTC by convention. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
... without the storage space and parsing cost of iso 8601 varchars. Also use dictionaries for "stream" (
stdout/stderr) and "logtag" (logtag) where possible (arrow, not parquet, AFAIK), again to reduce storage space. Use ZSTD for arrow output because that's what the nanoarrow extension (see https://duckdb.org/2025/05/23/arrow-ipc-support-in-duckdb) supports at the time of writing.