Skip to content

Fix JsonlWriter crash on pandas.Timestamp metadata#490

Open
discobot wants to merge 1 commit into
huggingface:mainfrom
discobot:fix/325-jsonl-writer-timestamp
Open

Fix JsonlWriter crash on pandas.Timestamp metadata#490
discobot wants to merge 1 commit into
huggingface:mainfrom
discobot:fix/325-jsonl-writer-timestamp

Conversation

@discobot

@discobot discobot commented Jun 12, 2026

Copy link
Copy Markdown

Fixes #325.

orjson serializes plain datetime/date/time natively but deliberately rejects their subclasses, and pandas.Timestamp (what ParquetReader yields for timestamp columns) is a datetime.datetime subclass — with no default= handler, orjson.dumps raises. The handler added here emits isoformat() for date/time subclasses, matching orjson's native datetime output exactly, while other unserializable types still fail loudly (vs. a blanket default=str, which would silently stringify anything — happy to switch to that if preferred).

The new regression test fails on main with the exact TypeError: Type is not JSON serializable: Timestamp from the issue and passes with this change.


Note

Low Risk
Narrow change to JSON serialization in JsonlWriter with explicit errors for unknown types; no auth, security, or data-path changes beyond fixing a metadata edge case.

Overview
Fixes JsonlWriter crashing when document metadata contains pandas.Timestamp (and other date/time subclasses) by passing a default=_json_default hook to orjson.dumps. Subclasses that orjson won’t serialize natively are written as isoformat() strings, aligned with native datetime output; other non-JSON types still raise TypeError.

Adds a regression test that writes a document with timestamp metadata and asserts the ISO string in the JSONL output, plus a require_pandas skip decorator for optional test deps.

Reviewed by Cursor Bugbot for commit a555fec. Bugbot is set up for automated code reviews on this repo. Configure here.

orjson serializes datetime objects natively but raises TypeError for their subclasses, so documents with pandas.Timestamp metadata (e.g. timestamp columns read by ParquetReader) crashed JsonlWriter. Pass a default handler to orjson.dumps that converts datetime date/time subclasses to their ISO format, matching orjson's native datetime output. Unknown types still raise TypeError. Adds a regression test writing a document with pandas.Timestamp metadata.

Fixes huggingface#325
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JsonlWriter failed when trying to write Timestamp object

1 participant