Skip to content

IRI/URL sanitization #60

@thenonameguy

Description

@thenonameguy

In case the Dataframe contains invalid IRI/URL data, currently this gets serialized without it being detected, nor at mapping/nor at serialization-time.

Leaving library users having to implement functions like:

 def sanitize_iris(df: pl.DataFrame, *cols: str) -> pl.DataFrame:
     """Sanitize IRI columns using native Polars expressions (columnar).
 
     1. Strips leading/trailing whitespace.
     2. Percent-encodes spaces (``%20``).
     3. Nulls out values that don't start with ``http``, contain non-ASCII
        characters, or contain characters that are illegal in an IRI
        (``" < > \\ ^ ` { | }`` and control chars) — broken source URLs that
        RDFox/rdflib reject.
     """

Having a more ergonomic/strict mode of serialization that actively catches these cases (like rdflib does at read-time) would be useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions