Skip to content

Kenzy-Zero/k1

Repository files navigation

K1

The foundation format of K-Series. GeoParquet + H3 spatial sorting + K-Series metadata, in one open file format that travels between big-data pipelines and the analyst's laptop.

K1 is an open geo-data file format for storing location intelligence — device pings, GPS tracks, POI events, anything with a latitude and a longitude. Files have the extension .k1 and are valid Parquet, so every tool that reads Parquet reads K1 already.

What K1 adds on top of plain Parquet:

  • H3 spatial index baked into every row.
  • Global sort by H3 cell, plus sorting_columns Parquet metadata so readers can skip whole row groups on spatial filters.
  • K-Series metadata footer carrying version, region, author, source schema, and creation time.
  • GeoParquet 1.0 compatibility — geopandas / DuckDB spatial / Sedona read the file as a GeoDataFrame natively.
  • Streaming writer with external merge sort — works on inputs too large to fit in RAM. Single 20 GB / 200 GB Parquet dumps are processed chunk by chunk; many small files are processed in parallel.

K1 is the file format — not a database, not a processing engine, not a visualisation tool. Like PDF, the goal is that every tool reads and writes it.


Table of Contents


Quick start

Install:

pip install k1

Write a K1 file from a pandas DataFrame:

import pandas as pd
import k1

df = pd.DataFrame([
    {"device_id": "d_8f3a2b91", "lat": 40.7128, "lon": -74.0060,
     "timestamp": "2026-01-15T09:23:11Z", "event_type": "ping",
     "accuracy_m": 12.5, "speed_kmh": 0.0},
    {"device_id": "d_c7e5d104", "lat": 51.5074, "lon": -0.1278,
     "timestamp": "2026-01-15T14:11:03Z", "event_type": "start",
     "accuracy_m": 15.0, "speed_kmh": 0.0},
    {"device_id": "d_e1a9c8b3", "lat": 35.6762, "lon": 139.6503,
     "timestamp": "2026-01-15T18:55:21Z", "event_type": "ping",
     "accuracy_m": 22.0, "speed_kmh": 11.4},
    {"device_id": "d_47b22f01", "lat": 52.5200, "lon": 13.4050,
     "timestamp": "2026-01-15T20:01:09Z", "event_type": "end",
     "accuracy_m": 9.8, "speed_kmh": 0.0},
    {"device_id": "d_fa3e0d6c", "lat": -33.8688, "lon": 151.2093,
     "timestamp": "2026-01-16T03:45:32Z", "event_type": "ping",
     "accuracy_m": 18.0, "speed_kmh": 33.5},
])

k1.write_k1(df, "mobility.k1", h3_resolution=8, author="example")

Read it back:

data = k1.load("mobility.k1")
print(data.info())
# {'path': '.../mobility.k1', 'k1_version': '1.0.0', 'h3_resolution': 8, ...}

df = data.to_dataframe()
gdf = data.to_geodataframe()    # geopandas, geometry decoded from WKB

Or from the command line:

k1 write mobility.csv mobility.k1 --region GLOBAL --author me
k1 info mobility.k1
k1 convert mobility.k1 mobility.k2     # produces a DuckDB file

For inputs that don't fit in RAM (many files, single huge file):

k1.write_k1_streaming(
    sources=["/data/export/part_0.parquet", "/data/export/part_1.parquet", ...],
    output="combined.k1",
)

Why K1?

Format Query engine H3 native Scales K-Series metadata Single file
GeoJSON no no breaks past a few hundred MB no yes
Shapefile no no 1990s tech, 4-file bundle no no
GeoParquet no no billions no yes
GeoPackage yes (SQLite) no millions no yes
FlatGeobuf no no fast reads no yes
PMTiles no no tiles only no yes
K1 yes (Parquet) yes billions yes yes
K2 yes (DuckDB) yes any yes yes

The point of K1 is intelligence baked in at every layer. Other formats are bytes on disk; K1 carries enough context to be useful the moment it lands on someone's machine.


Features at a glance

File format

  • .k1 extension. Internally a Parquet 1.0 file.
  • Required columns: h3_index (string), h3_resolution (int32), geometry (binary WKB), k1_source_id (string), k1_imported_at (timestamp UTC).
  • Rows globally sorted ascending by h3_index.
  • sorting_columns declared in the Parquet footer so range filters can prune row groups.
  • GeoParquet 1.0 compatible (geo schema metadata key).
  • K-Series metadata footer (k1_version, k1_h3_resolution, k1_region, k1_author, etc.).

Python SDK

  • k1.write_k1(source, output, ...) — in-memory writer for inputs up to a few GB.
  • k1.write_k1_streaming(sources, output, ..., workers=None) — multi-source, memory-bounded writer with external merge sort via DuckDB and multi-core shard-pass parallelism.
  • k1.load(path) — read a .k1 file. Returns a K1 object with info(), to_dataframe(), to_geodataframe(), to_k2(), metadata, columns.
  • Auto-detection of lat/lng columns across many common name variants (lat, latitude, y, lon, lng, long, longitude, x).
  • Auto-snake-case for all column names.
  • Auto-reprojection to WGS84.
  • Single-file row-group chunking — feed in a 20 GB / 200 GB Parquet dump and it processes chunk by chunk.
  • CSV chunked reads via pd.read_csv(chunksize=…) when the source is a CSV.

CLI

  • k1 write SOURCE OUTPUT.k1 — convert any source to K1.
  • k1 info FILE.k1 — print metadata + summary (pretty or --json).
  • k1 convert FILE.k1 FILE.k2 — produce a K2 (DuckDB) file with bundled SQL engine.

JavaScript / Node SDK

  • npm install k1-js.
  • Read-only in v0.1. load(path), K1.info(), K1.rows(), K1.toGeoJSON().
  • Pure JS — Node 18+, modern browsers, no WASM.

Output compression and encoding

  • Default: zstd compression — typically ~50% smaller than snappy on K1 data.
  • BYTE_STREAM_SPLIT encoding on every float column — additional ~30–50% reduction for sequential coordinates.
  • Dictionary encoding on string columns — h3_index repeats heavily within a row group so this is near-free.
  • 1M-row Parquet row groups by default — sensible predicate- pushdown granularity, no pyarrow page-header pitfalls.

Documentation

Four documentation files cover the project from beginner to expert:

Doc Audience What it covers
USER_GUIDE.md Anyone using K1 Install, write your first file, every CLI command, big-data streaming workflows, single-file chunking, K1→K2 conversion, JavaScript usage, common workflows, troubleshooting/FAQ, glossary.
TECHNICAL_DOC.md Engineers and integrators Architecture, file format spec, SDK layout, in-memory and streaming pipelines, memory model, performance model, every design decision and rationale, failure modes, extension points, limitations.
CHANGELOG.md Anyone tracking versions Per-release notes (currently v0.1.0), known issues, semantic-versioning policy.
CONTRIBUTING.md Contributors Project structure, dev setup, code style, docs discipline, commit/PR conventions, end-to-end walkthroughs for adding source formats / CLI commands / metadata keys / K1 methods, JS SDK work, performance-work guidelines, format-evolution rules.

The K1 file format specification, the K-Series vision document, and all sibling K-format specs (K2, K3, ...) live in the K-Series standards repo: github.com/Kenzy-Zero/k-series.

  • VISION.md — the K-Series north star.
  • K1.md — the K1 format spec (the contract this repo implements).
  • K2.md — the K2 format spec (K1 already produces conformant .k2 via K1.to_k2(); the standalone K2 SDK is in active development).

The CLI in one page

k1 write SOURCE OUTPUT.k1
    [-r RESOLUTION]            # H3 resolution 0-15, default 8
    [--region REGION]
    [--author AUTHOR]
    [--source-id ID]
    [--lat-col COL]            # override auto-detection
    [--lng-col COL]
    [--compression CODEC]      # zstd (default), snappy, gzip, lz4, brotli, none

k1 info FILE.k1
    [--json]                   # emit JSON instead of pretty layout

k1 convert FILE.k1 FILE.k2

k1 --version
k1 --help
k1 <subcommand> --help

Example session:

$ k1 write mobility.csv mobility.k1 -r 9 --region GLOBAL
wrote /.../mobility.k1
  rows: 5
  h3_resolution: 9
  size: 8.91 KB

$ k1 info mobility.k1
path             /.../mobility.k1
k1_version       1.0.0
k1_standard      geo
h3_resolution    9
crs              EPSG:4326
region           GLOBAL
created_at       2026-01-15T20:01:09+00:00
row_count        5
columns          (12)
  - device_id
  - lat
  - lon
  - timestamp
  - event_type
  - accuracy_m
  - speed_kmh
  - h3_index
  - h3_resolution
  - k1_source_id
  - k1_imported_at
  - geometry

$ k1 convert mobility.k1 mobility.k2
converted /.../mobility.k1 -> /.../mobility.k2
  rows: 5
  size: 256.0 KB

Full reference: USER_GUIDE.md §6.


JavaScript / Node

npm install k1-js
import { load } from 'k1-js';

const data = await load('mobility.k1');
console.log(await data.info());

const sample = await data.rows({
  columns: ['device_id', 'lat', 'lon'],
  limit: 3,
});
console.log(sample);
// [
//   { device_id: 'd_8f3a2b91', lat: 40.7128, lon: -74.006 },
//   { device_id: 'd_c7e5d104', lat: 51.5074, lon: -0.1278 },
//   { device_id: 'd_e1a9c8b3', lat: 35.6762, lon: 139.6503 }
// ]

await data.toGeoJSON({ output: 'mobility.geojson' });

Write support is on the roadmap. v0.1 is read-only.

Full reference: USER_GUIDE.md §10.


The K-Series family

K1 is part of K-Series — a family of open geo-data standards. Today only K1 ships; the others are in the roadmap.

Format Role Status
K1 Foundation. Parquet-based. Big-data pipeline output. v0.1 — active
K2 Intelligence. DuckDB-based. Analyst/dev layer. Active development
K3 Mobility & trajectory (OD matrices, flows). 2027
K4 Audience & identity (segments, tiers). 2027
K5 Urban & real estate (buildings, parcels). 2028
K6 Retail & POI intelligence. 2028

K1 today can already produce K2 files via K1.to_k2() or k1 convert. The full K2 SDK (its own writer, querier, etc.) is in active development.


Project status

Version: 0.1.0 — first public release. Format spec at k1_version: 1.0.0.

Stability:

  • File format is stable. Files written today will be readable by future versions.
  • Python SDK public API (write_k1, write_k1_streaming, load, K1.*) is stable for the 0.x line. Internal helpers (_*) may change without notice.
  • CLI subcommand surface is stable; flags may be added (never removed in a minor release).
  • JavaScript SDK public surface is stable for the 0.x line.

Versioning: Semantic Versioning. Format-breaking changes require a MAJOR bump; additions are MINOR; fixes are PATCH.


Benchmarks

Benchmarks run on real-world mobility data. Results may vary based on hardware and data characteristics.


Acknowledgements

K1 stands on the shoulders of giants:

  • Apache Parquet — binary storage foundation.
  • Apache Arrow — in-memory columnar format.
  • GeoParquet — the geometry spec K1 is compatible with.
  • DuckDB — the external sort and the K2 file format.
  • Uber H3 — hexagonal spatial indexing.
  • hyparquet — pure-JS Parquet reader powering k1-js.

License

MIT. See LICENSE.

Built by Kenzy-Zero with K-Series as an open community standard. Contributions welcome — see CONTRIBUTING.md.

About

Open geo data standard built on Apache Parquet + H3. Drop any geo file. Query anything.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors