Structural-index substrate for fast JSON queries. Standalone stage-1
scanner + Mison-style key-bitmap layer. Built to slot under
jetro for $..find(k == lit) shapes
without materialising the value tree.
Given a JSON byte buffer, build:
- Stage-1 columns from simd-json byte offset, kind, depth for every structural
character (
{,},[,],",:,,, scalar starts). AVX2 fast path; scalar/memchrfallback elsewhere. - Sidecar —
parent[],close_of[]for tree navigation. - Mison key bitmaps interned key dictionary with a Roaring bitmap per key over token positions. Compound predicates via bitmap-AND.
Built once, reusable. Queries like $..find(x == "test") use the
bitmap to skip every JSON object that doesn't contain key x, and a
SIMD byte compare to validate the value. No Val tree allocation.
Mison (Li et al., VLDB 2017) — speculative key-bitmap indices outperform tree walks on selective queries by 10-100×. jetro-experimental is a self-contained Mison-style substrate that builds the structural columns and key bitmaps directly from raw JSON in a single pass.
┌──────────────────────────────┐
│ JSON bytes (&[u8]) │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────────────────┐
│ stage-1 SIMD/scalar scan │
│ • emit offset/kind/depth columns │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ index build (one fused walk) │
│ • parent / close_of │
│ • role classification │
│ • key interner + Roaring bitmaps │
└──────────────────────────────────────┘
│
▼
StructuralIndex + KeyBitmaps
│
▼
┌─────────┬────────┬──────────┬─────────┐
│ find_eq │ count │ ancestors│ slice() │
│ (Mison)│ (1) │ (parents)│ raw │
└─────────┴────────┴──────────┴─────────┘
use jetro_experimental::{from_bytes, find_eq, count_key};
let bytes = br#"{"a":{"x":"test"},"b":{"x":"nope"},"c":{"x":"test"}}"#;
let idx = from_bytes(bytes).unwrap();
// Mison-style: every object whose `x` key equals "test".
for obj_tok in find_eq(&idx, bytes, "x", b"test") {
let span = idx.byte_span(obj_tok);
let raw = &bytes[span.start as usize..span.end as usize];
println!("{}", std::str::from_utf8(raw).unwrap());
}
// → {"x":"test"}
// → {"x":"test"}
// Popcount only — no value walk.
let n = count_key(&idx, "x");
assert_eq!(n, 3);fn from_bytes(buf: &[u8]) -> Result<StructuralIndex, Error>;
fn from_bytes_with(buf: &[u8], opts: BuildOptions) -> Result<StructuralIndex, Error>;impl StructuralIndex {
fn kind(&self, tok: TokenId) -> TokenKind;
fn depth(&self, tok: TokenId) -> u16;
fn byte_offset(&self, tok: TokenId) -> u32;
fn byte_span(&self, tok: TokenId) -> ByteSpan;
fn parent(&self, tok: TokenId) -> Option<TokenId>;
fn close_of(&self, container: TokenId) -> Option<TokenId>;
fn slice<'a>(&self, bytes: &'a [u8], tok: TokenId) -> &'a [u8];
}fn container_at_byte(&self, pos: u32) -> Option<TokenId>;
fn ancestors(&self, tok: TokenId) -> Ancestors<'_>;fn keys_named(&self, name: &str, depth: Option<u16>) -> KeyHits<'_>;
fn has_key(&self, name: &str) -> bool;
fn keys_seen(&self) -> impl Iterator<Item = &str> + '_;
fn value_for_key(&self, key_tok: TokenId) -> Option<TokenId>;impl KeyHits<'_> {
fn at_depth(self, d: u16) -> Self;
fn and(self, other: Self) -> Self; // bitmap intersect
fn or(self, other: Self) -> Self;
fn count(self) -> u64; // O(1) popcount
fn first(self) -> Option<TokenId>; // short-circuit
fn last(self) -> Option<TokenId>;
}
impl Iterator for KeyHits<'_> { type Item = TokenId; }fn find_eq<'a>(idx, bytes, key, literal) -> impl Iterator<Item = TokenId>;
fn find_eq_compound<'a>(idx, bytes, conds) -> impl Iterator<Item = TokenId>;
fn count_key(idx, key) -> u64;fn json_string_eq(value: &[u8], literal: &[u8]) -> bool; // memchr escape probe
fn json_number_eq(value: &[u8], literal: &[u8]) -> bool; // numeric semantics
fn parse_f64(bytes: &[u8]) -> Option<f64>; // fast-float when "fast-numbers"
fn parse_i64(bytes: &[u8]) -> Option<i64>;| feature | adds |
|---|---|
fast-numbers |
fast-float SIMD f64 parser (~3× over str::parse) |
multi-key |
Aho-Corasick + Teddy multi-pattern matcher for raw byte scans |
validate-utf8 |
simdutf8 (~5-10 GB/s) |
Default: minimal. No external SIMD-JSON dependency — stage-1 of simd-json is fully self-contained.
Pre-1.0. Public API is stable (StructuralIndex, TokenId, fused
helpers). Internals (Stage1, KeyBitmaps, StructIndex) are
#[doc(hidden)] and may change without semver bumps.
- Stage-1 SIMD scanner (AVX2 + scalar/memchr fallback)
- StructuralIndex + Mison KeyBitmaps
- Public API facade with opaque types
- SIMD helpers:
json_string_eq,parse_f64, multi-key, utf8 validation - NEON stage-1 port (currently scalar+memchr fallback on aarch64)
- Streaming
IndexBuilderfor NDJSON / huge inputs
This repository contain direct copy of Stage 1 from simd-json project.
MIT/Apache-2.0.