Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,13 @@
> ⚠️ We discourage the use of `process(input).first` / `process(input)[0]` because it silently drops potential additional documents
> Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads, because it emits on_warning if it finds multiple docs.

## 1.2.0 (unreleased)

RSpec tests: 1,143

- A leading-zero token now reads as a number when it carries a sign, a decimal point, or an exponent (`+007` → `7`, `-000023.5` → `-23.5`, `00.0` → `0.0`, `007e2` → `700.0`) — previously these were kept as strings. A bare leading-zero integer (`000001`, `02`) still reads as a string, so IDs, zip codes, and account numbers keep their zeros.
- `Null` and `NULL` are now read as `nil` (joining `null` / `None` / `undefined`), for SQL / R / PHP / YAML / DB-derived input — in every position the existing spellings work. Quoted (`"NULL"`) or embedded (`NULL Island`) forms stay strings.

## 1.1.2 (2026-06-12)

RSpec tests: 1,097
Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ A lenient, fast JSON processor for Ruby. It extracts strict JSON, NDJSON, JSONL,

## Features at a glance

- **Reads the whole human-JSON superset, no modes or flags** — strict JSON, NDJSON, JSONL, JSON5, HJSON, JSONC, plus comments, trailing commas, unquoted / single / triple / smart quotes, an implicit root object, `NaN` / `Infinity` / hex / underscores, Python & JavaScript literals, a UTF-8 BOM, mixed line endings, and any Ruby encoding (see [What it accepts](#what-it-accepts-beyond-strict-json) for the full list).
- **Reads the whole human-JSON superset, no modes or flags** — strict JSON, NDJSON, JSONL, JSON5, HJSON, JSONC, plus comments, trailing commas, unquoted / single / triple / smart quotes, an implicit root object, `NaN` / `Infinity` / hex / underscores, Python / JavaScript / SQL literals, a UTF-8 BOM, mixed line endings, and any Ruby encoding (see [What it accepts](#what-it-accepts-beyond-strict-json) for the full list).
- **Every document from multi-document input, in one call** — `process` returns an `Array` of all of them; `process_one` returns the single value and warns if there was more than one (never raises; routed to `on_warning`, else `Rails.logger`, else `Kernel.warn`).
- **Streaming in bounded memory** — pass a block, or use `foreach(path_or_io)` for a composable `Enumerator` you can `.select` / `.map` / `.lazy` over.
- **Recovers JSON from LLM / markdown noise** — strips markdown code fences, surrounding prose, and `<json>` tags, and pulls every payload out of one messy blob.
Expand Down Expand Up @@ -75,7 +75,8 @@ Three things set it apart:
- Trailing commas; unquoted keys (`{host: localhost}`); single-quoted, triple-quoted (`'''…'''`), and quoteless string values
- Implicit root object — a config file that starts with `key: value`, no outer `{}`
- `NaN`, `Infinity`, hex (`0xFF`), leading `+` / `.`, underscores in numbers (`1_000_000`)
- UTF-8 BOM, smart/curly quotes (in keys and values), Python literals (`True` / `False` / `None`), JavaScript `undefined`
- Leading-zero numbers (which strict JSON rejects): a token with a sign, decimal point, or exponent reads as a number (`-007.5` → `-7.5`, `007e2` → `700.0`), but a bare leading-zero integer is kept as a string (`007`, `02`) so IDs, zip codes, and account numbers don't lose their zeros
- UTF-8 BOM, smart/curly quotes (in keys and values), Python literals (`True` / `False` / `None`), JavaScript `undefined`, case-variant null (`Null` / `NULL`, as SQL / R / PHP / YAML emit it)
- Mixed CR / LF / CRLF line endings, and any Ruby-supported input encoding (via `encoding:`)
- Duplicate keys (last value wins by default; configurable)

Expand Down
2 changes: 1 addition & 1 deletion docs/_introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Most JSON parsers reject anything that isn't perfectly strict JSON, and they mak

## What it accepts, beyond strict JSON

Comments (`//`, `/* … */`, `#` — a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` reads as a string, not a truncated value), markdown-wrapped / chatty blobs around the payload, trailing commas, unquoted / single- / triple-quoted / quoteless strings, an implicit root object (`key: value`, no braces), `NaN` / `Infinity` / hex / underscored numbers, Python (`True` / `False` / `None`) and JavaScript (`undefined`) literals, smart quotes, a UTF-8 BOM, mixed CR / LF / CRLF line endings, any Ruby-supported input encoding (via `encoding:`), and duplicate keys. The full list — with the human-JSON spec references it's drawn from — is kept in one place: [**What it accepts, beyond strict JSON**](../README.md#what-it-accepts-beyond-strict-json) in the README.
Comments (`//`, `/* … */`, `#` — a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` reads as a string, not a truncated value), markdown-wrapped / chatty blobs around the payload, trailing commas, unquoted / single- / triple-quoted / quoteless strings, an implicit root object (`key: value`, no braces), `NaN` / `Infinity` / hex / underscored numbers, leading-zero numbers (a signed / decimal / exponent token like `-007.5` is a number, a bare `007` is kept as a string so IDs keep their zeros), Python (`True` / `False` / `None`), JavaScript (`undefined`), and SQL / R / PHP / YAML (`Null` / `NULL`) literals, smart quotes, a UTF-8 BOM, mixed CR / LF / CRLF line endings, any Ruby-supported input encoding (via `encoding:`), and duplicate keys. The full list — with the human-JSON spec references it's drawn from — is kept in one place: [**What it accepts, beyond strict JSON**](../README.md#what-it-accepts-beyond-strict-json) in the README.

It raises only on genuinely unreadable input (unterminated string, mismatched bracket), with line and column in the message — never on valid-but-lenient input.

Expand Down
24 changes: 20 additions & 4 deletions docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,23 @@ JSON

A `#`/`//` only starts a comment when preceded by whitespace, so `http://example.com` stays a string rather than being truncated.

### Example 10: Wrapper Noise Around a Payload
### Example 10: Leading-Zero IDs and SQL `NULL`

```ruby
SmarterJSON.process_one(<<~JSON)
{
user_id: 007, # bare leading zero -> kept as a string
zip: 02139, # ditto: zip codes keep their leading zero
balance: -007.50, # a sign / decimal point / exponent makes it a number
deleted_at: NULL # SQL / R / YAML null spelling -> nil
}
JSON
# => {"user_id"=>"007", "zip"=>"02139", "balance"=>-7.5, "deleted_at"=>nil}
```

A bare leading-zero integer is kept as a string so identifiers, zip codes, and account numbers don't lose their zeros; a sign, decimal point, or exponent marks numeric intent (`-007.50` → `-7.5`). `Null` and `NULL` join `null` / `None` / `undefined` as spellings of `nil`; a quoted `"NULL"` stays a string.

### Example 11: Wrapper Noise Around a Payload

#### Fenced payload

Expand Down Expand Up @@ -197,22 +213,22 @@ TEXT
# => [{"a"=>1}, {"b"=>2}]
```

### Example 11: Write JSON
### Example 12: Write JSON

```ruby
SmarterJSON.generate({ "a" => 1, "b" => [2, 3] }) # => '{"a":1,"b":[2,3]}'
SmarterJSON.generate([1, 2, 3]) # => '[1,2,3]'
```

### Example 12: Write NDJSON
### Example 13: Write NDJSON

An Array writes one element per line:

```ruby
SmarterJSON.generate([{ "id" => 1 }, { "id" => 2 }], format: :ndjson) # => "{\"id\":1}\n{\"id\":2}\n"
```

### Example 13: Round-Trip Read and Write
### Example 14: Round-Trip Read and Write

```ruby
obj = { "a" => 1, "b" => [2, "three", nil, true] }
Expand Down
61 changes: 53 additions & 8 deletions ext/smarter_json/smarter_json.c
Original file line number Diff line number Diff line change
Expand Up @@ -641,16 +641,33 @@ static FJ_ALWAYS_INLINE VALUE fj_float_from_parts(fj_state *st, uint64_t m10, in
* per-byte '_' test, dropping to a slow step only when an underscore appears. */
static int fj_try_decimal(fj_state *st, const char *p, long n, VALUE *out) {
long i = 0;
int is_float = 0, neg = 0, has_digit = 0, overflow = 0;
int is_float = 0, neg = 0, has_digit = 0, overflow = 0, has_sign = 0, had_leading_zero = 0;
uint64_t m10 = 0;
int m10digits = 0, frac = 0;
int64_t e10 = 0;

if (i < n && (p[i] == '-' || p[i] == '+')) { neg = (p[i] == '-'); i++; }
if (i < n && (p[i] == '-' || p[i] == '+')) { has_sign = 1; neg = (p[i] == '-'); i++; }

/* Integer part: a single '0', or [1-9] then digits/underscores. */
/* Integer part: a single '0', or [1-9] then digits/underscores. A leading '0' followed
* by more digits (a leading-zero token) is consumed too but flagged: a BARE leading-zero
* integer (no sign / dot / exponent) is rejected below and kept as a string, so zip /
* account / check numbers preserve their zeros. */
if (i < n && p[i] == '0') {
has_digit = 1; m10digits = 1; i++;
/* Underscore-separated too (like the [1-9] branch below), so 0_5.0 / 0_0.5 behave
* exactly like 05.0 / 00.5 on both paths. */
if (i < n && ((p[i] >= '0' && p[i] <= '9') || p[i] == '_')) {
for (;;) {
while (i < n && p[i] >= '0' && p[i] <= '9') {
had_leading_zero = 1;
if (m10digits < 18) { m10 = m10 * 10 + (uint64_t)(p[i] - '0'); m10digits++; }
else overflow = 1;
i++;
}
if (i < n && p[i] == '_') { i++; continue; }
break;
}
}
} else if (i < n && p[i] >= '1' && p[i] <= '9') {
has_digit = 1;
for (;;) {
Expand Down Expand Up @@ -699,6 +716,8 @@ static int fj_try_decimal(fj_state *st, const char *p, long n, VALUE *out) {

if (i != n) return 0; /* token not fully consumed -> not a number (string) */
if (!has_digit) return 0; /* e.g. "." or "+" -> not a number (string) */
/* A BARE leading-zero integer (no sign / dot / exponent) is an ID, not a number. */
if (had_leading_zero && !has_sign && !is_float) return 0;

if (!is_float) {
*out = fj_int_from_parts(m10, m10digits, neg, overflow, p, n);
Expand Down Expand Up @@ -730,13 +749,13 @@ static VALUE fj_parse_number(fj_state *st) {
const char *p = buf + st->pos; /* buf[len] == '\0' (RSTRING_PTR) is the scan sentinel */
const char *np = p; /* token start, includes a leading sign */
long nlen;
int is_float = 0, neg = 0, overflow = 0;
int is_float = 0, neg = 0, overflow = 0, has_sign = 0, had_leading_zero = 0;
uint64_t m10 = 0; /* mantissa: integer + fraction digits */
int m10digits = 0; /* mantissa digit chars (caps the Eisel-Lemire fast path at 18) */
int frac = 0; /* fraction digit chars: e10 -= frac */
int64_t e10 = 0;

if (*p == '-' || *p == '+') { neg = (*p == '-'); p++; }
if (*p == '-' || *p == '+') { has_sign = 1; neg = (*p == '-'); p++; }

/* Cold branches (rare, not perf-critical): sync the cursor, reuse scalar helpers. */
if (*p == 'I') { st->pos = p - buf; fj_consume_keyword(st, "Infinity"); return rb_float_new(neg ? -INFINITY : INFINITY); }
Expand All @@ -755,10 +774,27 @@ static VALUE fj_parse_number(fj_state *st) {
return rb_str_to_inum(hx, 16, 0);
}

/* Integer part: a single '0', or [1-9] then digits/underscores. */
/* Integer part: a single '0', or [1-9] then digits/underscores. A leading '0' followed
* by more digits is consumed but flagged; a BARE leading-zero integer (no sign / dot /
* exponent) is rejected after the scan — it is an ID, not a number, and has no bare
* top-level quoteless-string form, so it raises (matching `000001`). */
if (*p == '0') {
m10digits = 1; /* one leading zero, counted as a single mantissa digit */
p++;
/* Underscore-separated too (like the [1-9] branch below), so the underscore is just a
* separator (0_0.5 behaves like 00.5). */
if ((*p >= '0' && *p <= '9') || *p == '_') {
for (;;) {
while (*p >= '0' && *p <= '9') {
had_leading_zero = 1;
if (m10digits < 18) { m10 = m10 * 10 + (uint64_t)(*p - '0'); m10digits++; }
else overflow = 1;
p++;
}
if (*p == '_') { p++; continue; }
break;
}
}
} else if (*p >= '1' && *p <= '9') {
for (;;) {
while (*p >= '0' && *p <= '9') {
Expand Down Expand Up @@ -811,6 +847,12 @@ static VALUE fj_parse_number(fj_state *st) {
st->pos = p - buf;
nlen = p - np;

/* A BARE leading-zero integer is an ID, not a number; at this top-level / strict
* position there is no quoteless-string form, so it raises. */
if (had_leading_zero && !has_sign && !is_float) {
fj_error(st, "invalid number with a leading zero");
}

if (!is_float) {
return fj_int_from_parts(m10, m10digits, neg, overflow, np, nlen);
}
Expand Down Expand Up @@ -979,7 +1021,8 @@ static VALUE fj_classify_quoteless(fj_state *st, const char *p0, long n0) {

if (fj_tok_eq(p, n, "true") || fj_tok_eq(p, n, "True")) return Qtrue;
if (fj_tok_eq(p, n, "false") || fj_tok_eq(p, n, "False")) return Qfalse;
if (fj_tok_eq(p, n, "null") || fj_tok_eq(p, n, "None") || fj_tok_eq(p, n, "undefined")) return Qnil;
if (fj_tok_eq(p, n, "null") || fj_tok_eq(p, n, "Null") || fj_tok_eq(p, n, "NULL") ||
fj_tok_eq(p, n, "None") || fj_tok_eq(p, n, "undefined")) return Qnil;
if (fj_tok_eq(p, n, "NaN")) return rb_float_new(NAN);
if (fj_tok_eq(p, n, "Infinity")) return rb_float_new(INFINITY);

Expand Down Expand Up @@ -1273,8 +1316,10 @@ static VALUE fj_parse_value(fj_state *st) {
case 'T': return fj_parse_literal(st, "True", Qtrue);
case 'F': return fj_parse_literal(st, "False", Qfalse);
case 'u': return fj_parse_literal(st, "undefined", Qnil);
case 'N': /* NaN (number) vs None (Python null) */
case 'N': /* NaN (number); None / Null / NULL (null) */
if (fj_byte_at(st, 1) == 'a') return fj_parse_number(st);
if (fj_byte_at(st, 1) == 'u') return fj_parse_literal(st, "Null", Qnil);
if (fj_byte_at(st, 1) == 'U') return fj_parse_literal(st, "NULL", Qnil);
return fj_parse_literal(st, "None", Qnil);
default:
if (b == '-' || b == '+' || b == '.' || b == 'I' || (b >= '0' && b <= '9')) {
Expand Down
44 changes: 37 additions & 7 deletions lib/smarter_json/parser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -739,7 +739,7 @@ class Parser
# Mantissa must carry at least one digit (int part, or a leading-dot fraction), so a
# bare exponent like "-e695881" is NOT a number — it falls through to a quoteless
# string, matching the C path. Trailing exponent stays optional.
DEC_RE = /\A[-+]?(?:(?:0|[1-9][0-9_]*)(?:\.[0-9_]*)?|\.[0-9_]+)(?:[eE][-+]?[0-9_]+)?\z/.freeze
DEC_RE = /\A[-+]?(?:[0-9][0-9_]*(?:\.[0-9_]*)?|\.[0-9_]+)(?:[eE][-+]?[0-9_]+)?\z/.freeze
# A decimal BigDecimal() would reject as-is: a leading dot (".5") or a dot not
# followed by a digit ("5.", "5.e3"). Matches iff normalize_for_bigdecimal
# would change the string — so when it doesn't match, we skip normalization.
Expand Down Expand Up @@ -1210,10 +1210,11 @@ def parse_value

# Disambiguate NaN (number) from None (Python null) at a strict position.
def parse_upper_n
if byte_at(1) == 0x61 # 'a' → NaN
parse_number
else
parse_literal_keyword("None", nil)
case byte_at(1)
when 0x61 then parse_number # 'a' -> NaN
when 0x75 then parse_literal_keyword("Null", nil) # 'u' -> Null
when 0x55 then parse_literal_keyword("NULL", nil) # 'U' -> NULL
else parse_literal_keyword("None", nil)
end
end

Expand Down Expand Up @@ -1378,7 +1379,7 @@ def classify_quoteless(str)
case str
when "true", "True" then return true
when "false", "False" then return false
when "null", "None" then return nil
when "null", "Null", "NULL", "None" then return nil
when "undefined" then return nil
when "NaN" then return Float::NAN
when "Infinity", "+Infinity" then return Float::INFINITY
Expand All @@ -1405,7 +1406,15 @@ def numeric_value(str)
# number tokens that is a real per-value allocation. Underscores are rare, so only
# pay it when the token actually contains one (measured +27% on long-token decimals).
body = str.include?("_") ? str.delete("_") : str
body.match?(/[.eE]/) ? decimal_value(body) : body.to_i
return decimal_value(body) if body.match?(/[.eE]/)

# A BARE leading-zero integer (no sign / dot / exponent) is an ID — a zip code,
# account number, phone number — not a number; keep it a string so the zeros survive.
# A sign (+007 / -007) signals numeric intent (IDs never carry a sign), so those parse.
c0 = body.getbyte(0)
return NOT_NUMERIC if c0 == ZERO && body.bytesize > 1

body.to_i
end

# True when the token starts with [+-]?0[xX] — the only shape HEX_RE can match.
Expand Down Expand Up @@ -1663,10 +1672,13 @@ def decode_unicode_escape(i)

def parse_number
negative = false
signed = false
if byte == MINUS
negative = true
signed = true
advance(1)
elsif byte == PLUS
signed = true
advance(1)
end

Expand All @@ -1680,6 +1692,7 @@ def parse_number
end

int_start = @pos
had_leading_zero = false

if byte == ZERO
advance(1)
Expand All @@ -1692,6 +1705,16 @@ def parse_number
value = @input.byteslice(hex_start, @pos - hex_start).delete("_").to_i(16)
return negative ? -value : value
end
# A run of further digits after the single leading '0' (007, 00023, or the
# underscore-separated 0_0) — consume it and flag the leading zero; the reject check
# below turns a bare leading-zero integer into an error. The underscore is only a
# separator, so 0_0.5 behaves like 00.5.
if (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
while (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
had_leading_zero = true if b >= ZERO && b <= NINE
advance(1)
end
end
elsif byte && byte >= 0x31 && byte <= NINE
advance(1) while (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
elsif byte == DOT
Expand All @@ -1717,6 +1740,13 @@ def parse_number
advance(1) while (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
end

# A BARE leading-zero integer is an ID, not a number; at this top-level / strict
# position there is no quoteless-string form, so it raises (a sign or a dot/exponent
# signals numeric intent and is allowed: +007 -> 7, -000023.5 -> -23.5, 007e2 -> 700.0).
if had_leading_zero && !signed && !is_float
raise error("invalid number with a leading zero")
end

slice = @input.byteslice(int_start, @pos - int_start).delete("_")
value = is_float ? decimal_value(slice) : slice.to_i
negative ? -value : value
Expand Down
2 changes: 1 addition & 1 deletion lib/smarter_json/version.rb
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# frozen_string_literal: true

module SmarterJSON
VERSION = "1.1.2"
VERSION = "1.2.0"
end
Loading