diff --git a/CLAUDE.md b/CLAUDE.md
index 01885c7..2eca0cc 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -45,10 +45,14 @@ cargo test --features test-panic --release
 
 ### Two-phase parse
 
-**Phase 1** (`src/scan/`, called from `Document::parse_with_options`): a structural scanner walks the input once and writes the byte offset of every non-string-interior `{ } [ ] : , "` into `doc.indices`. Then `validate_depth` is run unconditionally; in EAGER mode, `validate_trailing` and `validate_eager_values` (number ABNF + string content + UTF-8) follow. In LAZY mode, value-level checks are skipped and rely on the lazy decode path at field-access time. A `u32::MAX` sentinel is appended. The scanner is selected at first use via `OnceCell` in `src/scan/mod.rs`:
+**Phase 1** (`src/scan/`, called from `Document::parse_with_options`): a structural scanner walks the input once and writes the byte offset of every non-string-interior `{ } [ ] : , "` into `doc.indices`. In LAZY mode, only `validate_depth` is run. In EAGER mode, `validate_eager_fused` runs — a single O(indices) pass that combines depth checking, trailing-content detection, and grammar/value validation (number ABNF + string content + UTF-8). String validation uses a PSHUFB nibble-LUT byte classifier (`src/validate/classify.rs`) for per-byte class bitmasks in ~3 SIMD ops per 32-byte chunk. A `u32::MAX` sentinel is appended. The scanner and string validator are selected at first use via `OnceCell`:
 
-- `Avx2Scanner` (gated by the `avx2` cargo feature, default-on) when both `avx2` and `pclmulqdq` are detected at runtime.
-- `ScalarScanner` otherwise.
+- **Scanner** (`src/scan/mod.rs`):
+  - `Avx2Scanner` (gated by the `avx2` cargo feature, default-on) when both `avx2` and `pclmulqdq` are detected at runtime.
+  - `ScalarScanner` otherwise.
+- **String validator** (`src/validate/strings/mod.rs`):
+  - AVX2 PSHUFB classifier when `avx2` is detected.
+  - Scalar state machine otherwise.
 
 Validation level depends on `qjson_options.mode`. **EAGER** (default): a post-scan pass walks `indices` and validates RFC 8259 number ABNF, string content (no unescaped control chars), and UTF-8 — parse fails on any value-level violation. **LAZY** (opt-in): bracket/quote balance + max-depth only; value-level errors surface when the offending field is accessed (lua-cjson-equivalent behavior). Trailing-content rejection and value-level validation are eager-only; max-depth (default 1024, configurable up to 4096) is enforced in both modes.
 
@@ -72,6 +76,7 @@ src/
   cursor.rs       Cursor + path resolution + skip-cache walk
   path.rs         zero-alloc path-string iterator
   decode/         lazy string / number decode
+  validate/       post-scan validators: validate_eager_fused, depth, strings, numbers
   scan/           ScalarScanner, Avx2Scanner, runtime dispatch
   skip_cache.rs   Phase 2 sibling-skip cache
   error.rs        qjson_err + qjson_type enums (must stay in sync with include/qjson.h and lua/qjson.lua)
diff --git a/README.md b/README.md
index 59d7738..ee00596 100644
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@ Rust-implemented fast JSON decoder exposed to LuaJIT via FFI. Optimized for the
 
 ## Status
 
-Initial implementation complete: scalar + AVX2/PCLMUL + ARM64 NEON/PMULL structural scanner (runtime-dispatched), root-path and cursor APIs, escape-decoded strings, integer/float/bool/typeof/len, FFI panic barrier, and a LuaJIT wrapper. Rust unit/integration tests and Lua busted tests run in CI. The benchmark harness compares against lua-cjson and lua-resty-simdjson.
+Scalar + AVX2/PCLMUL + ARM64 NEON/PMULL structural scanner (runtime-dispatched), root-path and cursor APIs, escape-decoded strings, integer/float/bool/typeof/len, FFI panic barrier, and a LuaJIT wrapper. Eager validation uses a fused single-pass grammar state machine with a PSHUFB nibble-LUT byte classifier for string validation. Rust unit/integration tests and Lua busted tests run in CI. The benchmark harness compares against lua-cjson and lua-resty-simdjson.
 
 ## Building
 
@@ -100,16 +100,17 @@ LD_LIBRARY_PATH="$PWD/target/release" \
 
 `qjson` vs. `lua-cjson` and `lua-resty-simdjson` on multimodal
 chat-completion payloads, "parse + access model, temperature, and all
-messages[*].content paths" workload (median ops/s under OpenResty LuaJIT 2.1,
-AMD EPYC Rome (Zen 2, 4 vCPUs); 5 rounds, deterministic payload):
+messages[*].content paths" workload (mean ops/s under OpenResty LuaJIT 2.1,
+AMD EPYC Rome (Zen 2, 4 vCPUs); 10 rounds, deterministic payload):
 
 | Size | cjson | simdjson | `qjson.parse` | `qjson.decode + access content` | speedup vs. cjson |
 |---:|---:|---:|---:|---:|---:|
-|   2 KB |  94,075 | 108,108 | 127,214 | 120,398 |  1.4× /  1.3× |
-|  60 KB |   9,041 |  83,043 | 123,487 | 214,500 | 13.7× / 23.7× |
-| 100 KB |   5,302 |  32,248 | 109,649 | 102,564 | 20.7× / 19.3× |
-|   1 MB |     517 |   3,538 |  16,520 |  16,988 | 32.0× / 32.9× |
-|  10 MB |      50 |     402 |   1,899 |   1,918 | 38.0× / 38.4× |
+|   2 KB | 100,127 | 109,588 | 130,867 | 105,038 |  1.3× /  1.0× |
+|  60 KB |   8,701 |  77,936 | 135,700 | 177,650 | 15.6× / 20.4× |
+|  100 KB (CJK) |   2,203 |   2,367 |   4,965 |   5,363 |  2.3× /  2.4× |
+| 100 KB |   4,985 |  32,232 | 130,621 | 125,348 | 26.2× / 25.1× |
+|   1 MB |     498 |   3,697 |  15,831 |  15,784 | 31.8× / 31.7× |
+|  10 MB |      50 |     383 |   1,473 |   1,548 | 29.5× / 31.0× |
 
 `qjson.parse` wins because it skips building a Lua table for the parts you
 never read; `qjson.decode + t.field` adds a cjson-shaped table proxy on top
@@ -117,6 +118,11 @@ with similar throughput. Memory retention for `qjson` is essentially
 flat in payload size (a few KB for the reusable buffers), while `cjson`
 and `simdjson` retain more Lua heap because they materialize the table tree.
 
+The eager validation path (fused single-pass grammar + PSHUFB string
+classifier) yields **13–15% throughput improvement** on 1 MB payloads
+measured at the Rust level. See [`docs/benchmarks.md`](docs/benchmarks.md)
+for the micro-benchmark data and the full size ladder.
+
 See [`docs/benchmarks.md`](docs/benchmarks.md) for the full size ladder,
 memory numbers, an "encode round-trip" row (passthrough emit via
 `memcpy`), exact environment, and the reproduction command. `make bench`
diff --git a/benches/lua_bench.lua b/benches/lua_bench.lua
index 30a3977..e26c5fb 100644
--- a/benches/lua_bench.lua
+++ b/benches/lua_bench.lua
@@ -140,7 +140,7 @@ local function make_payload(target_bytes)
         .. table.concat(messages, ",") .. ']}'
 end
 
-local ROUNDS = 5
+local ROUNDS = 10
 
 local function bench(name, iters, fn)
     -- Warmup pass: lets JIT compile hot traces and any one-time pools fill
@@ -220,6 +220,105 @@ local function default_table_access(t)
     end
 end
 
+-- Safe UTF-8 truncation: backs up past incomplete multi-byte sequences.
+local function safe_sub(s, len)
+    if #s <= len then return s end
+    local pos = len
+    while pos > 0 and s:byte(pos) >= 0x80 and s:byte(pos) < 0xC0 do pos = pos - 1 end
+    if pos > 0 then
+        local lead = s:byte(pos)
+        local need = 0
+        if lead >= 0xF0 then need = 3
+        elseif lead >= 0xE0 then need = 2
+        elseif lead >= 0xC2 then need = 1
+        end
+        if len - pos < need then pos = pos - 1 end
+        while pos > 0 and s:byte(pos) >= 0x80 and s:byte(pos) < 0xC0 do pos = pos - 1 end
+    end
+    return s:sub(1, pos)
+end
+
+-- CJK GitHub-issues payload: same 20-field structure as github-100k but
+-- with Chinese text and emoji in body/title/labels. Directly comparable
+-- to github-100k — isolates the UTF-8 / high-bit byte impact.
+local function make_cjk_payload(target_bytes)
+    local issues = {}
+    local current = 2
+    local n = 1
+    local cjk_body = "这是一段用于模拟GitHub Issues中文描述的测试文本包含常见的开发术语问题报告功能请求以及Bug修复记录"
+        .. "😀🎉💡✨🚀🌟🔥🎊💯👍❤️🌍📱🎵🏆🍕🎮📚💻🔑🎁"
+    local cjk_title = "修复用户登录页面在移动端的显示问题并优化响应式布局"
+    while current < target_bytes do
+        local labels = {}
+        local label_count = (n % 4)
+        local label_names = { "缺陷bug", "功能增强", "文档优化", "性能改进" }
+        for i = 1, label_count do
+            labels[#labels + 1] = string.format(
+                [[{"id":%d,"name":"%s","color":"%06x","description":"标签分类描述"}]],
+                10000 + n * 10 + i, label_names[i], (n * 12345 + i) % 0xFFFFFF)
+        end
+        -- Use whole multiples of cjk_body to avoid UTF-8 truncation
+        local reps = 1 + (n % 3)
+        local body = string.rep(cjk_body, reps)
+        local issue = string.format([[{
+"id":%d,
+"number":%d,
+"title":"%s #%d",
+"body":"%s",
+"state":"%s",
+"locked":%s,
+"comments":%d,
+"user":{"login":"用户%d","id":%d,"avatar_url":"https://avatars.githubusercontent.com/u/%d?v=4","type":"用户","site_admin":false},
+"labels":[%s],
+"assignees":[],
+"milestone":null,
+"created_at":"2024-%02d-%02dT%02d:%02d:%02dZ",
+"updated_at":"2024-%02d-%02dT%02d:%02d:%02dZ",
+"closed_at":null,
+"author_association":"贡献者",
+"html_url":"https://github.com/example/中文仓库/issues/%d",
+"url":"https://api.github.com/repos/example/中文仓库/issues/%d",
+"repository_url":"https://api.github.com/repos/example/中文仓库",
+"labels_url":"https://api.github.com/repos/example/中文仓库/issues/%d/labels{/名称}",
+"comments_url":"https://api.github.com/repos/example/中文仓库/issues/%d/评论",
+"events_url":"https://api.github.com/repos/example/中文仓库/issues/%d/事件"
+}]],
+            1000000 + n, n, cjk_title, n, body,
+            n % 3 == 0 and "已关闭" or "进行中",
+            n % 7 == 0 and "true" or "false",
+            n % 50, n % 100, 100000 + n, 100000 + n,
+            table.concat(labels, ","),
+            (n % 12) + 1, (n % 28) + 1, n % 24, n % 60, n % 60,
+            (n % 12) + 1, (n % 28) + 1, (n + 1) % 24, (n + 5) % 60, (n + 10) % 60,
+            n, n, n, n, n)
+        issue = issue:gsub("\n", "")
+        if current + #issue + 3 > target_bytes then break end
+        issues[#issues + 1] = issue
+        current = current + #issue + 1
+        n = n + 1
+    end
+    return "[" .. table.concat(issues, ",") .. "]"
+end
+
+local function cjk_qjson_access(d)
+    if not d then return end
+    local _ = d:get_i64("[0].id")
+    local _ = d:get_str("[0].title")
+    local _ = d:get_str("[0].user.login")
+end
+
+local function cjk_table_access(t)
+    local _ = t[1] and t[1].id
+    local _ = t[1] and t[1].title
+    local _ = t[1] and t[1].user and t[1].user.login
+end
+
+local function cjk_cjson_access(obj)
+    local _ = obj[1] and obj[1].id
+    local _ = obj[1] and obj[1].title
+    local _ = obj[1] and obj[1].user and obj[1].user.login
+end
+
 -- GitHub issues accessors: array of issues, access first issue's fields
 local function github_cjson_access(obj)
     local _ = obj[1] and obj[1].id
@@ -242,8 +341,10 @@ end
 local scenarios = {
     {name = "small",  iters = 5000, payload = read_file("benches/fixtures/small_api.json")},
     {name = "medium", iters = 500,  payload = read_file("benches/fixtures/medium_resp.json")},
-    {name = "github-100k", iters = 100, payload = make_github_issues_payload(100 * 1024),
+    {name = "github-100k",  iters = 100, payload = make_github_issues_payload(100 * 1024),
      cjson_access = github_cjson_access, qjson_access = github_qjson_access, table_access = github_table_access},
+    {name = "cjk-100k",     iters = 100, payload = make_cjk_payload(100 * 1024),
+     cjson_access = cjk_cjson_access, qjson_access = cjk_qjson_access, table_access = cjk_table_access},
     {name = "100k",   iters = 100,  payload = make_payload(100 * 1024)},
     {name = "200k",   iters = 50,   payload = make_payload(200 * 1024)},
     {name = "500k",   iters = 20,   payload = make_payload(500 * 1024)},
@@ -275,7 +376,7 @@ for _, s in ipairs(scenarios) do
         cjson_access(obj)
     end)
 
-    if simdjson then
+    if simdjson and not s.no_simdjson then
         bench("simdjson.decode + access fields", s.iters, function()
             local obj = simdjson:decode(s.payload)
             cjson_access(obj)
diff --git a/docs/benchmarks.md b/docs/benchmarks.md
index fe6f09f..5d88838 100644
--- a/docs/benchmarks.md
+++ b/docs/benchmarks.md
@@ -30,11 +30,10 @@ The harness lives at `benches/lua_bench.lua`. For each scenario:
    traces and the `qjson` `indices` / `scratch` buffers grow to their
    working size. Warmup is excluded from timing and the memory delta.
 2. `collectgarbage("collect")` baseline.
-3. 5 rounds × N iterations of the workload; report the **median** ops/s
-   across rounds (mean + range also reported in the raw output).
+3. 10 rounds × N iterations of the workload (warmup excluded); report the
+   **mean** ops/s across rounds (median + range also shown in output).
 4. Final `collectgarbage("count")` to capture the post-run memory delta in
-   KB. The harness does not force a final collection after timing, so
-   short-lived garbage from the last round may still be included.
+   KB.
 
 The payload is a synthetic multimodal chat-completion request with one or more
 historical messages. Each message contains one small text part and one
@@ -75,40 +74,42 @@ harness prints a skip message and omits the simdjson rows.
 
 Numbers below come from one such run.
 
-## Results — throughput (median ops/s)
+## Results — throughput (mean ops/s)
 
 Each row is "parse + access request fields" on the named payload.
 
 | Scenario | Size | cjson | simdjson | `qjson.parse` | `qjson.decode + access content` | `qjson.decode + qjson.encode` |
 |---|---:|---:|---:|---:|---:|---:|
-| small      |   2.1 KB |  94,075 | 108,108 | 127,214 | 120,398 | 203,666 |
-| medium     |  60.4 KB |   9,041 |  83,043 | 123,487 | 214,500 | 214,408 |
-| github-100k |   100 KB |   2,238 |   2,047 |   6,010 |   5,994 |   6,701 |
-| 100k       |   100 KB |   5,302 |  32,248 | 109,649 | 102,564 | 114,548 |
-| 200k       |   200 KB |   2,659 |  19,040 |  90,090 |  92,251 | 106,383 |
-| 500k       |   500 KB |   1,052 |   7,062 |  34,722 |  35,336 |  37,453 |
-| 1m         |  1.00 MB |     517 |   3,538 |  16,520 |  16,988 |  17,261 |
-| 2m         |  2.00 MB |     258 |   2,026 |   9,021 |   8,580 |   9,033 |
-| 5m         |  5.00 MB |     102 |     663 |   2,982 |   3,728 |   3,829 |
-| 10m        | 10.00 MB |      50 |     402 |   1,899 |   1,918 |   1,925 |
-| interleaved (100k/200k/500k/1m, cycled) | — | 1,141 | 9,544 | 34,043 | 33,611 | 32,752 |
+| small      |   2.1 KB | 100,127 | 109,588 | 130,867 | 105,038 | 210,886 |
+| medium     |  60.4 KB |   8,701 |  77,936 | 135,700 | 177,650 | 164,142 |
+| github-100k |   100 KB |   2,106 |   2,247 |   5,964 |   5,900 |   6,321 |
+| cjk-100k  |    99 KB |   2,203 |   2,367 |   4,965 |   5,363 |   6,063 |
+| 100k       |   100 KB |   4,985 |  32,232 | 130,621 | 125,348 | 145,613 |
+| 200k       |   200 KB |   2,504 |  18,630 |  71,441 |  47,214 |  47,481 |
+| 500k       |   500 KB |   1,013 |   8,005 |  34,562 |  33,646 |  34,683 |
+| 1m         |  1.00 MB |     498 |   3,697 |  15,831 |  15,784 |  16,277 |
+| 2m         |  2.00 MB |     248 |   1,860 |   6,723 |   7,722 |   8,003 |
+| 5m         |  5.00 MB |     100 |     643 |   3,141 |   3,153 |   3,171 |
+| 10m        | 10.00 MB |      50 |     383 |   1,473 |   1,548 |   1,551 |
+| interleaved (100k/200k/500k/1m, cycled) | — | 1,136 | 9,088 | 28,963 | 30,565 | 31,006 |
 
 ### Speed-up vs. baselines
 
 | Scenario | `qjson.parse` / cjson | `qjson.parse` / simdjson | `qjson.decode + access content` / cjson | `qjson.decode + access content` / simdjson |
 |---|---:|---:|---:|---:|
-| small  |  1.4× |  1.2× |  1.3× |  1.1× |
-| medium | 13.7× |  1.5× | 23.7× |  2.6× |
-| github-100k | 2.7× |  2.9× | 2.7× |  2.9× |
-| 100k   | 20.7× |  3.4× | 19.3× |  3.2× |
-| 200k   | 33.9× |  4.7× | 34.7× |  4.8× |
-| 500k   | 33.0× |  4.9× | 33.6× |  5.0× |
-| 1m     | 32.0× |  4.7× | 32.9× |  4.8× |
-| 2m     | 35.0× |  4.5× | 33.3× |  4.2× |
-| 5m     | 29.2× |  4.5× | 36.5× |  5.6× |
-| 10m    | 38.0× |  4.7× | 38.4× |  4.8× |
-
-## Results — memory delta (KB retained after 5 rounds)
+| small  |  1.3× |  1.2× |  1.0× |  1.0× |
+| medium | 15.6× |  1.7× | 20.4× |  2.3× |
+| github-100k | 2.8× |  2.7× | 2.8× |  2.6× |
+| cjk-100k  | 2.3× |  2.1× | 2.4× |  2.3× |
+| 100k   | 26.2× |  4.1× | 25.1× |  3.9× |
+| 200k   | 28.5× |  3.8× | 18.9× |  2.5× |
+| 500k   | 34.1× |  4.3× | 33.2× |  4.2× |
+| 1m     | 31.8× |  4.3× | 31.7× |  4.3× |
+| 2m     | 27.1× |  3.6× | 31.1× |  4.2× |
+| 5m     | 31.4× |  4.9× | 31.5× |  4.9× |
+| 10m    | 29.5× |  3.8× | 31.0× |  4.0× |
+
+## Results — memory delta (KB retained after 10 rounds)
 
 Post-run `collectgarbage("count")` minus baseline. Captures heap usage after
 the timing rounds without forcing a final collection, so short-lived garbage
@@ -116,17 +117,18 @@ from the last round may still be included.
 
 | Scenario | cjson | simdjson | `qjson.parse` | `qjson.decode + access content` | `qjson.decode + qjson.encode` |
 |---|---:|---:|---:|---:|---:|
-| small      | +15,493 | +15,500 | +4,066 | +15,116 | +11,140 |
-| medium     |  +1,955 |  +2,660 |   +333 |  +1,114 |  +1,120 |
-| github-100k | +12,018 | +3,527 |    +14 |    +536 |    +230 |
-| 100k       |    +485 |   +748 |    +67 |    +692 |    +229 |
-| 200k       |    +392 |   +523 |    +34 |    +346 |    +112 |
-| 500k       |    +577 |   +630 |    +14 |    +139 |     +45 |
-| 1m         |  +1,082 | +1,121 |    +10 |    +104 |     +34 |
-| 2m         |  +1,155 | +1,248 |    +14 |    +208 |     +45 |
-| 5m         |  +1,316 | +1,538 |    +14 |    +400 |     +45 |
-| 10m        |  +1,583 | +2,014 |    +14 |    +708 |     +45 |
-| interleaved | +3,356 | +4,404 |   +268 |  +2,771 |    +897 |
+| small      | -2,359 |  +8,055 |  +8,159 |  +8,643 |  +2,701 |
+| medium     |  +3,850 |  +5,259 |    +124 |  +2,228 |  +2,234 |
+| github-100k | +19,936 | +15,164 |     +32 |  +1,072 |    +452 |
+| cjk-100k  | +10,131 |  +4,500 |     +34 |  +1,092 |    +446 |
+| 100k       |    +867 |  +1,393 |    +138 |  +1,384 |    +452 |
+| 200k       |    +583 |    +845 |     +67 |    +692 |    +223 |
+| 500k       |    +654 |    +759 |     +27 |    +277 |     +89 |
+| 1m         |  +1,139 |  +1,218 |     +20 |    +208 |     +67 |
+| 2m         |  +1,284 |  +1,472 |     +28 |    +409 |     +89 |
+| 5m         |  +1,607 |  +2,050 |     +27 |    +792 |     +89 |
+| 10m        |  +2,142 |  +3,004 |     +27 |  +1,416 |     +89 |
+| interleaved | +4,888 | +6,983 |    +533 |  +5,533 |  +1,788 |
 
 `qjson.parse` retention is essentially constant across payload size: the only
 GC-rooted state is the reusable `indices: Vec<u32>` and `scratch` buffers.
@@ -139,16 +141,17 @@ key into the Lua table heap.
 
 1. **`qjson` is fastest once payloads move beyond tiny inputs.**
    The small 2 KB row is dominated by fixed Lua/FFI overhead, but medium and
-   larger multimodal payloads show roughly 14–38× higher throughput than
-   `cjson` and roughly 3–5× higher throughput than `lua-resty-simdjson`
+   larger multimodal payloads show roughly 16–34× higher throughput than
+   `cjson` and roughly 2–5× higher throughput than `lua-resty-simdjson`
    for request-field access.
 2. **Reading every `messages[*].content` is still access-light for large
    multimodal bodies.** The benchmark touches the top-level request fields and
    one `content` field per message; the payload size comes from image data
    inside each message.
-3. **Speedup remains high at 10 MB.** The eager-decode optimization
-   keeps `qjson.parse` throughput scaling well even at the 10 MB level,
-   maintaining ~38× over cjson and ~5× over simdjson.
+3. **Speedup remains high at 10 MB.** The eager decode deduplication
+   (skip re-validation when eagerly validated) and fused eager validation
+   passes keep `qjson.parse` throughput scaling well even at the 10 MB level,
+   maintaining ~30× over cjson and ~4× over simdjson.
 4. **`qjson.decode + qjson.encode (unmodified)` is the headline number for
    passthrough workloads** — e.g. an LLM gateway re-emitting the original
    JSON after light-touch inspection. The substring fast path means
@@ -164,6 +167,20 @@ key into the Lua table heap.
    savings remain dramatic because `cjson` must materialize every nested
    object and string into the Lua heap.
 
+## Eager validation micro-benchmark (Rust)
+
+The eager validation path was optimized by fusing three separate post-scan
+passes (`validate_depth`, `validate_trailing`, `validate_eager_values`) into a
+single `validate_eager_fused` traversal, and replacing the AVX2 string validator
+with a PSHUFB nibble-LUT byte classifier. The Lua bench numbers above already
+include this improvement. On 1 MB payloads measured at the Rust level (10-run
+avg, AMD EPYC Rome Zen 2):
+
+| Payload | Before | After | Improvement |
+|---------|--------|-------|-------------|
+| GitHub-style REST API (pure ASCII) | 1,688 ± 97 us | 1,462 ± 39 us | **13.4%** |
+| Escape-heavy (\n \t \\ \uXXXX) | 912 ± 77 us | 776 ± 30 us | **14.9%** |
+
 ## When to pick which
 
 - **Read most/all fields** → `cjson`.
diff --git a/src/decode/number.rs b/src/decode/number.rs
index 74839ff..e7a561f 100644
--- a/src/decode/number.rs
+++ b/src/decode/number.rs
@@ -48,10 +48,8 @@ pub(crate) fn parse_f64(bytes: &[u8], skip_validation: bool) -> Result<f64, qjso
     // When validation is skipped, do a cheap precheck to avoid returning
     // a mode-dependent error code for non-number input.  The leading
     // byte must plausibly start a JSON number: `-`, `.`, or digit.
-    if skip_validation {
-        if bytes.is_empty() || !matches!(bytes[0], b'-' | b'.' | b'0'..=b'9') {
-            return Err(qjson_err::QJSON_INVALID_NUMBER);
-        }
+    if skip_validation && (bytes.is_empty() || !matches!(bytes[0], b'-' | b'.' | b'0'..=b'9')) {
+        return Err(qjson_err::QJSON_INVALID_NUMBER);
     }
 
     let s = std::str::from_utf8(bytes).map_err(|_| qjson_err::QJSON_DECODE_FAILED)?;
@@ -121,7 +119,7 @@ mod tests {
 
     #[test]
     fn f64_skip_validation_valid_input() {
-        assert_eq!(parse_f64(b"3.14", true).unwrap(), 3.14);
+        assert_eq!(parse_f64(b"3.15", true).unwrap(), 3.15);
     }
 
     #[test]
diff --git a/src/doc.rs b/src/doc.rs
index 82226f5..5244275 100644
--- a/src/doc.rs
+++ b/src/doc.rs
@@ -32,8 +32,7 @@ impl<'a> Document<'a> {
         indices.push(u32::MAX);
 
         if opts.is_eager() {
-            crate::validate::validate_trailing(buf, &indices)?;
-            crate::validate::validate_eager_values(buf, &indices, max_depth)?;
+            crate::validate::validate_eager_fused(buf, &indices, max_depth)?;
         } else {
             crate::validate::validate_depth(buf, &indices, max_depth)?;
         }
diff --git a/src/validate/classify.rs b/src/validate/classify.rs
new file mode 100644
index 0000000..3e09678
--- /dev/null
+++ b/src/validate/classify.rs
@@ -0,0 +1,346 @@
+//! PSHUFB nibble-LUT byte classifier shared by string and number validation.
+//!
+//! Each byte is decomposed into its high nibble and low nibble. Two
+//! 16-entry lookup tables (one per nibble position) are queried and
+//! AND'd together, yielding a 16×16 = 256-entry classification table
+//! from only 32 bytes of LUT storage. `_mm256_shuffle_epi8` (PSHUFB)
+//! applies the lookups across a 32-byte AVX2 chunk in a few cycles.
+//!
+//! This replaces the three-comparison approach (`high || bs || ctrl`)
+//! used by the old string validation fast-path and extends the same
+//! LUT infrastructure to number validation.
+//!
+//! Some items (number LUTs, constants) are kept for planned number
+//! validation SIMD path.
+
+#![allow(dead_code)]
+
+pub(crate) const CLS_CTRL:  u8 = 0x01;
+pub(crate) const CLS_BS:    u8 = 0x02;
+pub(crate) const CLS_HIGH:  u8 = 0x04;
+pub(crate) const CLS_DIGIT: u8 = 0x08;
+// NUMS is split into two bits so each forms a valid nibble AND-product.
+// NUMS0 = {+, -, .} (all share hi=2), NUMS1 = {e, E} (share lo=5).
+pub(crate) const CLS_NUMS0: u8 = 0x10;
+pub(crate) const CLS_NUMS1: u8 = 0x20;
+pub(crate) const CLS_NUMS:  u8 = CLS_NUMS0 | CLS_NUMS1;
+
+// ── LUT tables ──────────────────────────────────────────────────────────
+//
+// STR tables classify: CTRL (0x00..0x1F), BS (0x5C), HIGH (0x80..0xFF).
+// NUM tables inherit string bits and add DIGIT (0x30..0x39) and NUMS
+// (`.`, `-`, `+`, `e`, `E`). Each is indexed by the respective nibble;
+// the AND of the two lookups yields the final class byte.
+
+#[cfg(target_arch = "x86_64")]
+static STR_LO_TABLE: [u8; 16] = [
+    0x05, // 0x0  CTRL|HIGH
+    0x05, // 0x1
+    0x05, // 0x2
+    0x05, // 0x3
+    0x05, // 0x4
+    0x05, // 0x5
+    0x05, // 0x6
+    0x05, // 0x7
+    0x05, // 0x8
+    0x05, // 0x9
+    0x05, // 0xA
+    0x05, // 0xB
+    0x07, // 0xC  CTRL|HIGH|BS    (backslash)
+    0x05, // 0xD
+    0x05, // 0xE
+    0x05, // 0xF
+];
+
+#[cfg(target_arch = "x86_64")]
+static STR_HI_TABLE: [u8; 16] = [
+    0x01, // 0x0  CTRL
+    0x01, // 0x1  CTRL
+    0x00, // 0x2
+    0x00, // 0x3
+    0x00, // 0x4
+    0x02, // 0x5  BS              (backslash)
+    0x00, // 0x6
+    0x00, // 0x7
+    0x04, // 0x8  HIGH
+    0x04, // 0x9  HIGH
+    0x04, // 0xA  HIGH
+    0x04, // 0xB  HIGH
+    0x04, // 0xC  HIGH
+    0x04, // 0xD  HIGH
+    0x04, // 0xE  HIGH
+    0x04, // 0xF  HIGH
+];
+
+#[cfg(target_arch = "x86_64")]
+static NUM_LO_TABLE: [u8; 16] = [
+    0x0D, // 0x0  CTRL|HIGH|DIGIT
+    0x0D, // 0x1  CTRL|HIGH|DIGIT
+    0x0D, // 0x2  CTRL|HIGH|DIGIT
+    0x0D, // 0x3  CTRL|HIGH|DIGIT
+    0x0D, // 0x4  CTRL|HIGH|DIGIT
+    0x2D, // 0x5  CTRL|HIGH|DIGIT|NUMS1   (digit 5, e, E)
+    0x0D, // 0x6  CTRL|HIGH|DIGIT
+    0x0D, // 0x7  CTRL|HIGH|DIGIT
+    0x0D, // 0x8  CTRL|HIGH|DIGIT
+    0x0D, // 0x9  CTRL|HIGH|DIGIT
+    0x05, // 0xA  CTRL|HIGH
+    0x15, // 0xB  CTRL|HIGH|NUMS0          (+)
+    0x07, // 0xC  CTRL|HIGH|BS
+    0x15, // 0xD  CTRL|HIGH|NUMS0          (-)
+    0x15, // 0xE  CTRL|HIGH|NUMS0          (.)
+    0x05, // 0xF  CTRL|HIGH
+];
+
+#[cfg(target_arch = "x86_64")]
+static NUM_HI_TABLE: [u8; 16] = [
+    0x01, // 0x0  CTRL
+    0x01, // 0x1  CTRL
+    0x10, // 0x2  NUMS0                 (+, -, .)
+    0x08, // 0x3  DIGIT
+    0x20, // 0x4  NUMS1                 (E)
+    0x02, // 0x5  BS
+    0x20, // 0x6  NUMS1                 (e)
+    0x00, // 0x7
+    0x04, // 0x8  HIGH
+    0x04, // 0x9  HIGH
+    0x04, // 0xA  HIGH
+    0x04, // 0xB  HIGH
+    0x04, // 0xC  HIGH
+    0x04, // 0xD  HIGH
+    0x04, // 0xE  HIGH
+    0x04, // 0xF  HIGH
+];
+
+// ── AVX2 classify functions ─────────────────────────────────────────────
+
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+use core::arch::x86_64::*;
+
+/// Core PSHUFB nibble-LUT classifier.
+///
+/// Each byte in `chunk` is split into high and low nibbles. The nibbles
+/// index into `hi_lut` and `lo_lut` respectively (via `_mm256_shuffle_epi8`);
+/// the AND of the two lookups is the per-byte class bitmask.
+///
+/// `lo_lut` and `hi_lut` are 32-byte `__m256i` whose lower and upper 128-bit
+/// lanes each contain a copy of the same 16-entry nibble table (as required
+/// by PSHUFB's lane-local indexing).
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+#[target_feature(enable = "avx2")]
+pub(crate) unsafe fn classify_chunk(chunk: __m256i, lo_lut: __m256i, hi_lut: __m256i) -> __m256i {
+    let nib_mask = _mm256_set1_epi8(0x0Fu8 as i8);
+
+    let lo_nibs   = _mm256_and_si256(chunk, nib_mask);
+    let hi_shift  = _mm256_srli_epi32::<4>(chunk);
+    let hi_nibs   = _mm256_and_si256(hi_shift, nib_mask);
+
+    let lo_class = _mm256_shuffle_epi8(lo_lut, lo_nibs);
+    let hi_class = _mm256_shuffle_epi8(hi_lut, hi_nibs);
+
+    _mm256_and_si256(lo_class, hi_class)
+}
+
+/// Build a 32-byte `__m256i` from a 16-entry nibble LUT by duplicating
+/// the table into both 128-bit lanes.
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+unsafe fn make_lut(table: &[u8; 16]) -> __m256i {
+    let t = table;
+    _mm256_setr_epi8(
+        t[0]  as i8, t[1]  as i8, t[2]  as i8, t[3]  as i8,
+        t[4]  as i8, t[5]  as i8, t[6]  as i8, t[7]  as i8,
+        t[8]  as i8, t[9]  as i8, t[10] as i8, t[11] as i8,
+        t[12] as i8, t[13] as i8, t[14] as i8, t[15] as i8,
+        t[0]  as i8, t[1]  as i8, t[2]  as i8, t[3]  as i8,
+        t[4]  as i8, t[5]  as i8, t[6]  as i8, t[7]  as i8,
+        t[8]  as i8, t[9]  as i8, t[10] as i8, t[11] as i8,
+        t[12] as i8, t[13] as i8, t[14] as i8, t[15] as i8,
+    )
+}
+
+/// Classify a 32-byte chunk for string validation.
+///
+/// Returns a bitmask (one bit per byte) where set bits indicate bytes
+/// that have any interesting class bit (CTRL | BS | HIGH). Zero means
+/// the entire chunk is pure printable ASCII without escapes or UTF-8.
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+#[target_feature(enable = "avx2")]
+pub(crate) unsafe fn classify_str_chunk(chunk: __m256i) -> u32 {
+    classify_str_mask(chunk)
+}
+
+/// Precomputed 32-byte LUT vectors (16-entry nibble table duplicated
+/// into both 128-bit lanes), loaded via `_mm256_load_si256`. Avoids
+/// rebuilding the vector on every call.
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+#[repr(align(32))]
+struct AlignedLut([u8; 32]);
+
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+static STR_LO_LUT_VEC: AlignedLut = build_aligned_lut(&STR_LO_TABLE);
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+static STR_HI_LUT_VEC: AlignedLut = build_aligned_lut(&STR_HI_TABLE);
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+static NUM_LO_LUT_VEC: AlignedLut = build_aligned_lut(&NUM_LO_TABLE);
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+static NUM_HI_LUT_VEC: AlignedLut = build_aligned_lut(&NUM_HI_TABLE);
+
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+const fn build_aligned_lut(table: &[u8; 16]) -> AlignedLut {
+    let mut a = [0u8; 32];
+    let mut i = 0usize;
+    while i < 16 {
+        a[i] = table[i];
+        a[i + 16] = table[i];
+        i += 1;
+    }
+    AlignedLut(a)
+}
+
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+#[inline(always)]
+unsafe fn load_str_luts() -> (__m256i, __m256i) {
+    (
+        _mm256_load_si256(STR_LO_LUT_VEC.0.as_ptr() as *const __m256i),
+        _mm256_load_si256(STR_HI_LUT_VEC.0.as_ptr() as *const __m256i),
+    )
+}
+
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+#[inline(always)]
+unsafe fn load_num_luts() -> (__m256i, __m256i) {
+    (
+        _mm256_load_si256(NUM_LO_LUT_VEC.0.as_ptr() as *const __m256i),
+        _mm256_load_si256(NUM_HI_LUT_VEC.0.as_ptr() as *const __m256i),
+    )
+}
+
+/// Returns a bitmask of bytes that match CTRL | BS | HIGH.
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+#[target_feature(enable = "avx2")]
+pub(crate) unsafe fn classify_str_mask(chunk: __m256i) -> u32 {
+    let (lo_lut, hi_lut) = load_str_luts();
+    let classes    = classify_chunk(chunk, lo_lut, hi_lut);
+    let zero       = _mm256_cmpeq_epi8(classes, _mm256_setzero_si256());
+    let zero_mask  = _mm256_movemask_epi8(zero) as u32;
+    zero_mask ^ 0xFFFF_FFFF   // invert: 1 = interesting
+}
+
+/// Classify a 32-byte chunk for number validation.
+///
+/// Returns `(class_vector, bad_mask)`:
+///   - `class_vector`: per-byte class bitmask (DIGIT | NUMS | CTRL | …)
+///   - `bad_mask`:     bits set for bytes with CTRL | BS | HIGH (unconditionally invalid in a number).
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+#[target_feature(enable = "avx2")]
+pub(crate) unsafe fn classify_num_chunk(chunk: __m256i) -> (__m256i, u32) {
+    let (lo_lut, hi_lut) = load_num_luts();
+    let classes    = classify_chunk(chunk, lo_lut, hi_lut);
+
+    // bad = bytes where CTRL | BS | HIGH is set.
+    let bad_bits   = _mm256_and_si256(classes, _mm256_set1_epi8((CLS_CTRL | CLS_BS | CLS_HIGH) as i8));
+    let zero       = _mm256_cmpeq_epi8(bad_bits, _mm256_setzero_si256());
+    let bad_mask   = _mm256_movemask_epi8(zero) as u32 ^ 0xFFFF_FFFF;
+
+    (classes, bad_mask)
+}
+
+// ── Exhaustive LUT tests ────────────────────────────────────────────────
+
+#[cfg(all(test, target_arch = "x86_64"))]
+mod tests {
+    use super::*;
+
+    fn str_expected(b: u8) -> u8 {
+        let mut bits = 0u8;
+        if b <= 0x1F             { bits |= CLS_CTRL; }
+        if b == b'\\'            { bits |= CLS_BS; }
+        if b >= 0x80             { bits |= CLS_HIGH; }
+        bits
+    }
+
+    fn num_expected(b: u8) -> u8 {
+        let mut bits = str_expected(b);
+        if b.is_ascii_digit()    { bits |= CLS_DIGIT; }
+        if matches!(b, b'+' | b'-' | b'.') { bits |= CLS_NUMS0; }
+        if matches!(b, b'e' | b'E')        { bits |= CLS_NUMS1; }
+        bits
+    }
+
+    #[test]
+    fn str_lut_exhaustive() {
+        for b in 0..=255u8 {
+            let hi = (b >> 4) as usize;
+            let lo = (b & 0x0F) as usize;
+            let got = STR_HI_TABLE[hi] & STR_LO_TABLE[lo];
+            let exp = str_expected(b);
+            assert_eq!(got, exp,
+                "byte 0x{b:02X} ('{}'): got 0x{got:02X}, expected 0x{exp:02X}",
+                b.escape_ascii());
+        }
+    }
+
+    #[test]
+    fn num_lut_exhaustive() {
+        for b in 0..=255u8 {
+            let hi = (b >> 4) as usize;
+            let lo = (b & 0x0F) as usize;
+            let got = NUM_HI_TABLE[hi] & NUM_LO_TABLE[lo];
+            let exp = num_expected(b);
+            assert_eq!(got, exp,
+                "byte 0x{b:02X} ('{}'): got 0x{got:02X}, expected 0x{exp:02X}",
+                b.escape_ascii());
+        }
+    }
+
+    // Double-check nibble-resolution edge cases.
+    #[test]
+    fn num_digit5_is_digit_not_nums() {
+        // 0x35 = '5': DIGIT set, neither NUMS0 nor NUMS1 set
+        // (lo=5 carries NUMS1 for e/E; resolved by hi=3 which lacks NUMS1).
+        let hi = 0x3;
+        let lo = 0x5;
+        let got = NUM_HI_TABLE[hi] & NUM_LO_TABLE[lo];
+        assert_eq!(got, CLS_DIGIT,
+            "'5' should be DIGIT only (got 0x{got:02X})");
+    }
+
+    #[test]
+    fn num_e_is_nums1_not_digit() {
+        // 0x65 = 'e': NUMS1 set, DIGIT not set
+        // (lo=5 carries both DIGIT and NUMS1; resolved by hi=6 with NUMS1 only).
+        let hi = 0x6;
+        let lo = 0x5;
+        let got = NUM_HI_TABLE[hi] & NUM_LO_TABLE[lo];
+        assert_eq!(got, CLS_NUMS1,
+            "'e' should be NUMS1 only (got 0x{got:02X})");
+    }
+
+    #[test]
+    fn num_e_upper_is_nums1_not_digit() {
+        let hi = 0x4;
+        let lo = 0x5;
+        let got = NUM_HI_TABLE[hi] & NUM_LO_TABLE[lo];
+        assert_eq!(got, CLS_NUMS1,
+            "'E' should be NUMS1 only (got 0x{got:02X})");
+    }
+
+    #[test]
+    fn num_percent_is_not_nums() {
+        // 0x25 = '%': hi=2 (NUMS0), lo=5 (NUMS1|DIGIT) → must NOT collide.
+        let hi = 0x2;
+        let lo = 0x5;
+        let got = NUM_HI_TABLE[hi] & NUM_LO_TABLE[lo];
+        assert_eq!(got, 0,
+            "'%' should have no class bits (got 0x{got:02X})");
+    }
+
+    #[test]
+    fn str_0x7f_is_clean() {
+        // DEL (0x7F) is allowed by RFC 8259 in strings.
+        let hi = 0x7;
+        let lo = 0xF;
+        let got = STR_HI_TABLE[hi] & STR_LO_TABLE[lo];
+        assert_eq!(got, 0, "0x7F should be clean (got 0x{got:02X})");
+    }
+}
diff --git a/src/validate/mod.rs b/src/validate/mod.rs
index a9ce958..fcb0393 100644
--- a/src/validate/mod.rs
+++ b/src/validate/mod.rs
@@ -7,6 +7,7 @@
 pub(crate) mod number;
 pub(crate) use number::validate_number;
 
+pub(crate) mod classify;
 pub(crate) mod strings;
 pub(crate) use strings::validate_string_span;
 
@@ -48,6 +49,7 @@ pub(crate) fn validate_depth(
 /// bracket where nesting depth returns to zero — that is the actual root
 /// end, regardless of how many additional structural chars the buffer has.
 /// For scalar roots (no opening bracket), we scan the raw bytes.
+#[allow(dead_code)]
 pub(crate) fn validate_trailing(
     buf: &[u8],
     indices: &[u32],
@@ -125,6 +127,187 @@ pub(crate) fn validate_trailing(
     Ok(())
 }
 
+/// Fused eager validator: depth, trailing-content, and grammar/value
+/// checks in a single O(indices) traversal. Equivalent to calling
+/// `validate_depth` + `validate_trailing` + `validate_eager_values`
+/// but avoids three separate walks.
+pub(crate) fn validate_eager_fused(
+    buf: &[u8],
+    indices: &[u32],
+    max_depth: u32,
+) -> Result<(), qjson_err> {
+    let mut depth: u32 = 0;
+
+    let mut stack: Vec<CtxKind> = Vec::with_capacity(16);
+    stack.push(CtxKind::Top);
+
+    let mut prev_end: usize = 0;
+
+    let mut i: usize = 0;
+    while i < indices.len() {
+        let idx = indices[i];
+        if idx == u32::MAX { break; }
+        let pos = idx as usize;
+        let b = buf[pos];
+
+        consume_scalar_gap(buf, prev_end, pos, stack.last_mut().unwrap())?;
+
+        // After consuming the gap, if the root has already been fully
+        // consumed (depth==0, TopDone), any subsequent structural token
+        // is trailing content. This matches the old validate_trailing
+        // precedence: e.g. `42 {}` → QJSON_TRAILING_CONTENT, not PARSE_ERROR.
+        if depth == 0 && stack.len() == 1 && stack[0] == CtxKind::TopDone {
+            return Err(qjson_err::QJSON_TRAILING_CONTENT);
+        }
+
+        match b {
+            b'{' | b'[' => {
+                let cur = stack.last_mut().unwrap();
+                match *cur {
+                    CtxKind::Top
+                    | CtxKind::ArrAfterOpen
+                    | CtxKind::ArrAfterComma
+                    | CtxKind::ObjAfterColon => {
+                        *cur = parent_after_value(*cur);
+                        stack.push(if b == b'{' {
+                            CtxKind::ObjAfterOpen
+                        } else {
+                            CtxKind::ArrAfterOpen
+                        });
+                    }
+                    _ => return Err(qjson_err::QJSON_PARSE_ERROR),
+                }
+                depth += 1;
+                if depth > max_depth {
+                    return Err(qjson_err::QJSON_NESTING_TOO_DEEP);
+                }
+                prev_end = pos + 1;
+                i += 1;
+            }
+            b'}' => {
+                let top = stack.pop().ok_or(qjson_err::QJSON_PARSE_ERROR)?;
+                if !matches!(top, CtxKind::ObjAfterOpen | CtxKind::ObjAfterValue) {
+                    return Err(qjson_err::QJSON_PARSE_ERROR);
+                }
+                if stack.is_empty() { return Err(qjson_err::QJSON_PARSE_ERROR); }
+                depth -= 1;
+                if depth == 0 && stack.len() == 1 && stack[0] == CtxKind::TopDone {
+                    let mut p = pos + 1;
+                    while p < buf.len() && is_ws(buf[p]) { p += 1; }
+                    if p < buf.len() {
+                        return Err(qjson_err::QJSON_TRAILING_CONTENT);
+                    }
+                }
+                prev_end = pos + 1;
+                i += 1;
+            }
+            b']' => {
+                let top = stack.pop().ok_or(qjson_err::QJSON_PARSE_ERROR)?;
+                if !matches!(top, CtxKind::ArrAfterOpen | CtxKind::ArrAfterValue) {
+                    return Err(qjson_err::QJSON_PARSE_ERROR);
+                }
+                if stack.is_empty() { return Err(qjson_err::QJSON_PARSE_ERROR); }
+                depth -= 1;
+                if depth == 0 && stack.len() == 1 && stack[0] == CtxKind::TopDone {
+                    let mut p = pos + 1;
+                    while p < buf.len() && is_ws(buf[p]) { p += 1; }
+                    if p < buf.len() {
+                        return Err(qjson_err::QJSON_TRAILING_CONTENT);
+                    }
+                }
+                prev_end = pos + 1;
+                i += 1;
+            }
+            b',' => {
+                let cur = stack.last_mut().ok_or(qjson_err::QJSON_PARSE_ERROR)?;
+                match *cur {
+                    CtxKind::ArrAfterValue => *cur = CtxKind::ArrAfterComma,
+                    CtxKind::ObjAfterValue => *cur = CtxKind::ObjAfterComma,
+                    _ => return Err(qjson_err::QJSON_PARSE_ERROR),
+                }
+                prev_end = pos + 1;
+                i += 1;
+            }
+            b':' => {
+                let cur = stack.last_mut().ok_or(qjson_err::QJSON_PARSE_ERROR)?;
+                match *cur {
+                    CtxKind::ObjAfterKey => *cur = CtxKind::ObjAfterColon,
+                    _ => return Err(qjson_err::QJSON_PARSE_ERROR),
+                }
+                prev_end = pos + 1;
+                i += 1;
+            }
+            b'"' => {
+                if i + 1 >= indices.len() { return Err(qjson_err::QJSON_PARSE_ERROR); }
+                let close = indices[i + 1] as usize;
+                if close <= pos || close >= buf.len() || buf[close] != b'"' {
+                    return Err(qjson_err::QJSON_PARSE_ERROR);
+                }
+
+                let cur = stack.last().copied().unwrap();
+                // For a top-level string root, check trailing content BEFORE
+                // validating the string. This preserves the old validate_trailing
+                // error-code precedence: `"\\q" x` → QJSON_TRAILING_CONTENT, not
+                // QJSON_INVALID_STRING.
+                if matches!(cur, CtxKind::Top) && depth == 0 {
+                    let mut p = close + 1;
+                    while p < buf.len() && is_ws(buf[p]) { p += 1; }
+                    if p < buf.len() {
+                        return Err(qjson_err::QJSON_TRAILING_CONTENT);
+                    }
+                }
+
+                strings::validate_string_span(&buf[pos + 1 .. close])?;
+
+                let cur = stack.last_mut().ok_or(qjson_err::QJSON_PARSE_ERROR)?;
+                match *cur {
+                    CtxKind::ObjAfterOpen | CtxKind::ObjAfterComma => {
+                        *cur = CtxKind::ObjAfterKey;
+                    }
+                    CtxKind::Top
+                    | CtxKind::ArrAfterOpen
+                    | CtxKind::ArrAfterComma
+                    | CtxKind::ObjAfterColon => {
+                        *cur = parent_after_value(*cur);
+                    }
+                    _ => return Err(qjson_err::QJSON_PARSE_ERROR),
+                }
+                prev_end = close + 1;
+                i += 2;
+            }
+            _ => return Err(qjson_err::QJSON_PARSE_ERROR),
+        }
+    }
+
+    // Tail: handle any remaining content.
+    // For scalar roots (depth == 0, still in Top), find the first
+    // token, validate it, then check for trailing content beyond it.
+    if matches!(*stack.last().unwrap(), CtxKind::Top) && depth == 0 {
+        let mut scan = prev_end;
+        while scan < buf.len() && is_ws(buf[scan]) { scan += 1; }
+        if scan < buf.len() {
+            let mut end = scan;
+            while end < buf.len() && !is_ws(buf[end]) { end += 1; }
+            // Check for trailing content BEFORE validating the scalar.
+            // Preserves old validate_trailing precedence: `1a 2` → QJSON_TRAILING_CONTENT.
+            let mut p = end;
+            while p < buf.len() && is_ws(buf[p]) { p += 1; }
+            if p < buf.len() {
+                return Err(qjson_err::QJSON_TRAILING_CONTENT);
+            }
+            validate_scalar(&buf[scan..end])?;
+            *stack.last_mut().unwrap() = CtxKind::TopDone;
+        }
+    } else {
+        consume_scalar_gap(buf, prev_end, buf.len(), stack.last_mut().unwrap())?;
+    }
+
+    if stack.len() != 1 || stack[0] != CtxKind::TopDone {
+        return Err(qjson_err::QJSON_PARSE_ERROR);
+    }
+    Ok(())
+}
+
 /// Grammar-aware eager pass: walk `indices` once and validate every
 /// structural transition, key/value string, and scalar value.
 ///
@@ -140,6 +323,7 @@ pub(crate) fn validate_trailing(
 /// `validate_number` or matched against the three literal keywords;
 /// the error-code precedence matches the previous heuristic-based
 /// `check_gap` so existing tests keep their current error codes.
+#[allow(dead_code)]
 pub(crate) fn validate_eager_values(
     buf: &[u8],
     indices: &[u32],
@@ -494,24 +678,122 @@ mod tests {
 
     #[test]
     fn grammar_accepts_at_max_depth() {
-        // 1024 nested arrays at the default max_depth limit.
-        let mut buf = Vec::new();
-        for _ in 0..1024 { buf.push(b'['); }
-        for _ in 0..1024 { buf.push(b']'); }
-        assert!(
-            validate_eager_values(&buf, &ix(&buf), 1024).is_ok(),
-            "should accept exactly at max_depth"
-        );
+        let buf = [b'['].repeat(1024).into_iter()
+            .chain([b']'].repeat(1024))
+            .collect::<Vec<u8>>();
+        let indices = ix(&buf);
+        assert!(validate_eager_values(&buf, &indices, 1024).is_ok());
     }
 
     #[test]
     fn grammar_rejects_over_max_depth() {
-        // 1025 nested arrays — one past the default max_depth limit.
-        let mut buf = Vec::new();
-        for _ in 0..1025 { buf.push(b'['); }
-        for _ in 0..1025 { buf.push(b']'); }
+        let buf = [b'['].repeat(1025).into_iter()
+            .chain([b']'].repeat(1025))
+            .collect::<Vec<u8>>();
+        let indices = ix(&buf);
         assert_eq!(
-            validate_eager_values(&buf, &ix(&buf), 1024), Err(qjson_err::QJSON_NESTING_TOO_DEEP),
+            validate_eager_values(&buf, &indices, 1024), Err(qjson_err::QJSON_NESTING_TOO_DEEP),
         );
     }
-}
\ No newline at end of file
+
+    // ── fused validator tests ────────────────────────────────────────
+
+    #[test]
+    fn fused_accepts_clean_input() {
+        for buf in [
+            &b"{}"[..], &b"[]"[..], &b"{\"a\":1}"[..],
+            &b"[1,2,3]"[..], &b"42"[..], &b"\"hi\""[..],
+            &b"[true,false,null]"[..],
+        ] {
+            assert!(validate_eager_fused(buf, &ix(buf), 1024).is_ok(),
+                "fused should accept {:?}", std::str::from_utf8(buf).unwrap_or("(non-utf8)"));
+        }
+    }
+
+    #[test]
+    fn fused_rejects_trailing_content() {
+        assert_eq!(
+            validate_eager_fused(b"{}garbage", &ix(b"{}garbage"), 1024),
+            Err(qjson_err::QJSON_TRAILING_CONTENT),
+        );
+    }
+
+    #[test]
+    fn fused_rejects_excessive_depth() {
+        assert_eq!(
+            validate_eager_fused(b"[[[1]]]", &ix(b"[[[1]]]"), 2),
+            Err(qjson_err::QJSON_NESTING_TOO_DEEP),
+        );
+    }
+
+    #[test]
+    fn fused_depth_ok_at_limit() {
+        assert!(validate_eager_fused(b"[[1]]", &ix(b"[[1]]"), 2).is_ok());
+    }
+
+    #[test]
+    fn fused_trailing_whitespace_accepted() {
+        assert!(validate_eager_fused(b"{}   \n\t", &ix(b"{}   \n\t"), 1024).is_ok());
+    }
+
+    #[test]
+    fn fused_two_root_scalars_rejected() {
+        assert_eq!(
+            validate_eager_fused(b"1 2", &ix(b"1 2"), 1024),
+            Err(qjson_err::QJSON_TRAILING_CONTENT),
+        );
+    }
+
+    #[test]
+    fn fused_trailing_in_nested_container_detected() {
+        assert_eq!(
+            validate_eager_fused(b"[1] x", &ix(b"[1] x"), 1024),
+            Err(qjson_err::QJSON_TRAILING_CONTENT),
+        );
+    }
+
+    #[test]
+    fn fused_grammar_rejects_missing_colon() {
+        assert_eq!(
+            validate_eager_fused(b"{\"a\"}", &ix(b"{\"a\"}"), 1024),
+            Err(qjson_err::QJSON_PARSE_ERROR),
+        );
+    }
+
+    #[test]
+    fn fused_grammar_rejects_trailing_garbage_inside_object() {
+        assert_eq!(
+            validate_eager_fused(b"{\"a\":\"a\" 123}", &ix(b"{\"a\":\"a\" 123}"), 1024),
+            Err(qjson_err::QJSON_PARSE_ERROR),
+        );
+    }
+
+    // ── error-code precedence regression tests ──────────────────────
+
+    #[test]
+    fn fused_string_root_trailing_before_validation() {
+        // Old validate_trailing ran before validate_eager_values.
+        // `"\\q" x` → QJSON_TRAILING_CONTENT, not QJSON_INVALID_STRING.
+        assert_eq!(
+            validate_eager_fused(b"\"\\q\" x", &ix(b"\"\\q\" x"), 1024),
+            Err(qjson_err::QJSON_TRAILING_CONTENT),
+        );
+    }
+
+    #[test]
+    fn fused_scalar_then_structural_is_trailing() {
+        // `42 {}` — scalar root followed by container must be trailing.
+        assert_eq!(
+            validate_eager_fused(b"42 {}", &ix(b"42 {}"), 1024),
+            Err(qjson_err::QJSON_TRAILING_CONTENT),
+        );
+    }
+
+    #[test]
+    fn fused_scalar_then_array_is_trailing() {
+        assert_eq!(
+            validate_eager_fused(b"42[]", &ix(b"42[]"), 1024),
+            Err(qjson_err::QJSON_TRAILING_CONTENT),
+        );
+    }
+}
diff --git a/src/validate/strings/avx2.rs b/src/validate/strings/avx2.rs
index 8391a93..60a5772 100644
--- a/src/validate/strings/avx2.rs
+++ b/src/validate/strings/avx2.rs
@@ -1,18 +1,25 @@
 #![cfg(all(target_arch = "x86_64", feature = "avx2"))]
 
-//! AVX2 ASCII fast path for string-content validation.
+//! AVX2 string-content validation using the PSHUFB nibble-LUT classifier.
 //!
-//! For each 32-byte chunk, compute a "needs-attention" mask covering bytes
-//! that are either control chars (< 0x20), backslashes, or high-bit bytes.
-//! If the mask is all-zero the chunk is pure printable ASCII (no escapes,
-//! no UTF-8, no control) and can be skipped entirely.
+//! `classify_str_mask` classifies all 32 bytes in a chunk simultaneously
+//! via a 32-byte look-up table queried by `_mm256_shuffle_epi8` (PSHUFB).
+//! The LUT produces a byte-class bitmask for each input byte: pure
+//! printable ASCII returns zero, while control chars, backslashes, and
+//! high-bit bytes set bits that fold into a single `u32` attention mask.
 //!
-//! On the first non-zero chunk we hand off to the scalar state machine for
-//! the remainder of the span — we don't try to bit-scan inside the chunk.
-//! The fast-path payoff comes from cleanly skipping long ASCII prefixes;
-//! the scalar tail handles correctness without needing SIMD escape logic.
+//! Zero-mask chunks are skipped entirely. For non-zero chunks we iterate
+//! the set bits and validate each flagged byte in-batch:
+//!   - control → INVALID_STRING
+//!   - backslash → validate the escape introducer + following byte(s)
+//!   - high-bit  → delegate the remainder to the well-tested scalar path
+//!
+//! Single-char escapes and `\uXXXX` that fit within the current 32-byte
+//! chunk are validated inline; escapes straddling a chunk boundary fall
+//! through to the scalar path for correctness.
 
 use crate::error::qjson_err;
+use crate::validate::classify::classify_str_mask;
 use core::arch::x86_64::*;
 
 use super::scalar::validate_span_scalar;
@@ -28,37 +35,68 @@ unsafe fn validate_span_avx2_impl(span: &[u8]) -> Result<(), qjson_err> {
     let mut i: usize = 0;
     let n = span.len();
 
-    // ASCII bytes that need scalar attention have:
-    //   - top bit set                  → byte >= 0x80
-    //   - value < 0x20                 → control char
-    //   - value == 0x5C ('\\')         → escape introducer
-    //
-    // Detection via three SIMD compares OR'd together.
-    let backslash = _mm256_set1_epi8(b'\\' as i8);
-    // For "< 0x20" we use a signed unsigned trick: compare against 0x1F via
-    // unsigned MAX. _mm256_cmpgt_epi8 is signed, but bytes <0x20 are also
-    // <0x20 as signed positive values, so signed cmpgt works here for the
-    // 0x00..=0x1F range (none of which has the high bit set).
-    let ctrl_thresh = _mm256_set1_epi8(0x20_i8);
-
     while i + 32 <= n {
         let chunk = _mm256_loadu_si256(span.as_ptr().add(i) as *const __m256i);
+        let mask = classify_str_mask(chunk);
+
+        if mask != 0 {
+            let mut m = mask;
+            let mut consumed: usize = 0; // bytes from chunk start already handled
+            while m != 0 {
+                let offset = m.trailing_zeros() as usize;
+                m &= m - 1;
+
+                if offset < consumed {
+                    continue; // already consumed as part of a prior escape
+                }
+
+                let pos = i + offset;
+                let b = span[pos];
+
+                if b < 0x20 {
+                    return Err(qjson_err::QJSON_INVALID_STRING);
+                }
+
+                if b >= 0x80 {
+                    return validate_span_scalar(&span[pos..]);
+                }
 
-        // high bit set?
-        let high  = _mm256_movemask_epi8(chunk) as u32;
-        // byte == '\\' ?
-        let bs    = _mm256_movemask_epi8(_mm256_cmpeq_epi8(chunk, backslash)) as u32;
-        // byte < 0x20 ?  (signed cmpgt: ctrl_thresh > chunk for 0x00..=0x1F bytes)
-        let ctrl  = _mm256_movemask_epi8(_mm256_cmpgt_epi8(ctrl_thresh, chunk)) as u32;
+                // b == b'\\' (mask only has bits for ctrl|bs|high)
+                if pos + 1 >= n {
+                    return Err(qjson_err::QJSON_INVALID_STRING);
+                }
 
-        let interesting = high | bs | ctrl;
-        if interesting != 0 {
-            // Hand off to the scalar state machine starting at the first
-            // interesting byte in this chunk. We don't try to validate any
-            // already-cleared bytes — those are pure printable ASCII and
-            // self-terminating so it's safe to resume there.
-            let offset = interesting.trailing_zeros() as usize;
-            return validate_span_scalar(&span[i + offset..]);
+                let next = span[pos + 1];
+                match next {
+                    b'"' | b'\\' | b'/' | b'b' | b'f' | b'n' | b'r' | b't' => {
+                        // Escape straddles chunk boundary: delegate to scalar
+                        // so consumed tracking doesn't lose sync.
+                        if pos + 2 > i + 32 {
+                            return validate_span_scalar(&span[pos..]);
+                        }
+                        consumed = offset + 2;
+                    }
+                    b'u' => {
+                        let hex_start = pos + 2;
+                        let hex_end = hex_start + 4;
+                        if hex_end > n {
+                            return Err(qjson_err::QJSON_INVALID_STRING);
+                        }
+                        // If the full \uXXXX straddles the chunk boundary,
+                        // hand off to scalar.
+                        if hex_end > i + 32 {
+                            return validate_span_scalar(&span[pos..]);
+                        }
+                        for &h in &span[hex_start..hex_end] {
+                            if !h.is_ascii_hexdigit() {
+                                return Err(qjson_err::QJSON_INVALID_STRING);
+                            }
+                        }
+                        consumed = offset + 6;
+                    }
+                    _ => return Err(qjson_err::QJSON_INVALID_STRING),
+                }
+            }
         }
 
         i += 32;