Skip to content

test: stealth corpus real-dump similarity gates (T1–T12)#21

Merged
DavidOsipov merged 21 commits into
masterfrom
stealth-corpus-real-dump-similarity
Jun 12, 2026
Merged

test: stealth corpus real-dump similarity gates (T1–T12)#21
DavidOsipov merged 21 commits into
masterfrom
stealth-corpus-real-dump-similarity

Conversation

@ichmagmaus111

Copy link
Copy Markdown
Collaborator

What

Implements the Stealth Corpus Real-Dump Similarity test plan (T1–T12,
docs/Plans/STEALTH_CORPUS_REAL_DUMP_SIMILARITY_TEST_PLAN_2026-06-08.md):
move release-facing "similar to real browsers" claims onto fixture-derived,
fail-closed
gates instead of self-calibrated generator output.

  • T1–T4 (evidence model): fixture-derived family-lane oracle + contracts;
    truthful 1k iteration-tier naming; generated EvidenceFieldStatus /
    per-field status / extension-count histograms into
    ReviewedFamilyLaneBaselines.h; fail-closed gate on unavailable evidence.
  • T5 exact-field fixture gate · T6 extension-count similarity ·
    T7 wire-length fixture gate (+ nightly Monte Carlo reclassified as a
    generator-stability diagnostic) · T8 Chrome shuffle similarity ·
    T9 remove silent empty-baseline early returns from
    test_tls_multi_dump_windows_chrome_stats.cpp (fail-closed status asserts) ·
    T10 docs (real-corpus gates vs seed-stress diagnostics) ·
    T12 closeout handoff artifact.

32 files changed (+1950 / −104), per-task commits.

Sanctioned plan deviations (documented in-code, TDD §4.4)

  1. T7 wire-length is NOT byte-exact (plan drafted tolerance 0.0).
    TlsHelloBuilder.cpp injects 0..255 B of per-build padding-target entropy
    (anti-DPI), so the wire length is non-deterministic by design and a
    byte-exact gate could never go green without removing a security feature.
    Uses an entropy-bounded, fixture-anchored envelope instead.
  2. T8 chromium extension-set is checked via the reviewed order-template
    catalog
    , not forced to Exact. The chromium_linux_desktop cohort
    genuinely pools sets of size 15/16/17, so by the plan's own Exact rule it
    is a Catalog with a correctly-empty collapsed invariant; forcing Exact
    would fabricate evidence.

Verification state — READ BEFORE MERGE

  • Python analysis suites pass locally (oracle / iteration-tier /
    release-gate contracts — real red→green for T9/T10).
  • C++ source-verified against real headers and existing passing tests
    (generator behavior for T5/T6 is pinned by existing
    test_tls_corpus_ios_apple_tls_1k.cpp and *_baseline.cpp suites).
  • ⚠️ C++ was NOT compiled or run, and sanitizers were NOT run — authored on
    macOS, which cannot build tdlib-obf (zlib≥1.3.2 gate, missing htole32/64,
    std::atomic<std::shared_ptr> unsupported by Apple libc++; the sanitizer
    matrix is pinned to clang-22/gcc-16). The first Linux CI run
    (build → ctest → tools/ci/run_sanitizer_matrix.py) is the real gate.

Residual runtime risk to watch on CI

  • T9 site-1 (test_tls_multi_dump_windows_chrome_stats.cpp): removing the
    early return makes matches_exact_invariants + the conditional
    covers_observed_ech_payload_length run for Chrome147_Windows for the
    first time. The matches_exact_invariants part only enforces universal
    version/record constants (chromium_windows is a Catalog cohort), but the ECH
    payload coverage against the reviewed chromium_windows catalog is newly
    exercised — confirm on CI.
  • T7 firefox wire-length tolerance is reasoned (same mechanism as the
    passing apple baseline suite) but not locally executed.

🤖 Generated with Claude Code

ichmagmaus and others added 21 commits June 8, 2026 21:42
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Convert 1k-named corpus suites to the spot-or-full tier (was kQuickIterations=3), remove the now-unused quick_seed helpers, and rename the seedless structural cross_platform_contamination suite away from the 1k label. Add a contract test enforcing truthful 1k naming.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Generate EvidenceFieldStatus per release-critical field, an extension-count histogram, and observed record/handshake length catalogs into ReviewedFamilyLaneBaselines.h. Regenerating also resyncs the previously stale committed header.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a C++ gate asserting release-critical fields are enforceable (Exact/Catalog/Policy) for reviewed lanes and Unavailable for synthetic fail-closed lanes. Treat authoritative multi-browser field divergence as a membership Catalog rather than Mixed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drive the generator over many seeds for chromium_linux_desktop,
firefox_linux_desktop and apple_ios_tls and require every emitted
ClientHello to match the reviewed exact invariants. Assert the
cipher-suite, extension-set and supported-version evidence statuses are
enforceable first, so an Unavailable or Mixed regression fails the gate
instead of passing vacuously. ECH mode per family follows the reviewed
ech_presence_required (Rfc9180Outer for chromium/firefox, Disabled for
apple_ios_tls).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a per-family gate requiring every generated ClientHello's non-GREASE,
non-padding extension count to appear in the reviewed extension-count
histogram Catalog for that lane. The count metric matches the generator
script's histogram derivation (GREASE and padding 0x0015 excluded), so the
generated counts (chromium 16, firefox 17, apple_ios_tls 13) each land in
their reviewed Catalog ({15,16,17} / {17} / {13}) rather than passing a
broad envelope. ECH mode per family follows ech_presence_required.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a release-facing wire-length gate that bounds the generated ClientHello
length against the reviewed fixture wire-length Catalog (fail-closed on
Unavailable evidence), replacing reliance on the self-calibrated nightly
Monte Carlo envelope. Reclassify the nightly Monte Carlo as a diagnostic
generator-stability suite (comment-only; behavior unchanged).

Plan deviation, documented in-file per TDD sec 4.4: the gate is NOT
byte-exact (tolerance 0.0 as the plan drafted). TlsHelloBuilder injects
0..255 B of per-build padding-target entropy as an anti-DPI feature, so the
emitted length is non-deterministic by design and a single-byte-exact gate
could never go green without removing that security jitter. The tolerance is
derived from that documented entropy budget plus the SNI-length delta; it
stays fixture-anchored (real dump lengths, not generator self-sampling),
which is the similarity guarantee the broad envelope lacked.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Separate Chrome anchored-shuffle legality and diversity from real-corpus
evidence: generated chromium orders must be legal anchored permutations with
no duplicate extension types, must not collapse to a degenerate set of
sequences, and must use an extension set actually observed in the reviewed
corpus; fixed-order apple_ios_tls must equal its single reviewed template
without spurious shuffle.

Plan deviation, documented in-file per TDD sec 4.4: the draft compared the
generated set to invariants.non_grease_extension_set and assumed chromium
could be made Exact. The chromium_linux_desktop cohort genuinely pools sets
of size 15/16/17 (ECH/ALPS vary by source), so by the plan's own Exact rule
that field is a per-template Catalog with a correctly empty collapsed
invariant; forcing Exact would fabricate evidence. The set check instead
requires membership in the reviewed order-template set catalog, which is the
fixture-derived guarantee intended. No oracle/header change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add test_similarity_release_gate_contract.py asserting release similarity
suites never early-return on empty/unreviewed baselines, and replace the
three such skips in test_tls_multi_dump_windows_chrome_stats.cpp with
fail-closed status assertions (ASSERT_NE Unavailable/Mixed; ASSERT_FALSE
empty wire-length catalog). chromium_windows non_ru_egress is a populated
multi-source Catalog (cipher/ext_set/groups/alpn/compress empty exact
invariants by design; supported_versions Exact; wire lengths populated), so
the matcher enforces the populated fields without the skip.

test_tls_multi_dump_ios_apple_tls_stats.cpp needed no change: it has no
empty-baseline early returns (verified by the new contract).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a doc contract (test_docs_separate_similarity_gates_from_seed_stress)
and the validation-topology / lessons paragraphs it pins: real-corpus
similarity gates consume reviewed fixtures and fail closed on
unavailable/mixed evidence, whereas seed-stress diagnostics prove generator
stability and may not serve as release denominators. Records the wire-length
padding-entropy lesson behind the T7 deviation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Closeout handoff for the stealth-corpus-real-dump-similarity phase. Records
honest per-command status: the three plan-specified Python analysis suites
pass locally and all five C++ similarity gates are CMake-registered, while
C++ build/run_all_tests/ctest are marked not_run (macOS cannot build TDLib)
and deferred to Linux CI. Documents the two sanctioned plan deviations
(Task 7 entropy-bounded wire length, Task 8 template-catalog set membership)
and the Task 9 residual ECH-coverage risk for CI to confirm.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ASSERT_EQ/ASSERT_NE format both operands via StringBuilder on failure, but the
generated enum class EvidenceFieldStatus has no StringBuilder operator<<, so
those assertions failed to compile (Core gcc15/clang22, ASan/UBSan, TSan all
stopped on test_tls_multi_dump_windows_chrome_stats.cpp). Replace the enum
ASSERT_EQ/NE with ASSERT_TRUE(status == / != ...), the same idiom already used
in test_tls_generator_fixture_exact_fields_gate.cpp (which compiled). No
behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…S1192)

test_tls_generator_shuffle_similarity.cpp repeated the string literals
"chromium_linux_desktop" and "non_ru_egress" 3x each, which SonarCloud's
C++ profile flags as CRITICAL cpp:S1192. Hoist them to kChromiumLinuxDesktop /
kNonRuEgress Slice constants. Pre-emptive: SonarCloud has not yet analyzed
PR #21 (the sonar CI job fails at build on the pre-existing logging.cpp
std::atomic<std::shared_ptr> under libc++, so no analysis uploads).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…th in release gates

Closes PR #21 review findings 2 and 3 (release-facing similarity gates were
weaker than their names implied).

Finding 2 — exact-field gate skipped catalog-backed critical fields.
test_tls_generator_fixture_exact_fields_gate previously called only
matches_exact_invariants(), which skips any field whose ExactInvariants entry is
empty — precisely the Catalog-status fields where reviewed sources legitimately
disagree (apple cipher/groups/versions/alpn, chromium/firefox extension set). A
generator drift in those fields could survive the gate.

- build_family_lane_baselines.py now emits per-field observed-value catalogs
  (observed_cipher_suite_sequences, observed_extension_sets,
  observed_supported_versions_sequences) into SetMembershipCatalog; header
  regenerated, byte-deterministic and matching the generator self-test.
- FamilyLaneMatcher::matches_release_critical_field() dispatches on
  EvidenceFieldStatus: Exact -> non-empty exact equality; Catalog -> membership
  in the observed catalog; Policy -> fail closed (no named matcher yet);
  Unavailable/Mixed -> fail closed.
- The gate now runs that dispatch for cipher suites, extension set and supported
  versions in addition to matches_exact_invariants, and adds mutant/negative
  tests proving a wrong cipher list, extension set, or supported-versions list
  fails for both Exact and Catalog status.

Finding 3 — wire-length gate used a broad 15% percent envelope that admitted
lengths present in no reviewed dump (e.g. firefox ~1606..2545 vs observed
{1890..2213}).

- FamilyLaneMatcher::within_wire_length_byte_model() bounds the generated length
  to within max_byte_delta of some observed sample, expressed in bytes.
- test_tls_generator_wire_length_fixture_gate derives the budget from the
  generator mechanism: 255 B padding-target entropy (TlsHelloBuilder
  rng.bounded(256u)) + a fixture-derived 16 B SNI-length delta, replacing the
  arbitrary 15%. within_wire_length_envelope() is retained for the nightly
  self-calibrated Monte Carlo diagnostic.

Python generator self-test and the three PR analysis suites pass locally; the
C++ gates are validated on Linux CI (tdlib-obf does not build on macOS).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Maps each review finding (PR21_STEALTH_CORPUS_SIMILARITY_REVIEW_2026-06-11.md) to
its remediation, branch, and commit, and records the Linux-CI run commands
(Finding 4). Findings 2 and 3 land on this branch; the five runtime risks
(Finding 1, F1-F5) land on stealth-runtime-hardening, split out per the review's
own recommendation.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Closes PR #21 review finding 1 (F1): create_transport() silently downgraded an
emulate_tls() ObfuscatedTcp connection to a plain tcp::ObfuscatedTransport when
make_transport_stealth_config or StealthTransportDecorator::create failed,
putting the unmasked legacy obfuscated-MTProto fingerprint that emulate_tls was
meant to hide on the wire — a masking downgrade exploitable by DPI.

Both failure branches now return a FailClosedStealthTransport instead of a
working legacy transport. It keeps the ObfuscatedTcp type for upstream logging
but refuses to operate: write() drops outbound data and can_write() is false so
the engine never hands it un-shaped bytes, and read_next() returns an error so
the MTProto reconnect path tears the connection down (and keeps refusing) rather
than silently using an un-shaped channel. This matches the existing compiled-out
(#else) policy that already LOG(FATAL)s rather than fall back to legacy
fingerprinting.

The structured WARNING and secret-sanitised status message are preserved; the
message changes from "disabled" to "unavailable; refusing ... (fail-closed)".
test_stream_transport_activation_fail_closed is updated to assert the fail-closed
contract (can_write() false, read_next() errors) instead of the old downgrade,
and the redaction/multiline/non-ascii log tests track the new wording.

Validated by source review; tdlib-obf C++ builds and runs on Linux CI only.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…weight slots

Closes PR #21 review findings 4 and 5.

F4 — population-correlation defence. stable_selection_hash mixed only
destination, time bucket, and platform hints, so every installation on the same
proxy/destination/platform/time bucket deterministically selected the same
profile — a synchronized, DPI-correlatable population pattern. A per-install
salt is now mixed in (set_per_install_selection_salt). The salt is opt-in and
host-supplied: the host generates it once per installation, persists it, and
re-applies it on every launch (a salt that rotated each start would itself become
a fingerprint). It is intentionally NOT auto-minted inside the library, so it
couples profile choice to no unrelated global state; with no salt set it stays 0
and selection is byte-for-byte the legacy deterministic vector, leaving every
existing selection test unaffected. New API: set/get/reset_per_install_selection_salt.

F5 — independent firefox weight slots. Firefox148 (Linux) and
Firefox149_MacOS26_3 (macOS) aliased the single `firefox148` weight slot, so an
operator could not tune or zero one Firefox lane without disabling the other.
Firefox149_MacOS26_3 now has its own ProfileWeights slot, bridged from the darwin
firefox ratio so default effective weights — and therefore selection behaviour —
are unchanged; only independent tunability is added. profile_weight,
profile_weight_for_runtime_validation, validate_profile_weights and the Darwin
allowed-weight check are updated; zero_profile_weights helpers and the defaults
contract track the new slot.

New regression tests: test_tls_profile_selection_per_install_entropy (different
salts de-correlate; fixed/zero salt deterministic) and
test_tls_profile_firefox_weight_independence (zeroing one firefox lane leaves the
other selectable). Validated by source review; built/run on Linux CI only.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Closes PR #21 review finding 3 (F3). StealthConfig::from_secret selects a profile
at config-construction time (T1) to set the decorator's record-size cap, while
TlsInit::send_hello independently selects the ClientHello profile at hello-send
time (T2). Across a sticky-rotation-window boundary the two pick_runtime_profile
calls can land on different profiles, so a cap tied only to config.profile could
let the post-handshake records exceed the record_size_limit the wire actually
declared — a DRS fingerprint inconsistency (currently latent because every
profile maps to the same 16384 cap, concrete once a smaller-record profile ships).

apply_profile_record_size_limit now additionally clamps to
platform_record_size_floor(): the floor record_size_limit across every profile the
platform may select. This makes the decorator consistent with whichever profile
the wire used, independent of which profile config-time selection landed on. The
per-profile clamp is unchanged and the floor is a no-op while all profiles share
the same effective cap, so test_stealth_config_profile_record_limit_consistency
still holds; the darwin source-scan contract tracks the new call signature.

The temporal-divergence suite is updated: the cosmetic config.profile-vs-wire
divergence (tests A/B/D) is retained and documented as now record-size-harmless, a
new test asserts the decorator cap never exceeds the platform floor, and the
firefox weight-slot tests (E/F/G) are rewritten for the F5 independent-slot fix.
Validated by source review; built/run on Linux CI only.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…de mobile)

Closes PR #21 review finding 2 (F2): the verified browser-capture iOS Chromium
lane (Chrome147_IOSChromium) was pinned to weight 0 in the effective profile
weights, so iOS had only the advisory utls IOS14 lane — the main active masking
risk vs DPI on iOS.

The effective-weights flatteners (effective_profile_weights_for_platform and the
config loader's flatten_profile_selection) now carve a slice (1/7, == 10 of the
default 70) of the iOS share for the verified iOS Chromium lane; the remainder
stays with IOS14. This is done at flatten time, so the mobile policy schema
(ios14 + android = 100) and its config-loader parsing are unchanged and remain
backward-compatible — no new policy field, no sum-rule change. The loader
flattener also now sets firefox149_macos26_3 to match the default path (a follow
-on to the F5 slot split).

Honest residuals, documented rather than papered over: at the default Unknown
transport_confidence iOS still selects advisory IOS14 (a cross-layer-claim profile
must not be used without confidence evidence — the conservative default), and
Android still has no verified browser capture so its only lane is advisory.
Closing those is a corpus/provenance task (a real Android capture) and a
release_gating curation decision for the team, not something to fix by marking
advisory evidence as release-grade.

New test test_tls_mobile_release_grade_lane covers: non-zero iOS Chromium weight,
iOS reaching the verified lane at established confidence, iOS defaulting to
advisory at Unknown, and Android's advisory-only lane. Built/run on Linux CI only.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@DavidOsipov DavidOsipov merged commit 3085f8e into master Jun 12, 2026
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants