test: stealth corpus real-dump similarity gates (T1–T12) by ichmagmaus111 · Pull Request #21 · telemt/tdlib-obf

ichmagmaus111 · 2026-06-09T13:44:12Z

What

Implements the Stealth Corpus Real-Dump Similarity test plan (T1–T12,
docs/Plans/STEALTH_CORPUS_REAL_DUMP_SIMILARITY_TEST_PLAN_2026-06-08.md):
move release-facing "similar to real browsers" claims onto fixture-derived,
fail-closed gates instead of self-calibrated generator output.

T1–T4 (evidence model): fixture-derived family-lane oracle + contracts;
truthful 1k iteration-tier naming; generated EvidenceFieldStatus /
per-field status / extension-count histograms into
ReviewedFamilyLaneBaselines.h; fail-closed gate on unavailable evidence.
T5 exact-field fixture gate · T6 extension-count similarity ·
T7 wire-length fixture gate (+ nightly Monte Carlo reclassified as a
generator-stability diagnostic) · T8 Chrome shuffle similarity ·
T9 remove silent empty-baseline early returns from
test_tls_multi_dump_windows_chrome_stats.cpp (fail-closed status asserts) ·
T10 docs (real-corpus gates vs seed-stress diagnostics) ·
T12 closeout handoff artifact.

32 files changed (+1950 / −104), per-task commits.

Sanctioned plan deviations (documented in-code, TDD §4.4)

T7 wire-length is NOT byte-exact (plan drafted tolerance 0.0).
TlsHelloBuilder.cpp injects 0..255 B of per-build padding-target entropy
(anti-DPI), so the wire length is non-deterministic by design and a
byte-exact gate could never go green without removing a security feature.
Uses an entropy-bounded, fixture-anchored envelope instead.
T8 chromium extension-set is checked via the reviewed order-template
catalog, not forced to Exact. The chromium_linux_desktop cohort
genuinely pools sets of size 15/16/17, so by the plan's own Exact rule it
is a Catalog with a correctly-empty collapsed invariant; forcing Exact
would fabricate evidence.

Verification state — READ BEFORE MERGE

✅ Python analysis suites pass locally (oracle / iteration-tier /
release-gate contracts — real red→green for T9/T10).
✅ C++ source-verified against real headers and existing passing tests
(generator behavior for T5/T6 is pinned by existing
test_tls_corpus_ios_apple_tls_1k.cpp and *_baseline.cpp suites).
⚠️ C++ was NOT compiled or run, and sanitizers were NOT run — authored on
macOS, which cannot build tdlib-obf (zlib≥1.3.2 gate, missing htole32/64,
std::atomic<std::shared_ptr> unsupported by Apple libc++; the sanitizer
matrix is pinned to clang-22/gcc-16). The first Linux CI run
(build → ctest → tools/ci/run_sanitizer_matrix.py) is the real gate.

Residual runtime risk to watch on CI

T9 site-1 (test_tls_multi_dump_windows_chrome_stats.cpp): removing the
early return makes matches_exact_invariants + the conditional
covers_observed_ech_payload_length run for Chrome147_Windows for the
first time. The matches_exact_invariants part only enforces universal
version/record constants (chromium_windows is a Catalog cohort), but the ECH
payload coverage against the reviewed chromium_windows catalog is newly
exercised — confirm on CI.
T7 firefox wire-length tolerance is reasoned (same mechanism as the
passing apple baseline suite) but not locally executed.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Convert 1k-named corpus suites to the spot-or-full tier (was kQuickIterations=3), remove the now-unused quick_seed helpers, and rename the seedless structural cross_platform_contamination suite away from the 1k label. Add a contract test enforcing truthful 1k naming. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Generate EvidenceFieldStatus per release-critical field, an extension-count histogram, and observed record/handshake length catalogs into ReviewedFamilyLaneBaselines.h. Regenerating also resyncs the previously stale committed header. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add a C++ gate asserting release-critical fields are enforceable (Exact/Catalog/Policy) for reviewed lanes and Unavailable for synthetic fail-closed lanes. Treat authoritative multi-browser field divergence as a membership Catalog rather than Mixed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Drive the generator over many seeds for chromium_linux_desktop, firefox_linux_desktop and apple_ios_tls and require every emitted ClientHello to match the reviewed exact invariants. Assert the cipher-suite, extension-set and supported-version evidence statuses are enforceable first, so an Unavailable or Mixed regression fails the gate instead of passing vacuously. ECH mode per family follows the reviewed ech_presence_required (Rfc9180Outer for chromium/firefox, Disabled for apple_ios_tls). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add a per-family gate requiring every generated ClientHello's non-GREASE, non-padding extension count to appear in the reviewed extension-count histogram Catalog for that lane. The count metric matches the generator script's histogram derivation (GREASE and padding 0x0015 excluded), so the generated counts (chromium 16, firefox 17, apple_ios_tls 13) each land in their reviewed Catalog ({15,16,17} / {17} / {13}) rather than passing a broad envelope. ECH mode per family follows ech_presence_required. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add a release-facing wire-length gate that bounds the generated ClientHello length against the reviewed fixture wire-length Catalog (fail-closed on Unavailable evidence), replacing reliance on the self-calibrated nightly Monte Carlo envelope. Reclassify the nightly Monte Carlo as a diagnostic generator-stability suite (comment-only; behavior unchanged). Plan deviation, documented in-file per TDD sec 4.4: the gate is NOT byte-exact (tolerance 0.0 as the plan drafted). TlsHelloBuilder injects 0..255 B of per-build padding-target entropy as an anti-DPI feature, so the emitted length is non-deterministic by design and a single-byte-exact gate could never go green without removing that security jitter. The tolerance is derived from that documented entropy budget plus the SNI-length delta; it stays fixture-anchored (real dump lengths, not generator self-sampling), which is the similarity guarantee the broad envelope lacked. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Separate Chrome anchored-shuffle legality and diversity from real-corpus evidence: generated chromium orders must be legal anchored permutations with no duplicate extension types, must not collapse to a degenerate set of sequences, and must use an extension set actually observed in the reviewed corpus; fixed-order apple_ios_tls must equal its single reviewed template without spurious shuffle. Plan deviation, documented in-file per TDD sec 4.4: the draft compared the generated set to invariants.non_grease_extension_set and assumed chromium could be made Exact. The chromium_linux_desktop cohort genuinely pools sets of size 15/16/17 (ECH/ALPS vary by source), so by the plan's own Exact rule that field is a per-template Catalog with a correctly empty collapsed invariant; forcing Exact would fabricate evidence. The set check instead requires membership in the reviewed order-template set catalog, which is the fixture-derived guarantee intended. No oracle/header change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add test_similarity_release_gate_contract.py asserting release similarity suites never early-return on empty/unreviewed baselines, and replace the three such skips in test_tls_multi_dump_windows_chrome_stats.cpp with fail-closed status assertions (ASSERT_NE Unavailable/Mixed; ASSERT_FALSE empty wire-length catalog). chromium_windows non_ru_egress is a populated multi-source Catalog (cipher/ext_set/groups/alpn/compress empty exact invariants by design; supported_versions Exact; wire lengths populated), so the matcher enforces the populated fields without the skip. test_tls_multi_dump_ios_apple_tls_stats.cpp needed no change: it has no empty-baseline early returns (verified by the new contract). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add a doc contract (test_docs_separate_similarity_gates_from_seed_stress) and the validation-topology / lessons paragraphs it pins: real-corpus similarity gates consume reviewed fixtures and fail closed on unavailable/mixed evidence, whereas seed-stress diagnostics prove generator stability and may not serve as release denominators. Records the wire-length padding-entropy lesson behind the T7 deviation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Closeout handoff for the stealth-corpus-real-dump-similarity phase. Records honest per-command status: the three plan-specified Python analysis suites pass locally and all five C++ similarity gates are CMake-registered, while C++ build/run_all_tests/ctest are marked not_run (macOS cannot build TDLib) and deferred to Linux CI. Documents the two sanctioned plan deviations (Task 7 entropy-bounded wire length, Task 8 template-catalog set membership) and the Task 9 residual ECH-coverage risk for CI to confirm. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ASSERT_EQ/ASSERT_NE format both operands via StringBuilder on failure, but the generated enum class EvidenceFieldStatus has no StringBuilder operator<<, so those assertions failed to compile (Core gcc15/clang22, ASan/UBSan, TSan all stopped on test_tls_multi_dump_windows_chrome_stats.cpp). Replace the enum ASSERT_EQ/NE with ASSERT_TRUE(status == / != ...), the same idiom already used in test_tls_generator_fixture_exact_fields_gate.cpp (which compiled). No behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…S1192) test_tls_generator_shuffle_similarity.cpp repeated the string literals "chromium_linux_desktop" and "non_ru_egress" 3x each, which SonarCloud's C++ profile flags as CRITICAL cpp:S1192. Hoist them to kChromiumLinuxDesktop / kNonRuEgress Slice constants. Pre-emptive: SonarCloud has not yet analyzed PR #21 (the sonar CI job fails at build on the pre-existing logging.cpp std::atomic<std::shared_ptr> under libc++, so no analysis uploads). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…th in release gates Closes PR #21 review findings 2 and 3 (release-facing similarity gates were weaker than their names implied). Finding 2 — exact-field gate skipped catalog-backed critical fields. test_tls_generator_fixture_exact_fields_gate previously called only matches_exact_invariants(), which skips any field whose ExactInvariants entry is empty — precisely the Catalog-status fields where reviewed sources legitimately disagree (apple cipher/groups/versions/alpn, chromium/firefox extension set). A generator drift in those fields could survive the gate. - build_family_lane_baselines.py now emits per-field observed-value catalogs (observed_cipher_suite_sequences, observed_extension_sets, observed_supported_versions_sequences) into SetMembershipCatalog; header regenerated, byte-deterministic and matching the generator self-test. - FamilyLaneMatcher::matches_release_critical_field() dispatches on EvidenceFieldStatus: Exact -> non-empty exact equality; Catalog -> membership in the observed catalog; Policy -> fail closed (no named matcher yet); Unavailable/Mixed -> fail closed. - The gate now runs that dispatch for cipher suites, extension set and supported versions in addition to matches_exact_invariants, and adds mutant/negative tests proving a wrong cipher list, extension set, or supported-versions list fails for both Exact and Catalog status. Finding 3 — wire-length gate used a broad 15% percent envelope that admitted lengths present in no reviewed dump (e.g. firefox ~1606..2545 vs observed {1890..2213}). - FamilyLaneMatcher::within_wire_length_byte_model() bounds the generated length to within max_byte_delta of some observed sample, expressed in bytes. - test_tls_generator_wire_length_fixture_gate derives the budget from the generator mechanism: 255 B padding-target entropy (TlsHelloBuilder rng.bounded(256u)) + a fixture-derived 16 B SNI-length delta, replacing the arbitrary 15%. within_wire_length_envelope() is retained for the nightly self-calibrated Monte Carlo diagnostic. Python generator self-test and the three PR analysis suites pass locally; the C++ gates are validated on Linux CI (tdlib-obf does not build on macOS). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Maps each review finding (PR21_STEALTH_CORPUS_SIMILARITY_REVIEW_2026-06-11.md) to its remediation, branch, and commit, and records the Linux-CI run commands (Finding 4). Findings 2 and 3 land on this branch; the five runtime risks (Finding 1, F1-F5) land on stealth-runtime-hardening, split out per the review's own recommendation. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Closes PR #21 review finding 1 (F1): create_transport() silently downgraded an emulate_tls() ObfuscatedTcp connection to a plain tcp::ObfuscatedTransport when make_transport_stealth_config or StealthTransportDecorator::create failed, putting the unmasked legacy obfuscated-MTProto fingerprint that emulate_tls was meant to hide on the wire — a masking downgrade exploitable by DPI. Both failure branches now return a FailClosedStealthTransport instead of a working legacy transport. It keeps the ObfuscatedTcp type for upstream logging but refuses to operate: write() drops outbound data and can_write() is false so the engine never hands it un-shaped bytes, and read_next() returns an error so the MTProto reconnect path tears the connection down (and keeps refusing) rather than silently using an un-shaped channel. This matches the existing compiled-out (#else) policy that already LOG(FATAL)s rather than fall back to legacy fingerprinting. The structured WARNING and secret-sanitised status message are preserved; the message changes from "disabled" to "unavailable; refusing ... (fail-closed)". test_stream_transport_activation_fail_closed is updated to assert the fail-closed contract (can_write() false, read_next() errors) instead of the old downgrade, and the redaction/multiline/non-ascii log tests track the new wording. Validated by source review; tdlib-obf C++ builds and runs on Linux CI only. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…weight slots Closes PR #21 review findings 4 and 5. F4 — population-correlation defence. stable_selection_hash mixed only destination, time bucket, and platform hints, so every installation on the same proxy/destination/platform/time bucket deterministically selected the same profile — a synchronized, DPI-correlatable population pattern. A per-install salt is now mixed in (set_per_install_selection_salt). The salt is opt-in and host-supplied: the host generates it once per installation, persists it, and re-applies it on every launch (a salt that rotated each start would itself become a fingerprint). It is intentionally NOT auto-minted inside the library, so it couples profile choice to no unrelated global state; with no salt set it stays 0 and selection is byte-for-byte the legacy deterministic vector, leaving every existing selection test unaffected. New API: set/get/reset_per_install_selection_salt. F5 — independent firefox weight slots. Firefox148 (Linux) and Firefox149_MacOS26_3 (macOS) aliased the single `firefox148` weight slot, so an operator could not tune or zero one Firefox lane without disabling the other. Firefox149_MacOS26_3 now has its own ProfileWeights slot, bridged from the darwin firefox ratio so default effective weights — and therefore selection behaviour — are unchanged; only independent tunability is added. profile_weight, profile_weight_for_runtime_validation, validate_profile_weights and the Darwin allowed-weight check are updated; zero_profile_weights helpers and the defaults contract track the new slot. New regression tests: test_tls_profile_selection_per_install_entropy (different salts de-correlate; fixed/zero salt deterministic) and test_tls_profile_firefox_weight_independence (zeroing one firefox lane leaves the other selectable). Validated by source review; built/run on Linux CI only. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Closes PR #21 review finding 3 (F3). StealthConfig::from_secret selects a profile at config-construction time (T1) to set the decorator's record-size cap, while TlsInit::send_hello independently selects the ClientHello profile at hello-send time (T2). Across a sticky-rotation-window boundary the two pick_runtime_profile calls can land on different profiles, so a cap tied only to config.profile could let the post-handshake records exceed the record_size_limit the wire actually declared — a DRS fingerprint inconsistency (currently latent because every profile maps to the same 16384 cap, concrete once a smaller-record profile ships). apply_profile_record_size_limit now additionally clamps to platform_record_size_floor(): the floor record_size_limit across every profile the platform may select. This makes the decorator consistent with whichever profile the wire used, independent of which profile config-time selection landed on. The per-profile clamp is unchanged and the floor is a no-op while all profiles share the same effective cap, so test_stealth_config_profile_record_limit_consistency still holds; the darwin source-scan contract tracks the new call signature. The temporal-divergence suite is updated: the cosmetic config.profile-vs-wire divergence (tests A/B/D) is retained and documented as now record-size-harmless, a new test asserts the decorator cap never exceeds the platform floor, and the firefox weight-slot tests (E/F/G) are rewritten for the F5 independent-slot fix. Validated by source review; built/run on Linux CI only. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…de mobile) Closes PR #21 review finding 2 (F2): the verified browser-capture iOS Chromium lane (Chrome147_IOSChromium) was pinned to weight 0 in the effective profile weights, so iOS had only the advisory utls IOS14 lane — the main active masking risk vs DPI on iOS. The effective-weights flatteners (effective_profile_weights_for_platform and the config loader's flatten_profile_selection) now carve a slice (1/7, == 10 of the default 70) of the iOS share for the verified iOS Chromium lane; the remainder stays with IOS14. This is done at flatten time, so the mobile policy schema (ios14 + android = 100) and its config-loader parsing are unchanged and remain backward-compatible — no new policy field, no sum-rule change. The loader flattener also now sets firefox149_macos26_3 to match the default path (a follow -on to the F5 slot split). Honest residuals, documented rather than papered over: at the default Unknown transport_confidence iOS still selects advisory IOS14 (a cross-layer-claim profile must not be used without confidence evidence — the conservative default), and Android still has no verified browser capture so its only lane is advisory. Closing those is a corpus/provenance task (a real Android capture) and a release_gating curation decision for the team, not something to fix by marking advisory evidence as release-grade. New test test_tls_mobile_release_grade_lane covers: non-zero iOS Chromium weight, iOS reaching the verified lane at established confidence, iOS defaulting to advisory at Unknown, and Android's advisory-only lane. Built/run on Linux CI only. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ichmagmaus and others added 21 commits June 8, 2026 21:42

test: add fixture-derived family lane oracle contracts

df6b019

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

feat(stealth): harden runtime profile rotation and release gating

3b9d852

Fixed sqlite issues

7f52cc6

DavidOsipov merged commit 3085f8e into master Jun 12, 2026
17 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: stealth corpus real-dump similarity gates (T1–T12)#21

test: stealth corpus real-dump similarity gates (T1–T12)#21
DavidOsipov merged 21 commits into
masterfrom
stealth-corpus-real-dump-similarity

ichmagmaus111 commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ichmagmaus111 commented Jun 9, 2026

What

Sanctioned plan deviations (documented in-code, TDD §4.4)

Verification state — READ BEFORE MERGE

Residual runtime risk to watch on CI

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants