Skip to content

Latest commit

 

History

History
296 lines (216 loc) · 10 KB

File metadata and controls

296 lines (216 loc) · 10 KB

LayerCake Architecture

V2 portable-domain runtime

LayerCake v2 has two domain execution modes:

  1. Additive ABI brick: a sparse residual uses host ABI states and host logits. It is cheap and compositional, but only bounded transfer is expected across independent cores.
  2. PX portable domain: a versioned byte-level decoder uses deterministic causal features and owns the selected domain prediction path. Host weights are not inputs, so a verified unchanged artifact produces identical logits and PPL across hosts.

The PX artifact contains:

  • a domain identifier and format version;
  • deterministic byte-feature contract;
  • recurrent decoder architecture metadata;
  • quantization contract;
  • canonical spec hash;
  • payload hash covering every stored tensor;
  • optional training/evaluation metadata.

The runtime rejects modified metadata or tensors. The selected Python artifact has 148,736 parameters, occupies 594,944 bytes in fp32 or 148,808 bytes with symmetric per-tensor int8 storage, and is dequantized at load time. Native int8 execution is not yet implemented.

PX should be understood as an installable domain capsule, not as evidence that host cores share internal representations or general intelligence.

v2 canonical ABI and byte-patch path

The v2 research path is additive:

UTF-8 bytes
  -> byte embeddings
  -> fixed/heuristic causal patches
  -> patch-level core
  -> versioned canonical ABI
  -> top-k sparse low-rank domain brick
  -> patch expansion
  -> byte logits

ABISpec binds the coordinate width, normalization, basis, interface contract, quantization contract, compatible brick types, and a stable SHA-256 hash. Bricks validate this contract before execution. Tokenized and byte-patch interfaces may share an ABI version, but that alone does not align independently trained coordinates; paired alignment losses and measured PPL gates are required.

Canonical generalization contract

V2 resolves the legacy cross-seed failure with two deterministic components:

  1. canonical_anchors.py maps every observed byte prefix to a fixed normalized ABI target. Both byte and byte-patch cores learn against the same target, independent of seed or model width.
  2. canonical_brick_head() creates a fixed ABI-to-byte-logit projection. Brick deltas no longer depend on each core's independently initialized ABI decoder.

For patch n, the global patch state summarizes completed bytes through patch n. It aligns to the context used to predict patch n+1; aligning it to the current patch's BOS context is an off-by-one error and is explicitly avoided.

The selected compact path uses a continuous local GRU state across patch boundaries. The global transformer runs at one quarter of byte sequence length while the local decoder preserves exact causal byte history.

bytes [B,T]
  -> byte embeddings
  -> fixed patches [B,T/4,4]
  -> compact global transformer [B,T/4,d_model]
  -> canonical ABI context [B,T/4,d_abi]
  -> repeat context over local bytes
  -> continuous causal GRU [B,T,d_model]
  -> byte logits

brick(ABI) - ABI
  -> fixed canonical head
  -> additive byte-logit correction

Overview

LayerCake is a transformer-based language model with a fixed-dimension bottleneck interface called the ABI (Architecture Bottleneck Interface). The ABI sits between the core transformer and any domain-specific extensions, enabling domain modules to be shared across models of different sizes.


Core Architectural Components

1. Core Transformer

The core follows a standard pre-norm causal transformer architecture:

Input IDs [B, T]
    → token_emb [vocab × d_model]  +  pos_emb [max_seq_len × d_model]
    → N × SimpleTransformerBlock

SimpleTransformerBlock (identical in LayerCake and the baseline):

x = x + Attn(LayerNorm(x))   # Pre-norm multi-head self-attention
x = x + FFN(LayerNorm(x))    # Pre-norm feed-forward (GELU)

All blocks use pre-norm (LayerNorm before attention/FFN), which provides better training stability than post-norm.

2. ABI Projection (the key innovation)

After the core transformer, a linear projection maps to a fixed d_abi=512 dimension:

h_core  [B, T, d_model]
    → core_to_abi  Linear(d_model, d_abi=512, bias=False)
    → LayerNorm(d_abi)
    → h_abi  [B, T, 512]

For the 48M config, d_model = d_abi = 512 (the projection is the identity up to learned rotation). For the 150M config, d_model = 768 projects down to 512. For 350M, d_model = 1024 projects down to 512.

Why 512? It is large enough to carry rich semantic information while being small enough that domain modules are parameter-efficient (~1.05M for DomainModuleLite).

3. Domain Modules

Domain modules operate entirely in d_abi=512 space. They compute a residual delta:

$$h_\text{out} = h_\text{abi} + \exp(\alpha) \cdot \bigl(F(h_\text{abi}) - h_\text{abi}\bigr)$$

  • $\alpha$ is a learned scalar, initialized to 0 (so exp(0) = 1.0, identity delta at init)
  • $F$ is the domain module network
  • The delta is weighted by $\exp(\alpha)$ which is trained to scale the domain contribution

Two implementations are provided:

DomainModuleLite (~1.05M params) — recommended for most use cases:

h → LayerNorm → SwiGLU(Linear(d_abi → d_ff), Linear(d_abi → d_ff)) → Linear(d_ff → d_abi)
delta = output - h_abi (weighted by exp(α))

DomainModule (~6.3M params) — for maximum domain capacity:

h → 2 × SimpleTransformerBlock(d_abi)
delta = output - h_abi (weighted by exp(α))

Because d_abi=512 is fixed, all domain modules have exactly the same weight shapes regardless of which model size they are embedded in. This is what makes the state dict directly copyable.

4. ABI Back-Projection

After domain modules apply their deltas, the representation is projected back to d_model:

h_abi_out  [B, T, 512]
    → abi_to_core  Linear(512, d_model, bias=False)
    → h  [B, T, d_model]
    → h = h + h_core   (residual from core output — helps training)
    → LayerNorm → LM Head → logits  [B, T, vocab]

5. Domain Router (optional)

An optional learned router selects which domain modules to activate:

h_abi.mean(dim=1)  [B, d_abi]
    → Linear(d_abi, hidden=256) → GELU → Linear(256, num_domains)
    → sigmoid(logits / temperature)
    → domain_mask  [B, num_domains]   (soft mask in [0, 1])

When use_router=False (default), domain masks are passed explicitly or set to None (core-only mode).


Parameter Counts

For the 48M configuration (d_model=512, d_abi=512, n_layers=6, n_heads=8, d_ff=2048, vocab=16K):

Component Parameters
Token embedding 512 × 16000 = 8.19M
Positional embedding 512 × 256 = 0.13M
Core blocks (6×) 6 × ~6.3M = ~37.8M
core_to_abi 512 × 512 = 0.26M
abi_to_core 512 × 512 = 0.26M
ABI LayerNorm 512 × 2 = ~0
LM head (tied) shared with token_emb
Domain modules (per domain, Lite) ~1.05M
Router (optional) ~0.13M
Total (core-only, no domains) ~35.96M

For comparison, the baseline transformer with matched parameters has 35.96M parameters using the same block class. The ABI adds projections but they are matched by slight d_ff expansion in the baseline for a true apples-to-apples comparison.


Forward Pass (Core-Only Mode)

# No domain modules active — equivalent to standard transformer
logits, _ = model(input_ids, domain_mask=None)

In core-only mode, the ABI projections are identity operations with no semantic effect (at init, before training). The model behaves as a standard causal LM.

Forward Pass (Domain-Active Mode)

# Activate chess domain with weight 1.0
domain_mask = torch.zeros(batch_size, num_domains)
domain_mask[:, domain_names.index("chess")] = 1.0
logits, (router_logits, router_probs) = model(input_ids, domain_mask=domain_mask)

Forward Pass (Learned Router)

logits, (router_logits, router_probs) = model(
    input_ids,
    use_learned_router=True,
    router_temperature=1.0,
)

ABI Interface for Thinkers

The ABI can also be used as a frozen black-box interface for "thinker" modules that modulate representations without touching the core:

# Get frozen ABI states (gradients detached — core stays frozen)
h_abi = model.get_abi_hidden_states(input_ids)  # [B, T, 512]

# Thinker applies modulations to h_abi (not shown here)
h_abi_modulated = thinker(h_abi)

# Decode modulated states back to logits (gradients flow through thinker only)
logits = model.decode_from_abi(h_abi_modulated)

This allows thinker modules to be trained while the core weights are completely frozen.


Why the ABI Enables Lossless Portability

Consider two models:

  • Model A: d_model=512, d_abi=512. The core_to_abi projection is 512→512.
  • Model B: d_model=768, d_abi=512. The core_to_abi projection is 768→512.

The domain module in both is DomainModuleLite(d_abi=512). Its weight matrices have shapes:

  • ln.weight: [512]
  • gate_proj.weight: [1024, 512]
  • up_proj.weight: [1024, 512]
  • down_proj.weight: [512, 1024]
  • log_alpha: [1]

These shapes are identical in both models. There is no d_model in any of them.

A state_dict copy therefore transfers all weights bit-for-bit with no mismatches.

The results/paste_proof.json file records MD5 checksums of each tensor for a 37.5M model and a 112.2M model, confirming they are identical.


Comparison to Standard Transformer (Baseline)

The baseline_lm.py uses the same SimpleTransformerBlock as LayerCake's core, ensuring that any performance difference is attributable to architecture (ABI + domains), not to implementation details like weight initialization or block variants.

At 20K training steps with matched parameters (35.96M each):

  • LayerCake adds core_to_abi and abi_to_core projections and an ABI LayerNorm
  • The baseline has slightly wider FFN layers to compensate for parameter matching
  • Both share the same causal masking, optimizer, scheduler, and data

See results/fair_comparison.json for the full benchmark.