An attempt to make a stability-first optimizer inspired by neural homeostasis and multi-scale memory, extending AdamW for heterogeneous architectures.
Biological neurons maintain homeostasis, stable internal activity levels even when stimuli vary. BiostatisV6 brings this principle to deep optimization by monitoring gradient “activity” and coherence across the model and dynamically adjusting learning behavior to remain within a healthy range.
- The optimizer accumulates all gradients into a flat tensor.
- It computes:
-
Global energy: average gradient magnitude
$$E_t = \text{mean}(g_t^2)$$ -
Global coherence:
$$C_t = \text{mean}(|\tanh(g_t)|)$$ which measures gradient activation saturation:- Small gradients
$(|g| < 1)$ →$\tanh(g) \approx g$ : linear, low activation - Large gradients
$(|g| > 2)$ →$\tanh(g) \to \pm 1$ : saturated, high activation -
$C_t \in [0, 1]$ thus reflects how "excited" or "quiet" the overall network is.
- Small gradients
-
Global energy: average gradient magnitude
- A homeostatic modulation term gently keeps coherence near the target (≈0.8), preventing either overexcitation or gradient collapse.
- An energy feedback term ensures global gradient magnitude doesn't overshoot or undershoot.
- Each parameter performs standard Adam-style momentum updates with bias correction.
- Two multi-scale exponential memories (decay = 0.9, 0.99) smooth variance across short and long time horizons — analogous to short-term and long-term synaptic memory.
- Coherence modulation: Uses cosine similarity between the momentum and current gradient to align directions (directional coherence).
- Selective ascent: If a parameter's signal-to-noise ratio is low, it adds a small "ascent" term to escape sharp minima.
-
Local homeostasis: Per-parameter energy and activation (via
$|\tanh(g)|$ ) are regulated to keep them in stable ranges.
- The optimizer merges global and local control:
$$h_t = 0.3h_{\text{global}} + 0.7h_{\text{local}}, \quad e_t = 0.3e_{\text{global}} + 0.7e_{\text{local}}$$ - This gives each parameter local adaptivity while maintaining mild global consistency — crucial for heterogeneous architectures (e.g., LLMs with both FFN and linear attention blocks).
Where:
-
$F_t$ : multi-scale memory (fractional gradient trace) -
$\tilde{g}_t$ : polarity-aligned gradient -
$h_t, e_t$ : hierarchical homeostatic gains - Weight decay is decoupled (AdamW-style)
Gradients → tanh → |·| → mean = global coherence
↓
Homeostasis feedback ← targets ~0.8 activation
↓
Local updates → multi-scale memory → directional modulation
↓
Hierarchical merge (30% global, 70% local)
↓
Final adaptive step
for each param group:
g_cat = concat(all grads)
global_energy = mean(g_cat**2)
global_coherence = mean(abs(tanh(g_cat)))
# weak global homeostasis (0.5× weaker, blended, clamped)
h_g = 1 - 0.5*ρ*tanh(global_coherence - c_target)
e_g_raw = 1 + 0.5*λ*(E_target - global_energy)
e_g = clip(0.9*e_g_raw + 0.1, [0.925, 1.075])
for each parameter p:
m = β1*m + (1-β1)*g
v = β2*v + (1-β2)*g²
# multi-scale memory
ema_i = ρ_i*ema_i + (1-ρ_i)*g
energy_flow = Σ(w_i * ema_i)
# coherence modulation
polarity = 0.5*sign(g)*tanh(cos(m, g))
adaptive_grad = g*(1+polarity)
if importance(m, v) < threshold:
adaptive_grad += ascent_strength*g
# local homeostasis (full strength, blended, clamped)
h_l = 1 - ρ*tanh(local_coherence - c_target)
e_l_raw = 1 + λ*(E_target - local_energy)
e_l = clip(0.8*e_l_raw + 0.2, [0.85, 1.15])
# hierarchical blend
h = 0.3*h_g + 0.7*h_l
e = 0.3*e_g + 0.7*e_l
# final update
Δθ = -lr * h * e * (m/√v + 0.05*energy_flow + 0.01*adaptive_grad)
θ ← θ * (1 - lr*wd) + Δθ| Task | Model | Metric | AdamW | BiostatisV6 | Δ | Time Overhead |
|---|---|---|---|---|---|---|
| CIFAR-10 | ResNet18 | Acc | 69.23% | 72.01% | +2.78% | +67% |
| CIFAR-100 | ResNet18 | Acc | 49.78% | 52.06% | +2.28% | +60% |
| Shakespeare | GPT-1 | Train PPL | 17.39 | 13.27 | -23.7% | +23% |
| WikiText-103 | GPT-2 (25M) | Val PPL | 249.68 | 248.21 | +0.59% | +34% |
| Optimizer | Top-5 Singular Value Concentration | Effective Rank |
|---|---|---|
| AdamW | 0.555 | 9.83 |
| BiostatisV6 | 0.596 | 9.41 |
➡️ Higher concentration & lower rank → more compact, structured representations.
✅ Recommended For:
- Vision tasks (CNNs, ResNets)
- Small or noisy language models (GPT-1, Shakespeare)
- Architectures mixing different blocks (e.g. FFN + linear attention)
- Non-convex or noisy optimization landscapes
- Mid-scale clean transformers (25M–100M params)
- Compute-limited setups (≈30–70% slower than AdamW)
❌ Not Ideal:
- Extremely stable datasets or latency-critical systems
from biostatis import BiostatisV6
optimizer = BiostatisV6(
model.parameters(),
lr=1e-3,
weight_decay=1e-2,
coherence_target=0.8,
energy_target=1e-3
)- ~1.9× optimizer memory (multi-scale EMAs)
- Slower iteration throughput (≈0.6× AdamW)
- Sparse gradients not yet supported
“BiostatisV6 stabilizes gradient dynamics via hierarchical homeostasis. It consistently improves vision tasks (+2–3%) and small LMs (+24%), while maintaining parity on well-behaved large models (GPT-2). This project was developed as a student-driven exploration into optimizer design, following theoretical norms as closely as possible. V6 wasn’t built to outperform AdamW on a specific dataset. It was built from curiosity, a remix of biological mechanisms and deep learning optimization, crafted from scratch to see what emerges Any constructive feedbacks are always welcomed.”
Future work:
- Component-wise energy targets (per module)
- Dynamic α_global scheduling
- Broader transformer-scale tuning
@misc{choudhury2025biostatisv6,
author = {Hindol Roy Choudhury and Chunlin Huang},
title = {BiostatisV6: A Biologically-Inspired Optimizer with Hierarchical Homeostasis},
year = {2025},
note = {Student-led optimizer research project},
url = {https://github.com/Hindolch/V6-optimizer},
}