What
Train a small LoRA adapter (r=8, λ_LM=0.01) jointly with the probe on Gemma 4, following Obeso et al. §3 (https://arxiv.org/abs/2509.03531). Use CyberNative/Code_Vulnerability_Security_DPO chosen-vs-rejected pairs so the model defaults to safer patterns when answering ambiguous "quick prototype" prompts.
Why
The paper shows joint probe + LoRA training makes models more conservative: "more readily acknowledge uncertainty, explicitly recognize when they might be generating unreliable information." The cybersecurity analogue is a model that emits a guarded parameterised query / bounds-checked / authenticated version by default, rather than the textbook-but-vulnerable shortest answer.
Status / context
scripts/train_lora.py — in-progress
scripts/eval_lora.py — eval comparison harness
- Dataset:
CyberNative/Code_Vulnerability_Security_DPO
- Base model: Gemma 4 E2B
Definition of done
- LoRA-fine-tuned Gemma 4 generates demonstrably safer code on the 5 demo prompts vs. base (verified via Semgrep pattern match)
- Adapter ships alongside the probe (loadable from
data/lora/)
What
Train a small LoRA adapter (r=8, λ_LM=0.01) jointly with the probe on Gemma 4, following Obeso et al. §3 (https://arxiv.org/abs/2509.03531). Use
CyberNative/Code_Vulnerability_Security_DPOchosen-vs-rejected pairs so the model defaults to safer patterns when answering ambiguous "quick prototype" prompts.Why
The paper shows joint probe + LoRA training makes models more conservative: "more readily acknowledge uncertainty, explicitly recognize when they might be generating unreliable information." The cybersecurity analogue is a model that emits a guarded
parameterised query / bounds-checked / authenticatedversion by default, rather than the textbook-but-vulnerable shortest answer.Status / context
scripts/train_lora.py— in-progressscripts/eval_lora.py— eval comparison harnessCyberNative/Code_Vulnerability_Security_DPODefinition of done
data/lora/)