Skip to content
View Jason-Wang313's full-sized avatar

Block or report Jason-Wang313

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Jason-Wang313/README.md

Hi, I'm Jason Wang 👋

AI Systems Engineer | Alignment Scholar | Control Systems Researcher

I treat Large Language Models not as black boxes, but as stochastic dynamical systems that can be modeled, monitored, and controlled. My work bridges the gap between Control Theory, Game Theory, and Systems Engineering to operationalize safety for frontier models.


🔬 Research & Engineering Portfolio

A closed-loop control system that steers internal activation states in real-time.

Status PyTorch RL

The Problem: Open-loop safety (RLHF) is brittle and prone to jailbreaks.
The Solution: An on-chip "Router" policy trained via PPO that sits inside the residual stream (Layer 15). It senses semantic state and injects steering vectors token-by-token to route generation away from harmful basins.

  • Key Result: Successfully prevents mode collapse and toxicity (e.g., "I hate everything") by dynamically modulating steering intensity only when necessary.
  • Tech: PyTorch Hooks, Gymnasium, TinyLlama, Reinforcement Learning.

Applying Non-Linear Control Theory and H-Infinity Robust Control to AI Alignment.

Status Math

The Innovation: Unlike standard alignment, Aegis models the LLM as a non-linear plant and synthesizes a mathematically rigorous controller to reject "Deception" as a system disturbance.

  • System ID: Uses Subspace System Identification (N4SID) to reverse-engineer residual stream physics.
  • State Estimation: Implements an Extended Kalman Filter (EKF) to filter polysemantic noise and measure the true "Deception State."
  • Synthesis: Solves Algebraic Riccati Equations to guarantee safety bounds under adversarial pressure.

A Defense-in-Depth Sandbox for Model Organisms of Deception.

Status Docker

The Architecture: A containment system built to trap and neutralize AI-generated "sleeper agent" malware.

  • Breeder: Generates "Agent 47" code (malware hidden in Fibonacci functions) using Llama-3.
  • Hardened Nucleus: Traps execution in a Docker container with seccomp-bpf syscall filtering and no network access.
  • The Defense: Achieved a 100% neutralization rate against exfiltration attacks by exploiting the "Utility Gap"—paraphrasing code to destroy hidden steganographic channels while preserving visible utility.

Adversarial Research: Proving the existence of deceptive cognitive channels.

Status

The Findings: A proof-of-concept exploration into the limits of AI oversight.

  • Steganography: Null Space Analysis of GPT-2 found ~14 bits/token of capacity for hidden messaging.
  • Game Theory: Simulations identified the "Roger Threshold" ($\beta \approx 0.95$), the specific oversight capability level where lying becomes a Nash Equilibrium for models.
  • Cognitive Limits: Modeled human oversight failure, pinpointing "Minute 19" as the collapse point for effective monitoring.

🛠️ Technical Arsenal

Domain Technologies & Methods
Control Theory H-Infinity ($H_\infty$) Synthesis, Kalman Filtering (EKF), N4SID, PID Dynamics
AI Implementation PyTorch Internals (Hooks), Transformers, Gymnasium, PPO, SAEs
Systems Engineering Docker, Seccomp-bpf, Linux Kernel Security, Real-time Systems
Math & Theory Game Theory (POSG), Null Space Analysis, Information Theory

📫 Connect

  • Focus: I am currently seeking roles that allow me to move alignment guarantees from "probabilistic" to "provable."
  • Code: github.com/Jason-Wang313

Pinned Loading

  1. RISER RISER Public

    A closed-loop control system for Large Language Models that steers internal activation states in real-time to prevent mode collapse and toxicity

    Python

  2. ouroboros-protocol ouroboros-protocol Public

    A Defense-in-Depth AI Control Sandbox using Docker, seccomp-bpf, and Paraphrasing to neutralize Model Organisms of Deception

    Python

  3. glass-babel-initiative glass-babel-initiative Public

    Implementation of the Glass Babel Initiative: A theoretical framework demonstrating how LLMs can utilize adversarial superposition to hide deceptive reasoning from mechanistic interpretability tool…

    Python

  4. Drift-Bench Drift-Bench Public

    Quantifying the "Safety Half-Life" of LLMs: A framework to measure how safety alignment degrades and susceptibility to jailbreaks increases as context length grows

    Python

  5. panopticon-lattice panopticon-lattice Public

    Multi-Agent Evolutionary Simulation exploring adversarial economics and AI steering

    Python