floquet-hpc

High-performance Floquet calculator for driven fluxonium qubit systems, targeting GPU-accelerated parameter sweeps on HPC clusters.

Reimplements the analysis from nikc2003/Fluxonium_Floquet with a custom propagator engine that drops the QuTiP dependency and achieves 30–50x speedup on GPU via batched split-operator propagation and Cayley-transform eigensystem extraction.

Physics

Periodically driven fluxonium-resonator composite:

H(t) = H0 + H1 · Ω_d · cos(ω_d · t)

H0 = H_q ⊗ I_r + I_q ⊗ ω_r·a†a + g · n_q ⊗ (a + a†)   (static Hamiltonian)
H1 = n_q ⊗ I_r                                              (drive coupling)

The code computes the one-period Floquet propagator U(T) = T exp(-i ∫₀ᵀ H(t) dt), extracts quasienergies and Floquet modes, and produces scar maps that reveal resonance structure in the (ω_d, χ_ac) parameter space.

Typical Hilbert space: dim_q=20 qubit levels × dim_r=8 resonator levels = 160×160 matrices.

Features

GPU-accelerated propagation via CuPy batched Strang splitting (fp32/fp64)
GPU-accelerated eigensystem via Cayley transform + batched Hermitian eigendecomposition
Displaced state tracking with polynomial fits for clean scar maps
AC Stark shift parametrization (χ_ac → amplitude conversion)
Automatic GPU memory management — sub-batch sizes scale to available VRAM
SLURM integration for Stanford Sherlock cluster

Installation

# CPU only
pip install -e .

# With GPU support (CUDA 12.x)
pip install -e ".[gpu]"

# Development
pip install -e ".[dev]"

Quick Start

# Local CPU test (small grid)
PYTHONPATH=src python scripts/run_scar_sweep.py \
    --backend cpu --n-omega 31 --n-chi 26 --displaced

# GPU sweep
PYTHONPATH=src python scripts/run_scar_sweep.py \
    --backend auto --n-omega 301 --n-chi 251 --displaced

# Plot results
PYTHONPATH=src python scripts/plot_scar_map.py output/scar_sweep.h5

Deploy to Sherlock

./scripts/deploy.sh                    # rsync + install on Sherlock
ssh sherlock
cd $GROUP_HOME/floquet-hpc
sbatch slurm/scar_sweep_displaced_full.sbatch

Project Structure

floquet-hpc/
├── src/floquet_hpc/
│   ├── engine.py            # Floquet propagator + eigensystem (CPU/GPU)
│   ├── hamiltonian.py       # Fluxonium-resonator Hamiltonian construction
│   ├── analysis.py          # State tracking, scar maps, branch tracking
│   ├── displaced_state.py   # Polynomial displaced state fitting
│   ├── chi_ac.py            # AC Stark shift ↔ amplitude conversion
│   ├── sweeps.py            # Parameter grid, chunked sweep, HDF5 I/O
│   └── cli.py               # CLI entry point
├── scripts/
│   ├── run_scar_sweep.py        # Main sweep script (bare + displaced modes)
│   ├── plot_scar_map.py         # Scar map visualization
│   ├── benchmark_batch_size.py  # GPU batch size benchmarks
│   ├── benchmark_eig.py         # Eigensystem method comparison
│   ├── deploy.sh                # rsync to Sherlock + setup venv
│   └── sherlock_env.sh          # Sherlock module loads
├── slurm/                       # SLURM job scripts for Sherlock
├── configs/                     # Sweep parameter configs
├── tests/                       # Unit tests
└── pyproject.toml

Example Output

Displaced scar map (301×251, ω_d ∈ [3.8, 4.2] GHz, χ_ac ∈ [0, 0.2] GHz)

Sharp resonance lines on a clean near-zero background. Left: ground state |0,0⟩. Right: first excited state |1,0⟩. The displaced state polynomial fit removes the smooth AC Stark drift, leaving only resonance features visible.

Wide frequency scan (401×201, ω_d ∈ [3.2, 4.8] GHz, χ_ac ∈ [0, 0.05] GHz)

A forest of resonance lines across a broad frequency range at low drive amplitude. Multiple multi-photon transitions are visible, with different structure for ground and excited states.

Methods and Optimizations

1. Propagator: Strang Split-Operator (GPU)

The standard approach computes U(T) = ∏ₖ exp(-iH(tₖ)dt) using scipy.linalg.expm at each time step. This is LAPACK-optimized but inherently sequential.

Our approach: Since H(t) = H0 + f(t)·H1, we diagonalize H0 and H1 once:

H0 = V0 · diag(e0) · V0†
H1 = V1 · diag(e1) · V1†

Then each Strang splitting step is:

U_step = exp(-iH0·dt/2) · exp(-if(t)·H1·dt) · exp(-iH0·dt/2)

where each matrix exponential is computed as V·diag(exp(-ieₖ·dt))·V†, i.e. just batched matrix multiplies. The eigenvectors V0, V1 are shared across all parameter points and time steps.

Key optimizations:

fp32 (complex64): Consumer GPUs have ~30x more fp32 TFLOPS than fp64. For dim=160 with n_steps≥500, the precision loss is negligible. Results are cast back to fp64 for eigensystem extraction.
Batched across parameter points: All (ω_d, amplitude) points are propagated simultaneously at each time step. The time loop is sequential (1000 steps), but all matrix operations are batched.
Midpoint rule: Using t = (k+0.5)·dt instead of forward Euler reduces the required step count by ~30x (n_steps=1000 matches n_steps=30000 with forward Euler).

2. Eigensystem: Cayley Transform (GPU)

Extracting Floquet quasienergies requires diagonalizing the unitary propagator U(T). CuPy lacks cp.linalg.eig for non-Hermitian matrices — only eigh (Hermitian) is available.

Cayley transform: For a unitary matrix U with eigenvalues e^{iθₖ}, the matrix

H = i(I - U')(I + U')⁻¹

is Hermitian with eigenvalues tan(θₖ/2) and the same eigenvectors as U.

Phase shift to avoid singularity: If U has an eigenvalue at -1 (θ=π), then (I+U) is singular. We apply U' = e^{iε}U with ε=0.01, which shifts all eigenphases away from π.

Eigenphase recovery: Rather than converting back from tan(θₖ/2), we recover exact eigenphases from V†UV where V are the Cayley eigenvectors. This handles fp32-induced non-unitarity gracefully — the eigenvectors of the Cayley-Hermitian matrix are still excellent approximations even when U is only approximately unitary.

Hermitization: After computing H, we enforce (H + H†)/2 to correct any numerical non-Hermiticity from the matrix inversion.

Why not other approaches?

torch.linalg.eig: Works for batched non-Hermitian eig on GPU, but cuSOLVER's geev processes matrices sequentially — actually slower than CPU for 160×160.
Hermitian proxy (U+U†)/2: Eigenvectors are wrong when cos(θ) has near-degeneracies (common in Floquet systems).
CPU multiprocessing: OpenBLAS thread contention and pickle overhead for large arrays eliminate any gains.

3. Displaced State Tracking

Standard scar maps compare Floquet modes against fixed H0 eigenstates. As drive amplitude increases, the AC Stark shift smoothly deforms the eigenstates, producing a noisy background that obscures resonance features.

Displaced states are polynomial fits to the smooth evolution of each tracked eigenstate:

|ψ_displaced(ω, A)⟩ = |bare⟩ + Σ_{(i,j)} c_{n,i,j} · ω^i · A^j · |n⟩

Key design choices:

Exponent constraint j ≥ 1: No pure-ω terms ensures |ψ_displaced⟩ = |bare⟩ at A=0
Overlap filtering: Points near resonances (overlap < 0.8) are excluded from the fit
Multi-segment fitting: The amplitude range is split into overlapping segments. Each segment uses the previous segment's polynomial as the reference for overlap filtering, enabling tracking through large AC Stark shifts.
Global refit: After per-segment masks are computed, a single global least-squares fit produces smooth coefficients across the full range.

The scar map is then: scar = 1 - |⟨displaced|floquet⟩|², which is ~0 away from resonances and spikes at avoided crossings.

4. Branch Tracking

Floquet modes are identified across the amplitude sweep using max-overlap continuity:

At amplitude=0, modes are matched to bare states by overlap
At each amplitude step, each tracked mode is the Floquet mode with maximum overlap with the previous step
Global phase is fixed so the overlap with the bare state is real and positive (essential for smooth polynomial fitting)

5. Automatic GPU Memory Management

Sub-batch sizes are determined at runtime based on available GPU VRAM:

max_points = int(free_memory * 0.7 / bytes_per_point)

This ensures the code runs efficiently on any GPU from a 2080 Ti (11 GB) to an H100 (80 GB) without manual tuning.

Benchmarks

All benchmarks use dim=160, n_steps=1000, ~5000 parameter points.

Propagator Performance

GPU	ms/point	Notes
NVIDIA H100 80GB	4.3	HBM3 bandwidth helps
NVIDIA L40S 48GB	7.5	Good fp32 throughput
NVIDIA V100 16GB	13.5	Older but capable
NVIDIA 2080 Ti 11GB	14.0	Consumer GPU baseline
CPU (scipy.expm)	37.0	Sequential, single-core

Propagation is compute-bound; batch size has no measurable effect on throughput.

Eigensystem Performance

Method	GPU	ms/point	Speedup vs CPU
Cayley + eigh	H100	0.7	49.5x
Cayley + eigh	L40S	2.4	15.5x
Cayley + eigh	V100	1.4	25x
Cayley + eigh	2080 Ti	8.0	8.2x
torch.linalg.eig	H100	20.0	1.6x (slower!)
np.linalg.eig	CPU	35.1	1x (baseline)

The Cayley approach scales well across GPU tiers. torch.linalg.eig is slower due to cuSOLVER's sequential geev. Batch size has minimal effect (±5%) — cuSOLVER's eigh and inv dominate regardless.

End-to-End Sweep Timing (301×251 = 75,551 points)

GPU	Propagation	Eigensystem	Total	ms/point
H100 (projected)	~5.4 min	~0.9 min	~6.3 min	5.0
V100	18.5 min	1.8 min	20.6 min	15.4
2080 Ti	17.8 min	10.5 min	28.4 min	22.6
2080 Ti (CPU eig)	9.1 min	37.0 min	46.3 min	36.8

Batch Size Effect (L40S, 48GB)

sub_batch	Propagator (ms/pt)	Eigensystem (ms/pt)
512	7.4	2.4
2048	7.5	2.5
8192	7.5	2.4

Conclusion: batch size doesn't meaningfully affect performance. The bottleneck operations (eigh, inv, batched matmul in the time loop) run at the same rate regardless of batch size. Auto-sizing is useful for memory management but not for speed.

Numerical Accuracy

Propagator (fp32 vs fp64)

Singular values of U deviate from 1 by ~1e-3 in fp32 (exact in fp64)
Quasienergies agree to ~1e-4 between fp32 and fp64 propagators
For dim=160 with n_steps=1000, fp32 precision is more than sufficient

Eigensystem (Cayley vs CPU)

Mean eigenvector residual ||Uv - λv||: 1.78e-4 (Cayley) vs 1.60e-4 (CPU)
Max quasienergy difference: 4.1e-4
The Cayley approach correctly handles fp32-induced non-unitarity by recovering eigenphases from V†UV rather than from the Cayley eigenvalues directly

Known Pitfall: Advanced Indexing Transpose

NumPy/CuPy/PyTorch advanced indexing with a[ix, :, iy] (advanced indices on axes 0 and 2) transposes the result — the slice dimension ends up after the advanced-indexed dimensions. All batched eigenvector sorting code must include .transpose(0, 2, 1) (NumPy/CuPy) or .permute(0, 2, 1) (PyTorch) to correct this.

References

Original code: nikc2003/Fluxonium_Floquet
Zhu et al., PRB 87, 024510 (2013) — AC Stark shift formalism
Strang splitting: G. Strang, SIAM J. Numer. Anal. 5, 506 (1968)
Cayley transform for unitary matrices: Cayley, Phil. Trans. Roy. Soc. London 148, 17 (1858)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
docs		docs
scripts		scripts
slurm		slurm
src/floquet_hpc		src/floquet_hpc
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
start-floquet-hpc.sh		start-floquet-hpc.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

floquet-hpc

Physics

Features

Installation

Quick Start

Deploy to Sherlock

Project Structure

Example Output

Displaced scar map (301×251, ω_d ∈ [3.8, 4.2] GHz, χ_ac ∈ [0, 0.2] GHz)

Wide frequency scan (401×201, ω_d ∈ [3.2, 4.8] GHz, χ_ac ∈ [0, 0.05] GHz)

Methods and Optimizations

1. Propagator: Strang Split-Operator (GPU)

2. Eigensystem: Cayley Transform (GPU)

3. Displaced State Tracking

4. Branch Tracking

5. Automatic GPU Memory Management

Benchmarks

Propagator Performance

Eigensystem Performance

End-to-End Sweep Timing (301×251 = 75,551 points)

Batch Size Effect (L40S, 48GB)

Numerical Accuracy

Propagator (fp32 vs fp64)

Eigensystem (Cayley vs CPU)

Known Pitfall: Advanced Indexing Transpose

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

floquet-hpc

Physics

Features

Installation

Quick Start

Deploy to Sherlock

Project Structure

Example Output

Displaced scar map (301×251, ω_d ∈ [3.8, 4.2] GHz, χ_ac ∈ [0, 0.2] GHz)

Wide frequency scan (401×201, ω_d ∈ [3.2, 4.8] GHz, χ_ac ∈ [0, 0.05] GHz)

Methods and Optimizations

1. Propagator: Strang Split-Operator (GPU)

2. Eigensystem: Cayley Transform (GPU)

3. Displaced State Tracking

4. Branch Tracking

5. Automatic GPU Memory Management

Benchmarks

Propagator Performance

Eigensystem Performance

End-to-End Sweep Timing (301×251 = 75,551 points)

Batch Size Effect (L40S, 48GB)

Numerical Accuracy

Propagator (fp32 vs fp64)

Eigensystem (Cayley vs CPU)

Known Pitfall: Advanced Indexing Transpose

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages