Skip to content

SchusterLab/floquet-hpc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

floquet-hpc

High-performance Floquet calculator for driven fluxonium qubit systems, targeting GPU-accelerated parameter sweeps on HPC clusters.

Reimplements the analysis from nikc2003/Fluxonium_Floquet with a custom propagator engine that drops the QuTiP dependency and achieves 30–50x speedup on GPU via batched split-operator propagation and Cayley-transform eigensystem extraction.

Physics

Periodically driven fluxonium-resonator composite:

H(t) = H0 + H1 · Ω_d · cos(ω_d · t)

H0 = H_q ⊗ I_r + I_q ⊗ ω_r·a†a + g · n_q ⊗ (a + a†)   (static Hamiltonian)
H1 = n_q ⊗ I_r                                              (drive coupling)

The code computes the one-period Floquet propagator U(T) = T exp(-i ∫₀ᵀ H(t) dt), extracts quasienergies and Floquet modes, and produces scar maps that reveal resonance structure in the (ω_d, χ_ac) parameter space.

Typical Hilbert space: dim_q=20 qubit levels × dim_r=8 resonator levels = 160×160 matrices.

Features

  • GPU-accelerated propagation via CuPy batched Strang splitting (fp32/fp64)
  • GPU-accelerated eigensystem via Cayley transform + batched Hermitian eigendecomposition
  • Displaced state tracking with polynomial fits for clean scar maps
  • AC Stark shift parametrization (χ_ac → amplitude conversion)
  • Automatic GPU memory management — sub-batch sizes scale to available VRAM
  • SLURM integration for Stanford Sherlock cluster

Installation

# CPU only
pip install -e .

# With GPU support (CUDA 12.x)
pip install -e ".[gpu]"

# Development
pip install -e ".[dev]"

Quick Start

# Local CPU test (small grid)
PYTHONPATH=src python scripts/run_scar_sweep.py \
    --backend cpu --n-omega 31 --n-chi 26 --displaced

# GPU sweep
PYTHONPATH=src python scripts/run_scar_sweep.py \
    --backend auto --n-omega 301 --n-chi 251 --displaced

# Plot results
PYTHONPATH=src python scripts/plot_scar_map.py output/scar_sweep.h5

Deploy to Sherlock

./scripts/deploy.sh                    # rsync + install on Sherlock
ssh sherlock
cd $GROUP_HOME/floquet-hpc
sbatch slurm/scar_sweep_displaced_full.sbatch

Project Structure

floquet-hpc/
├── src/floquet_hpc/
│   ├── engine.py            # Floquet propagator + eigensystem (CPU/GPU)
│   ├── hamiltonian.py       # Fluxonium-resonator Hamiltonian construction
│   ├── analysis.py          # State tracking, scar maps, branch tracking
│   ├── displaced_state.py   # Polynomial displaced state fitting
│   ├── chi_ac.py            # AC Stark shift ↔ amplitude conversion
│   ├── sweeps.py            # Parameter grid, chunked sweep, HDF5 I/O
│   └── cli.py               # CLI entry point
├── scripts/
│   ├── run_scar_sweep.py        # Main sweep script (bare + displaced modes)
│   ├── plot_scar_map.py         # Scar map visualization
│   ├── benchmark_batch_size.py  # GPU batch size benchmarks
│   ├── benchmark_eig.py         # Eigensystem method comparison
│   ├── deploy.sh                # rsync to Sherlock + setup venv
│   └── sherlock_env.sh          # Sherlock module loads
├── slurm/                       # SLURM job scripts for Sherlock
├── configs/                     # Sweep parameter configs
├── tests/                       # Unit tests
└── pyproject.toml

Example Output

Displaced scar map (301×251, ω_d ∈ [3.8, 4.2] GHz, χ_ac ∈ [0, 0.2] GHz)

Displaced scar map

Sharp resonance lines on a clean near-zero background. Left: ground state |0,0⟩. Right: first excited state |1,0⟩. The displaced state polynomial fit removes the smooth AC Stark drift, leaving only resonance features visible.

Wide frequency scan (401×201, ω_d ∈ [3.2, 4.8] GHz, χ_ac ∈ [0, 0.05] GHz)

Wide scar map

A forest of resonance lines across a broad frequency range at low drive amplitude. Multiple multi-photon transitions are visible, with different structure for ground and excited states.


Methods and Optimizations

1. Propagator: Strang Split-Operator (GPU)

The standard approach computes U(T) = ∏ₖ exp(-iH(tₖ)dt) using scipy.linalg.expm at each time step. This is LAPACK-optimized but inherently sequential.

Our approach: Since H(t) = H0 + f(t)·H1, we diagonalize H0 and H1 once:

H0 = V0 · diag(e0) · V0†
H1 = V1 · diag(e1) · V1†

Then each Strang splitting step is:

U_step = exp(-iH0·dt/2) · exp(-if(t)·H1·dt) · exp(-iH0·dt/2)

where each matrix exponential is computed as V·diag(exp(-ieₖ·dt))·V†, i.e. just batched matrix multiplies. The eigenvectors V0, V1 are shared across all parameter points and time steps.

Key optimizations:

  • fp32 (complex64): Consumer GPUs have ~30x more fp32 TFLOPS than fp64. For dim=160 with n_steps≥500, the precision loss is negligible. Results are cast back to fp64 for eigensystem extraction.
  • Batched across parameter points: All (ω_d, amplitude) points are propagated simultaneously at each time step. The time loop is sequential (1000 steps), but all matrix operations are batched.
  • Midpoint rule: Using t = (k+0.5)·dt instead of forward Euler reduces the required step count by ~30x (n_steps=1000 matches n_steps=30000 with forward Euler).

2. Eigensystem: Cayley Transform (GPU)

Extracting Floquet quasienergies requires diagonalizing the unitary propagator U(T). CuPy lacks cp.linalg.eig for non-Hermitian matrices — only eigh (Hermitian) is available.

Cayley transform: For a unitary matrix U with eigenvalues e^{iθₖ}, the matrix

H = i(I - U')(I + U')⁻¹

is Hermitian with eigenvalues tan(θₖ/2) and the same eigenvectors as U.

Phase shift to avoid singularity: If U has an eigenvalue at -1 (θ=π), then (I+U) is singular. We apply U' = e^{iε}U with ε=0.01, which shifts all eigenphases away from π.

Eigenphase recovery: Rather than converting back from tan(θₖ/2), we recover exact eigenphases from V†UV where V are the Cayley eigenvectors. This handles fp32-induced non-unitarity gracefully — the eigenvectors of the Cayley-Hermitian matrix are still excellent approximations even when U is only approximately unitary.

Hermitization: After computing H, we enforce (H + H†)/2 to correct any numerical non-Hermiticity from the matrix inversion.

Why not other approaches?

  • torch.linalg.eig: Works for batched non-Hermitian eig on GPU, but cuSOLVER's geev processes matrices sequentially — actually slower than CPU for 160×160.
  • Hermitian proxy (U+U†)/2: Eigenvectors are wrong when cos(θ) has near-degeneracies (common in Floquet systems).
  • CPU multiprocessing: OpenBLAS thread contention and pickle overhead for large arrays eliminate any gains.

3. Displaced State Tracking

Standard scar maps compare Floquet modes against fixed H0 eigenstates. As drive amplitude increases, the AC Stark shift smoothly deforms the eigenstates, producing a noisy background that obscures resonance features.

Displaced states are polynomial fits to the smooth evolution of each tracked eigenstate:

|ψ_displaced(ω, A)⟩ = |bare⟩ + Σ_{(i,j)} c_{n,i,j} · ω^i · A^j · |n⟩

Key design choices:

  • Exponent constraint j ≥ 1: No pure-ω terms ensures |ψ_displaced⟩ = |bare⟩ at A=0
  • Overlap filtering: Points near resonances (overlap < 0.8) are excluded from the fit
  • Multi-segment fitting: The amplitude range is split into overlapping segments. Each segment uses the previous segment's polynomial as the reference for overlap filtering, enabling tracking through large AC Stark shifts.
  • Global refit: After per-segment masks are computed, a single global least-squares fit produces smooth coefficients across the full range.

The scar map is then: scar = 1 - |⟨displaced|floquet⟩|², which is ~0 away from resonances and spikes at avoided crossings.

4. Branch Tracking

Floquet modes are identified across the amplitude sweep using max-overlap continuity:

  1. At amplitude=0, modes are matched to bare states by overlap
  2. At each amplitude step, each tracked mode is the Floquet mode with maximum overlap with the previous step
  3. Global phase is fixed so the overlap with the bare state is real and positive (essential for smooth polynomial fitting)

5. Automatic GPU Memory Management

Sub-batch sizes are determined at runtime based on available GPU VRAM:

max_points = int(free_memory * 0.7 / bytes_per_point)

This ensures the code runs efficiently on any GPU from a 2080 Ti (11 GB) to an H100 (80 GB) without manual tuning.


Benchmarks

All benchmarks use dim=160, n_steps=1000, ~5000 parameter points.

Propagator Performance

GPU ms/point Notes
NVIDIA H100 80GB 4.3 HBM3 bandwidth helps
NVIDIA L40S 48GB 7.5 Good fp32 throughput
NVIDIA V100 16GB 13.5 Older but capable
NVIDIA 2080 Ti 11GB 14.0 Consumer GPU baseline
CPU (scipy.expm) 37.0 Sequential, single-core

Propagation is compute-bound; batch size has no measurable effect on throughput.

Eigensystem Performance

Method GPU ms/point Speedup vs CPU
Cayley + eigh H100 0.7 49.5x
Cayley + eigh L40S 2.4 15.5x
Cayley + eigh V100 1.4 25x
Cayley + eigh 2080 Ti 8.0 8.2x
torch.linalg.eig H100 20.0 1.6x (slower!)
np.linalg.eig CPU 35.1 1x (baseline)

The Cayley approach scales well across GPU tiers. torch.linalg.eig is slower due to cuSOLVER's sequential geev. Batch size has minimal effect (±5%) — cuSOLVER's eigh and inv dominate regardless.

End-to-End Sweep Timing (301×251 = 75,551 points)

GPU Propagation Eigensystem Total ms/point
H100 (projected) ~5.4 min ~0.9 min ~6.3 min 5.0
V100 18.5 min 1.8 min 20.6 min 15.4
2080 Ti 17.8 min 10.5 min 28.4 min 22.6
2080 Ti (CPU eig) 9.1 min 37.0 min 46.3 min 36.8

Batch Size Effect (L40S, 48GB)

sub_batch Propagator (ms/pt) Eigensystem (ms/pt)
512 7.4 2.4
2048 7.5 2.5
8192 7.5 2.4

Conclusion: batch size doesn't meaningfully affect performance. The bottleneck operations (eigh, inv, batched matmul in the time loop) run at the same rate regardless of batch size. Auto-sizing is useful for memory management but not for speed.


Numerical Accuracy

Propagator (fp32 vs fp64)

  • Singular values of U deviate from 1 by ~1e-3 in fp32 (exact in fp64)
  • Quasienergies agree to ~1e-4 between fp32 and fp64 propagators
  • For dim=160 with n_steps=1000, fp32 precision is more than sufficient

Eigensystem (Cayley vs CPU)

  • Mean eigenvector residual ||Uv - λv||: 1.78e-4 (Cayley) vs 1.60e-4 (CPU)
  • Max quasienergy difference: 4.1e-4
  • The Cayley approach correctly handles fp32-induced non-unitarity by recovering eigenphases from V†UV rather than from the Cayley eigenvalues directly

Known Pitfall: Advanced Indexing Transpose

NumPy/CuPy/PyTorch advanced indexing with a[ix, :, iy] (advanced indices on axes 0 and 2) transposes the result — the slice dimension ends up after the advanced-indexed dimensions. All batched eigenvector sorting code must include .transpose(0, 2, 1) (NumPy/CuPy) or .permute(0, 2, 1) (PyTorch) to correct this.


References

  • Original code: nikc2003/Fluxonium_Floquet
  • Zhu et al., PRB 87, 024510 (2013) — AC Stark shift formalism
  • Strang splitting: G. Strang, SIAM J. Numer. Anal. 5, 506 (1968)
  • Cayley transform for unitary matrices: Cayley, Phil. Trans. Roy. Soc. London 148, 17 (1858)

About

High-performance Floquet calculator for fluxonium qubits (GPU/CPU)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages