I'm a student in the Master in High Performance Computing (MHPC) programme at SISSA / ICTP and PhD student in High Energy Physics at SISSA, Trieste. My work focuses on parallel and GPU-accelerated scientific computing — from cache-aware single-core kernels up to distributed solvers running on Leonardo. More recently I've been applying that toolkit to ML/AI workloads: fine-tuning LLMs with rule-based RL, running multi-agent LLM systems on cluster GPUs, and Bayesian inference on real scientific data.
The repositories below collect coursework, group projects, and a few solo experiments. The HPC ones build with CMake or Make and ship with benchmarks where they make sense; the ML ones are reproducible end-to-end (data fetched from public archives, training reports checked in).
🌐 For the full experience visit the live portfolio — project cards, terminal hero, the works.
|
languages parallelism |
scientific libraries ml / ai tooling |
The six below are pinned and are mainly focused on ML/AI + HPC engineering. The full set (9 projects, including atmospheric simulation, eigenvalue solvers, and modern C++) lives on the live portfolio site.
From-scratch REINFORCE / RLOO with PPO-style clipping and a KL penalty, applied to fine-tuning Llama-3.2-1B-Instruct for <think>…</think> reasoning on GSM8K. A miniature, run on Leonardo, version of the recipe behind DeepSeek-R1-style reasoners: rule-based reward, no preference data, no learned reward model. Includes a small ablation over KL strength and group size with a discussion of the format-vs-reasoning reward-hacking trade-off. PyTorch, Transformers, TRL.
Small experiment where we treat the LLM as a particle in a statistical-mechanics system: sweep the qwen2.5 family from 0.5b to 14b × eight temperatures, study single-agent output distributions and N=4 multi-agent opinion dynamics over R=10 rounds. Headline finding: the genuine control parameter is model size, not sampling temperature, the Ising-style phase picture only fits the small models. Local Ollama, CrewAI for the multi-agent layer, SLURM template for CINECA Leonardo.
Quasi-periodic Gaussian Process built from scratch, i.e. hand-written kernel, log marginal likelihood via Cholesky factorisation, scipy.optimize for hyperparameters, posterior predictive from the standard equations.
We fit to a real Kepler stellar light curve to recover a candidate rotation period. Lomb–Scargle periodogram seeds the period prior. Data fetched live from the STScI archive; nothing to download.
GPU programming portfolio progressing from first CUDA kernels to a production-quality GPU-accelerated Lattice Boltzmann fluid solver. Shared-memory transpose with bank-conflict avoidance, distributed matrix multiplication via MPI + cuBLAS (Cannon-style), and a 2D Jacobi solver across multiple GPUs. The same kernel-writing fluency that's needed for fast ML training and inference paths.
Eight progressive HPC projects: distributed identity and matmul, Cannon's algorithm, OpenMP fundamentals, Jacobi solvers (pure MPI → hybrid MPI+OpenMP → parallel HDF5 output), and a 3D diffusion solver with FFTW3-MPI. Each project compares blocking, non-blocking, and collective communication strategies, the foundations behind any distributed-training stack.
Annotated C examples building intuition for why code performs the way it does on modern hardware: cache hierarchies (memory mountains, blocked transpose, AoS vs. SoA), branch prediction, loop reordering and unrolling for ILP, software prefetching, sparse-matrix layouts, FP rounding. Benchmarks across laptop, Leonardo, and LUMI. The microarchitectural floor every fast ML kernel sits on top of.
→ finishing the MHPC programme at SISSA / ICTP and the PhD in High Energy Physics at SISSA
→ looking for ML/AI + HPC engineering roles — GPU programming, distributed training infrastructure, LLM systems
→ open to research collaborations on ML for science, scientific computing, numerical methods and relative topics
→ Always eager to learn new stuff and chat about cool projects, so feel free to reach out! 😉
