It is 2 AM.
You started training 6 hours ago. The loss curve is finally going down. The accuracy is climbing. You can see the finish line.
Then your screen goes dark.
Epoch 31/50 ━━━━━━━━━━━━━━━━━━━━━━ 62% | loss: 0.38 | acc: 71.4%
💥 Colab session disconnected.
Your runtime has timed out.
You stare at it for a full minute.
Then you restart. From epoch 1. Again.
This is not a rare edge case. This is the daily reality of every ML engineer, every data scientist, every student running experiments on a laptop that might die, a Colab that will disconnect, a server that will restart.
Every year, millions of GPU-hours are wasted on work that was already done. Not because the models were wrong. Because the infrastructure failed and there was no safety net.
loopz is that safety net. One decorator. That is it.
pip install loopzimport loopz
@loopz.track("process_images", save_every=100)
def process(image_path):
extract_and_save_features(image_path)
process(all_image_paths)
# 💥 crash at 60,000? just run again → resumes at 60,000 ✅One decorator. One argument. Done.
loopz does not just remember where you were. It remembers everything.
import loopz
import torch
model = MyModel()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
running_loss = [0.0]
best_acc = [0.0]
@loopz.track(
"training",
save_every = 1,
state = {"model": model, "optimizer": optimizer, "scheduler": scheduler},
loop_vars = {"running_loss": running_loss, "best_acc": best_acc},
notify = print,
)
def train(epoch):
loss, acc = train_one_epoch(model, train_loader, optimizer, scheduler)
running_loss[0] += loss
best_acc[0] = max(best_acc[0], acc)
print(f"Epoch {epoch} | loss={loss:.4f} | acc={acc:.4f}")
train(range(50))# 💥 crashes at epoch 31? run the same script again →
🔁 loopz: Resuming 'training' from 31/50 (62.0% complete)
├── model weights ✅ restored
├── optimizer state ✅ restored
├── lr scheduler ✅ restored
├── random seed ✅ restored ← deterministic resume
└── loop variables ✅ running_loss · best_acc restored
Epoch 31 | loss=0.3821 | acc=71.4% ← continues exactly here
Epoch 32 | loss=0.3744 | acc=72.1%
...
The crash never happened. Your training never knew.
On every checkpoint, loopz atomically writes:
~/.loopz/
├── loopz_<hash>.json ← position · timestamps · metadata
├── loopz_<hash>.state ← model weights · optimizer · rng state
└── loopz_<hash>.vars ← your loop accumulators
atomic write (temp → rename) = zero corruption risk on crash
| Object | Supported |
|---|---|
torch.nn.Module |
✅ |
torch.nn.DataParallel |
✅ |
torch.nn.parallel.DistributedDataParallel |
✅ |
torch.optim.Optimizer (Adam, SGD, AdamW, …) |
✅ |
torch.optim.lr_scheduler.* |
✅ |
torch.cuda.amp.GradScaler |
✅ |
torch.Tensor |
✅ |
numpy.ndarray |
✅ |
sklearn estimator |
✅ |
| Any picklable Python object | ✅ |
| Python / NumPy / PyTorch / CUDA random state | ✅ |
Variables inside the loop (running_loss, best_acc, …) |
✅ |
@loopz.track(
job_name = "my_job", # unique name — identifies this job's checkpoint
save_every = 10, # checkpoint every N iterations
state = {...}, # ML objects to save/restore (optional)
loop_vars = {...}, # accumulators inside the loop (optional)
notify = callable, # called on completion or crash (optional)
)
def process(item):
...
process(my_list)📋 loopz — 2 saved job(s):
🔁 training
Progress : 31/50 (62.0%)
Saved at : 2026-04-01 02:14:38
Crashed : Colab session disconnected
🔁 process_images
Progress : 61,200/100,000 (61.2%)
Saved at : 2026-04-01 01:58:02
Wipe a checkpoint. Start fresh next run.
🤖 ML Training → model + optimizer + scheduler checkpointed every epoch
🖼️ Dataset Processing → never reprocess what is already done
🌐 Web Scraping → crash-safe iteration over URL lists
📥 Bulk Downloads → resume from last successful file
🔬 Long Experiments → any loop that might not finish in one run
- Primitive loop vars —
int,float,strcannot mutate in-place. Wrap them:loss = [0.0]notloss = 0.0 - Multi-node DDP — single-machine DDP works. Multi-node across separate machines does not
- Custom CUDA C++ ops — non-standard CUDA state may need manual checkpointing alongside loopz
- Non-picklable objects — skipped with a warning
v0.1 ████████████ core decorator + resume + ML state ✅ shipped
v0.2 ████████░░░░ tqdm integration + async support ✅ shipped
v0.3 ██████░░░░░░ notify hooks + Telegram/webhook 🔄 in progress
v0.4 ████░░░░░░░░ web dashboard for job status 📅 planned
v0.5 ██░░░░░░░░░░ multi-node DDP support 📅 planned
loopz is MIT licensed and built for the community.
If you have ever lost training progress — you already understand this project deeply enough to contribute.
- Fork the repo
git checkout -b feature/your-idea- Make your change, add a test
- Open a Pull Request — all sizes welcome
Built by a solo developer from India — after losing hours of Colab training one too many times.
This is not just a project. It is a frustration turned into a tool. Every ML student who has ever stared at a disconnected Colab session and felt their stomach drop — this is for you.
