partcleda · ashvin-verma · Mar 21, 2026 · Mar 22, 2026 · Mar 22, 2026 · Mar 22, 2026
diff --git a/.python-version b/.python-version
@@ -0,0 +1 @@
+3.12
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,41 @@
+# CLAUDE.md
+
+## How to Run
+
+This project runs under WSL. From a Windows terminal:
+
+    wsl -d Ubuntu-24.04
+
+Then from the repo root (`/mnt/c/Users/ashvi/Documents/intern_challenge`):
+
+    uv run python test.py                              # upstream test suite (all 12 tests)
+    uv run python ashvin/run_tests.py                  # instrumented runner (timing + CSV)
+    uv run python ashvin/run_tests.py --tests 1,2,3    # run specific tests
+    uv run python ashvin/run_tests.py --tag experiment1 # tag for CSV filename
+
+## Environment
+
+- Python 3.12 (managed by uv)
+- PyTorch with CUDA 12.8 (RTX 3080 Ti)
+- Package manager: uv (see pyproject.toml)
+- OS: WSL Ubuntu 24.04 on Windows 11
+
+## Project Structure
+
+- `placement.py` — challenge code (we implement `overlap_repulsion_loss()` here)
+- `test.py` — upstream test harness (DO NOT MODIFY)
+- `PLAN.md` — strategic roadmap
+- `HISTORY.md` — raw experiment results log
+- `PROGRESS.md` — analysis of each run: what worked, why, what to try next
+- `ashvin/` — all custom code
+  - `ashvin/run_tests.py` — instrumented test runner with CSV output
+  - `ashvin/instrumented_train.py` — training wrapper with per-phase timing
+  - `ashvin/results/` — CSV output from experiments
+
+## Conventions
+
+1. `placement.py` is modified only for `overlap_repulsion_loss()` (the challenge). `test.py` is read-only.
+2. All custom code goes in `ashvin/`.
+3. Log every experiment: raw data in `HISTORY.md`, analysis in `PROGRESS.md`.
+4. Primary metric: overlap_ratio (lower = better, 0.0 = perfect).
+5. Secondary metric: normalized_wl (lower = better).
diff --git a/HISTORY.md b/HISTORY.md
@@ -0,0 +1,45 @@
+# Experiment History
+
+## Baseline — Placeholder Overlap Loss (2026-03-21)
+
+**Config:** Default `train_placement()` params (1000 epochs, Adam lr=0.01, lambda_wl=1.0, lambda_overlap=10.0). `overlap_repulsion_loss()` is a placeholder returning constant 1.0.
+
+| Test | Cells | Overlap | Norm WL | Time (s) |
+|------|-------|---------|---------|----------|
+| 1    | 22    | 0.9091  | 0.3435  | 16.16    |
+| 2    | 28    | 0.8929  | 0.3450  | 0.62     |
+| 3    | 32    | 0.9375  | 0.3492  | 0.59     |
+| 4    | 53    | 0.8302  | 0.3866  | 0.93     |
+| 5    | 79    | 0.9367  | 0.4173  | 0.82     |
+| 6    | 105   | 0.7429  | 0.3443  | 0.83     |
+| 7    | 155   | 0.7548  | 0.3403  | 0.90     |
+| 8    | 157   | 0.8662  | 0.3784  | 0.89     |
+| 9    | 208   | 0.6394  | 0.3787  | 0.87     |
+| 10   | 2010  | 0.7846  | 0.3441  | 1.82     |
+
+**Averages (tests 1-10):** overlap=0.8294, wl=0.3627, total_time=24.43s
+
+**Notes:** Tests 11 (10K cells) and 12 (100K cells) not run — `calculate_cells_with_overlaps()` uses O(N^2) Python loops, too slow for large designs.
+
+## Naive Overlap Loss — N×N Broadcasting (2026-03-21)
+
+**Config:** Same default params. Implemented `overlap_repulsion_loss()` using pairwise broadcasting: `relu((w1+w2)/2 - |x1-x2|) * relu((h1+h2)/2 - |y1-y2|)`, upper triangle mask, normalized by pair count.
+
+| Test | Cells | Overlap | Norm WL | Time (s) | Overlap Loss (s) |
+|------|-------|---------|---------|----------|-------------------|
+| 1    | 22    | 0.4091  | 0.5036  | 13.60    | 0.17              |
+| 2    | 28    | 0.6429  | 0.4124  | 0.94     | 0.15              |
+| 3    | 32    | 0.5000  | 0.6023  | 0.91     | 0.14              |
+| 4    | 53    | 0.6038  | 0.4607  | 1.14     | 0.18              |
+| 5    | 79    | 0.6076  | 0.5398  | 1.09     | 0.17              |
+| 6    | 105   | 0.6476  | 0.4323  | 1.18     | 0.21              |
+| 7    | 155   | 0.7097  | 0.3982  | 1.48     | 0.29              |
+| 8    | 157   | 0.6815  | 0.4341  | 1.55     | 0.32              |
+| 9    | 208   | 0.6202  | 0.4094  | 1.80     | 0.41              |
+| 10   | 2010  | 0.8164  | 0.3486  | 67.84    | 30.79             |
+
+**Averages (tests 1-10):** overlap=0.6239, wl=0.4541, total_time=91.53s
+
+**vs Baseline:** overlap 0.83→0.62 (-25%), but wirelength worse 0.36→0.45 (tradeoff). Test 10 bottleneck: overlap loss 30.8s + backward 33.9s out of 66s training. O(N²) approach unusable for N>2000.
+
+**Next:** Need hyperparameter tuning (more epochs, higher lambda_overlap, LR schedule) and scalable overlap engine for tests 11-12.
diff --git a/PLAN.md b/PLAN.md
@@ -0,0 +1,253 @@
+# Goal
+Build a strong solver for the partcl intern placement challenge.
+
+Primary metric: overlap ratio = number of cells involved in overlaps / total cells.
+Secondary metric: normalized wirelength.
+
+Important constraint:
+The test suite includes designs up to 10 macros + 100000 standard cells.
+Do NOT use O(N^2) all-pairs overlap tensors except for tiny debugging cases.
+
+# Problem framing
+This is a scalable mixed-size overlap-removal problem, not full production PnR.
+The best solution will likely be:
+1. macro-aware
+2. coarse-to-fine
+3. spatially local
+4. GPU-friendly
+5. driven by search over solver schedules, not raw coordinate chromosomes
+
+# Immediate tasks
+
+## Task 1: inspect and instrument
+- Read placement.py and test.py.
+- Add timing breakdowns for:
+  - overlap loss
+  - wirelength loss
+  - optimizer step
+  - total runtime
+- Add per-test logging:
+  - overlap_ratio
+  - num_cells_with_overlaps
+  - normalized_wl
+  - runtime
+- Add seed control and CSV logging.
+
+## Task 2: build a scalable overlap engine
+Implement a spatial-hash or uniform-grid overlap candidate generator:
+- bin cells by center
+- only compare cells in same or neighboring bins
+- support macros and std cells
+- return candidate pairs
+- compute overlap penalties only on candidate pairs
+
+Need both:
+- exact overlap metric for evaluation
+- differentiable overlap loss for optimization
+
+## Task 3: add a density term
+Implement a bin overflow / density penalty:
+- accumulate cell area into bins
+- penalize overflow above target density
+- make it differentiable if practical
+- start with a simple smooth penalty
+
+## Task 4: macro-first pipeline
+Add a 2-stage solver:
+- stage A: place / legalize macros first
+- stage B: place std cells given macro anchors
+- optional stage C: allow small macro nudges if hot bins remain
+
+For macro placement, try:
+- simulated annealing on macro coordinates
+- or greedy local search with restarts
+
+## Task 5: hot-bin repair
+Implement a local repair pass:
+- identify bins with highest overlap / overflow
+- collect cells in those bins
+- try batch local moves:
+  - small translations
+  - nearest-low-density-bin snap
+  - pair swaps
+  - short local reorder
+- accept moves that reduce overlap first, wirelength second
+
+## Task 6: outer-loop search
+Do NOT use GA over all cell coordinates.
+Use evolutionary search over solver parameters and schedules:
+- overlap weight schedule
+- density weight schedule
+- learning rate / temperature schedule
+- bin size
+- number of repair passes
+- macro move radius
+- restart count
+- stage transition criteria
+
+Represent one candidate as a compact config dict.
+Each candidate decodes into a deterministic or semi-deterministic run.
+
+## Task 7: GPU acceleration
+Port bottlenecks first:
+- binning
+- sorting / grouping
+- candidate pair generation
+- overlap scoring
+- batch move scoring
+
+Use PyTorch or Triton if convenient.
+Do not port high-level orchestration until kernels matter.
+
+# Experiments to run
+
+## Baseline set
+1. repo baseline
+2. scalable overlap only
+3. scalable overlap + density
+4. macro-first + scalable overlap + density
+5. macro-first + hot-bin repair
+6. outer-loop EA over schedules
+7. macro SA + deterministic cell spreading
+8. parallel multi-start SA on macro-only state
+
+## Ablations
+- no macro-first
+- no density term
+- no hot-bin repair
+- no outer-loop search
+- different bin sizes
+- different overlap penalties:
+  - area
+  - squared area
+  - softplus on overlap lengths before multiply
+- different schedules:
+  - fixed
+  - ramped overlap weight
+  - overlap-first then WL polish
+
+# Acceptance criteria
+- Solver handles all benchmark sizes without OOM
+- Overlap ratio driven to ~0 on most or all tests
+- Runtime remains competitive
+- Wirelength improves once overlap is solved
+
+# Guardrails
+- Never introduce full NxN tensors for large cases
+- Do not use GA over raw coordinates
+- Do not spend time on RL or learned policies yet
+- Keep every change behind a config flag
+- Always run ablations and save results to CSV
+
+# Deliverables
+- clean solver code
+- config-driven experiment runner
+- CSV results
+- short notes on what helped, what failed, and why
+
+# Post-algo: competitive analysis & tuning
+
+## Task 8: compile & document what we did
+- Write up each heuristic, what worked, what didn't, with numbers
+- Clean up PROGRESS.md into a coherent narrative
+- Ensure all code is well-organized in ashvin/
+
+## Task 9: competitor analysis
+- Download competitor solutions from the old leaderboard PRs (partcleda/intern_challenge)
+- Run their solutions through our test suite
+- Compare: overlap, wirelength, runtime
+- Plot inspections: how do their placements look vs ours?
+- Identify what they got right that we missed
+
+## Task 10: new heuristics (informed by competitor analysis + literature)
+
+### Key competitor insights (old leaderboard, all achieved 0.0000 overlap):
+- **Annealed softplus** (not ReLU): beta ramps 0.1→4.0. Smooth early, sharp late. Used by top 3.
+- **Lambda ramping**: overlap weight 20→200 linear (Shashank) or 4*(e/E)^10 exponential (Brayden)
+- **Warmup + cosine LR**: LinearLR 5% warmup then CosineAnnealing. Pawan's 1.74s solution.
+- **Deterministic legalization**: row-packing guarantees 0.0000. Marcos, 2.3s for 100K cells.
+- **Soft-Coulomb repulsion**: 1/r² global field for spreading (manuhalapeth, WL=0.2630)
+- **Cell swaps on high-WL edges**: Shashank's WL secret (0.1310)
+
+### Strategies to implement:
+
+**Strategy A: Annealed activation + lambda ramp + more epochs**
+- Replace ReLU with annealed softplus/GELU/leaky-ReLU (try all three)
+- Ramp lambda_overlap from 10% to 100% over training
+- Warmup LR (5%) + cosine decay
+- Double epochs (1000 per stage → 2000 total or more)
+- This is the common thread across ALL zero-overlap competitors
+
+**Strategy B: Simulated annealing for macro placement (Stage A replacement)**
+- SA naturally maximizes entropy → spreads macros apart
+- Perturbation: random macro translations, swaps
+- Energy = overlap_area + alpha * wirelength
+- Temperature schedule: high→low over iterations
+- Accept worse moves probabilistically → escapes local minima
+- Literature: TimberWolf (Sechen 1986), Dragon (Wang+ 2000) use SA for macro placement
+- Our current gradient descent on 10 macros gets stuck; SA explores better
+
+**Strategy C: Deterministic legalization (guarantees 0.0000)**
+- After gradient descent + SA, run row-based greedy packing
+- Sort cells by x-coordinate, assign to rows, resolve conflicts by shifting
+- Handles macros as fixed obstacles, packs std cells around them
+- Marcos achieves 100K cells in 2.3s with this approach
+- Eliminates need for our current greedy repair (which doesn't guarantee 0)
+
+**Strategy D: WL-aware post-optimization**
+- After legalization (overlap = 0 guaranteed), optimize wirelength
+- Cell swaps: for each high-WL edge, try swapping endpoints with neighbors
+- Barycentric refinement: move each cell toward weighted center of its connected cells
+- Accept moves only if overlap stays at 0
+- This is where Shashank gets 0.1310 WL vs everyone else's 0.26+
+
+### Implementation order:
+1. Strategy A first (quick win, config changes only)
+2. Strategy C next (guarantees 0.0000, enables WL optimization)
+3. Strategy B if Stage A still fails on some seeds
+4. Strategy D last (WL polish, competitive edge)
+
+## Task 11: optuna hyperparameter tuning
+- Define search space over: activation type, beta schedule, lambda ramp curve,
+  LR + warmup, epochs per stage, bin_size, repair params
+- Objective: minimize overlap_ratio, tiebreak on normalized_wl
+- Run on tests 1-10 with budget ~100-200 trials
+- Also evaluate on alternate seeds (2001-2010) to prevent overfitting
+- Apply best config, verify generalization
+- Record best config and results
+
+## Task 12: GPU acceleration (originally Task 7)
+- Port pair generation to GPU (current bottleneck for 100K cells)
+- Vectorize bin assignment + neighbor lookup
+- Use torch sorting + searchsorted instead of Python defaultdict
+- Target: test 12 under 60s (currently 392s)
+
+# Longer-term plan (playing to win)
+
+## Phase 1: Zero overlap (current priority)
+- Strategy A + C should get us to 0.0000 on all tests
+- Optuna tunes the exact schedule
+- Validate on alternate seeds
+
+## Phase 2: WL optimization (competitive edge)
+- Strategy D (cell swaps + barycentric refinement)
+- Multi-start: run solver 3-5 times with different seeds, pick best WL
+- Edge sampling for large designs (Marcos: 50-80K edges/epoch)
+
+## Phase 3: Scale + speed
+- GPU acceleration for 100K cell tests
+- Adaptive epoch count by problem size
+- Target: all 12 tests under 60s total
+
+## Phase 3.5: Benchmark competitors on tests 11-12
+- Run top competitor solutions on the NEW test suite (tests 11-12: 10K and 100K cells)
+- Most competitors used O(N²) approaches — they will OOM or timeout on 100K cells
+- Document which competitors scale and which don't
+- This is our competitive advantage: scalable spatial hash + legalization
+
+## Phase 4: Outlandish ideas (if gap remains)
+- Soft-Coulomb repulsion field (manuhalapeth)
+- Graph neural network for initial placement prediction
+- Force-directed placement with momentum
+- Spectral placement (eigenvector-based initial positions)
+- Reinforcement learning for schedule selection