diff --git a/LitReview.tex b/LitReview.tex deleted file mode 100644 index e69de29..0000000 diff --git a/ML Primary Lit/2021.eacl-main.74.pdf b/ML Primary Lit/2021.eacl-main.74.pdf deleted file mode 100644 index f460810..0000000 Binary files a/ML Primary Lit/2021.eacl-main.74.pdf and /dev/null differ diff --git a/ML Primary Lit/25a43194-c74c-4cd3-b60f-0a1f27f8b8af.pdf b/ML Primary Lit/25a43194-c74c-4cd3-b60f-0a1f27f8b8af.pdf deleted file mode 100644 index 8034481..0000000 Binary files a/ML Primary Lit/25a43194-c74c-4cd3-b60f-0a1f27f8b8af.pdf and /dev/null differ diff --git a/ML Primary Lit/NeurIPS-2020-retrieval-augmented-generation-for-knowledge-intensive-nlp-tasks-Paper.pdf b/ML Primary Lit/NeurIPS-2020-retrieval-augmented-generation-for-knowledge-intensive-nlp-tasks-Paper.pdf deleted file mode 100644 index d4805ea..0000000 Binary files a/ML Primary Lit/NeurIPS-2020-retrieval-augmented-generation-for-knowledge-intensive-nlp-tasks-Paper.pdf and /dev/null differ diff --git a/ML_Project_Report_Group_2.pdf b/ML_Project_Report_Group_2.pdf new file mode 100644 index 0000000..f9dbcc7 Binary files /dev/null and b/ML_Project_Report_Group_2.pdf differ diff --git a/REPORT_GUIDE.md b/REPORT_GUIDE.md deleted file mode 100644 index eb620af..0000000 --- a/REPORT_GUIDE.md +++ /dev/null @@ -1,490 +0,0 @@ -# RAG Chunk-Routing — Project & Report Writer Guide - -> General Project Layout and Advice for Report Writers - ---- - -## 1. What We Are Building - -We test a single question: *can a cheap, query-only classifier pick the right -chunk size for RAG retrieval and recover most of the "oracle gap" without paying -the cost of full retrieval fusion?* - -**The oracle gap** is the difference between the best any fixed chunk size achieves -and what a perfect router would achieve if it always picked the best size per -question. On our test set that gap is **8.19 F1 points**: - -- Best fixed-size baseline (128-token chunks): F1 = **0.2128** -- Oracle ceiling: F1 = **0.2947** -- Gap: **8.19 F1 points** - -We compare four system families: - -| System | Description | -|---|---| -| Fixed-128 / 256 / 512 | Always retrieve from one chunk-size index | -| RRF Fusion | Retrieve from all three indices, merge with Reciprocal Rank Fusion | -| Oracle ceiling | Always picks the per-question best size (unreachable upper bound) | -| Router | Learned classifier predicts best chunk size from query features alone | - -The corpus is medical literature on **Friedreich Ataxia** (~1.4 MB). The QA set -has **398 validated question-answer pairs** across three types: factoid, multihop, -and synthesis. - ---- - -## 2. Repository Layout - -``` -ML-Project/ -├── LitReview.tex # Literature review -├── neurips_2026.tex # NeurIPS 2026 paper template -├── neurips_2026.sty # NeurIPS style file -├── README.md # Original team working document -├── REPORT_GUIDE.md # ← this file -│ -└── rag-chunk-routing/ # Everything lives here - ├── configs/ # YAML hyperparameter files - ├── data/ # Raw corpus (never modified) - ├── artifacts/ # Derived data (rebuildable from scripts) - │ ├── chunks/ # Chunked corpus (JSONL per size) - │ ├── indices/ # FAISS retrieval indices - │ ├── qa/ # Validated QA pairs - │ ├── splits/ # Train/val/test split indices - │ ├── oracle/ # Oracle labels and evaluation grid - │ └── baselines/ # ★ Baseline metrics — primary figure data ★ - ├── experiments/ # CLI entry-point scripts (run these) - ├── rag_cr/ # Reusable Python library - ├── results/ # Timestamped run outputs - │ └── 20260508_000000_fusion/ # ★ Latest fusion run ★ - │ └── 20260508_000001_router/ # ★ Latest router run ★ - ├── prompts/ # LLM prompt templates - ├── slurm/ # HPC job scripts - └── tests/ # Unit tests -``` - -### Three-tier data philosophy - -| Tier | Location | Rule | -|---|---|---| -| **Raw** | `data/` | Never touch. Read-only forever. | -| **Artifacts** | `artifacts/` | Rebuildable by running the pipeline scripts. | -| **Results** | `results/` | Timestamped, append-only — never overwrite a past run. | - ---- - -## 3. Artifacts — Detailed Map - -### `artifacts/chunks/` - -Chunked corpus in JSONL format. Each line: `{chunk_id, size, start_char, end_char, text}`. - -| File | Chunks | -|---|---| -| `128.jsonl` | 2,911 | -| `256.jsonl` | fewer (larger chunks) | -| `512.jsonl` | fewer still | -| `1024.jsonl` | fewest (dropped from canonical experiments) | - -### `artifacts/indices/` - -FAISS dense-retrieval indices built with BAAI/BGE-small-en-v1.5 embeddings on CPU. -One `.faiss` file per chunk size. - -### `artifacts/qa/` - -| File | Contents | -|---|---| -| `qa_validated.jsonl` | 398 human-reviewed QA pairs: `{qa_id, question, answer, source_chunk_id, type, validated}` | -| `qa_rejected.jsonl` | Filtered pairs with reject reasons | - -Type breakdown by split: - -| Type | Train | Val | Test | -|---|---|---|---| -| factoid | 80 | 25 | 28 | -| multihop | 78 | 26 | 28 | -| synthesis | 79 | 26 | 28 | -| **Total** | **237** | **77** | **84** | - -### `artifacts/splits/` - -Stratified split (seed=42). `manifest.json` contains audit metadata. -`train.jsonl`, `val.jsonl`, `test.jsonl` contain split indices. - -### `artifacts/oracle/` - -| File | Contents | -|---|---| -| `eval_grid.jsonl` | Full 398×4 grid: every (question, chunk_size) pair scored with `em, f1, faithfulness, cost_tokens` | -| `labels.jsonl` | Per-question oracle label: `{qa_id, best_size, scores_by_size}` for the canonical 3-size action space | -| `labels_full.jsonl` | Same but including size 1024 (for the ablation) | - -### `artifacts/baselines/` — Primary data source for figures - -Six JSON files summarising system performance on the **test split**. - -#### `test_summary.json` — Fixed-size baselines (canonical 3 sizes) - -```json -{ - "128": {"n": 84, "f1": 0.2128, "em": 0.1429, "faithfulness": 0.3626}, - "256": {"n": 84, "f1": 0.1947, "em": 0.1190, "faithfulness": 0.3451}, - "512": {"n": 84, "f1": 0.1830, "em": 0.1190, "faithfulness": 0.3396} -} -``` - -#### `test_summary_full.json` — Fixed-size baselines including 1024 - -Size 1024: F1 = 0.1573. Confirms that adding 1024 is strictly worse than any of -the canonical three sizes. - -#### `oracle_gap.json` — Oracle ceiling analysis (canonical 3 sizes) - -```json -{ - "action_space": [128, 256, 512], - "oracle_f1_mean": 0.2947, - "best_baseline_f1": 0.2128, - "gap_f1_points": 8.19, - "per_type": { - "factoid": {"oracle_f1": ..., "best_baseline_f1": ..., "gap": 14.63}, - "multihop": {"..."}, - "synthesis":{"..."} - } -} -``` - -**Factoid questions have the largest gap (~14.63 pts).** - -#### `oracle_gap_full.json` — Oracle gap with all 4 sizes - -Gap with 1024 included is 10.61 pts (larger than the canonical 8.19 pts because -oracle can sometimes exploit 1024). Used to justify keeping the canonical 3-size -action space for the main experiments. - -#### `size_distribution.json` — Oracle-best size distribution (canonical) - -Which chunk size is "best" per question, by split: - -| Size | Train | Val | Test | -|---|---|---|---| -| 128 | 83.1% | 88.3% | 82.1% | -| 256 | 11.0% | 7.8% | 11.9% | -| 512 | 5.9% | 3.9% | 6.0% | - -**Key insight:** 128 wins 82% of the time. The learning problem is heavily -class-imbalanced, which is why a naive always-128 classifier achieves decent -accuracy but fails on F1 gap recovery. - -#### `size_distribution_full.json` — Same with 1024 - -1024 captures near-zero oracle selections — further justifying its exclusion. - ---- - -## 4. Results — Run Outputs - -All runs live under `rag-chunk-routing/results/` as timestamp-prefixed folders. - -### `results/SUMMARY.md` - -Auto-generated leaderboard comparing all systems evaluated so far. **Always check -this first** before opening individual run folders. - -### Key runs from 2026-05-08 - -#### `results/20260508_000000_fusion/metrics.json` - -RRF Fusion on the test split: - -| Metric | Value | -|---|---| -| F1 | **0.2233** | -| EM | 0.1429 | -| Faithfulness | 0.3669 | -| Total cost tokens | 68,815 | -| Mean cost per query | ~819 tokens | - -Fusion recovers **~12.8% of the oracle gap** at the cost of querying all three -indices simultaneously. - -#### `results/20260508_000001_router/metrics.json` - -Learned Router on the test split: - -| Metric | Value | -|---|---| -| F1 | **0.171** | -| EM | 0.107 | -| Router macro F1 (classification) | 0.292 | -| Mean cost per query | 569.1 tokens | -| Gap closure fraction | **−0.511** | - -The router currently **underperforms** the best fixed baseline by a large margin -(−51% gap recovery). This is the central negative result of the paper. - -### Historical runs - -| Timestamp prefix | Type | Notes | -|---|---|---| -| `20260507_164140` | fusion | Previous fusion run | -| `20260507_163556` | fusion | Earlier fusion attempt | -| `20260506_*` | fusion | Multiple earlier fusion iterations | -| `20260430_135557` | oracle | Oracle ceiling evaluation | -| `20260430_135552` | fixed_512 | Fixed-512 baseline | -| `20260428_202144` | fixed_512 | Earlier fixed-512 run | -| `20260421_115611` | fixed_256 | Fixed-256 baseline | - ---- - -## 5. Code Structure - -### `configs/` - -| File | Purpose | -|---|---| -| `base.yaml` | Master config: chunk sizes `[128,256,512]`, embedder (BGE-small), generation model (Qwen2.5-7B via Ollama), router feature/classifier grid | -| `cluster.yaml` | HPC variant (vLLM backend) | -| `eval_dry_run.yaml` | Quick smoke-test | - -All hyperparameters live here — never hardcode values in scripts. - -### `experiments/` — Pipeline entry points (run in order) - -``` -1. build_indices.py Chunk corpus → embed → build FAISS indices -2. generate_qa.py Synthetic QA via OpenAI -3. filter_qa.py Primary-F1 + LLM-judge filtering -4. validate_qa.py Interactive human review -5. make_splits.py Stratified train/val/test split -6. compute_oracle.py Score all (question, size) pairs -7. run_baselines.py Fixed-size metrics + oracle gap → artifacts/baselines/ -8. run_fusion.py RRF Fusion evaluation → results/ -9. train_router.py CV grid search + val re-ranking -10. run_router.py Router evaluation → results/ -11. make_frontier.py Accuracy-cost frontier plot -12. make_figures.py Report tables (LaTeX) and figures -13. make_router_figures.py Router-specific visualizations -``` - -### `rag_cr/` — Reusable library - -| Module | Role | -|---|---| -| `chunking.py` | Tokenizer-aware fixed-size chunking | -| `embedding.py` | BGE-small dense embeddings | -| `indexing.py` | FAISS build/persist/search | -| `retrieval.py` | Single-scale and RRF Fusion retrieval | -| `metrics.py` | EM, F1, faithfulness scoring | -| `oracle.py` | Oracle label derivation | -| `systems.py` | System abstractions (FixedSize, Fusion, Oracle, Router) | -| `router/features.py` | TF-IDF, MiniLM, and handcrafted query feature extractors | -| `router/models.py` | LogReg, LinearSVM, LightGBM classifier wrappers | -| `router/train.py` | 5-fold CV grid search + val re-ranking | - ---- - -## 6. Guide for Report Writers - -This section is for team members creating **figures and tables** for the NeurIPS -2026 submission. **You do not need to re-run any experiments.** All data is -already in `artifacts/baselines/` and `results/20260508_*/`. - ---- - -### Data cheat sheet - -| What you need | File | Key | -|---|---|---| -| Fixed-size F1 / EM / faithfulness | `artifacts/baselines/test_summary.json` | `"128"`, `"256"`, `"512"` | -| Oracle ceiling | `artifacts/baselines/oracle_gap.json` | `oracle_f1_mean`, `gap_f1_points` | -| Per-type oracle gap | `artifacts/baselines/oracle_gap.json` | `per_type` | -| Oracle-best size distribution | `artifacts/baselines/size_distribution.json` | `test`, `train`, `val` | -| 4-size ablation baselines | `artifacts/baselines/test_summary_full.json` | adds `"1024"` | -| 4-size ablation oracle gap | `artifacts/baselines/oracle_gap_full.json` | — | -| Fusion results | `results/20260508_000000_fusion/metrics.json` | `f1`, `em`, `faithfulness`, `cost_tokens_total` | -| Router results | `results/20260508_000001_router/metrics.json` | `f1`, `em`, `gap_closure_fraction`, `mean_cost_tokens` | -| Per-question predictions + cost | `artifacts/oracle/eval_grid.jsonl` | `f1`, `cost_tokens`, `type`, `chunk_size` | - ---- - -### Figures to produce - -#### Figure 1 — Main results bar chart - -**Goal:** Show all five systems vs. oracle ceiling on F1. - -- **X-axis:** Systems — Fixed-128, Fixed-256, Fixed-512, Fusion, Router -- **Y-axis:** F1 on test split -- **Add** a horizontal dashed line at Oracle F1 = 0.2947, labelled "Oracle ceiling" -- Optionally add error bars from `eval_grid.jsonl` (bootstrap or per-type std) - -Numbers to plot: - -| System | F1 | Source | -|---|---|---| -| Fixed-128 | 0.2128 | `test_summary.json` | -| Fixed-256 | 0.1947 | `test_summary.json` | -| Fixed-512 | 0.1830 | `test_summary.json` | -| Fusion | 0.2233 | `results/20260508_000000_fusion/metrics.json` | -| Router | 0.171 | `results/20260508_000001_router/metrics.json` | -| Oracle | 0.2947 | `oracle_gap.json` | - ---- - -#### Figure 2 — Per-type oracle gap breakdown - -**Goal:** Show where the gap is largest (factoid >> multihop >> synthesis). - -- **Type:** Horizontal bar chart, one row per question type -- **Bars:** oracle F1 (full) with best-baseline F1 marked inside (stacked or grouped) -- **Data source:** `oracle_gap.json` → `per_type` field - -**Key message for caption:** Factoid questions carry the largest gap (~14.6 pts), -explaining why the router's failure on factoid is the dominant contributor to its -overall underperformance. - ---- - -#### Figure 3 — Oracle-best chunk size distribution - -**Goal:** Show class imbalance (why the router defaults to predicting 128). - -- **Type:** Stacked horizontal bar chart, one bar per split (train / val / test) -- **Segments:** 128 (blue), 256 (orange), 512 (green) -- **Data source:** `size_distribution.json` - -**Key message for caption:** 128-token chunks are oracle-best for ~82% of test -questions — the router must overcome severe class imbalance to improve over -always-predict-128. - ---- - -#### Figure 4 — Accuracy vs. retrieval cost frontier - -**Goal:** Position each system on a cost-effectiveness plane. - -- **X-axis:** Mean retrieval cost per query (tokens) -- **Y-axis:** F1 on test split -- **Each system is one point.** Draw a Pareto frontier line connecting non-dominated - points. - -Approximate values (compute exact per-query means from `eval_grid.jsonl` -`cost_tokens` field if needed): - -| System | Approx. mean cost tokens | F1 | -|---|---|---| -| Fixed-128 | ~570 | 0.2128 | -| Fixed-256 | ~570 | 0.1947 | -| Fixed-512 | ~570 | 0.1830 | -| Fusion | ~820 | 0.2233 | -| Router | 569.1 | 0.171 | -| Oracle | ~570 | 0.2947 | - -**Key message for caption:** The router sits at single-index cost but below even -Fixed-128 quality — Fusion achieves better F1 at only moderate extra cost. - ---- - -#### Figure 5 — 4-size ablation (supplementary) - -**Goal:** Justify dropping size 1024 from the canonical action space. - -Same bar chart structure as Figure 1 but add Fixed-1024 (F1 = 0.1573, -`test_summary_full.json`). The oracle gap also widens (10.61 pts → 8.19 pts) when -1024 is dropped, which actually shrinks the gap — explain this in the caption using -`oracle_gap_full.json`. - ---- - -#### Table 1 — Main results (LaTeX) - -Produce a LaTeX `booktabs` table: - -| System | F1 | EM | Faithfulness | Cost (tokens) | Gap Recovery | -|---|---|---|---|---|---| -| Fixed-128 | 0.2128 | 0.1429 | 0.3626 | ~570 | 0% (baseline) | -| Fixed-256 | 0.1947 | 0.1190 | 0.3451 | ~570 | −21.8% | -| Fixed-512 | 0.1830 | 0.1190 | 0.3396 | ~570 | −36.1% | -| RRF Fusion | 0.2233 | 0.1429 | 0.3669 | ~820 | +12.8% | -| Router | 0.171 | 0.107 | — | 569.1 | −51.1% | -| Oracle | 0.2947 | — | — | ~570 | 100% | - -Gap Recovery formula: -``` -gap_recovery = (system_F1 − best_baseline_F1) / (oracle_F1 − best_baseline_F1) - = (system_F1 − 0.2128) / (0.2947 − 0.2128) -``` - -Pre-computed values: -- Fusion: (0.2233 − 0.2128) / 0.0819 = **+12.8%** -- Router: (0.171 − 0.2128) / 0.0819 = **−51.1%** - ---- - -### Existing figure scripts - -`experiments/make_figures.py` and `experiments/make_router_figures.py` already -exist. **Check these first** — they may already read the right files and only need -minor updates for the latest 20260508 run paths. Only write new plotting code if -these scripts are missing a specific figure you need. - ---- - -### Plotting conventions - -- Use **matplotlib** (or seaborn on top of it). -- Target NeurIPS column width = **3.25 in** (single column) or **6.75 in** (full). -- Use a **colourblind-safe palette** (`seaborn colorblind` or ColorBrewer Set2). -- Export as **PDF** for the LaTeX submission; **PNG at 300 dpi** for slides. -- Set `plt.rcParams["font.size"] = 9` to match NeurIPS body text. -- All data is deterministic — figures must be reproducible with a fixed script and - no random seed required. - ---- - -## 7. Current Status - -| Phase | Status | Description | -|---|---|---| -| Phase 1 | ✅ Done | Infrastructure: chunking, indexing, QA generation, validated splits | -| Phase 2 | ✅ Done | Core experiments: all systems have test-set numbers | -| Phase 3 | 🔄 In progress | Freeze, ablations, paper figures | - -**The router underperforms.** The paper frames this as a *negative result*: a -simple query-only classifier cannot reliably select chunk size, and the difficulty -stems from (a) severe class imbalance toward 128, (b) limited training data (237 -examples), and (c) the absence of any retrieval signal in the router's features. - ---- - -## 8. Reproduction Checklist (for developers, not report writers) - -To reproduce results from scratch: - -```bash -cd rag-chunk-routing - -# Build everything up to oracle labels -python experiments/build_indices.py --config configs/base.yaml -python experiments/compute_oracle.py --config configs/base.yaml - -# Run evaluations -python experiments/run_baselines.py --config configs/base.yaml -python experiments/run_fusion.py --config configs/base.yaml -python experiments/train_router.py --config configs/base.yaml -python experiments/run_router.py --config configs/base.yaml - -# Generate figures -python experiments/make_figures.py -python experiments/make_router_figures.py -``` - -`qa_validated.jsonl` and `eval_grid.jsonl` are already generated. Regenerating -them requires OpenAI API calls and Ollama inference — expensive and unnecessary -unless you suspect data corruption. - ---- - -*Last updated: 2026-05-09* diff --git a/neurips_2026.sty b/neurips_2026.sty deleted file mode 100644 index e728398..0000000 --- a/neurips_2026.sty +++ /dev/null @@ -1,420 +0,0 @@ -% partial rewrite of the LaTeX2e package for submissions to the -% Conference on Neural Information Processing Systems (NeurIPS): -% -% - uses more LaTeX conventions -% - line numbers at submission time replaced with aligned numbers from -% lineno package -% - \nipsfinalcopy replaced with [final] package option -% - automatically loads times package for authors -% - loads natbib automatically; this can be suppressed with the -% [nonatbib] package option -% - adds foot line to first page identifying the conference -% - adds preprint option for submission to e.g. arXiv -% - conference acronym modified -% - update foot line to display the track name -% -% Roman Garnett (garnett@wustl.edu) and the many authors of -% nips15submit_e.sty, including MK and drstrip@sandia -% -% last revision: April 2025 - -\NeedsTeXFormat{LaTeX2e} -\ProvidesPackage{neurips_2026}[2026/01/01 NeurIPS 2026 style file] - -% declare final option, which creates camera-ready copy -\newif\if@neuripsfinal\@neuripsfinalfalse -\DeclareOption{final}{ - \@neuripsfinaltrue - \@anonymousfalse -} - -% declare nonatbib option, which does not load natbib in case of -% package clash (users can pass options to natbib via -% \PassOptionsToPackage) -\newif\if@natbib\@natbibtrue -\DeclareOption{nonatbib}{ - \@natbibfalse -} - -% declare preprint option, which creates a preprint version ready for -% upload to, e.g., arXiv -\newif\if@preprint\@preprintfalse -\DeclareOption{preprint}{ - \@preprinttrue - \@anonymousfalse -} - -% determine the track of the paper in camera-ready mode -\newif\if@main\@maintrue -\DeclareOption{main}{ - \@maintrue - \newcommand{\@trackname}{\@neuripsordinal\ Conference on Neural Information Processing Systems (NeurIPS \@neuripsyear).} -} -\newif\if@position\@positionfalse -\DeclareOption{position}{ - \@positiontrue - \newcommand{\@trackname}{\@neuripsordinal\ Conference on Neural Information Processing Systems (NeurIPS \@neuripsyear) Position Paper Track.} -} -\newif\if@dandb\@dandbfalse -\DeclareOption{dandb}{ - \@dandbtrue - \@anonymousfalse - \newcommand{\@trackname}{\@neuripsordinal\ Conference on Neural Information Processing Systems (NeurIPS \@neuripsyear) Track on Datasets and Benchmarks.} -} -\newif\if@creativeai\@creativeaifalse -\DeclareOption{creativeai}{ - \@creativeaitrue - \@anonymousfalse - \newcommand{\@trackname}{\@neuripsordinal\ Conference on Neural Information Processing Systems (NeurIPS \@neuripsyear) Creative AI Track.} -} - -% For anonymous or non-anonymous -\newif\if@anonymous\@anonymoustrue - -% For workshop papers -\newcommand{\@workshoptitle}{} -\newcommand{\workshoptitle}[1]{\renewcommand{\@workshoptitle}{#1}} - -\newif\if@workshop\@workshopfalse -\DeclareOption{sglblindworkshop}{ - \@workshoptrue - \@anonymousfalse - \newcommand{\@trackname}{\@neuripsordinal\ Conference on Neural Information Processing Systems (NeurIPS \@neuripsyear) Workshop: \@workshoptiƒtle.} -} -\DeclareOption{dblblindworkshop}{ - \@workshoptrue - \newcommand{\@trackname}{\@workshoptitle\ --- Course Project Report.} -} - -\ProcessOptions\relax - -% fonts -\renewcommand{\rmdefault}{ptm} -\renewcommand{\sfdefault}{phv} - -% change this every year for notice string at bottom -\newcommand{\@neuripsordinal}{39th} -\newcommand{\@neuripsyear}{2026} -\newcommand{\@neuripslocation}{San Diego} - -% acknowledgments -\usepackage{environ} -\newcommand{\acksection}{\section*{Acknowledgments and Disclosure of Funding}} -\NewEnviron{ack}{% - \acksection - \BODY -} - - -% load natbib unless told otherwise -\if@natbib - \RequirePackage{natbib} -\fi - -% set page geometry -\usepackage[verbose=true,letterpaper]{geometry} -\AtBeginDocument{ - \newgeometry{ - textheight=9in, - textwidth=5.5in, - top=1in, - headheight=12pt, - headsep=25pt, - footskip=30pt - } - \@ifpackageloaded{fullpage} - {\PackageWarning{neurips_2026}{fullpage package not allowed! Overwriting formatting.}} - {} -} - -\widowpenalty=10000 -\clubpenalty=10000 -\flushbottom -\sloppy - - -% font sizes with reduced leading -\renewcommand{\normalsize}{% - \@setfontsize\normalsize\@xpt\@xipt - \abovedisplayskip 7\p@ \@plus 2\p@ \@minus 5\p@ - \abovedisplayshortskip \z@ \@plus 3\p@ - \belowdisplayskip \abovedisplayskip - \belowdisplayshortskip 4\p@ \@plus 3\p@ \@minus 3\p@ -} -\normalsize -\renewcommand{\small}{% - \@setfontsize\small\@ixpt\@xpt - \abovedisplayskip 6\p@ \@plus 1.5\p@ \@minus 4\p@ - \abovedisplayshortskip \z@ \@plus 2\p@ - \belowdisplayskip \abovedisplayskip - \belowdisplayshortskip 3\p@ \@plus 2\p@ \@minus 2\p@ -} -\renewcommand{\footnotesize}{\@setfontsize\footnotesize\@ixpt\@xpt} -\renewcommand{\scriptsize}{\@setfontsize\scriptsize\@viipt\@viiipt} -\renewcommand{\tiny}{\@setfontsize\tiny\@vipt\@viipt} -\renewcommand{\large}{\@setfontsize\large\@xiipt{14}} -\renewcommand{\Large}{\@setfontsize\Large\@xivpt{16}} -\renewcommand{\LARGE}{\@setfontsize\LARGE\@xviipt{20}} -\renewcommand{\huge}{\@setfontsize\huge\@xxpt{23}} -\renewcommand{\Huge}{\@setfontsize\Huge\@xxvpt{28}} - -% sections with less space -\providecommand{\section}{} -\renewcommand{\section}{% - \@startsection{section}{1}{\z@}% - {-2.0ex \@plus -0.5ex \@minus -0.2ex}% - { 1.5ex \@plus 0.3ex \@minus 0.2ex}% - {\large\bf\raggedright}% -} -\providecommand{\subsection}{} -\renewcommand{\subsection}{% - \@startsection{subsection}{2}{\z@}% - {-1.8ex \@plus -0.5ex \@minus -0.2ex}% - { 0.8ex \@plus 0.2ex}% - {\normalsize\bf\raggedright}% -} -\providecommand{\subsubsection}{} -\renewcommand{\subsubsection}{% - \@startsection{subsubsection}{3}{\z@}% - {-1.5ex \@plus -0.5ex \@minus -0.2ex}% - { 0.5ex \@plus 0.2ex}% - {\normalsize\bf\raggedright}% -} -\providecommand{\paragraph}{} -\renewcommand{\paragraph}{% - \@startsection{paragraph}{4}{\z@}% - {1.5ex \@plus 0.5ex \@minus 0.2ex}% - {-1em}% - {\normalsize\bf}% -} -\providecommand{\subparagraph}{} -\renewcommand{\subparagraph}{% - \@startsection{subparagraph}{5}{\z@}% - {1.5ex \@plus 0.5ex \@minus 0.2ex}% - {-1em}% - {\normalsize\bf}% -} -\providecommand{\subsubsubsection}{} -\renewcommand{\subsubsubsection}{% - \vskip5pt{\noindent\normalsize\rm\raggedright}% -} - -% float placement -\renewcommand{\topfraction }{0.85} -\renewcommand{\bottomfraction }{0.4} -\renewcommand{\textfraction }{0.1} -\renewcommand{\floatpagefraction}{0.7} - -\newlength{\@neuripsabovecaptionskip}\setlength{\@neuripsabovecaptionskip}{7\p@} -\newlength{\@neuripsbelowcaptionskip}\setlength{\@neuripsbelowcaptionskip}{\z@} - -\setlength{\abovecaptionskip}{\@neuripsabovecaptionskip} -\setlength{\belowcaptionskip}{\@neuripsbelowcaptionskip} - -% swap above/belowcaptionskip lengths for tables -\renewenvironment{table} - {\setlength{\abovecaptionskip}{\@neuripsbelowcaptionskip}% - \setlength{\belowcaptionskip}{\@neuripsabovecaptionskip}% - \@float{table}} - {\end@float} - -% footnote formatting -\setlength{\footnotesep }{6.65\p@} -\setlength{\skip\footins}{9\p@ \@plus 4\p@ \@minus 2\p@} -\renewcommand{\footnoterule}{\kern-3\p@ \hrule width 12pc \kern 2.6\p@} -\setcounter{footnote}{0} - -% paragraph formatting -\setlength{\parindent}{\z@} -\setlength{\parskip }{5.5\p@} - -% list formatting -\setlength{\topsep }{4\p@ \@plus 1\p@ \@minus 2\p@} -\setlength{\partopsep }{1\p@ \@plus 0.5\p@ \@minus 0.5\p@} -\setlength{\itemsep }{2\p@ \@plus 1\p@ \@minus 0.5\p@} -\setlength{\parsep }{2\p@ \@plus 1\p@ \@minus 0.5\p@} -\setlength{\leftmargin }{3pc} -\setlength{\leftmargini }{\leftmargin} -\setlength{\leftmarginii }{2em} -\setlength{\leftmarginiii}{1.5em} -\setlength{\leftmarginiv }{1.0em} -\setlength{\leftmarginv }{0.5em} -\def\@listi {\leftmargin\leftmargini} -\def\@listii {\leftmargin\leftmarginii - \labelwidth\leftmarginii - \advance\labelwidth-\labelsep - \topsep 2\p@ \@plus 1\p@ \@minus 0.5\p@ - \parsep 1\p@ \@plus 0.5\p@ \@minus 0.5\p@ - \itemsep \parsep} -\def\@listiii{\leftmargin\leftmarginiii - \labelwidth\leftmarginiii - \advance\labelwidth-\labelsep - \topsep 1\p@ \@plus 0.5\p@ \@minus 0.5\p@ - \parsep \z@ - \partopsep 0.5\p@ \@plus 0\p@ \@minus 0.5\p@ - \itemsep \topsep} -\def\@listiv {\leftmargin\leftmarginiv - \labelwidth\leftmarginiv - \advance\labelwidth-\labelsep} -\def\@listv {\leftmargin\leftmarginv - \labelwidth\leftmarginv - \advance\labelwidth-\labelsep} -\def\@listvi {\leftmargin\leftmarginvi - \labelwidth\leftmarginvi - \advance\labelwidth-\labelsep} - -% create title -\providecommand{\maketitle}{} -\renewcommand{\maketitle}{% - \par - \begingroup - \renewcommand{\thefootnote}{\fnsymbol{footnote}} - % for perfect author name centering - \renewcommand{\@makefnmark}{\hbox to \z@{$^{\@thefnmark}$\hss}} - % The footnote-mark was overlapping the footnote-text, - % added the following to fix this problem (MK) - \long\def\@makefntext##1{% - \parindent 1em\noindent - \hbox to 1.8em{\hss $\m@th ^{\@thefnmark}$}##1 - } - \thispagestyle{empty} - \@maketitle - \@thanks - \@notice - \endgroup - \let\maketitle\relax - \let\thanks\relax -} - -% rules for title box at top of first page -\newcommand{\@toptitlebar}{ - \hrule height 4\p@ - \vskip 0.25in - \vskip -\parskip% -} -\newcommand{\@bottomtitlebar}{ - \vskip 0.29in - \vskip -\parskip - \hrule height 1\p@ - \vskip 0.09in% -} - -% create title (includes both anonymized and non-anonymized versions) -\providecommand{\@maketitle}{} -\renewcommand{\@maketitle}{% - \vbox{% - \hsize\textwidth - \linewidth\hsize - \vskip 0.1in - \@toptitlebar - \centering - {\LARGE\bf \@title\par} - \@bottomtitlebar - \if@anonymous - \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@} - Anonymous Author(s) \\ - Affiliation \\ - Address \\ - \texttt{email} \\ - \end{tabular}% - \else - \def\And{% - \end{tabular}\hfil\linebreak[0]\hfil% - \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\ignorespaces% - } - \def\AND{% - \end{tabular}\hfil\linebreak[4]\hfil% - \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\ignorespaces% - } - \begin{tabular}[t]{c}\bf\rule{\z@}{24\p@}\@author\end{tabular}% - \fi - \vskip 0.3in \@minus 0.1in - } -} - -% add conference notice to bottom of first page -\newcommand{\ftype@noticebox}{8} -\newcommand{\@notice}{% - % give a bit of extra room back to authors on first page - \enlargethispage{2\baselineskip}% - \@float{noticebox}[b]% - \footnotesize\@noticestring% - \end@float% -} - -% abstract styling -\renewenvironment{abstract}% -{% - \vskip 0.075in% - \centerline% - {\large\bf Abstract}% - \vspace{0.5ex}% - \begin{quote}% -} -{ - \par% - \end{quote}% - \vskip 1ex% -} - -% For the paper checklist -\newcommand{\answerYes}[1][]{\textcolor{blue}{[Yes] #1}} -\newcommand{\answerNo}[1][]{\textcolor{orange}{[No] #1}} -\newcommand{\answerNA}[1][]{\textcolor{gray}{[NA] #1}} -\newcommand{\answerTODO}[1][]{\textcolor{red}{\bf [TODO]}} -\newcommand{\justificationTODO}[1][]{\textcolor{red}{\bf [TODO]}} - -% handle tweaks for camera-ready copy vs. submission copy -\if@preprint - \newcommand{\@noticestring}{% - Preprint.% - } -\else - \if@neuripsfinal - \newcommand{\@noticestring}{ - \@trackname - } - \else - \newcommand{\@noticestring}{% - 30562 --- Machine Learning and Artificial Intelligence, \@neuripsyear.% - } - - % hide the acknowledgements - \NewEnviron{hide}{} - \let\ack\hide - \let\endack\endhide - - % line numbers for submission - \RequirePackage{lineno} - \linenumbers - - % fix incompatibilities between lineno and amsmath, if required, by - % transparently wrapping linenomath environments around amsmath - % environments - \AtBeginDocument{% - \@ifpackageloaded{amsmath}{% - \newcommand*\patchAmsMathEnvironmentForLineno[1]{% - \expandafter\let\csname old#1\expandafter\endcsname\csname #1\endcsname - \expandafter\let\csname oldend#1\expandafter\endcsname\csname end#1\endcsname - \renewenvironment{#1}% - {\linenomath\csname old#1\endcsname}% - {\csname oldend#1\endcsname\endlinenomath}% - }% - \newcommand*\patchBothAmsMathEnvironmentsForLineno[1]{% - \patchAmsMathEnvironmentForLineno{#1}% - \patchAmsMathEnvironmentForLineno{#1*}% - }% - \patchBothAmsMathEnvironmentsForLineno{equation}% - \patchBothAmsMathEnvironmentsForLineno{align}% - \patchBothAmsMathEnvironmentsForLineno{flalign}% - \patchBothAmsMathEnvironmentsForLineno{alignat}% - \patchBothAmsMathEnvironmentsForLineno{gather}% - \patchBothAmsMathEnvironmentsForLineno{multline}% - } - {} - } - \fi -\fi - - -\endinput diff --git a/neurips_2026.tex b/neurips_2026.tex deleted file mode 100644 index 96b4952..0000000 --- a/neurips_2026.tex +++ /dev/null @@ -1,182 +0,0 @@ -\documentclass{article} - -\usepackage[dblblindworkshop, final]{neurips_2026} - -\usepackage[utf8]{inputenc} -\usepackage[T1]{fontenc} -\usepackage{hyperref} -\usepackage{url} -\usepackage{booktabs} -\usepackage{amsfonts} -\usepackage{amsmath} -\usepackage{microtype} -\usepackage{xcolor} -\usepackage{graphicx} - -\workshoptitle{30562 --- Machine Learning and Artificial Intelligence} -\title{Project Report Title} - -\author{% - Author One \\ - Bocconi University \\ - \texttt{author.one@studbocconi.it} \\ - \And - Author Two \\ - Bocconi University \\ - \texttt{author.two@studbocconi.it} \\ - \And - Author Three \\ - Bocconi University \\ - \texttt{author.three@studbocconi.it} \\ -} - - - -\begin{document} - -\maketitle - -\begin{abstract} -A concise summary of the problem, the approach, and the main findings. -The abstract should be self-contained and not exceed one paragraph. -\end{abstract} - -\section{Introduction} - -Motivate the problem you are solving. -What is the research question? -Why is it interesting or important? -Briefly summarise what you do and what you find. - -\section{Background and Related Work} - -Describe any prior work your project builds on. -Cite relevant papers and explain how your work relates to them. - -\section{Method} - -Describe your approach clearly and precisely. -Include any mathematical formulation, model architecture, or algorithm -that is central to your work. - -\section{Experiments} - -Describe your experimental setup: dataset(s), baselines, evaluation metrics, -and implementation details (model size, optimiser, hyperparameters). - -\subsection{Results} - -Present your main results using tables or figures. -Compare against baselines where applicable. - -% Fixed-size baselines and oracle -\input{rag-chunk-routing/results/figures/table_main_results} - -\subsection{Chunk-Size Router} - -We train a supervised router that maps each question to a predicted optimal -chunk size in $\{128, 256, 512\}$. -The router is trained on oracle labels (the chunk size that maximises F1 for -each question on the validation split) using a cross-validated grid search -over three feature extractors and three classifiers. - -\paragraph{Feature extractors.} -\textit{TF-IDF}: bag-of-bigrams with up to 10\,000 features. -\textit{MiniLM}: frozen sentence embeddings from -\texttt{all-MiniLM-L6-v2} (384 dimensions). -\textit{Handcrafted}: 13-dimensional deterministic features (query length, -question-word one-hot, heuristic NER count, question-type one-hot). - -\paragraph{Classifiers.} -Logistic regression, linear SVM, and LightGBM, all tuned via 5-fold -stratified cross-validation on the training split ($n=237$). - -\paragraph{Two-pass model selection.} -The top-3 grid cells by validation macro-F1 are re-ranked by their -end-to-end validation RAG F1 (running the full pipeline with predicted -chunk sizes), and the winner is selected by RAG F1. -The selected model is \textit{MiniLM + logistic regression} -(val macro-F1 = 0.416, val RAG F1 = 0.278). - -\paragraph{Test results.} -Table~\ref{tab:router-results} and Figure~\ref{fig:router-comparison} -report test-set performance. -The router achieves a classification macro-F1 of 0.292 and a mean RAG F1 -of 0.171, below the best fixed-size baseline (0.213). -The gap-closure fraction is $-0.51$, indicating that the router's -mispredictions hurt more than its correct predictions help. - -\input{rag-chunk-routing/results/figures/table_router_results} - -\begin{figure}[h] - \centering - \includegraphics[width=0.72\linewidth]{rag-chunk-routing/results/figures/fig_router_comparison.pdf} - \caption{Mean F1 on the test split for each system. The dashed line - marks the best fixed-size baseline (size~128, F1~=~0.213). The - type-aware heuristic routes each question to its type's - best-on-average chunk size with no training; the trained router - (MiniLM~+~LR) underperforms despite end-to-end selection.} - \label{fig:router-comparison} -\end{figure} - -\paragraph{Type-aware sanity baseline.} -As a parameter-free sanity check, we route each test question to the -chunk size with the highest mean F1 for its question type, as determined -from the eval grid (no new inference required). -This type-aware policy achieves a mean F1 of 0.229 and a gap-closure -fraction of 0.20. -Figure~\ref{fig:router-per-type} shows the per-type breakdown. - -\begin{figure}[h] - \centering - \includegraphics[width=0.82\linewidth]{rag-chunk-routing/results/figures/fig_router_per_type.pdf} - \caption{Per-type mean F1 on the test split. Factoid questions dominate - the dataset (82\% oracle-best at size~128); the trained router fails - on the minority types (multihop, synthesis) where the routing signal - matters most.} - \label{fig:router-per-type} -\end{figure} - -\paragraph{Analysis.} -Three factors explain the trained router's negative result. -First, the training set is small (237 examples), limiting generalisation; -classification macro-F1 drops from 0.416 on validation to 0.292 on test. -Second, the oracle--baseline gap is only 8.2~F1~points -(Table~\ref{tab:main-results}), so even modest routing errors erase the -potential gain. -Third, the type-aware heuristic demonstrates that question type is the -most informative routing signal: routing by type alone outperforms the -trained model, confirming that the learned features do not capture -information beyond what question type already encodes. - -% Per-type gap table -\input{rag-chunk-routing/results/figures/table_per_type_gap} - -\subsection{Ablations} - -If applicable, include ablation studies that isolate the contribution -of individual design choices. - -\section{Conclusion} - -Summarise what you did, what you found, and what the limitations are. -Optionally, suggest directions for future work. - -\section*{References} - -\small - -% Use any consistent citation style. -% Example: -% -% [1] Author, A. \& Author, B. (Year). Title. \textit{Venue}. - -\appendix - -\section{Additional Results and Implementation Details} - -Put here any supplementary figures, tables, proofs, or extended -experimental details that did not fit in the main paper. -The appendix has no page limit. - -\end{document}