The Evaluate tab lets you systematically benchmark your fine-tuned models using standardized tests. Instead of just chatting with your model to see if it "feels right," you can get objective scores that show exactly how well it performs.
After training a model, you might wonder: "Did my training actually improve the model?" The Evaluate tab answers this question with numbers.
The workflow:
- Train → Teach your model with your data
- Inference → Chat with it to see if it feels right
- Evaluate → Get objective benchmark scores
- Publish → Share with confidence
The Evaluate tab has four sections:
- Training run to evaluate: Pick from your completed training runs
- Base model: Auto-filled based on your training run
- Compare with base model: Toggle this to see if fine-tuning helped
Choose which benchmark to run:
| Benchmark | Category | What It Tests | Speed |
|---|---|---|---|
| ⚡ HellaSwag | Quick | Commonsense reasoning | Fast |
| ⚡ TruthfulQA MC2 | Quick | Truthfulness and factual accuracy | Fast |
| ⚡ ARC Easy | Quick | Elementary science questions | Fast |
| ⚡ Winogrande | Quick | Pronoun resolution | Fast |
| ⚡ BoolQ | Quick | Yes/no questions | Fast |
| ARC Challenge | Full | Harder science questions | Medium |
| MMLU | Full | 57 different knowledge tasks | Slow |
| MMLU-PRO | Full | Enhanced MMLU with 10 choices | Slow |
| GSM8K | Full | Grade school math word problems | Medium |
| IFEval | Advanced | Instruction following | Medium |
| BBH | Advanced | 23 challenging reasoning tasks | Slow |
| GPQA | Advanced | PhD-level questions | Medium |
| MuSR | Advanced | Multistep soft reasoning | Medium |
| HumanEval | Advanced | Python code generation | Medium |
Configuration options:
- Max samples: How many test questions to run (default: 100). Lower = faster, higher = more accurate.
- Batch size: How many questions to process at once (default: 4). Lower if you run out of GPU memory.
Click Run Evaluation to start. You'll see:
- Progress indicator showing which benchmark is running
- Status messages as the model loads and processes questions
Click Stop to cancel a running evaluation.
After completion, you'll see:
Metrics Table:
| Metric | Score |
|---|---|
| Acc | 69.00% |
| Acc Norm | 72.50% |
Visual Bars:
- Green bar for Accuracy
- Gold bar for Normalized Accuracy (when available)
With Comparison Mode enabled:
| Metric | Fine-tuned | Base Model | Δ Change |
|---|---|---|---|
| Acc | 69.00% | 65.00% | +4.00% |
Green Δ = improvement, Red Δ = regression
- Go to Evaluate tab
- Select your training run from the dropdown
- Keep benchmark as "⚡ HellaSwag (quick)"
- Keep max samples at 100
- Click Run Evaluation
- Wait for results (~2-3 minutes)
- Go to Evaluate tab
- Select your training run
- Check "Also evaluate base model for comparison"
- Select a benchmark (e.g., "TruthfulQA MC2")
- Set max samples to 200-500 for more accurate results
- Click Run Evaluation
- Compare fine-tuned vs base model scores
The percentage of questions answered correctly. Simple and straightforward.
Accuracy adjusted for answer length bias. Some models prefer shorter or longer answers regardless of correctness. Normalized accuracy corrects for this.
Which to use?
- For HellaSwag, ARC, MMLU: Use acc_norm (standard practice)
- For TruthfulQA: Use acc (MC2 format)
- When in doubt: Report both
The difference between your fine-tuned model and the base model:
- +5.00%: Your fine-tuning improved performance by 5 percentage points
- -2.00%: Your fine-tuning decreased performance by 2 percentage points
- 0.00%: No change
- Use quick benchmarks (⚡) first
- Start with 100 samples to get fast feedback
- Increase samples for final evaluation before publishing
- FineFoundry automatically clears GPU memory between evaluations
- If you get out-of-memory errors, lower the batch size
- Close other GPU-intensive applications
- General capability: HellaSwag, MMLU
- Truthfulness: TruthfulQA
- Reasoning: ARC, BBH
- Math: GSM8K
- Code: HumanEval
- Instruction following: IFEval
Good signs:
- Positive Δ on benchmarks related to your training data
- Maintained or improved scores on general benchmarks
Warning signs:
- Large negative Δ on general benchmarks (catastrophic forgetting)
- Very low scores overall (model may have overfit to training data)
- Lower the batch size to 2 or 1
- Reduce max samples
- Close other applications using GPU
- Use quick benchmarks (⚡)
- Reduce max samples to 50-100
- Avoid full MMLU (57 tasks) for quick tests
- Ensure you selected the correct training run
- Check that the adapter path exists
- Try a different benchmark to verify
- Base model evaluation requires loading the model twice
- Ensure you have enough GPU memory
- Try without comparison mode first
FineFoundry uses EleutherAI's lm-evaluation-harness under the hood. This is the same framework used by:
- HuggingFace Open LLM Leaderboard
- Academic research papers
- Industry model evaluations
The benchmarks are standardized, so your results are comparable to published model scores.
- Training Tab — Train your model first
- Inference Tab — Chat with your model
- Troubleshooting — General help
