Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 0 additions & 62 deletions examples/05_Llama3.1-8B_Example/README.md

This file was deleted.

108 changes: 108 additions & 0 deletions examples/05_Llama_Examples/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Running Endpoints with Llama Models

This example covers benchmarking two Llama models:

- **[Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)** — CNN/DailyMail summarization, offline and online modes
- **[Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)** — Open Orca, online (Poisson) mode

---

## Llama-3.1-8B-Instruct

### Dataset

The Llama3.1-8B benchmark uses the [cnn/dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) dataset. When using the provided config files, the validation split is downloaded automatically by specifying the dataset name as `- name: cnn_dailymail::llama3_8b`.

For post-training quantization calibration, use the repository-provided [`calibration-list.txt`](./calibration-list.txt), which corresponds to the [cnn-dailymail-calibration-list](https://github.com/mlcommons/inference/blob/v4.0/calibration/CNNDailyMail/calibration-list.txt):

```bash
uv run python download_cnndm.py --save-dir data --calibration-ids-file calibration-list.txt --split train
```

Comment on lines +16 to +21
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This README instructs users to download calibration-list.txt via curl, but the repo now also includes calibration-list.txt in this directory. That’s redundant and can lead to confusion (or overwriting the checked-in file). Either remove the checked-in list and keep the download step, or update the instructions to use the bundled file and drop the curl command.

Copilot uses AI. Check for mistakes.
### Environment

```bash
export HF_TOKEN=<your Hugging Face token>
export HF_HOME=<path to your hf_home, e.g. ~/.cache/huggingface>
export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
hf download $MODEL_NAME
```

### Launch the server

**Note:** To generate outputs matching MLPerf submissions from legacy loadgen, apply a custom chat template (handled automatically by the `cnn_dailymail::llama3_8b` preset). The `--trust-request-chat-template` flag is required. **Security warning:** this flag allows execution of request-provided chat templates and should only be used in trusted environments. Do not enable it on publicly exposed endpoints.

```bash
docker run --runtime nvidia --gpus all \
-v ${HF_HOME}:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest --model ${MODEL_NAME} --trust-request-chat-template
```

### Offline mode

```bash
uv run inference-endpoint benchmark from-config -c offline_llama3_8b_cnn.yaml --timeout 600
```

### Online mode

```bash
uv run inference-endpoint benchmark from-config -c online_llama3_8b_cnn.yaml --timeout 600
```
Comment on lines +44 to +54
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commands shown for the Llama-3.1 configs will run performance only by default: benchmark from-config defaults to --mode perf unless type: submission. Since the YAMLs include an accuracy dataset entry, readers may assume accuracy is being run when it isn’t. Either document using --mode both (or --mode acc) or change these configs to type: submission with an appropriate benchmark_mode so from-config defaults to BOTH.

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +54
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Llama-3.1 YAML configs include an accuracy dataset using eval_method: rouge, but this section doesn’t mention installing the ROUGE scoring deps (nltk, evaluate, rouge_score) that are required when running --mode acc/--mode both (the tool raises ImportError otherwise). Consider adding the same optional accuracy-setup snippet here (or explicitly stating these runs are perf-only).

Copilot uses AI. Check for mistakes.

These configs run in performance-only mode by default. To also evaluate summarization quality, add `--mode both` and install the accuracy dependencies listed in the [Llama-2-70b accuracy setup](#accuracy-evaluation-setup-optional) section below.

---

## Llama-2-70b-chat-hf

### Dataset

Download the preprocessed Open Orca dataset from the MLCommons R2 bucket. Navigate to your desired download directory and run:

```bash
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
https://inference.mlcommons-storage.org/metadata/llama-2-70b-open-orca-dataset.uri
```

Running the script above will download the dataset to the `./open_orca` directory. Additional instructions for downloading the model and dataset are in the [Reference Implementation for llama2-70b](https://github.com/mlcommons/inference/tree/master/language/llama2-70b).

### Environment

Go to [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) and request access, then create a HuggingFace access token with read permissions.

```bash
export MODEL_NAME=meta-llama/Llama-2-70b-chat-hf
export HF_TOKEN=<your Hugging Face token>
export HF_HOME=<path to your hf_home, e.g. ~/.cache/huggingface>
hf download $MODEL_NAME
```

### Accuracy evaluation setup (optional)

Accuracy evaluation requires additional packages. Skip this for performance-only runs.

```bash
uv pip install nltk evaluate rouge_score
uv run python -c 'import nltk; nltk.download("punkt"); nltk.download("punkt_tab")'
```

### Launch the server

```bash
docker run --runtime nvidia --gpus all \
-v ${HF_HOME}:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest --model ${MODEL_NAME} --gpu-memory-utilization 0.95
```

### Online mode

```bash
uv run inference-endpoint benchmark from-config -c online_llama2_70b_orca.yaml --timeout 600
```
Loading
Loading