Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Python
__pycache__/
*.py[cod]
*.egg-info/
*.egg
dist/
build/
*.whl

# Virtual environments
.venv/
venv/
env/

# IDE
.vscode/
.idea/
*.swp
*.swo
.DS_Store

# mypy
.mypy_cache/

# Test / pytest
.pytest_cache/
.coverage
htmlcov/
test_data.jsonl
test_tool_call.jsonl
results.jsonl

# Config secrets (never commit credentials)
llm_eval_kit.yaml
.env

# Blog drafts
blog/

# Lambda deployment artifacts
deploy_package/
*.zip
77 changes: 55 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,22 @@
# LLM Eval Kit

A Python SDK for creating custom evaluation metrics for LLM model evaluation on Sagemaker Training Job with built-in Pydantic validation.

For the official integration with AWS Sagemaker training job, please view in the [Official AWS Sagemaker Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-evaluation.html).

## Installation

```
```bash
git clone https://github.com/aws/llm-eval-kit.git
cd llm-eval-kit
pip install .
uv venv .venv && source .venv/bin/activate
uv pip install .
```

## Architecture

The SDK provides:

- **Pydantic Validation**: Automatic input/output validation using Pydantic models
- **PreProcessor**: For input data transformation with validation
- **PostProcessor**: For output data formatting with validation
Expand All @@ -27,6 +31,7 @@ The SDK provides:
See `example/run_example.py` for a complete working example to run locally.

### Run in AWS Lambda

You need to create a lambda (follow this [guide](https://docs.aws.amazon.com/lambda/latest/dg/getting-started.html)) and upload `llm-eval-kit` as a lambda layer in order to use it.

In the [github release](https://github.com/aws/llm-eval-kit/releases), you should be able to find a pre-built llm-eval-kit-layer.zip file.
Expand All @@ -35,10 +40,11 @@ Use below command to upload custom lambda layer.

```
aws lambda publish-layer-version \
--layer-name llm-eval-kit-layer \
--zip-file fileb://llm-eval-kit-layer.zip \
--compatible-runtimes python3.12 python3.11 python3.10 python3.9
--layer-name llm-eval-kit-layer \
--zip-file fileb://llm-eval-kit-layer.zip \
--compatible-runtimes python3.12 python3.11 python3.10 python3.9
```

You need to add this layer as custom layer along with the required AWS layer: `AWSLambdaPowertoolsPythonV3-python312-arm64` (because of pydantic depencency) to your lambda.

Then update your lambda code with:
Expand Down Expand Up @@ -72,7 +78,6 @@ def postprocessor(event: dict, context) -> dict:
"metric": "inverted_accuracy_custom",
"value": inverted_accuracy
})

# Add more metrics here

return {
Expand All @@ -92,31 +97,58 @@ lambda_handler = build_lambda_handler(
The SDK automatically validates:

### Preprocessing Input

```json
{
"process_type": "preprocess",
"data": {
"prompt": "what can you do?",
"gold": "Hello! How can I help you today?",
"system": "You are a helpful assistant"
}
"process_type": "preprocess",
"data": {
"prompt": "what can you do?",
"gold": "Hello! How can I help you today?",
"system": "You are a helpful assistant"
}
}
```

### Postprocessing Input

```json
{
"process_type": "postprocess",
"data": [
{
"prompt": "what can you do",
"inference_output": "Hello! How can I help you today?",
"gold": "Hello! How can I help you today?"
}
]
"process_type": "postprocess",
"data": [
{
"prompt": "what can you do",
"inference_output": "Hello! How can I help you today?",
"gold": "Hello! How can I help you today?"
}
]
}
```

## RLVR Grader Framework

llm-eval-kit also includes a grader framework for building and deploying reward functions for Reinforcement Learning with Verifiable Rewards (RLVR) on Amazon Bedrock. This extends the SDK beyond SageMaker evaluation into RFT (Reinforcement Fine-Tuning) workflows.

Features include:

- Built-in graders for exact match, string similarity, and BFCL tool-calling evaluation
- A `@grader` decorator for writing custom reward functions
- Dataset loaders for JSONL, BFCL, and HuggingFace Hub
- One-command Lambda deployment of graders as reward functions
- A CLI for local evaluation, validation, and deployment

```bash
uv pip install -e ".[dev,datasets,deploy]"
```

For full documentation on the grader framework, see the [src/llm_eval_kit README](src/llm_eval_kit/README.md).

| Topic | Description |
|-------|-------------|
| [Graders](docs/graders.md) | Built-in graders, writing custom graders, the `@grader` decorator |
| [Datasets](docs/datasets.md) | Loading from JSONL, BFCL, and HuggingFace Hub |
| [Lambda Deployment](docs/deploy.md) | Deploy graders as AWS Lambda reward functions for RLVR |
| [CLI Reference](docs/cli.md) | All CLI commands and options |

## Testing

```bash
Expand All @@ -130,8 +162,9 @@ python example/run_example.py
## Development

```bash
# Install in development mode
pip install -e .
# Create venv and install in development mode
uv venv .venv && source .venv/bin/activate
uv pip install -e ".[dev,datasets,deploy]"

# Run tests with coverage
python -m pytest tests/ --cov=llm_eval_kit
Expand Down
67 changes: 67 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# CLI Reference

```
llm-eval-kit <command> [options]
```

## `evaluate`

Run a grader over a dataset.

```bash
llm-eval-kit evaluate --grader <name> --data <path> [options]
```

| Option | Description |
|--------|-------------|
| `--grader` | Built-in grader name (`exact_match`, `string_similarity`, `tool_call`) |
| `--grader-path` | Custom grader as `module.path:function_name` |
| `--data` | Path to JSONL dataset file (required) |
| `--format` | `jsonl` (default) or `bfcl` for BFCL-formatted files |
| `--output` | Write per-sample results to a JSONL file |
| `--max-samples` | Limit number of samples to evaluate |

Examples:

```bash
# Built-in grader
llm-eval-kit evaluate --grader exact_match --data samples.jsonl

# Custom grader with output
llm-eval-kit evaluate --grader-path my_module:my_grader --data samples.jsonl --output results.jsonl

# BFCL format with sample limit
llm-eval-kit evaluate --grader tool_call --data BFCL_v3_simple.json --format bfcl --max-samples 50
```

## `list-graders`

Show all registered graders.

```bash
llm-eval-kit list-graders
```

## `validate`

Check a dataset file for schema errors.

```bash
llm-eval-kit validate --data <path>
```

## `deploy`

Deploy a grader as an AWS Lambda function. Requires `uv pip install -e ".[deploy]"`.

```bash
llm-eval-kit deploy --grader <name> [options]
```

| Option | Description |
|--------|-------------|
| `--grader` | Built-in grader name |
| `--grader-path` | Custom grader as `module.path:function_name` |
| `--config` | Path to `llm_eval_kit.yaml` config file |

See [deploy.md](deploy.md) for the full deployment walkthrough.
98 changes: 98 additions & 0 deletions docs/datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Datasets

llm-eval-kit supports loading evaluation data from JSONL files, BFCL-formatted files, and HuggingFace Hub.

## JSONL Format

Each line is a JSON object with `id`, `messages`, and `ground_truth`:

```jsonl
{"id": "1", "messages": [{"role": "user", "content": "2+2?"}, {"role": "assistant", "content": "4"}], "ground_truth": "4"}
{"id": "2", "messages": [{"role": "user", "content": "Capital of France?"}, {"role": "assistant", "content": "Paris"}], "ground_truth": "Paris"}
```

Load from CLI:

```bash
llm-eval-kit evaluate --grader exact_match --data samples.jsonl
```

Load from Python:

```python
from llm_eval_kit.datasets.loader import load_jsonl

dataset = load_jsonl("samples.jsonl", max_samples=100)
```

Validate a file before running:

```bash
llm-eval-kit validate --data samples.jsonl
```

## BFCL Format

The [Berkeley Function Calling Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html) uses a specific JSONL format with `id`, `question` (list of message dicts), and `function` (tool definitions).

```bash
llm-eval-kit evaluate \
--grader tool_call \
--data BFCL_v3_multiple.json \
--format bfcl
```

```python
from llm_eval_kit.datasets.loader import load_bfcl

dataset = load_bfcl("BFCL_v3_multiple.json", max_samples=100)
```

## HuggingFace Hub

Pull datasets directly from HuggingFace. Requires `uv pip install -e ".[datasets]"`.

```python
from llm_eval_kit.datasets.loader import load_huggingface

dataset = load_huggingface(
"gorilla-llm/Berkeley-Function-Calling-Leaderboard",
split="train",
max_samples=50,
data_files="BFCL_v3_exec_simple.json", # pick a specific file
prompt_key="question",
ground_truth_key="ground_truth",
id_key="id",
response_key=None,
)
```

### Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `dataset_name` | (required) | HF dataset name (e.g. `"gorilla-llm/Berkeley-Function-Calling-Leaderboard"`) |
| `split` | `"train"` | Dataset split |
| `max_samples` | `None` | Limit number of samples |
| `token` | `None` | HF API token (falls back to `HF_TOKEN` env var) |
| `data_files` | `None` | Specific file(s) to load from the repo |
| `config_name` | `None` | Dataset config/subset name |
| `prompt_key` | `"prompt"` | Column name for the prompt |
| `response_key` | `"response"` | Column name for model response (`None` to skip) |
| `ground_truth_key` | `"ground_truth"` | Column name for ground truth (`None` to skip) |
| `id_key` | `"id"` | Column name for sample ID (`None` to auto-generate) |

### BFCL on HuggingFace

The BFCL repo has ~49 files with different schemas. You must use `data_files` to select one — loading the entire repo will fail.

Available files include: `BFCL_v3_simple.json`, `BFCL_v3_multiple.json`, `BFCL_v3_parallel.json`, `BFCL_v3_exec_simple.json`, `BFCL_v3_live_simple.json`, and more.

### Private/Gated Datasets

```python
dataset = load_huggingface(
"my-org/my-private-dataset",
token="hf_...", # or set HF_TOKEN env var
)
```
Loading
Loading