aws · mccartnick · Mar 19, 2026
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,42 @@
+# Python
+__pycache__/
+*.py[cod]
+*.egg-info/
+*.egg
+dist/
+build/
+*.whl
+
+# Virtual environments
+.venv/
+venv/
+env/
+
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+.DS_Store
+
+# mypy
+.mypy_cache/
+
+# Test / pytest
+.pytest_cache/
+.coverage
+htmlcov/
+test_data.jsonl
+test_tool_call.jsonl
+results.jsonl
+
+# Config secrets (never commit credentials)
+llm_eval_kit.yaml
+.env
+
+# Blog drafts
+blog/
+
+# Lambda deployment artifacts
+deploy_package/
+*.zip
diff --git a/README.md b/README.md
@@ -1,18 +1,22 @@
 # LLM Eval Kit
 
 A Python SDK for creating custom evaluation metrics for LLM model evaluation on Sagemaker Training Job with built-in Pydantic validation.
+
 For the official integration with AWS Sagemaker training job, please view in the [Official AWS Sagemaker Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-model-evaluation.html).
+
 ## Installation
 
-```
+```bash
 git clone https://github.com/aws/llm-eval-kit.git
 cd llm-eval-kit
-pip install .
+uv venv .venv && source .venv/bin/activate
+uv pip install .
 ```
 
 ## Architecture
 
 The SDK provides:
+
 - **Pydantic Validation**: Automatic input/output validation using Pydantic models
 - **PreProcessor**: For input data transformation with validation
 - **PostProcessor**: For output data formatting with validation
@@ -27,6 +31,7 @@ The SDK provides:
 See `example/run_example.py` for a complete working example to run locally.
 
 ### Run in AWS Lambda
+
 You need to create a lambda (follow this [guide](https://docs.aws.amazon.com/lambda/latest/dg/getting-started.html)) and upload `llm-eval-kit` as a lambda layer in order to use it.
 
 In the [github release](https://github.com/aws/llm-eval-kit/releases), you should be able to find a pre-built llm-eval-kit-layer.zip file.
@@ -35,10 +40,11 @@ Use below command to upload custom lambda layer.
 
 ```
 aws lambda publish-layer-version \
-    --layer-name llm-eval-kit-layer \
-    --zip-file fileb://llm-eval-kit-layer.zip \
-    --compatible-runtimes python3.12 python3.11 python3.10 python3.9
+--layer-name llm-eval-kit-layer \
+--zip-file fileb://llm-eval-kit-layer.zip \
+--compatible-runtimes python3.12 python3.11 python3.10 python3.9
 ```
+
 You need to add this layer as custom layer along with the required AWS layer: `AWSLambdaPowertoolsPythonV3-python312-arm64` (because of pydantic depencency) to your lambda.
 
 Then update your lambda code with:
@@ -72,7 +78,6 @@ def postprocessor(event: dict, context) -> dict:
         "metric": "inverted_accuracy_custom",
         "value": inverted_accuracy
     })
-
     # Add more metrics here
 
     return {
@@ -92,31 +97,58 @@ lambda_handler = build_lambda_handler(
 The SDK automatically validates:
 
 ### Preprocessing Input
+
 ```json
 {
-  "process_type": "preprocess",
-  "data": {
-    "prompt": "what can you do?",
-    "gold": "Hello! How can I help you today?",
-    "system": "You are a helpful assistant"
-  }
+    "process_type": "preprocess",
+    "data": {
+        "prompt": "what can you do?",
+        "gold": "Hello! How can I help you today?",
+        "system": "You are a helpful assistant"
+    }
 }
 ```
 
 ### Postprocessing Input
+
 ```json
 {
-  "process_type": "postprocess",
-  "data": [
-    {
-      "prompt": "what can you do",
-      "inference_output": "Hello! How can I help you today?",
-      "gold": "Hello! How can I help you today?"
-    }
-  ]
+    "process_type": "postprocess",
+    "data": [
+        {
+            "prompt": "what can you do",
+            "inference_output": "Hello! How can I help you today?",
+            "gold": "Hello! How can I help you today?"
+        }
+    ]
 }
 ```
 
+## RLVR Grader Framework
+
+llm-eval-kit also includes a grader framework for building and deploying reward functions for Reinforcement Learning with Verifiable Rewards (RLVR) on Amazon Bedrock. This extends the SDK beyond SageMaker evaluation into RFT (Reinforcement Fine-Tuning) workflows.
+
+Features include:
+
+- Built-in graders for exact match, string similarity, and BFCL tool-calling evaluation
+- A `@grader` decorator for writing custom reward functions
+- Dataset loaders for JSONL, BFCL, and HuggingFace Hub
+- One-command Lambda deployment of graders as reward functions
+- A CLI for local evaluation, validation, and deployment
+
+```bash
+uv pip install -e ".[dev,datasets,deploy]"
+```
+
+For full documentation on the grader framework, see the [src/llm_eval_kit README](src/llm_eval_kit/README.md).
+
+| Topic | Description |
+|-------|-------------|
+| [Graders](docs/graders.md) | Built-in graders, writing custom graders, the `@grader` decorator |
+| [Datasets](docs/datasets.md) | Loading from JSONL, BFCL, and HuggingFace Hub |
+| [Lambda Deployment](docs/deploy.md) | Deploy graders as AWS Lambda reward functions for RLVR |
+| [CLI Reference](docs/cli.md) | All CLI commands and options |
+
 ## Testing
 
 ```bash
@@ -130,8 +162,9 @@ python example/run_example.py
 ## Development
 
 ```bash
-# Install in development mode
-pip install -e .
+# Create venv and install in development mode
+uv venv .venv && source .venv/bin/activate
+uv pip install -e ".[dev,datasets,deploy]"
 
 # Run tests with coverage
 python -m pytest tests/ --cov=llm_eval_kit

diff --git a/docs/cli.md b/docs/cli.md
@@ -0,0 +1,67 @@
+# CLI Reference
+
+```
+llm-eval-kit <command> [options]
+```
+
+## `evaluate`
+
+Run a grader over a dataset.
+
+```bash
+llm-eval-kit evaluate --grader <name> --data <path> [options]
+```
+
+| Option | Description |
+|--------|-------------|
+| `--grader` | Built-in grader name (`exact_match`, `string_similarity`, `tool_call`) |
+| `--grader-path` | Custom grader as `module.path:function_name` |
+| `--data` | Path to JSONL dataset file (required) |
+| `--format` | `jsonl` (default) or `bfcl` for BFCL-formatted files |
+| `--output` | Write per-sample results to a JSONL file |
+| `--max-samples` | Limit number of samples to evaluate |
+
+Examples:
+
+```bash
+# Built-in grader
+llm-eval-kit evaluate --grader exact_match --data samples.jsonl
+
+# Custom grader with output
+llm-eval-kit evaluate --grader-path my_module:my_grader --data samples.jsonl --output results.jsonl
+
+# BFCL format with sample limit
+llm-eval-kit evaluate --grader tool_call --data BFCL_v3_simple.json --format bfcl --max-samples 50
+```
+
+## `list-graders`
+
+Show all registered graders.
+
+```bash
+llm-eval-kit list-graders
+```
+
+## `validate`
+
+Check a dataset file for schema errors.
+
+```bash
+llm-eval-kit validate --data <path>
+```
+
+## `deploy`
+
+Deploy a grader as an AWS Lambda function. Requires `uv pip install -e ".[deploy]"`.
+
+```bash
+llm-eval-kit deploy --grader <name> [options]
+```
+
+| Option | Description |
+|--------|-------------|
+| `--grader` | Built-in grader name |
+| `--grader-path` | Custom grader as `module.path:function_name` |
+| `--config` | Path to `llm_eval_kit.yaml` config file |
+
+See [deploy.md](deploy.md) for the full deployment walkthrough.
diff --git a/docs/datasets.md b/docs/datasets.md
@@ -0,0 +1,98 @@
+# Datasets
+
+llm-eval-kit supports loading evaluation data from JSONL files, BFCL-formatted files, and HuggingFace Hub.
+
+## JSONL Format
+
+Each line is a JSON object with `id`, `messages`, and `ground_truth`:
+
+```jsonl
+{"id": "1", "messages": [{"role": "user", "content": "2+2?"}, {"role": "assistant", "content": "4"}], "ground_truth": "4"}
+{"id": "2", "messages": [{"role": "user", "content": "Capital of France?"}, {"role": "assistant", "content": "Paris"}], "ground_truth": "Paris"}
+```
+
+Load from CLI:
+
+```bash
+llm-eval-kit evaluate --grader exact_match --data samples.jsonl
+```
+
+Load from Python:
+
+```python
+from llm_eval_kit.datasets.loader import load_jsonl
+
+dataset = load_jsonl("samples.jsonl", max_samples=100)
+```
+
+Validate a file before running:
+
+```bash
+llm-eval-kit validate --data samples.jsonl
+```
+
+## BFCL Format
+
+The [Berkeley Function Calling Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html) uses a specific JSONL format with `id`, `question` (list of message dicts), and `function` (tool definitions).
+
+```bash
+llm-eval-kit evaluate \
+    --grader tool_call \
+    --data BFCL_v3_multiple.json \
+    --format bfcl
+```
+
+```python
+from llm_eval_kit.datasets.loader import load_bfcl
+
+dataset = load_bfcl("BFCL_v3_multiple.json", max_samples=100)
+```
+
+## HuggingFace Hub
+
+Pull datasets directly from HuggingFace. Requires `uv pip install -e ".[datasets]"`.
+
+```python
+from llm_eval_kit.datasets.loader import load_huggingface
+
+dataset = load_huggingface(
+    "gorilla-llm/Berkeley-Function-Calling-Leaderboard",
+    split="train",
+    max_samples=50,
+    data_files="BFCL_v3_exec_simple.json",  # pick a specific file
+    prompt_key="question",
+    ground_truth_key="ground_truth",
+    id_key="id",
+    response_key=None,
+)
+```
+
+### Parameters
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `dataset_name` | (required) | HF dataset name (e.g. `"gorilla-llm/Berkeley-Function-Calling-Leaderboard"`) |
+| `split` | `"train"` | Dataset split |
+| `max_samples` | `None` | Limit number of samples |
+| `token` | `None` | HF API token (falls back to `HF_TOKEN` env var) |
+| `data_files` | `None` | Specific file(s) to load from the repo |
+| `config_name` | `None` | Dataset config/subset name |
+| `prompt_key` | `"prompt"` | Column name for the prompt |
+| `response_key` | `"response"` | Column name for model response (`None` to skip) |
+| `ground_truth_key` | `"ground_truth"` | Column name for ground truth (`None` to skip) |
+| `id_key` | `"id"` | Column name for sample ID (`None` to auto-generate) |
+
+### BFCL on HuggingFace
+
+The BFCL repo has ~49 files with different schemas. You must use `data_files` to select one — loading the entire repo will fail.
+
+Available files include: `BFCL_v3_simple.json`, `BFCL_v3_multiple.json`, `BFCL_v3_parallel.json`, `BFCL_v3_exec_simple.json`, `BFCL_v3_live_simple.json`, and more.
+
+### Private/Gated Datasets
+
+```python
+dataset = load_huggingface(
+    "my-org/my-private-dataset",
+    token="hf_...",  # or set HF_TOKEN env var
+)
+```