Skip to content

[Bug]: Custom dataset name not getting reflected in sample_idx_map.json #284

@anandhu-eng

Description

@anandhu-eng

Bug Description

When we run a benchmark with custom dataset name, the run fails at accuracy evaluation stage with KeyError. When looking at the sample_idx_map.json, it seems like for dataset name which is not predefined like open_orca, "Dataset" becomes the generic key.

Image

Note:

The flow successfully executes if we modify config snippet the following way:

datasets:
  - name: "Dataset"
    type: "accuracy"
    samples: 24576
    path: "/home/anandhusooraj/endpoints/open_orca_gpt4_tokenized_llama.sampled_24576.parquet"
    accuracy_config:
      eval_method: "rouge"
      extractor: "identity_extractor"
      ground_truth: "output"
  - name: "Dataset"
    type: "performance"
    samples: 24576
    path: "/home/anandhusooraj/endpoints/open_orca_gpt4_tokenized_llama.sampled_24576.parquet" 

Steps to Reproduce

  1. Launch a vllm server with llama2-70b/llama2-7b
  2. Config file:
# Online Latency Benchmark
name: "online-llama2-70b-orca-benchmark"
version: "1.0"
type: "offline"
  #benchmark_mode: "online"

model_params:
  name: "meta-llama/Llama-2-7b-chat-hf"
  temperature: 0
  top_p: 1
  max_new_tokens: 1024

datasets:
  - name: "Dataset-openorca"
    type: "accuracy"
    samples: 24576
    path: "/home/anandhusooraj/endpoints/open_orca_gpt4_tokenized_llama.sampled_24576.parquet"
    accuracy_config:
      eval_method: "rouge"
      extractor: "identity_extractor"
      ground_truth: "output"
  - name: "Dataset-openorca"
    type: "performance"
    samples: 24576
    path: "/home/anandhusooraj/endpoints/open_orca_gpt4_tokenized_llama.sampled_24576.parquet" 
settings:
  runtime:
    min_duration_ms: 600000 # 1 minute
      #max_duration_ms: 600000 # 10 minutes
    scheduler_random_seed: 42 # For Poisson/distribution sampling
    dataloader_random_seed: 42 # For dataset shuffling
    n_samples_to_issue: 24576
  load_pattern:
    type: "max_throughput"
      #target_qps: 10

  client:
    num_workers: 4

metrics:
  collect:
    - "throughput"
    - "latency"
    - "ttft"
    - "tpot"

endpoint_config:
  endpoints:
    - "http://localhost:9000"
  api_key: null

report_dir: results/llama2_70b_orca_benchmark_mlperf_parq/

  1. Run command:
inference-endpoint benchmark from-config -c examples/06_Llama2-70B_Example/online_llama2_70b_orca_backup.yaml --timeout 600000

Environment

OS: Ubuntu 24.04
Python: 3.12.3
Endpoints repo latest commit hash: 8c0c63d

Relevant Logs

Error log:


(endp) anandhusooraj@mlc2:~/endpoints$ inference-endpoint benchmark from-config -c examples/06_Llama2-70B_Example/online_llama2_70b_orca_backup.yaml --timeout 600000
2026-04-16 14:46:42,753 - inference_endpoint.endpoint_client.cpu_affinity - INFO - CPU affinity: 224 online CPUs available to process
2026-04-16 14:46:42,763 - inference_endpoint.endpoint_client.cpu_affinity - INFO - CPU affinity: 112 physical cores across 2 NUMA nodes, requesting 5 for loadgen, 4 workers
2026-04-16 14:46:42,772 - inference_endpoint.endpoint_client.cpu_affinity - INFO - LoadGen pinned to 10 CPUs (5 physical cores)
2026-04-16 14:46:42,777 - inference_endpoint.commands.benchmark.execute - INFO - Loading tokenizer for model: meta-llama/Llama-2-7b-chat-hf
2026-04-16 14:46:42,884 - httpx - INFO - HTTP Request: HEAD https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json "HTTP/1.1 401 Unauthorized"
2026-04-16 14:46:42,947 - httpx - INFO - HTTP Request: HEAD https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/tokenizer_config.json "HTTP/1.1 401 Unauthorized"
2026-04-16 14:46:43,000 - httpx - INFO - HTTP Request: GET https://huggingface.co/api/models/meta-llama/Llama-2-7b-chat-hf/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
2026-04-16 14:46:43,064 - httpx - INFO - HTTP Request: GET https://huggingface.co/api/models/meta-llama/Llama-2-7b-chat-hf/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
2026-04-16 14:46:43,240 - inference_endpoint.commands.benchmark.execute - INFO - Tokenizer loaded successfully
2026-04-16 14:46:43,241 - inference_endpoint.commands.benchmark.execute - INFO - Streaming: disabled (off)
2026-04-16 14:46:43,757 - inference_endpoint.commands.benchmark.execute - INFO - Loaded <inference_endpoint.dataset_manager.dataset.Dataset object at 0x7958a9e0e810> - 24576 samples
2026-04-16 14:46:44,319 - inference_endpoint.commands.benchmark.execute - INFO - Loaded 24576 samples
2026-04-16 14:46:44,319 - inference_endpoint.commands.benchmark.execute - INFO - Mode: TestMode.PERF, Target QPS: None, Responses: False
2026-04-16 14:46:44,319 - inference_endpoint.commands.benchmark.execute - INFO - Min Duration: 600.0s, Expected samples: 49152
2026-04-16 14:46:44,320 - inference_endpoint.commands.benchmark.execute - INFO - Scheduler: MaxThroughputScheduler (pattern: max_throughput)
meta-llama/Llama-2-7b-chat-hf (Streaming: False):   0%|                                                                                                         | 0/49152 [00:00<?, ?it/s]2026-04-16 14:46:44,327 - inference_endpoint.commands.benchmark.execute - INFO - Connecting: ['http://localhost:9000']
2026-04-16 14:46:46,889 - inference_endpoint.endpoint_client.http_client - INFO - EndpointClient initialized with num_workers=4, endpoints=['http://localhost:9000/v1/chat/completions'], adapter=OpenAIMsgspecAdapter, accumulator=OpenAISSEAccumulator, transport=zmq
2026-04-16 14:46:46,890 - inference_endpoint.commands.benchmark.execute - INFO - Running...
2026-04-16 14:46:47,550 - inference_endpoint.load_generator.session - INFO - All performance samples issued
2026-04-16 14:46:48,121 - inference_endpoint.load_generator.session - INFO - All accuracy samples issued
meta-llama/Llama-2-7b-chat-hf (Streaming: False): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 49152/49152 [17:40<00:00, 46.37it/s]----------------- Summary -----------------
Version: 0.1.0
Git SHA: 8c0c63d
Test started at: (timestamp_ns):4653411357432119, approx. wall-clock time: (2026-04-16 14:46:46)
Total samples issued: 24576
Total samples completed: 24576
Total samples failed: 0
Duration: 654.46 seconds
QPS: 37.55
TPS: 11089.98
----------------- End of Summary -----------------
2026-04-16 15:04:25,890 - inference_endpoint.load_generator.session - INFO - Report saved to results/llama2_70b_orca_benchmark_mlperf_parq/report.txt
2026-04-16 15:04:25,910 - inference_endpoint.commands.benchmark.execute - INFO - Cleaning up...
meta-llama/Llama-2-7b-chat-hf (Streaming: False): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 49152/49152 [17:41<00:00, 46.30it/s]
2026-04-16 15:04:25,912 - inference_endpoint.endpoint_client.http_client - INFO - [bfdeb7ec] Shutting down...
2026-04-16 15:04:26,424 - inference_endpoint.endpoint_client.http_client - INFO - [bfdeb7ec] Shutdown complete.
Traceback (most recent call last):
  File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/main.py", line 128, in run
    app.meta()
  File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/cyclopts/core.py", line 1889, in __call__
    result = _run_maybe_async_command(command, bound, resolved_backend)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/cyclopts/_run.py", line 50, in _run_maybe_async_command
    return command(*bound.args, **bound.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/main.py", line 73, in launcher
    app(tokens)
  File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/cyclopts/core.py", line 1889, in __call__
    result = _run_maybe_async_command(command, bound, resolved_backend)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/cyclopts/_run.py", line 50, in _run_maybe_async_command
    return command(*bound.args, **bound.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/commands/benchmark/cli.py", line 112, in from_config
    _run(resolved, [], test_mode)
  File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/commands/benchmark/cli.py", line 54, in _run
    run_benchmark(config, mode)
  File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/commands/benchmark/execute.py", line 481, in run_benchmark
    finalize_benchmark(ctx, report, collector)
  File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/commands/benchmark/execute.py", line 405, in finalize_benchmark
    scorer_instance = eval_cfg.scorer(
                      ^^^^^^^^^^^^^^^^
  File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/evaluation/scoring.py", line 228, in __init__
    super().__init__(*args, **kwargs)
  File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/evaluation/scoring.py", line 112, in __init__
    self.sample_index_map = self._load_sample_index_map()
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anandhusooraj/endpoints/endp/lib/python3.12/site-packages/inference_endpoint/evaluation/scoring.py", line 123, in _load_sample_index_map
    return d[self.dataset_name]  # Implicitly raises KeyError
           ~^^^^^^^^^^^^^^^^^^^
KeyError: 'Dataset-openorca'

Before submitting

  • I searched existing issues and found no duplicates

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions