Skip to content

[WIP] Evaluation of platinum bench locally#4

Draft
Vahe1994 wants to merge 2 commits intoMadryLab:mainfrom
Vahe1994:local_evals
Draft

[WIP] Evaluation of platinum bench locally#4
Vahe1994 wants to merge 2 commits intoMadryLab:mainfrom
Vahe1994:local_evals

Conversation

@Vahe1994
Copy link

@Vahe1994 Vahe1994 commented Feb 9, 2026

This PR makes several changes to Platinum Bench:

  1. Platinum Bench is now an installable package.
  2. Adds support for local evaluation using vLLM.
  3. Adds a command-line interface (CLI).
  4. Enables usage from external projects.
  5. Adds Hugging Face model support.
  6. Integrates W&B logging (including error artifacts).
  7. Improves seed handling.
  8. Updates the README and documentation.
  • Add vLLM support to evaluate models locally
  • Make platinumbench an installable Python package
  • Create a CLI and make it runnable from the terminal
  • Add an interface so other projects can call platinumbench
  • Add support for Hugging Face models
  • Add W&B logging, including error artifacts
  • Correct and update documentation
  • Test that local integration works
  • Fix random seed handling
  • Add the seed to the run key/identifier
  • Test with HSUW
  • Enable reasoning for Qwen models
  • Verify that OpenAI, vLLM, and local generation produce consistent results
  • CLI testing with vLLM serving
  • Add local evals for VLM
  • Check that evaluation with online API requests still work for LLM
  • Test correctness
  • Update the README after the changes
  • Clean up the code
  • Add and fix project documentation
  • Check that evaluation with that online API requests still work for Visual VLM

Vahe1994 and others added 2 commits February 9, 2026 13:34
--------------------------------------------------

Co-authored-by: Dan Alistarh <d.alistarh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant