AI Gateway & LiteLLM Integration and LLM Evals UX fixes by EfeAcar6431 · Pull Request #3870 · verifywise-ai/verifywise

EfeAcar6431 · 2026-05-09T05:38:33Z

Describe your changes

Integrated AI Gateway for API key storage and retrieval; keys can be added directly from LLM Evals Settings
LiteLLM model catalog now used across Add Model modal and scorer configuration with live model fetching
Backend proxy correctly prefixes model names per provider and handles non-JSON error responses gracefully
Removed temperature and max_tokens from UI to prevent failures with models that reject these params
Settings page refactored: Configuration page deleted, Use Case Settings merged in, split into API Keys and Project Settings sections
Scorer modal rebuilt: simplified layout, tooltips, LiteLLM judge selector, PASS/FAIL defaults, deduplication, threshold saving fixed
Models table updated: combined MODEL column with sub-provider icon + name, PROVIDER column added, DATE renamed to DATE ADDED
New Experiment modal: saved models shown as selectable cards with no auto-fill; experiment names now use model - DD/MM/YY format
Playground page built: chat interface with Perplexity-style layout, model picker with sub-provider icons, image/document attachments, multimodal vision support, dictation with pulse animation, auto-focus
Bias and Toxicity scoring fixed: removed inverted display; 0% = no bias/toxicity = green across all views
Breadcrumbs fixed to match sidebar labels and icons for all tabs (Playground, Reports, Models, Bias audits)
Fixed NoneType crash in EvalServer when OpenRouter returns null content for a prompt

Write your issue number after "Fixes "

Fixes #3868, #3869

Please ensure all items are checked off before requesting a review:

I deployed the code locally.
I have performed a self-review of my code.
I have included the issue # in the PR.
I have labelled the PR correctly.
The issue I am working on is assigned to me.
I have avoided using hardcoded values to ensure scalability and maintain consistency across the application.
I have ensured that font sizes, color choices, and other UI elements are referenced from the theme.
My pull request is focused and addresses a single, specific feature.
If there are UI changes, I have attached a screenshot or video to this PR.

The reusable workflow and SDK pyproject.toml referenced verifywise/verifywise instead of the actual org verifywise-ai/verifywise, which would cause the fallback CI runner download to 404. Made-with: Cursor

Adds a `verifywise` console command (registered as entry point) with subcommands for all SDK namespaces: projects, experiments, datasets, reports, metrics, models, scorers, and logs. Supports --json output, env var auth (VW_API_URL, VW_API_TOKEN), and detailed --help at every level. Includes 35 CLI tests covering help, auth errors, formatting, and all command handlers. Made-with: Cursor

- Fix project_id not passed to eval subprocess config, causing NOT NULL violation on llm_evals_logs inserts - Handle metrics config as list-of-dicts (CI runner format) in addition to dict format (frontend format) in run_evaluation.py - Fix SDK datasets.upload() missing required org_id form field - Fix SDK projects.create() using snake_case instead of camelCase for useCase field, and extract project wrapper from response - Update verifywise-eval.yml to use standalone action repo - Add composite GitHub Action, SDK docs, and CI eval runner script Made-with: Cursor

…mats - Rewrite verifywise-eval.yml with clear examples for chatbot, RAG, and agent use cases; list all available metrics in header comments - Simplify action.yml: env-based arg passing, concise threshold check - Accept both snake_case and camelCase metric names in run_evaluation.py so CI runner and frontend both work without translation - Add missing agent metrics: plan_adherence, argument_correctness, task_completion, step_efficiency Made-with: Cursor

Made-with: Cursor

Sync updated action.yml and ci_eval_runner.py from verifywise-eval-action. Made-with: Cursor

Made-with: Cursor

…rkflow - Action now posts PR comments automatically (no extra step needed) - Score deltas vs previous experiment shown in summary table - Reusable workflow simplified (removed manual PR comment step) Made-with: Cursor

The leaderboard sidebar item was accidentally merged from a feature branch. Hide it behind import.meta.env.DEV so it only appears during local development. Made-with: Cursor

- Take upstream for EvalsDashboard UI files, workflow, package-lock - Keep ours for routes using AI Gateway key injection (deepEvalRoutes, evaluationLlmApiKey) - Style fix: multi-line LLMProvider union + catch(_error) in evaluationLlmApiKeysService Made-with: Cursor

- Mirror AI Gateway keys in LLM Evals Settings page (add/delete via /ai-gateway/keys) - Inject API keys from ai_gateway_api_keys table into experiment requests (judge, model, scorerApiKeys) - Gracefully handle missing AI Gateway table (42P01) so experiments still work when AIGateway isn't initialised - Return empty array instead of 502 for GET /ai-gateway/keys when service is down - Guard Sequelize migration for key copy so it no-ops if either table is absent - Fix migrate_to_shared_schema.py: cast created_by int→str; skip orphaned FK rows - Fix Bias/Toxicity display: invert percentage for those metrics (0 = perfect); only Hallucination treated as inverse for colour logic - Use backend passed flag for metric pass/fail determination with score>=0.5 fallback Co-authored-by: Cursor <cursoragent@cursor.com>

…a-ai-gateway-litellm Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # EvaluationModule/requirements.txt

…er model dropdowns Replace hardcoded static JSON model arrays with a live catalog fetched from AI Gateway GET /api/ai-gateway/v1/models (backed by litellm.model_cost). - evalModelsService: add getGatewayModelsForProvider() with 5-min module-level cache; strips provider prefix so IDs match existing format (gemini/gemini-2.5-pro → gemini-2.5-pro); filters to chat-mode models only; maps frontend provider IDs to LiteLLM keys - NewExperimentModal: load gateway models lazily when user picks a provider (model step and judge step); update getProviderModels() to prefer live catalog with graceful fallback to bundled static JSON while loading or if AI Gateway is unreachable Co-authored-by: Cursor <cursoragent@cursor.com>

…rams to AI Gateway Models like o1, o3, and other reasoning models reject temperature and max_tokens in their API calls, causing experiments to fail silently. - AIGateway/llm_service.py: add drop_params=True to both acompletion calls so LiteLLM silently drops any unsupported parameter instead of forwarding it to the provider and getting an error back - NewExperimentModal: remove temperature + max tokens input fields from the judge config step; remove them from the experiment payload and savePreferences call - CreateScorerModal: remove temperature slider and max tokens text field from the model parameters section - ProjectScorers: remove temperature/max_tokens from the judgeModel.params payload sent when creating or updating scorers - useModelPreferences: drop temperature/maxTokens from the ModelPreferences type and all read/write paths Co-authored-by: Cursor <cursoragent@cursor.com>

Resolved conflicts in CreateScorerModal.tsx and EvalsDashboard.tsx by keeping our versions: improved scorer modal (SectionHeadings, tooltips, LiteLLM model catalog, no Params button, Top P inline), and Settings consolidation (Configuration tab removed, use case settings merged into Settings, gear icon). Co-authored-by: Cursor <cursoragent@cursor.com>

…l UX fixes - Add Playground page with chat interface, file attachments, and dictation - Backend: POST /api/deepeval/playground/chat with LiteLLM model routing - Fix LiteLLM model string prefix (openrouter/*, gemini/*, etc.) - Gracefully handle non-JSON AI Gateway error responses - Fix delete API key bug (missing id argument) - Models table: consolidate into MODEL + PROVIDER columns, add Moonshot/Qwen icons - NewExperimentModal: saved model selection no longer fills provider grid or model field - EvalServer: handle None content from OpenRouter (transient empty response) Co-authored-by: Cursor <cursoragent@cursor.com>

…y scoring fixes - Playground: display actual image previews in chat bubbles - Playground: send multimodal content (image_url) for vision-capable models - Playground: disable file upload with tooltip for text-only models - Playground: "Add model" navigates to Models page and auto-opens add modal - Playground: Perplexity-style layout (composer centered when empty, pinned to bottom when chatting) - Playground: auto-focus textarea on load, after responses, file upload, and dictation - Playground: fix provider display names in Add Model dropdown (OpenAI, OpenRouter, xAI, etc.) - Breadcrumbs: add missing Playground, Reports entries; fix Models (Cpu) and Bias audits (Scale) icons - Experiments: rename format to "model - DD/MM/YY" - Bias/Toxicity: remove incorrect score inversion; 0% = no bias/toxicity = green - Bias/Toxicity: fix getScoreColor and passed flag to treat lower-is-better correctly Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

…a-ai-gateway-litellm

…es after run.

- Add .venv/, **/venv/, **/.venv/, env/, ENV/ to gitignore - Add coverage/, **/coverage/, .nyc_output/, htmlcov/, .coverage - Add common Python cruft (*.pyc, *.so, __pycache__, .pytest_cache) - Add **/node_modules/, **/dist/, **/build/, *.tsbuildinfo, *.log - Untrack 520 EvalServer/.venv files and 11 Servers/coverage files Co-authored-by: Cursor <cursoragent@cursor.com>

Format all files flagged by CI Prettier checks: - Clients: api services, GovernanceOS components, ProviderIcons, ScorersTable, ModelsTable, DatasetsTable, EvalsDashboard pages - Servers: AI Gateway routes, deepEvalRoutes, evaluationLlmApiKey route/utils, and migration file Co-authored-by: Cursor <cursoragent@cursor.com>

…utes Co-authored-by: Cursor <cursoragent@cursor.com>

- Update evaluationLlmApiKeysService tests: new /ai-gateway/keys endpoint, gateway payload shape (api_key, key_name), delete by numeric ID, hasKey via getAllKeys, verifyKey via /ai-gateway/keys/verify - Update deepEval.repository test: deleteLlmApiKey now takes provider + id Co-authored-by: Cursor <cursoragent@cursor.com>

Add 25 missing strings to both German and French dictionaries: - Scorer UI: name/slug field descriptions, choice label, add choice, pass threshold tooltip, judge LLM description, use-case type label - Models table: DATE ADDED, LAST RUN column headers - Playground: Add model, Clear conversation, No saved models yet, Saved Models, AI Gateway is not running - Provider names: Google, Groq, Meta, Microsoft, Moonshot AI, Nous Research, Perplexity, Qwen - General: Project settings, An unexpected error occurred Co-authored-by: Cursor <cursoragent@cursor.com>

Merge upstream/develop Governance OS strings alongside existing LLM Evals additions for both de and fr locales. Co-authored-by: Cursor <cursoragent@cursor.com>

EfeAcar6431 and others added 30 commits March 20, 2026 17:40

added sdk

5ffd1a1

Merge remote-tracking branch 'upstream/develop' into llm-evals-fixes

7dc734d

fix(evals): correct GitHub org in workflow URLs and SDK homepage

50ee9b6

The reusable workflow and SDK pyproject.toml referenced verifywise/verifywise instead of the actual org verifywise-ai/verifywise, which would cause the fallback CI runner download to 404. Made-with: Cursor

feat(evals): improve action reporting with Job Summary and annotations

2a09d22

Made-with: Cursor

feat(evals): per-sample failure details and text-only indicators

4883462

Sync updated action.yml and ci_eval_runner.py from verifywise-eval-action. Made-with: Cursor

fix(evals): sync ci_eval_runner with full explanations and log fetching

aea9b98

Made-with: Cursor

feat(evals): separate model/judge API keys, fix syntax, rename jobs

3ecea95

Made-with: Cursor

feat(evals): include judge model name in summary

482989e

Made-with: Cursor

fix(evals): center table columns in markdown summary

74204c5

Made-with: Cursor

Merge remote-tracking branch 'upstream/develop' into llm-evals-fixes

bba838d

feat(evals): gate leaderboard behind dev mode

bbad58c

The leaderboard sidebar item was accidentally merged from a feature branch. Hide it behind import.meta.env.DEV so it only appears during local development. Made-with: Cursor

presentation

426d0d8

Merge remote-tracking branch 'upstream/develop' into feat/eval-llm-vi…

f8b04f9

…a-ai-gateway-litellm Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # EvaluationModule/requirements.txt

fixed scorers ui

2a8096b

chore: remove LLM evals training workshop files

dfcd062

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge remote-tracking branch 'upstream/develop' into feat/eval-llm-vi…

fddaa72

…a-ai-gateway-litellm

fixed linter issues

9c23c13

fixed migration failure

93ccb69

Scorer saving lacked the number of choice scorers value, now it updat…

689e2c9

…es after run.

EfeAcar6431 added this to the 2.4 milestone May 9, 2026

EfeAcar6431 requested review from MuhammadKhalilzadeh, gorkem-bwl and sermengi May 9, 2026 05:38

EfeAcar6431 self-assigned this May 9, 2026

EfeAcar6431 added enhancement New feature or request frontend Frontend related tasks/issues backend Backend related tasks/issues labels May 9, 2026

EfeAcar6431 and others added 7 commits May 9, 2026 01:43

style: fix remaining Prettier issues in EvalsDashboard and deepEvalRo…

6085ec2

…utes Co-authored-by: Cursor <cursoragent@cursor.com>

chore(i18n): resolve merge conflict in translations.ts

e993b50

Merge upstream/develop Governance OS strings alongside existing LLM Evals additions for both de and fr locales. Co-authored-by: Cursor <cursoragent@cursor.com>

Merge branch 'develop' into feat/eval-llm-via-ai-gateway-litellm

db93153

MuhammadKhalilzadeh approved these changes May 11, 2026

View reviewed changes

MuhammadKhalilzadeh merged commit 3cc2f02 into develop May 11, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Gateway & LiteLLM Integration and LLM Evals UX fixes#3870

AI Gateway & LiteLLM Integration and LLM Evals UX fixes#3870
MuhammadKhalilzadeh merged 37 commits intodevelopfrom
feat/eval-llm-via-ai-gateway-litellm

EfeAcar6431 commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EfeAcar6431 commented May 9, 2026

Describe your changes

Write your issue number after "Fixes "

Please ensure all items are checked off before requesting a review:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants