Skip to content

AI Gateway & LiteLLM Integration and LLM Evals UX fixes#3870

Merged
MuhammadKhalilzadeh merged 37 commits intodevelopfrom
feat/eval-llm-via-ai-gateway-litellm
May 11, 2026
Merged

AI Gateway & LiteLLM Integration and LLM Evals UX fixes#3870
MuhammadKhalilzadeh merged 37 commits intodevelopfrom
feat/eval-llm-via-ai-gateway-litellm

Conversation

@EfeAcar6431
Copy link
Copy Markdown
Contributor

Describe your changes

  • Integrated AI Gateway for API key storage and retrieval; keys can be added directly from LLM Evals Settings
  • LiteLLM model catalog now used across Add Model modal and scorer configuration with live model fetching
  • Backend proxy correctly prefixes model names per provider and handles non-JSON error responses gracefully
  • Removed temperature and max_tokens from UI to prevent failures with models that reject these params
  • Settings page refactored: Configuration page deleted, Use Case Settings merged in, split into API Keys and Project Settings sections
  • Scorer modal rebuilt: simplified layout, tooltips, LiteLLM judge selector, PASS/FAIL defaults, deduplication, threshold saving fixed
  • Models table updated: combined MODEL column with sub-provider icon + name, PROVIDER column added, DATE renamed to DATE ADDED
  • New Experiment modal: saved models shown as selectable cards with no auto-fill; experiment names now use model - DD/MM/YY format
  • Playground page built: chat interface with Perplexity-style layout, model picker with sub-provider icons, image/document attachments, multimodal vision support, dictation with pulse animation, auto-focus
  • Bias and Toxicity scoring fixed: removed inverted display; 0% = no bias/toxicity = green across all views
  • Breadcrumbs fixed to match sidebar labels and icons for all tabs (Playground, Reports, Models, Bias audits)
  • Fixed NoneType crash in EvalServer when OpenRouter returns null content for a prompt
Screenshot 2026-05-09 at 01 37 38 Screenshot 2026-05-09 at 01 38 06 Screenshot 2026-05-09 at 01 38 21

Write your issue number after "Fixes "

Fixes #3868, #3869

Please ensure all items are checked off before requesting a review:

  • I deployed the code locally.
  • I have performed a self-review of my code.
  • I have included the issue # in the PR.
  • I have labelled the PR correctly.
  • The issue I am working on is assigned to me.
  • I have avoided using hardcoded values to ensure scalability and maintain consistency across the application.
  • I have ensured that font sizes, color choices, and other UI elements are referenced from the theme.
  • My pull request is focused and addresses a single, specific feature.
  • If there are UI changes, I have attached a screenshot or video to this PR.

EfeAcar6431 and others added 30 commits March 20, 2026 17:40
The reusable workflow and SDK pyproject.toml referenced
verifywise/verifywise instead of the actual org verifywise-ai/verifywise,
which would cause the fallback CI runner download to 404.

Made-with: Cursor
Adds a `verifywise` console command (registered as entry point) with
subcommands for all SDK namespaces: projects, experiments, datasets,
reports, metrics, models, scorers, and logs. Supports --json output,
env var auth (VW_API_URL, VW_API_TOKEN), and detailed --help at every
level. Includes 35 CLI tests covering help, auth errors, formatting,
and all command handlers.

Made-with: Cursor
- Fix project_id not passed to eval subprocess config, causing
  NOT NULL violation on llm_evals_logs inserts
- Handle metrics config as list-of-dicts (CI runner format) in
  addition to dict format (frontend format) in run_evaluation.py
- Fix SDK datasets.upload() missing required org_id form field
- Fix SDK projects.create() using snake_case instead of camelCase
  for useCase field, and extract project wrapper from response
- Update verifywise-eval.yml to use standalone action repo
- Add composite GitHub Action, SDK docs, and CI eval runner script

Made-with: Cursor
…mats

- Rewrite verifywise-eval.yml with clear examples for chatbot, RAG, and
  agent use cases; list all available metrics in header comments
- Simplify action.yml: env-based arg passing, concise threshold check
- Accept both snake_case and camelCase metric names in run_evaluation.py
  so CI runner and frontend both work without translation
- Add missing agent metrics: plan_adherence, argument_correctness,
  task_completion, step_efficiency

Made-with: Cursor
Sync updated action.yml and ci_eval_runner.py from verifywise-eval-action.

Made-with: Cursor
…rkflow

- Action now posts PR comments automatically (no extra step needed)
- Score deltas vs previous experiment shown in summary table
- Reusable workflow simplified (removed manual PR comment step)

Made-with: Cursor
The leaderboard sidebar item was accidentally merged from a feature
branch. Hide it behind import.meta.env.DEV so it only appears during
local development.

Made-with: Cursor
- Take upstream for EvalsDashboard UI files, workflow, package-lock
- Keep ours for routes using AI Gateway key injection (deepEvalRoutes, evaluationLlmApiKey)
- Style fix: multi-line LLMProvider union + catch(_error) in evaluationLlmApiKeysService

Made-with: Cursor
- Mirror AI Gateway keys in LLM Evals Settings page (add/delete via /ai-gateway/keys)
- Inject API keys from ai_gateway_api_keys table into experiment requests (judge, model, scorerApiKeys)
- Gracefully handle missing AI Gateway table (42P01) so experiments still work when AIGateway isn't initialised
- Return empty array instead of 502 for GET /ai-gateway/keys when service is down
- Guard Sequelize migration for key copy so it no-ops if either table is absent
- Fix migrate_to_shared_schema.py: cast created_by int→str; skip orphaned FK rows
- Fix Bias/Toxicity display: invert percentage for those metrics (0 = perfect); only Hallucination treated as inverse for colour logic
- Use backend passed flag for metric pass/fail determination with score>=0.5 fallback

Co-authored-by: Cursor <cursoragent@cursor.com>
…a-ai-gateway-litellm

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	EvaluationModule/requirements.txt
…er model dropdowns

Replace hardcoded static JSON model arrays with a live catalog fetched from
AI Gateway GET /api/ai-gateway/v1/models (backed by litellm.model_cost).

- evalModelsService: add getGatewayModelsForProvider() with 5-min module-level cache;
  strips provider prefix so IDs match existing format (gemini/gemini-2.5-pro → gemini-2.5-pro);
  filters to chat-mode models only; maps frontend provider IDs to LiteLLM keys
- NewExperimentModal: load gateway models lazily when user picks a provider (model step
  and judge step); update getProviderModels() to prefer live catalog with graceful
  fallback to bundled static JSON while loading or if AI Gateway is unreachable

Co-authored-by: Cursor <cursoragent@cursor.com>
…rams to AI Gateway

Models like o1, o3, and other reasoning models reject temperature and max_tokens
in their API calls, causing experiments to fail silently.

- AIGateway/llm_service.py: add drop_params=True to both acompletion calls so
  LiteLLM silently drops any unsupported parameter instead of forwarding it to
  the provider and getting an error back
- NewExperimentModal: remove temperature + max tokens input fields from the judge
  config step; remove them from the experiment payload and savePreferences call
- CreateScorerModal: remove temperature slider and max tokens text field from the
  model parameters section
- ProjectScorers: remove temperature/max_tokens from the judgeModel.params payload
  sent when creating or updating scorers
- useModelPreferences: drop temperature/maxTokens from the ModelPreferences type
  and all read/write paths

Co-authored-by: Cursor <cursoragent@cursor.com>
Resolved conflicts in CreateScorerModal.tsx and EvalsDashboard.tsx by
keeping our versions: improved scorer modal (SectionHeadings, tooltips,
LiteLLM model catalog, no Params button, Top P inline), and Settings
consolidation (Configuration tab removed, use case settings merged into
Settings, gear icon).

Co-authored-by: Cursor <cursoragent@cursor.com>
…l UX fixes

- Add Playground page with chat interface, file attachments, and dictation
- Backend: POST /api/deepeval/playground/chat with LiteLLM model routing
- Fix LiteLLM model string prefix (openrouter/*, gemini/*, etc.)
- Gracefully handle non-JSON AI Gateway error responses
- Fix delete API key bug (missing id argument)
- Models table: consolidate into MODEL + PROVIDER columns, add Moonshot/Qwen icons
- NewExperimentModal: saved model selection no longer fills provider grid or model field
- EvalServer: handle None content from OpenRouter (transient empty response)

Co-authored-by: Cursor <cursoragent@cursor.com>
…y scoring fixes

- Playground: display actual image previews in chat bubbles
- Playground: send multimodal content (image_url) for vision-capable models
- Playground: disable file upload with tooltip for text-only models
- Playground: "Add model" navigates to Models page and auto-opens add modal
- Playground: Perplexity-style layout (composer centered when empty, pinned to bottom when chatting)
- Playground: auto-focus textarea on load, after responses, file upload, and dictation
- Playground: fix provider display names in Add Model dropdown (OpenAI, OpenRouter, xAI, etc.)
- Breadcrumbs: add missing Playground, Reports entries; fix Models (Cpu) and Bias audits (Scale) icons
- Experiments: rename format to "model - DD/MM/YY"
- Bias/Toxicity: remove incorrect score inversion; 0% = no bias/toxicity = green
- Bias/Toxicity: fix getScoreColor and passed flag to treat lower-is-better correctly

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@EfeAcar6431 EfeAcar6431 added this to the 2.4 milestone May 9, 2026
@EfeAcar6431 EfeAcar6431 self-assigned this May 9, 2026
@EfeAcar6431 EfeAcar6431 added enhancement New feature or request frontend Frontend related tasks/issues backend Backend related tasks/issues labels May 9, 2026
EfeAcar6431 and others added 7 commits May 9, 2026 01:43
- Add .venv/, **/venv/, **/.venv/, env/, ENV/ to gitignore
- Add coverage/, **/coverage/, .nyc_output/, htmlcov/, .coverage
- Add common Python cruft (*.pyc, *.so, __pycache__, .pytest_cache)
- Add **/node_modules/, **/dist/, **/build/, *.tsbuildinfo, *.log
- Untrack 520 EvalServer/.venv files and 11 Servers/coverage files

Co-authored-by: Cursor <cursoragent@cursor.com>
Format all files flagged by CI Prettier checks:
- Clients: api services, GovernanceOS components, ProviderIcons,
  ScorersTable, ModelsTable, DatasetsTable, EvalsDashboard pages
- Servers: AI Gateway routes, deepEvalRoutes, evaluationLlmApiKey
  route/utils, and migration file

Co-authored-by: Cursor <cursoragent@cursor.com>
…utes

Co-authored-by: Cursor <cursoragent@cursor.com>
- Update evaluationLlmApiKeysService tests: new /ai-gateway/keys endpoint,
  gateway payload shape (api_key, key_name), delete by numeric ID, hasKey
  via getAllKeys, verifyKey via /ai-gateway/keys/verify
- Update deepEval.repository test: deleteLlmApiKey now takes provider + id

Co-authored-by: Cursor <cursoragent@cursor.com>
Add 25 missing strings to both German and French dictionaries:
- Scorer UI: name/slug field descriptions, choice label, add choice,
  pass threshold tooltip, judge LLM description, use-case type label
- Models table: DATE ADDED, LAST RUN column headers
- Playground: Add model, Clear conversation, No saved models yet,
  Saved Models, AI Gateway is not running
- Provider names: Google, Groq, Meta, Microsoft, Moonshot AI,
  Nous Research, Perplexity, Qwen
- General: Project settings, An unexpected error occurred

Co-authored-by: Cursor <cursoragent@cursor.com>
Merge upstream/develop Governance OS strings alongside existing LLM Evals
additions for both de and fr locales.

Co-authored-by: Cursor <cursoragent@cursor.com>
@MuhammadKhalilzadeh MuhammadKhalilzadeh merged commit 3cc2f02 into develop May 11, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Backend related tasks/issues enhancement New feature or request frontend Frontend related tasks/issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Store API keys for Evals in AI Gateway

2 participants