AI Gateway & LiteLLM Integration and LLM Evals UX fixes#3870
Merged
MuhammadKhalilzadeh merged 37 commits intodevelopfrom May 11, 2026
Merged
AI Gateway & LiteLLM Integration and LLM Evals UX fixes#3870MuhammadKhalilzadeh merged 37 commits intodevelopfrom
MuhammadKhalilzadeh merged 37 commits intodevelopfrom
Conversation
The reusable workflow and SDK pyproject.toml referenced verifywise/verifywise instead of the actual org verifywise-ai/verifywise, which would cause the fallback CI runner download to 404. Made-with: Cursor
Adds a `verifywise` console command (registered as entry point) with subcommands for all SDK namespaces: projects, experiments, datasets, reports, metrics, models, scorers, and logs. Supports --json output, env var auth (VW_API_URL, VW_API_TOKEN), and detailed --help at every level. Includes 35 CLI tests covering help, auth errors, formatting, and all command handlers. Made-with: Cursor
- Fix project_id not passed to eval subprocess config, causing NOT NULL violation on llm_evals_logs inserts - Handle metrics config as list-of-dicts (CI runner format) in addition to dict format (frontend format) in run_evaluation.py - Fix SDK datasets.upload() missing required org_id form field - Fix SDK projects.create() using snake_case instead of camelCase for useCase field, and extract project wrapper from response - Update verifywise-eval.yml to use standalone action repo - Add composite GitHub Action, SDK docs, and CI eval runner script Made-with: Cursor
…mats - Rewrite verifywise-eval.yml with clear examples for chatbot, RAG, and agent use cases; list all available metrics in header comments - Simplify action.yml: env-based arg passing, concise threshold check - Accept both snake_case and camelCase metric names in run_evaluation.py so CI runner and frontend both work without translation - Add missing agent metrics: plan_adherence, argument_correctness, task_completion, step_efficiency Made-with: Cursor
Sync updated action.yml and ci_eval_runner.py from verifywise-eval-action. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
…rkflow - Action now posts PR comments automatically (no extra step needed) - Score deltas vs previous experiment shown in summary table - Reusable workflow simplified (removed manual PR comment step) Made-with: Cursor
The leaderboard sidebar item was accidentally merged from a feature branch. Hide it behind import.meta.env.DEV so it only appears during local development. Made-with: Cursor
- Take upstream for EvalsDashboard UI files, workflow, package-lock - Keep ours for routes using AI Gateway key injection (deepEvalRoutes, evaluationLlmApiKey) - Style fix: multi-line LLMProvider union + catch(_error) in evaluationLlmApiKeysService Made-with: Cursor
- Mirror AI Gateway keys in LLM Evals Settings page (add/delete via /ai-gateway/keys) - Inject API keys from ai_gateway_api_keys table into experiment requests (judge, model, scorerApiKeys) - Gracefully handle missing AI Gateway table (42P01) so experiments still work when AIGateway isn't initialised - Return empty array instead of 502 for GET /ai-gateway/keys when service is down - Guard Sequelize migration for key copy so it no-ops if either table is absent - Fix migrate_to_shared_schema.py: cast created_by int→str; skip orphaned FK rows - Fix Bias/Toxicity display: invert percentage for those metrics (0 = perfect); only Hallucination treated as inverse for colour logic - Use backend passed flag for metric pass/fail determination with score>=0.5 fallback Co-authored-by: Cursor <cursoragent@cursor.com>
…a-ai-gateway-litellm Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # EvaluationModule/requirements.txt
…er model dropdowns Replace hardcoded static JSON model arrays with a live catalog fetched from AI Gateway GET /api/ai-gateway/v1/models (backed by litellm.model_cost). - evalModelsService: add getGatewayModelsForProvider() with 5-min module-level cache; strips provider prefix so IDs match existing format (gemini/gemini-2.5-pro → gemini-2.5-pro); filters to chat-mode models only; maps frontend provider IDs to LiteLLM keys - NewExperimentModal: load gateway models lazily when user picks a provider (model step and judge step); update getProviderModels() to prefer live catalog with graceful fallback to bundled static JSON while loading or if AI Gateway is unreachable Co-authored-by: Cursor <cursoragent@cursor.com>
…rams to AI Gateway Models like o1, o3, and other reasoning models reject temperature and max_tokens in their API calls, causing experiments to fail silently. - AIGateway/llm_service.py: add drop_params=True to both acompletion calls so LiteLLM silently drops any unsupported parameter instead of forwarding it to the provider and getting an error back - NewExperimentModal: remove temperature + max tokens input fields from the judge config step; remove them from the experiment payload and savePreferences call - CreateScorerModal: remove temperature slider and max tokens text field from the model parameters section - ProjectScorers: remove temperature/max_tokens from the judgeModel.params payload sent when creating or updating scorers - useModelPreferences: drop temperature/maxTokens from the ModelPreferences type and all read/write paths Co-authored-by: Cursor <cursoragent@cursor.com>
Resolved conflicts in CreateScorerModal.tsx and EvalsDashboard.tsx by keeping our versions: improved scorer modal (SectionHeadings, tooltips, LiteLLM model catalog, no Params button, Top P inline), and Settings consolidation (Configuration tab removed, use case settings merged into Settings, gear icon). Co-authored-by: Cursor <cursoragent@cursor.com>
…l UX fixes - Add Playground page with chat interface, file attachments, and dictation - Backend: POST /api/deepeval/playground/chat with LiteLLM model routing - Fix LiteLLM model string prefix (openrouter/*, gemini/*, etc.) - Gracefully handle non-JSON AI Gateway error responses - Fix delete API key bug (missing id argument) - Models table: consolidate into MODEL + PROVIDER columns, add Moonshot/Qwen icons - NewExperimentModal: saved model selection no longer fills provider grid or model field - EvalServer: handle None content from OpenRouter (transient empty response) Co-authored-by: Cursor <cursoragent@cursor.com>
…y scoring fixes - Playground: display actual image previews in chat bubbles - Playground: send multimodal content (image_url) for vision-capable models - Playground: disable file upload with tooltip for text-only models - Playground: "Add model" navigates to Models page and auto-opens add modal - Playground: Perplexity-style layout (composer centered when empty, pinned to bottom when chatting) - Playground: auto-focus textarea on load, after responses, file upload, and dictation - Playground: fix provider display names in Add Model dropdown (OpenAI, OpenRouter, xAI, etc.) - Breadcrumbs: add missing Playground, Reports entries; fix Models (Cpu) and Bias audits (Scale) icons - Experiments: rename format to "model - DD/MM/YY" - Bias/Toxicity: remove incorrect score inversion; 0% = no bias/toxicity = green - Bias/Toxicity: fix getScoreColor and passed flag to treat lower-is-better correctly Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…a-ai-gateway-litellm
- Add .venv/, **/venv/, **/.venv/, env/, ENV/ to gitignore - Add coverage/, **/coverage/, .nyc_output/, htmlcov/, .coverage - Add common Python cruft (*.pyc, *.so, __pycache__, .pytest_cache) - Add **/node_modules/, **/dist/, **/build/, *.tsbuildinfo, *.log - Untrack 520 EvalServer/.venv files and 11 Servers/coverage files Co-authored-by: Cursor <cursoragent@cursor.com>
Format all files flagged by CI Prettier checks: - Clients: api services, GovernanceOS components, ProviderIcons, ScorersTable, ModelsTable, DatasetsTable, EvalsDashboard pages - Servers: AI Gateway routes, deepEvalRoutes, evaluationLlmApiKey route/utils, and migration file Co-authored-by: Cursor <cursoragent@cursor.com>
…utes Co-authored-by: Cursor <cursoragent@cursor.com>
- Update evaluationLlmApiKeysService tests: new /ai-gateway/keys endpoint, gateway payload shape (api_key, key_name), delete by numeric ID, hasKey via getAllKeys, verifyKey via /ai-gateway/keys/verify - Update deepEval.repository test: deleteLlmApiKey now takes provider + id Co-authored-by: Cursor <cursoragent@cursor.com>
Add 25 missing strings to both German and French dictionaries: - Scorer UI: name/slug field descriptions, choice label, add choice, pass threshold tooltip, judge LLM description, use-case type label - Models table: DATE ADDED, LAST RUN column headers - Playground: Add model, Clear conversation, No saved models yet, Saved Models, AI Gateway is not running - Provider names: Google, Groq, Meta, Microsoft, Moonshot AI, Nous Research, Perplexity, Qwen - General: Project settings, An unexpected error occurred Co-authored-by: Cursor <cursoragent@cursor.com>
Merge upstream/develop Governance OS strings alongside existing LLM Evals additions for both de and fr locales. Co-authored-by: Cursor <cursoragent@cursor.com>
MuhammadKhalilzadeh
approved these changes
May 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe your changes
Write your issue number after "Fixes "
Fixes #3868, #3869
Please ensure all items are checked off before requesting a review: