[LLM] Build A/B testing framework for prompts and models#430
Merged
Conversation
|
@Menjay7 Great news! 🎉 Based on an automated assessment of this PR, the linked Wave issue(s) no longer count against your application limits. You can now already apply to more issues while waiting for a review of this PR. Keep up the great work! 🚀 |
Contributor
|
@Menjay7 please fix conflicts |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Summary
Introduces an A/B testing framework for evaluating different LLM prompts and model configurations. This framework enables controlled experiments, measures performance across predefined metrics, and supports data-driven decisions when optimizing prompts and model selection.
Changes Made
Added A/B testing infrastructure for prompt and model experiments.
Implemented experiment configuration with support for multiple variants.
Added traffic splitting and variant assignment logic.
Integrated experiment metadata into request processing.
Recorded experiment IDs, variants, and model information in logs.
Added configurable success metrics (latency, cost, quality, user feedback, etc.).
Implemented experiment result collection and aggregation.
Added feature flags to enable or disable experiments.
Added safeguards for fallback to the default prompt/model when experiments are disabled or fail.
Added unit and integration tests for experiment assignment and metrics collection.
Updated documentation with setup and usage instructions.
Benefits
Enables safe rollout of prompt and model changes.
Supports objective comparison of prompt effectiveness.
Improves experimentation without impacting production stability.
Provides measurable insights for optimizing response quality, latency, and cost.
Testing
Verified deterministic variant assignment.
Tested traffic allocation across experiment groups.
Validated logging and metrics collection.
Confirmed fallback behavior when experiments are disabled.
Executed unit and integration tests successfully.
Checklist
A/B testing framework implemented
Variant assignment logic added
Metrics collection integrated
Feature flag support included
Fallback mechanism implemented
Tests added and passing
Documentation updated..closed #400