POC: Derrickp/implementing baseline evals#277
Draft
derrickpersson wants to merge 6 commits into
Draft
Conversation
There was a problem hiding this comment.
Pull request summary created by Squire AI
Summary
This pull request introduces a baseline evaluation framework for the project, focusing on integrating the 'promptfoo' tool for evaluation purposes. Key additions include scripts for dataset generation, evaluation configuration, and execution, alongside necessary dependencies like pandas. Documentation has been updated with instructions for running evaluations, and configuration files have been added to facilitate consistent evaluation workflows.
File Summary
File Changes:
README.md: Added instructions for running evaluations, including installation of promptfoo, dataset generation, and evaluation execution steps.poetry.lock: Updated content hash reflecting changes in dependencies or configurations.pyproject.toml: Added pandas as a dependency and introduced new scripts for dataset creation, configuration generation, and evaluation execution.
New Files:
__init__.py: Added an empty init.py file to the evals directory.constants.py: Introduced constants for dataset creation and JSON schema definitions for response formats.create_draft_dataset.py: Added script to generate draft datasets using OpenAI API and save results to CSV.create_eval_config.py: Created evaluation config script for promptfoo with multiple tests and configurations.pfoo_eval.yaml: Added a YAML configuration file for promptfoo evaluation with detailed prompts, providers, and tests.reviewed_outputs_n=30.csv: Added a CSV file with reviewed output cases for evaluation purposes.identify_violations_info.py: Added script containing code standards and examples for identifying code violations.run_eval.py: Added a script to run evaluation using the promptfoo tool with a specified config file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why Evals?
In this Proof of Concept:
yamlfilepromptfootoolLearnings from this Proof of Concept:
Next Steps
demo-pyrepo)IdentifyViolationsNode&ChangeAnalysisOracleNode)IdentifyViolationsNodeidentify the exact rules we want to run evals for (ideally all of them, but we should start with 1-3)