POC: Derrickp/implementing baseline evals by derrickpersson · Pull Request #277 · SquireAI/squire-core

derrickpersson · 2024-10-29T22:47:07Z

Why Evals?

Evals are critical to building successful AI products.
Evals help you identify common LLM issues - i.e. hallucination
Evals help you make decisions faster about your product. They give you a signal which should help improve your product in the right direction.

In this Proof of Concept:

Adopted the Open AI's script:
- generates a test dataset for ONE "framework" + the Identify Violations Node.
- creates an 'eval config' which provides the logic and mechanisms for how the results are evaluated -> outputs a yaml file
- runs the 'eval config' yaml file with the promptfoo tool

(Failures here are from hallucinations, it correctly passes the LLM as judge for all scenarios)

Learnings from this Proof of Concept:

We have many different variables to optimize across; without having a systematic approach to our system it is going to be difficult to move our product forward in a productive manner.
- Rule prompt, agent prompt, way we input data (i.e. how the file / diff is formatted), etc.
Evals which pass 100% on the first try are NOT useful evals. The goal is to create progressively harder evaluation cases to test your agent. Writing these harder cases is hard for our agents - this PoC does not generate evals which our agent fails; perhaps with more finessing of the prompting we could create test cases using O1. But I could not get it to generate failing test cases.
Setting up evals for all these different scenarios will take quite a bit of effort; but, the end result will give us a solid understanding of how our agents are working, where they might fall down & how we can get them to be more reliable.

Next Steps

Play with these evals yourselves -> try generating a dataset for another rule/framework!
Data mining -> we should extract some use cases from Production (provided we have permission from our customers) where our bot has failed; having even a small sample set of production data would help us generate more realistic synthetic data via O1 for our evals. (I used the starting point of the demo-py repo)
Identify the top agents we should evaluate (my vote would be the IdentifyViolationsNode & ChangeAnalysisOracleNode)
- For the IdentifyViolationsNode identify the exact rules we want to run evals for (ideally all of them, but we should start with 1-3)
The implementation for evals should have better integration with the codebase -> pull prompts directly from the nodes themselves vs. copy pasta'ing them into the evals section
Watch the OpenAI Webinar -> https://vimeo.com/1023317525/be082a1029?share=copy

derricks-squire

Pull request summary created by Squire AI

Summary

This pull request introduces a baseline evaluation framework for the project, focusing on integrating the 'promptfoo' tool for evaluation purposes. Key additions include scripts for dataset generation, evaluation configuration, and execution, alongside necessary dependencies like pandas. Documentation has been updated with instructions for running evaluations, and configuration files have been added to facilitate consistent evaluation workflows.

3accb2e...979eb94

File Summary

File Changes:

README.md: Added instructions for running evaluations, including installation of promptfoo, dataset generation, and evaluation execution steps.
poetry.lock: Updated content hash reflecting changes in dependencies or configurations.
pyproject.toml: Added pandas as a dependency and introduced new scripts for dataset creation, configuration generation, and evaluation execution.

New Files:

__init__.py: Added an empty init.py file to the evals directory.
constants.py: Introduced constants for dataset creation and JSON schema definitions for response formats.
create_draft_dataset.py: Added script to generate draft datasets using OpenAI API and save results to CSV.
create_eval_config.py: Created evaluation config script for promptfoo with multiple tests and configurations.
pfoo_eval.yaml: Added a YAML configuration file for promptfoo evaluation with detailed prompts, providers, and tests.
reviewed_outputs_n=30.csv: Added a CSV file with reviewed output cases for evaluation purposes.
identify_violations_info.py: Added script containing code standards and examples for identifying code violations.
run_eval.py: Added a script to run evaluation using the promptfoo tool with a specified config file.

3accb2e...979eb94

derrickpersson added 5 commits October 28, 2024 12:58

WIP - adding baseline evals

979694f

Evals running, still failing

4b75f20

Running against larger dataset

577e3f6

Reworking to run with poetry

a672817

Running for larger dataset

979eb94

derrickpersson added the Do Not Merge label Oct 29, 2024

derricks-squire Bot reviewed Oct 29, 2024

View reviewed changes

Merge branch 'development' into derrickp/implementing-baseline-evals

5877754

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Derrickp/implementing baseline evals#277

POC: Derrickp/implementing baseline evals#277
derrickpersson wants to merge 6 commits into
developmentfrom
derrickp/implementing-baseline-evals

derrickpersson commented Oct 29, 2024

Uh oh!

derricks-squire Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

derrickpersson commented Oct 29, 2024

Why Evals?

In this Proof of Concept:

Learnings from this Proof of Concept:

Next Steps

Uh oh!

derricks-squire Bot left a comment

Choose a reason for hiding this comment

Pull request summary created by Squire AI

Summary

File Summary

File Changes:

New Files:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant