Skip to content

POC: Derrickp/implementing baseline evals#277

Draft
derrickpersson wants to merge 6 commits into
developmentfrom
derrickp/implementing-baseline-evals
Draft

POC: Derrickp/implementing baseline evals#277
derrickpersson wants to merge 6 commits into
developmentfrom
derrickp/implementing-baseline-evals

Conversation

@derrickpersson

Copy link
Copy Markdown
Contributor

Why Evals?

  • Evals are critical to building successful AI products.
  • Evals help you identify common LLM issues - i.e. hallucination
  • Evals help you make decisions faster about your product. They give you a signal which should help improve your product in the right direction.

In this Proof of Concept:

  • Adopted the Open AI's script:
    • generates a test dataset for ONE "framework" + the Identify Violations Node.
    • creates an 'eval config' which provides the logic and mechanisms for how the results are evaluated -> outputs a yaml file
    • runs the 'eval config' yaml file with the promptfoo tool
Screenshot 2024-10-29 at 3 42 42 PM (Failures here are from hallucinations, it correctly passes the LLM as judge for all scenarios)

Learnings from this Proof of Concept:

  • We have many different variables to optimize across; without having a systematic approach to our system it is going to be difficult to move our product forward in a productive manner.
    • Rule prompt, agent prompt, way we input data (i.e. how the file / diff is formatted), etc.
  • Evals which pass 100% on the first try are NOT useful evals. The goal is to create progressively harder evaluation cases to test your agent. Writing these harder cases is hard for our agents - this PoC does not generate evals which our agent fails; perhaps with more finessing of the prompting we could create test cases using O1. But I could not get it to generate failing test cases.
  • Setting up evals for all these different scenarios will take quite a bit of effort; but, the end result will give us a solid understanding of how our agents are working, where they might fall down & how we can get them to be more reliable.

Next Steps

  • Play with these evals yourselves -> try generating a dataset for another rule/framework!
  • Data mining -> we should extract some use cases from Production (provided we have permission from our customers) where our bot has failed; having even a small sample set of production data would help us generate more realistic synthetic data via O1 for our evals. (I used the starting point of the demo-py repo)
  • Identify the top agents we should evaluate (my vote would be the IdentifyViolationsNode & ChangeAnalysisOracleNode)
    • For the IdentifyViolationsNode identify the exact rules we want to run evals for (ideally all of them, but we should start with 1-3)
  • The implementation for evals should have better integration with the codebase -> pull prompts directly from the nodes themselves vs. copy pasta'ing them into the evals section
  • Watch the OpenAI Webinar -> https://vimeo.com/1023317525/be082a1029?share=copy

@derricks-squire derricks-squire Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request summary created by Squire AI

Summary

This pull request introduces a baseline evaluation framework for the project, focusing on integrating the 'promptfoo' tool for evaluation purposes. Key additions include scripts for dataset generation, evaluation configuration, and execution, alongside necessary dependencies like pandas. Documentation has been updated with instructions for running evaluations, and configuration files have been added to facilitate consistent evaluation workflows.

3accb2e...979eb94

File Summary

File Changes:

  • README.md: Added instructions for running evaluations, including installation of promptfoo, dataset generation, and evaluation execution steps.
  • poetry.lock: Updated content hash reflecting changes in dependencies or configurations.
  • pyproject.toml: Added pandas as a dependency and introduced new scripts for dataset creation, configuration generation, and evaluation execution.

New Files:

  • __init__.py: Added an empty init.py file to the evals directory.
  • constants.py: Introduced constants for dataset creation and JSON schema definitions for response formats.
  • create_draft_dataset.py: Added script to generate draft datasets using OpenAI API and save results to CSV.
  • create_eval_config.py: Created evaluation config script for promptfoo with multiple tests and configurations.
  • pfoo_eval.yaml: Added a YAML configuration file for promptfoo evaluation with detailed prompts, providers, and tests.
  • reviewed_outputs_n=30.csv: Added a CSV file with reviewed output cases for evaluation purposes.
  • identify_violations_info.py: Added script containing code standards and examples for identifying code violations.
  • run_eval.py: Added a script to run evaluation using the promptfoo tool with a specified config file.

3accb2e...979eb94

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant