🧪 Add tests and fix bugs for evaluate_gsm8k numeric parser#5
🧪 Add tests and fix bugs for evaluate_gsm8k numeric parser#5dhanush342 wants to merge 1 commit intomainfrom
Conversation
Co-authored-by: dhanush342 <187305764+dhanush342@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
There was a problem hiding this comment.
Pull request overview
Adds regression tests around evaluate_gsm8k.parse_numeric and adjusts its regexes to correctly parse additional GSM8K-style numeric formats (negative fractions and leading-dot decimals).
Changes:
- Extend fraction parsing to accept negative numerators (e.g.,
-1/2). - Extend decimal parsing to accept leading-dot decimals (e.g.,
.5,-.5). - Add a new test suite covering valid/invalid numeric strings and extraction from mixed text.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
evaluate_gsm8k.py |
Updates regex-based numeric extraction to handle additional numeric formats. |
test_evaluate_gsm8k.py |
Adds tests for parse_numeric, with module mocking to avoid importing heavy deps. |
Comments suppressed due to low confidence (1)
evaluate_gsm8k.py:25
parse_numericprefers the last numeric token, but the fraction branch usesre.search(...)which returns the first fraction in the string. This can yield inconsistent results for outputs containing multiple fractions (e.g., explanations before the final answer). Consider collecting all fraction matches (e.g., viare.findall/re.finditer) and using the last match to align with the “prefer last token” heuristic.
frac_match = re.search(r"(-?\d+)/(\d+)", s)
if frac_match:
try:
return float(Fraction(int(frac_match.group(1)), int(frac_match.group(2))))
except Exception:
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| from unittest.mock import MagicMock | ||
|
|
||
| # Mock out heavy dependencies that might be missing in sandbox | ||
| sys.modules['datasets'] = MagicMock() | ||
| sys.modules['inference'] = MagicMock() | ||
|
|
||
| import pytest | ||
| from evaluate_gsm8k import parse_numeric |
| sys.modules['datasets'] = MagicMock() | ||
| sys.modules['inference'] = MagicMock() | ||
|
|
||
| import pytest |
🎯 What: The testing gap in
parse_numericwithinevaluate_gsm8k.pyis now addressed with a robust test suite. Additionally, two silent parsing bugs were identified and fixed (support for negative fractions like-1/2and decimals without leading zeros like.5).📊 Coverage: Scenarios covered include:
None, invalid types, and division by zero.✨ Result: Test coverage for
parse_numericis now significantly improved, and the function correctly handles a wider range of numeric formats without throwing exceptions. The testing leveragesunittest.mockforsys.modulesto decouple the testing script from heavy ML dependencies.PR created automatically by Jules for task 1898371315961863988 started by @dhanush342