Skip to content

🧪 Add tests and fix bugs for evaluate_gsm8k numeric parser#5

Open
dhanush342 wants to merge 1 commit intomainfrom
jules-test-evaluate-gsm8k-1898371315961863988
Open

🧪 Add tests and fix bugs for evaluate_gsm8k numeric parser#5
dhanush342 wants to merge 1 commit intomainfrom
jules-test-evaluate-gsm8k-1898371315961863988

Conversation

@dhanush342
Copy link
Copy Markdown
Owner

🎯 What: The testing gap in parse_numeric within evaluate_gsm8k.py is now addressed with a robust test suite. Additionally, two silent parsing bugs were identified and fixed (support for negative fractions like -1/2 and decimals without leading zeros like .5).
📊 Coverage: Scenarios covered include:

  • Valid positive/negative integers, decimals, and fractions.
  • Empty strings, None, invalid types, and division by zero.
  • Extracting numbers from mixed text strings.
  • Falling back appropriately to the last numeric token.
    Result: Test coverage for parse_numeric is now significantly improved, and the function correctly handles a wider range of numeric formats without throwing exceptions. The testing leverages unittest.mock for sys.modules to decouple the testing script from heavy ML dependencies.

PR created automatically by Jules for task 1898371315961863988 started by @dhanush342

Co-authored-by: dhanush342 <187305764+dhanush342@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

Copilot AI review requested due to automatic review settings March 18, 2026 00:10
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds regression tests around evaluate_gsm8k.parse_numeric and adjusts its regexes to correctly parse additional GSM8K-style numeric formats (negative fractions and leading-dot decimals).

Changes:

  • Extend fraction parsing to accept negative numerators (e.g., -1/2).
  • Extend decimal parsing to accept leading-dot decimals (e.g., .5, -.5).
  • Add a new test suite covering valid/invalid numeric strings and extraction from mixed text.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
evaluate_gsm8k.py Updates regex-based numeric extraction to handle additional numeric formats.
test_evaluate_gsm8k.py Adds tests for parse_numeric, with module mocking to avoid importing heavy deps.
Comments suppressed due to low confidence (1)

evaluate_gsm8k.py:25

  • parse_numeric prefers the last numeric token, but the fraction branch uses re.search(...) which returns the first fraction in the string. This can yield inconsistent results for outputs containing multiple fractions (e.g., explanations before the final answer). Consider collecting all fraction matches (e.g., via re.findall/re.finditer) and using the last match to align with the “prefer last token” heuristic.
    frac_match = re.search(r"(-?\d+)/(\d+)", s)
    if frac_match:
        try:
            return float(Fraction(int(frac_match.group(1)), int(frac_match.group(2))))
        except Exception:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread test_evaluate_gsm8k.py
Comment on lines +2 to +9
from unittest.mock import MagicMock

# Mock out heavy dependencies that might be missing in sandbox
sys.modules['datasets'] = MagicMock()
sys.modules['inference'] = MagicMock()

import pytest
from evaluate_gsm8k import parse_numeric
Comment thread test_evaluate_gsm8k.py
sys.modules['datasets'] = MagicMock()
sys.modules['inference'] = MagicMock()

import pytest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants