🧪 Add tests and fix bugs for evaluate_gsm8k numeric parser by dhanush342 · Pull Request #5 · dhanush342/OpenMath

dhanush342 · 2026-03-18T00:10:40Z

🎯 What: The testing gap in parse_numeric within evaluate_gsm8k.py is now addressed with a robust test suite. Additionally, two silent parsing bugs were identified and fixed (support for negative fractions like -1/2 and decimals without leading zeros like .5).
📊 Coverage: Scenarios covered include:

Valid positive/negative integers, decimals, and fractions.
Empty strings, None, invalid types, and division by zero.
Extracting numbers from mixed text strings.
Falling back appropriately to the last numeric token.
✨ Result: Test coverage for parse_numeric is now significantly improved, and the function correctly handles a wider range of numeric formats without throwing exceptions. The testing leverages unittest.mock for sys.modules to decouple the testing script from heavy ML dependencies.

PR created automatically by Jules for task 1898371315961863988 started by @dhanush342

Co-authored-by: dhanush342 <187305764+dhanush342@users.noreply.github.com>

google-labs-jules · 2026-03-18T00:10:41Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

Copilot

Pull request overview

Adds regression tests around evaluate_gsm8k.parse_numeric and adjusts its regexes to correctly parse additional GSM8K-style numeric formats (negative fractions and leading-dot decimals).

Changes:

Extend fraction parsing to accept negative numerators (e.g., -1/2).
Extend decimal parsing to accept leading-dot decimals (e.g., .5, -.5).
Add a new test suite covering valid/invalid numeric strings and extraction from mixed text.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`evaluate_gsm8k.py`	Updates regex-based numeric extraction to handle additional numeric formats.
`test_evaluate_gsm8k.py`	Adds tests for `parse_numeric`, with module mocking to avoid importing heavy deps.

Comments suppressed due to low confidence (1)

evaluate_gsm8k.py:25

parse_numeric prefers the last numeric token, but the fraction branch uses re.search(...) which returns the first fraction in the string. This can yield inconsistent results for outputs containing multiple fractions (e.g., explanations before the final answer). Consider collecting all fraction matches (e.g., via re.findall/re.finditer) and using the last match to align with the “prefer last token” heuristic.

    frac_match = re.search(r"(-?\d+)/(\d+)", s)
    if frac_match:
        try:
            return float(Fraction(int(frac_match.group(1)), int(frac_match.group(2))))
        except Exception:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

+from unittest.mock import MagicMock
+
+# Mock out heavy dependencies that might be missing in sandbox
+sys.modules['datasets'] = MagicMock()
+sys.modules['inference'] = MagicMock()
+
+import pytest
+from evaluate_gsm8k import parse_numeric


+sys.modules['datasets'] = MagicMock()
+sys.modules['inference'] = MagicMock()
+
+import pytest


🧪 Add test coverage and fix parsing bugs in parse_numeric

a35f466

Co-authored-by: dhanush342 <187305764+dhanush342@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 18, 2026 00:10

Copilot started reviewing on behalf of dhanush342 March 18, 2026 00:11 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🧪 Add tests and fix bugs for evaluate_gsm8k numeric parser#5

🧪 Add tests and fix bugs for evaluate_gsm8k numeric parser#5
dhanush342 wants to merge 1 commit intomainfrom
jules-test-evaluate-gsm8k-1898371315961863988

dhanush342 commented Mar 18, 2026

Uh oh!

google-labs-jules Bot commented Mar 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dhanush342 commented Mar 18, 2026

Uh oh!

google-labs-jules Bot commented Mar 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants