Skip to content

fix(eval): add bin cap and fall back#533

Open
nina-xu wants to merge 3 commits into
mainfrom
ninaxu/fix-cap-num-bins-in-plot
Open

fix(eval): add bin cap and fall back#533
nina-xu wants to merge 3 commits into
mainfrom
ninaxu/fix-cap-num-bins-in-plot

Conversation

@nina-xu
Copy link
Copy Markdown
Contributor

@nina-xu nina-xu commented May 27, 2026

Summary

I ran into this issue where the report rendering failed because the text PCA graph failed, due to the auto bin function calculating 1B+ bins lol.

Log:

{"level": "info", "lineno": 90, "filename": "evaluator.py", "qual_name": "Evaluator.evaluate", "_category_display": "runtime", "timestamp": "2026-05-27T07:27:45.742", "message": "Evaluation complete."}
{"exc_info": true, "level": "error", "lineno": 110, "filename": "multimodal_report.py", "qual_name": "MultimodalReport.jinja_context", "_category_display": "runtime", "timestamp": "2026-05-27T07:27:49.653", "message": "Failed to get jinja context"}
{"exc_info": true, "level": "error", "lineno": 109, "filename": "render.py", "qual_name": "render_report", "_category_display": "runtime", "timestamp": "2026-05-27T07:27:54.229", "message": "Failed to render report."}

Steps to reproduce:
safe-synthesizer run --config script/slurm/configs/smollm3-nodp.yaml --data-source /lustre/fsw/portfolios/nemotron/users/ninaxu/datasets/telco_churn_10000.csv

This PR caps the bins to 1000, plus some other nice edge case catchers added by Cursor. Also planned: #539

Pre-Review Checklist

Ensure that the following pass:

  • make format && make check or via prek validation.
  • make test passes locally
  • make test-e2e passes locally
  • make test-ci-container passes locally (recommended)
  • GPU CI status check passes -- comment /sync on this PR to trigger a run (auto-triggers on ready-for-review)

Pre-Merge Checklist

  • New or updated tests for any fix or new behavior
  • Updated documentation for new features and behaviors, including docstrings for API docs.

Other Notes

  • Closes #

Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Walkthrough

The PR hardens the get_auto_bins function to compute histogram bin edges more robustly for PCA projection histograms. The rewritten function filters to finite values, caps adaptive bin counts, validates results, and provides deterministic safe defaults when data is degenerate or NumPy's computation fails. Three test cases verify pathological binning, non-finite inputs, and single-value distributions.

Changes

Robust auto-binning

Layer / File(s) Summary
Robust auto-binning implementation and validation
src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py, tests/evaluation/components/test_multi_modal_figures.py
get_auto_bins now concatenates inputs, filters to finite values, and handles empty/constant data via deterministic fallback. The function caps NumPy's adaptive "auto" binning to ~1000 bins, wraps edge computation in try/except for robustness, and validates computed ranges to return safe [0.0, 1.0] defaults on invalid results. Three new test cases verify bin-count capping for pathological distributions, safe defaults for all non-finite inputs, and single-value range anchoring to a fixed-width interval.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested labels

bug, test

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(eval): add bin cap and fall back' directly reflects the main changes: adding a bin cap and fallback handling in the evaluation module's binning logic.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ninaxu/fix-cap-num-bins-in-plot

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 93.02326% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...sizer/evaluation/components/multi_modal_figures.py 88.46% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@nina-xu nina-xu changed the title add bin cap and fall back fix(eval): add bin cap and fall back May 28, 2026
nina-xu added 2 commits May 28, 2026 06:51
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
@nina-xu nina-xu marked this pull request as ready for review May 28, 2026 19:08
@nina-xu nina-xu requested a review from a team as a code owner May 28, 2026 19:08
@coderabbitai coderabbitai Bot added bug Defects in shipped behavior test Test-only addition or change labels May 28, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py (1)

204-204: ⚡ Quick win

Consider making the return type more specific.

The return type annotation dict could be refined to dict[str, float] for better type safety and to help type checkers verify call sites.

🔧 Suggested type annotation improvement
-def get_auto_bins(x1: pd.Series, x2: pd.Series) -> dict:
+def get_auto_bins(x1: pd.Series, x2: pd.Series) -> dict[str, float]:
     """Get common histogram bins for training/synthetic PCA projections.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a07c5526-3a18-48de-bba5-a7ce1f8db7e8

📥 Commits

Reviewing files that changed from the base of the PR and between 6b7861b and 4c86b3f.

📒 Files selected for processing (2)
  • src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py
  • tests/evaluation/components/test_multi_modal_figures.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (9)
**/*.{md,markdown,py}

📄 CodeRabbit inference engine (.cursor/rules/agent-markdown-style.mdc)

**/*.{md,markdown,py}: Avoid decorative bold (**text**) in list items, body text, and docstrings; use structural cues (headers, list markers, colons, backticks) for emphasis instead
Use backticks for code identifiers, paths, and CLI commands in markdown and docstrings

Files:

  • tests/evaluation/components/test_multi_modal_figures.py
  • src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py
tests/**

📄 CodeRabbit inference engine (.cursor/rules/repo-navigation.mdc)

Tests should mirror the src/ directory structure in tests/

Files:

  • tests/evaluation/components/test_multi_modal_figures.py
tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/repo-navigation.mdc)

Tests in tests/e2e/ should be auto-marked with e2e marker, tests in tests/smoke/ with smoke marker, others with unit marker

tests/**/*.py: Use absolute imports in tests/ (e.g., from nemo_safe_synthesizer.observability import get_logger).
Use fixture_ prefix convention for fixtures for grep-ability and to separate fixtures from test functions. Add a one-line docstring describing the fixture's purpose and data.
Use function-scoped fixtures by default. Session scope only when empirically justified by test runtime.
Use bare assert as the primary assertion style; pytest.raises() with match= for exceptions; pytest.approx() for floating-point comparisons.
Mark CUDA-dependent tests with @pytest.mark.e2e, @pytest.mark.smoke, or @pytest.mark.requires_gpu.
Mock only external boundaries, not internal implementation details.
Ensure test isolation: no shared mutable state or execution-order dependencies between tests. If something must be run first before executing a test, include it in the test or a fixture.
Use @pytest.mark.parametrize for testing multiple input combinations rather than copy-pasting similar tests.

Organize tests using pytest following the structure in tests/TESTING.md with support for unit tests, smoke tests, and end-to-end tests

Files:

  • tests/evaluation/components/test_multi_modal_figures.py

⚙️ CodeRabbit configuration file

Review tests against tests/TESTING.md. Check marker usage, fixture naming, tmp_path usage, determinism, and GPU/vLLM process-isolation requirements. Flag slop tests that only check that code runs, assert result is not None when stronger invariants exist, over-mock internal implementation details, patch around the bug instead of reproducing it, or add broad snapshot/golden churn without a clear contract. Flag change detector tests that fail on harmless refactors, formatting, record ordering, incidental wording, or private implementation details without demonstrating a behavior regression. Prefer existing fixtures or focused new fixtures for repeated setup; keep tests DRY when reasonable without making the behavior under test opaque. print() is allowed in tests.

Files:

  • tests/evaluation/components/test_multi_modal_figures.py
**/*.py

📄 CodeRabbit inference engine (STYLE_GUIDE.md)

**/*.py: Use American English spelling: 'initialize' not 'initialise', 'recognize' not 'recognise', 'color' not 'colour'.
Use observability.get_logger(__name__) for logging, never logging.getLogger() or structlog.get_logger() directly.
Use category loggers: .runtime for internals, .user for progress/results, .system for system events.
Never use print() for operational output. Use click.echo() for CLI output or sys.stdout.write() for raw output in tools.
Use extra={} in logging for structured data that downstream tools should query or aggregate; use f-strings for human-readable context.
Raise from the custom error hierarchy with dual inheritance: SafeSynthesizerError (base), UserError, DataError, ParameterError, GenerationError, InternalError.
Use NSSBaseModel for config/parameter models in config/ which define user-facing configuration. Use raw BaseModel or module-specific bases for data transfer objects and internal structures.
Use BaseSettings for env/CLI settings. Prefer AliasChoices on individual fields when a field needs to respond to both its Python name and an env var name.
Include Field(description=...) for Pydantic model fields as the canonical field docstring for API documentation and CLI help text.
Use assignment-style type = Field(default=..., description="...") as the default for Pydantic model fields because type checkers understand default, default_factory, and alias in assignment style.
Use Annotated only when the field carries additional metadata beyond Field() -- ValueValidator, AutoParam, DependsOnValidator, reusable constrained type aliases, nested-type constraints, or discriminated unions.
Put defaults as bare assignment (= value), not inside Field(default=...), when using Annotated. Exception: use assignment-style Field(default_factory=...) for defaults that cannot be expressed as bare assignments.
Use @dataclass(frozen=True) for immutable value objects and validators; mu...

Files:

  • tests/evaluation/components/test_multi_modal_figures.py
  • src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py
**/*.{py,sh,yaml,yml}

📄 CodeRabbit inference engine (STYLE_GUIDE.md)

Include SPDX copyright header at the top: # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. and # SPDX-License-Identifier: Apache-2.0. The make format command handles this automatically.

Files:

  • tests/evaluation/components/test_multi_modal_figures.py
  • src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py
**/*

📄 CodeRabbit inference engine (STYLE_GUIDE.md)

**/*: Include a newline at the end of all files, never trailing whitespace. This is enforced by pre-commit.
Use line length of 120 characters for code, comments, and docstrings (configured in ruff.toml).

Files:

  • tests/evaluation/components/test_multi_modal_figures.py
  • src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py

⚙️ CodeRabbit configuration file

**/*: Review as a senior maintainer for NeMo Safe Synthesizer. Prioritize issues that can change behavior, break user workflows, weaken privacy guarantees, hide failures, make tests unreliable, or create maintenance risk. Avoid generic style commentary unless it points to a concrete project convention that automated tools will not catch.
Comment only when the finding is actionable and tied to changed code. For each finding, state the impact, the condition that triggers it, and the smallest practical fix. Prefer one precise comment over broad advice. Do not ask for refactors outside the PR scope unless the changed code creates the problem.
Review type guidance: - Potential issue: use for correctness bugs, data loss, privacy leaks,
security risks, broken public APIs, invalid config behavior, missing
validation, hidden failures, nondeterministic tests, or CI breakage.

  • Refactor suggestion: use for local maintainability problems introduced
    by the diff when they have clear future cost, such as duplicated setup,
    unclear boundaries, over-mocking, avoidable complexity, or opaque test
    helpers.
  • Nitpick: avoid in chill mode. Do not emit formatting, import-order,
    wording, or style-only comments unless automated tools cannot catch the
    issue and it affects maintainability.

Severity guidance: - Critical: security/privacy leaks, data loss, training/test/holdout
contamination, or broken release/package/core pipeline execution.

  • Major: incorrect generation/training/evaluation behavior, broken
    CLI/SDK public API, invalid config defaults or validators, or GPU/vLLM
    cleanup and process-isolation bugs likely to fail CI or production
    runs.
  • Minor: localized bugs, missing focused tests for changed behavior, or
    bad test patterns that weaken regression coverage.
  • Trivial: small cleanup with no behavior impact. Usually suppress in
    chill mode.
  • Info: context only. Avoid unless it helps reviewers understand risk.
    Safe-Synthesizer-specific review focus: - Data ...

Files:

  • tests/evaluation/components/test_multi_modal_figures.py
  • src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py
**/*test*.py

📄 CodeRabbit inference engine (AGENTS.md)

The unit_test marker is deprecated; use unit instead

Files:

  • tests/evaluation/components/test_multi_modal_figures.py
src/**/*.py

📄 CodeRabbit inference engine (STYLE_GUIDE.md)

src/**/*.py: Use relative imports in src/ (e.g., from ..observability import get_logger).
Do not use print() statements in library code. Use get_logger(__name__) from observability.py or click.echo() for CLI.
Do not use assert for validation in library code. Use if/raise for input validation. assert statements can be stripped by -O and must never guard correctness.

Files:

  • src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py

⚙️ CodeRabbit configuration file

Review library code against STYLE_GUIDE.md. Focus on behavior, API contracts, error handling, resource cleanup, typing, logging, and user-facing failures. Public APIs and nontrivial functions need Google-style docstrings.

Files:

  • src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py
src/nemo_safe_synthesizer/evaluation/**/*.py

⚙️ CodeRabbit configuration file

Treat evaluation changes as correctness-sensitive. Check metric inputs, holdout usage, privacy metric semantics, report data shape, missing-data handling, and whether unavailable metrics fail or degrade intentionally.

Files:

  • src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py
🧠 Learnings (1)
📚 Learning: 2026-05-27T22:20:37.354Z
Learnt from: kendrickb-nvidia
Repo: NVIDIA-NeMo/Safe-Synthesizer PR: 520
File: tests/generation/test_vllm_backend.py:556-587
Timestamp: 2026-05-27T22:20:37.354Z
Learning: In NVIDIA-NeMo/Safe-Synthesizer, `tests/conftest.py`’s `pytest_collection_modifyitems` hook applies pytest category markers automatically based on each test file’s path: tests under `/e2e/` get `pytest.mark.e2e`, tests under `/smoke/` get `pytest.mark.smoke`, and all other tests get `pytest.mark.unit`. Therefore, when reviewing pytest tests outside `tests/e2e/` and `tests/smoke/`, do not flag missing explicit `pytest.mark.unit` decorators on test classes/functions as an issue (the hook will add them during collection). If a new test directory/category is introduced, ensure the hook is updated so it’s categorized correctly.

Applied to files:

  • tests/evaluation/components/test_multi_modal_figures.py
🪛 Ruff (0.15.14)
src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py

[warning] 230-230: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (2)
src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py (1)

204-246: LGTM!

tests/evaluation/components/test_multi_modal_figures.py (1)

203-226: LGTM!

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile Summary

This PR hardens get_auto_bins against pathological PCA distributions where NumPy's bins="auto" heuristic would request a billion-plus bins and crash report rendering. The fix caps bin count at 1000 and adds guards for all-non-finite data and constant-value distributions.

  • Bin cap: min(auto_bin_count, 1000) prevents NumPy's adaptive estimator from producing an unreasonably large histogram when IQR ≈ 0 but range is huge.
  • Edge-case fallbacks: empty/all-non-finite data returns {start:0, end:1, size:1}; constant data returns a unit-width bin anchored at the observed value.
  • Tests: three new focused unit tests cover the cap path, non-finite input, and the degenerate (constant) distribution.

Confidence Score: 4/5

The change is self-contained and addresses a real crash in report rendering; the fallback logic is correct and well-tested.

The broad except Exception silently absorbs any numpy error — including unexpected bugs unrelated to bin-count overflow — and returns a plausible result with no log trace. This won't break anything today, but it removes visibility into future regressions in this code path.

The exception handler in multi_modal_figures.py (lines 230–235) deserves a second look to confirm the caught exception types are intentionally broad.

Important Files Changed

Filename Overview
src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py get_auto_bins refactored: adds a 1000-bin cap, NaN/inf filtering, degenerate-distribution guard, and a broad exception fallback. Logic is correct; the except clause is wider than needed.
tests/evaluation/components/test_multi_modal_figures.py Three new unit tests cover the bin-cap path, all-non-finite input, and a constant-value (degenerate) distribution. Assertions are mathematically sound.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["get_auto_bins(x1, x2)"] --> B["Concatenate & filter: finite_data = data[isfinite]"]
    B --> C{finite_data empty?}
    C -- Yes --> Z1["return {start:0, end:1, size:1}"]
    C -- No --> D{min == max?}
    D -- Yes --> Z2["return _single_value_fallback(min)"]
    D -- No --> E["np.histogram_bin_edges(finite_data, bins='auto')"]
    E -- success --> F["bin_count = min(auto_bin_count, 1000)"]
    E -- exception --> G["bin_count = 1000, start = nanmin, end = nanmax"]
    F --> H{isfinite start & end?}
    G --> H
    H -- No --> Z3["return {start:0, end:1, size:1}"]
    H -- Yes --> I{start == end?}
    I -- Yes --> Z4["return _single_value_fallback(start)"]
    I -- No --> J["bin_size = (end - start) / bin_count"]
    J --> K{isfinite bin_size and > 0?}
    K -- No --> Z5["return {start:0, end:1, size:1}"]
    K -- Yes --> Z6["return {start, end, size: bin_size}"]
Loading

Reviews (1): Last reviewed commit: "increase to 1000 bins" | Re-trigger Greptile

Comment on lines +230 to +235
except Exception:
# Fallback for edge cases where numpy "auto" overestimates bin count
# and can raise ValueError/MemoryError before returning edges.
start = float(np.nanmin(finite_data))
end = float(np.nanmax(finite_data))
bin_count = max_bins
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent exception swallow hides unexpected errors

The bare except Exception catches everything — including AttributeError, TypeError, and other bugs that have nothing to do with the pathological-bin-count scenario. When any such error occurs, the fallback silently produces a plausible-looking result (start=min, end=max, bin_count=1000), making the issue invisible in logs and the report. Consider at minimum narrowing the caught types to (ValueError, MemoryError) (the two documented numpy overflow cases) or adding a warnings.warn / logger call so unexpected failures surface during debugging.

Comment on lines +221 to +222
if np.nanmin(finite_data) == np.nanmax(finite_data):
return _single_value_fallback(float(np.nanmin(finite_data)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 After filtering with np.isfinite, finite_data contains no NaN or infinite values by construction, so np.nanmin/np.nanmax is equivalent to np.min/np.max here. Using the plain variants makes the intent clearer.

Suggested change
if np.nanmin(finite_data) == np.nanmax(finite_data):
return _single_value_fallback(float(np.nanmin(finite_data)))
if np.min(finite_data) == np.max(finite_data):
return _single_value_fallback(float(np.min(finite_data)))

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:evaluation bug Defects in shipped behavior test Test-only addition or change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant