fix(eval): add bin cap and fall back#533
Conversation
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
WalkthroughThe PR hardens the ChangesRobust auto-binning
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Signed-off-by: Nina Xu <19981858+nina-xu@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py (1)
204-204: ⚡ Quick winConsider making the return type more specific.
The return type annotation
dictcould be refined todict[str, float]for better type safety and to help type checkers verify call sites.🔧 Suggested type annotation improvement
-def get_auto_bins(x1: pd.Series, x2: pd.Series) -> dict: +def get_auto_bins(x1: pd.Series, x2: pd.Series) -> dict[str, float]: """Get common histogram bins for training/synthetic PCA projections.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a07c5526-3a18-48de-bba5-a7ce1f8db7e8
📒 Files selected for processing (2)
src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.pytests/evaluation/components/test_multi_modal_figures.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (9)
**/*.{md,markdown,py}
📄 CodeRabbit inference engine (.cursor/rules/agent-markdown-style.mdc)
**/*.{md,markdown,py}: Avoid decorative bold (**text**) in list items, body text, and docstrings; use structural cues (headers, list markers, colons, backticks) for emphasis instead
Use backticks for code identifiers, paths, and CLI commands in markdown and docstrings
Files:
tests/evaluation/components/test_multi_modal_figures.pysrc/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py
tests/**
📄 CodeRabbit inference engine (.cursor/rules/repo-navigation.mdc)
Tests should mirror the
src/directory structure intests/
Files:
tests/evaluation/components/test_multi_modal_figures.py
tests/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/repo-navigation.mdc)
Tests in
tests/e2e/should be auto-marked withe2emarker, tests intests/smoke/withsmokemarker, others withunitmarker
tests/**/*.py: Use absolute imports intests/(e.g.,from nemo_safe_synthesizer.observability import get_logger).
Usefixture_prefix convention for fixtures for grep-ability and to separate fixtures from test functions. Add a one-line docstring describing the fixture's purpose and data.
Use function-scoped fixtures by default. Session scope only when empirically justified by test runtime.
Use bareassertas the primary assertion style;pytest.raises()withmatch=for exceptions;pytest.approx()for floating-point comparisons.
Mark CUDA-dependent tests with@pytest.mark.e2e,@pytest.mark.smoke, or@pytest.mark.requires_gpu.
Mock only external boundaries, not internal implementation details.
Ensure test isolation: no shared mutable state or execution-order dependencies between tests. If something must be run first before executing a test, include it in the test or a fixture.
Use@pytest.mark.parametrizefor testing multiple input combinations rather than copy-pasting similar tests.Organize tests using pytest following the structure in
tests/TESTING.mdwith support for unit tests, smoke tests, and end-to-end tests
Files:
tests/evaluation/components/test_multi_modal_figures.py
⚙️ CodeRabbit configuration file
Review tests against tests/TESTING.md. Check marker usage, fixture naming, tmp_path usage, determinism, and GPU/vLLM process-isolation requirements. Flag slop tests that only check that code runs, assert result is not None when stronger invariants exist, over-mock internal implementation details, patch around the bug instead of reproducing it, or add broad snapshot/golden churn without a clear contract. Flag change detector tests that fail on harmless refactors, formatting, record ordering, incidental wording, or private implementation details without demonstrating a behavior regression. Prefer existing fixtures or focused new fixtures for repeated setup; keep tests DRY when reasonable without making the behavior under test opaque. print() is allowed in tests.
Files:
tests/evaluation/components/test_multi_modal_figures.py
**/*.py
📄 CodeRabbit inference engine (STYLE_GUIDE.md)
**/*.py: Use American English spelling: 'initialize' not 'initialise', 'recognize' not 'recognise', 'color' not 'colour'.
Useobservability.get_logger(__name__)for logging, neverlogging.getLogger()orstructlog.get_logger()directly.
Use category loggers:.runtimefor internals,.userfor progress/results,.systemfor system events.
Never useprint()for operational output. Useclick.echo()for CLI output orsys.stdout.write()for raw output in tools.
Useextra={}in logging for structured data that downstream tools should query or aggregate; use f-strings for human-readable context.
Raise from the custom error hierarchy with dual inheritance:SafeSynthesizerError(base),UserError,DataError,ParameterError,GenerationError,InternalError.
UseNSSBaseModelfor config/parameter models inconfig/which define user-facing configuration. Use rawBaseModelor module-specific bases for data transfer objects and internal structures.
UseBaseSettingsfor env/CLI settings. PreferAliasChoiceson individual fields when a field needs to respond to both its Python name and an env var name.
IncludeField(description=...)for Pydantic model fields as the canonical field docstring for API documentation and CLI help text.
Use assignment-styletype = Field(default=..., description="...")as the default for Pydantic model fields because type checkers understanddefault,default_factory, andaliasin assignment style.
UseAnnotatedonly when the field carries additional metadata beyondField()--ValueValidator,AutoParam,DependsOnValidator, reusable constrained type aliases, nested-type constraints, or discriminated unions.
Put defaults as bare assignment (= value), not insideField(default=...), when usingAnnotated. Exception: use assignment-styleField(default_factory=...)for defaults that cannot be expressed as bare assignments.
Use@dataclass(frozen=True)for immutable value objects and validators; mu...
Files:
tests/evaluation/components/test_multi_modal_figures.pysrc/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py
**/*.{py,sh,yaml,yml}
📄 CodeRabbit inference engine (STYLE_GUIDE.md)
Include SPDX copyright header at the top:
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.and# SPDX-License-Identifier: Apache-2.0. Themake formatcommand handles this automatically.
Files:
tests/evaluation/components/test_multi_modal_figures.pysrc/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py
**/*
📄 CodeRabbit inference engine (STYLE_GUIDE.md)
**/*: Include a newline at the end of all files, never trailing whitespace. This is enforced bypre-commit.
Use line length of 120 characters for code, comments, and docstrings (configured inruff.toml).
Files:
tests/evaluation/components/test_multi_modal_figures.pysrc/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py
⚙️ CodeRabbit configuration file
**/*: Review as a senior maintainer for NeMo Safe Synthesizer. Prioritize issues that can change behavior, break user workflows, weaken privacy guarantees, hide failures, make tests unreliable, or create maintenance risk. Avoid generic style commentary unless it points to a concrete project convention that automated tools will not catch.
Comment only when the finding is actionable and tied to changed code. For each finding, state the impact, the condition that triggers it, and the smallest practical fix. Prefer one precise comment over broad advice. Do not ask for refactors outside the PR scope unless the changed code creates the problem.
Review type guidance: - Potential issue: use for correctness bugs, data loss, privacy leaks,
security risks, broken public APIs, invalid config behavior, missing
validation, hidden failures, nondeterministic tests, or CI breakage.
- Refactor suggestion: use for local maintainability problems introduced
by the diff when they have clear future cost, such as duplicated setup,
unclear boundaries, over-mocking, avoidable complexity, or opaque test
helpers.- Nitpick: avoid in chill mode. Do not emit formatting, import-order,
wording, or style-only comments unless automated tools cannot catch the
issue and it affects maintainability.Severity guidance: - Critical: security/privacy leaks, data loss, training/test/holdout
contamination, or broken release/package/core pipeline execution.
- Major: incorrect generation/training/evaluation behavior, broken
CLI/SDK public API, invalid config defaults or validators, or GPU/vLLM
cleanup and process-isolation bugs likely to fail CI or production
runs.- Minor: localized bugs, missing focused tests for changed behavior, or
bad test patterns that weaken regression coverage.- Trivial: small cleanup with no behavior impact. Usually suppress in
chill mode.- Info: context only. Avoid unless it helps reviewers understand risk.
Safe-Synthesizer-specific review focus: - Data ...
Files:
tests/evaluation/components/test_multi_modal_figures.pysrc/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py
**/*test*.py
📄 CodeRabbit inference engine (AGENTS.md)
The
unit_testmarker is deprecated; useunitinstead
Files:
tests/evaluation/components/test_multi_modal_figures.py
src/**/*.py
📄 CodeRabbit inference engine (STYLE_GUIDE.md)
src/**/*.py: Use relative imports insrc/(e.g.,from ..observability import get_logger).
Do not useprint()statements in library code. Useget_logger(__name__)fromobservability.pyorclick.echo()for CLI.
Do not useassertfor validation in library code. Useif/raisefor input validation.assertstatements can be stripped by-Oand must never guard correctness.
Files:
src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py
⚙️ CodeRabbit configuration file
Review library code against STYLE_GUIDE.md. Focus on behavior, API contracts, error handling, resource cleanup, typing, logging, and user-facing failures. Public APIs and nontrivial functions need Google-style docstrings.
Files:
src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py
src/nemo_safe_synthesizer/evaluation/**/*.py
⚙️ CodeRabbit configuration file
Treat evaluation changes as correctness-sensitive. Check metric inputs, holdout usage, privacy metric semantics, report data shape, missing-data handling, and whether unavailable metrics fail or degrade intentionally.
Files:
src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py
🧠 Learnings (1)
📚 Learning: 2026-05-27T22:20:37.354Z
Learnt from: kendrickb-nvidia
Repo: NVIDIA-NeMo/Safe-Synthesizer PR: 520
File: tests/generation/test_vllm_backend.py:556-587
Timestamp: 2026-05-27T22:20:37.354Z
Learning: In NVIDIA-NeMo/Safe-Synthesizer, `tests/conftest.py`’s `pytest_collection_modifyitems` hook applies pytest category markers automatically based on each test file’s path: tests under `/e2e/` get `pytest.mark.e2e`, tests under `/smoke/` get `pytest.mark.smoke`, and all other tests get `pytest.mark.unit`. Therefore, when reviewing pytest tests outside `tests/e2e/` and `tests/smoke/`, do not flag missing explicit `pytest.mark.unit` decorators on test classes/functions as an issue (the hook will add them during collection). If a new test directory/category is introduced, ensure the hook is updated so it’s categorized correctly.
Applied to files:
tests/evaluation/components/test_multi_modal_figures.py
🪛 Ruff (0.15.14)
src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py
[warning] 230-230: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (2)
src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py (1)
204-246: LGTM!tests/evaluation/components/test_multi_modal_figures.py (1)
203-226: LGTM!
Greptile SummaryThis PR hardens
Confidence Score: 4/5The change is self-contained and addresses a real crash in report rendering; the fallback logic is correct and well-tested. The broad The exception handler in Important Files Changed
|
| except Exception: | ||
| # Fallback for edge cases where numpy "auto" overestimates bin count | ||
| # and can raise ValueError/MemoryError before returning edges. | ||
| start = float(np.nanmin(finite_data)) | ||
| end = float(np.nanmax(finite_data)) | ||
| bin_count = max_bins |
There was a problem hiding this comment.
Silent exception swallow hides unexpected errors
The bare except Exception catches everything — including AttributeError, TypeError, and other bugs that have nothing to do with the pathological-bin-count scenario. When any such error occurs, the fallback silently produces a plausible-looking result (start=min, end=max, bin_count=1000), making the issue invisible in logs and the report. Consider at minimum narrowing the caught types to (ValueError, MemoryError) (the two documented numpy overflow cases) or adding a warnings.warn / logger call so unexpected failures surface during debugging.
| if np.nanmin(finite_data) == np.nanmax(finite_data): | ||
| return _single_value_fallback(float(np.nanmin(finite_data))) |
There was a problem hiding this comment.
After filtering with
np.isfinite, finite_data contains no NaN or infinite values by construction, so np.nanmin/np.nanmax is equivalent to np.min/np.max here. Using the plain variants makes the intent clearer.
| if np.nanmin(finite_data) == np.nanmax(finite_data): | |
| return _single_value_fallback(float(np.nanmin(finite_data))) | |
| if np.min(finite_data) == np.max(finite_data): | |
| return _single_value_fallback(float(np.min(finite_data))) |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Summary
I ran into this issue where the report rendering failed because the text PCA graph failed, due to the auto bin function calculating 1B+ bins lol.
Log:
Steps to reproduce:
safe-synthesizer run --config script/slurm/configs/smollm3-nodp.yaml --data-source /lustre/fsw/portfolios/nemotron/users/ninaxu/datasets/telco_churn_10000.csvThis PR caps the bins to 1000, plus some other nice edge case catchers added by Cursor. Also planned: #539
Pre-Review Checklist
Ensure that the following pass:
make format && make checkor via prek validation.make testpasses locallymake test-e2epasses locallymake test-ci-containerpasses locally (recommended)/syncon this PR to trigger a run (auto-triggers on ready-for-review)Pre-Merge Checklist
Other Notes