Skip to content

Add diffusion based noise model#35

Draft
tHarvey303 wants to merge 16 commits into
mainfrom
diffusion_noise_model
Draft

Add diffusion based noise model#35
tHarvey303 wants to merge 16 commits into
mainfrom
diffusion_noise_model

Conversation

@tHarvey303
Copy link
Copy Markdown
Collaborator

This PR introduces a new generative machine learning model (ScoreBasedUncertaintyModel) to realistically simulate photometric measurement errors.

When generating mock galaxy catalogs, applying simple, uncorrelated Gaussian noise to true fluxes fails to capture the complex, real-world noise properties of surveys like COSMOS2020. Measurement errors are highly correlated across filters (e.g., due to blending or extraction methods) and exhibit heavy-tailed distributions.

This model uses a continuous-time diffusion framework (Variance Preserving SDE) to learn the full joint probability distribution of flux uncertainties across all bands in a survey, conditioned on the source's true magnitudes. This allows us to sample highly realistic, correlated noise for synthetic data.

Core Implementation

  • VP-SDE Diffusion Framework: Implements the continuous-time diffusion equations from Song et al. (2021). The model learns to reverse a noise-injection process to generate samples from the true uncertainty distribution.
  • Score-Matching Objective: Trains a neural network to estimate the score of the data distribution ($\nabla_x \log p_t(x)$) using denoising score matching.
  • Probability Flow ODE Sampler: Includes a fast, deterministic ODE solver for inference. Instead of requiring 500+ stochastic steps, this solver generates highly accurate uncertainty samples in ~50 steps while ensuring reproducibility.

Neural Network Architecture

The underlying score estimator (_RobustScoreNetwork) is designed specifically for stability and capturing high-frequency schedule details:

  • Residual Connections: Prevents signal degradation across deep layers.
  • Gaussian Fourier Projections: Maps the scalar diffusion time $t$ into a higher-dimensional periodic space, massively improving the network's ability to condition on time.
  • SiLU Activations: Used throughout to preserve gradients and prevent dead neurons during the complex score-matching task.
  • EMA Weight Tracking: Uses an Exponential Moving Average (AveragedModel) of the network weights during training to ensure smooth, artifact-free sampling at inference time.

Here's an example showing the true log-flux error vs magnitude density for the COSMOS2020 catalog (dashed are 68% of points, solid are 95% of points) in blue, and the recovered distribution from the trained model is shown in red. The conditional relationships between flux uncertainty in different filters is learned by the network, even in the complex distribution case of the HSC filters due to the striping of the survey depths.

image

To Do

  1. Add tests
  2. Add example to docs
  3. Fully integrate this noise model into the existing training (single band noise model vs one per band).

@tHarvey303 tHarvey303 marked this pull request as draft May 6, 2026 14:48
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 727eefb4-970c-4945-be64-fe02354eee92

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch diffusion_noise_model

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tHarvey303
Copy link
Copy Markdown
Collaborator Author

@copilot investigate why the tests are failing

Copy link
Copy Markdown

Copilot AI commented May 6, 2026

@copilot investigate why the tests are failing

The failures were caused by a ValueError: setting an array element with a sequence in library.py's _scatter method.

When noise models are used, apply_noise is called with np.atleast_1d(flux) (a 1-element array), so it returns 1-element arrays. These were then being assigned to scalar positions in scattered_fluxes[i] and errors[i], which raised the error.

Fixed in f6e0bf6 by extracting the scalar from the returned arrays with [0] indexing:

scattered_fluxes[i] = scattered_flux[0]
errors[i] = sigma[0]

@tHarvey303
Copy link
Copy Markdown
Collaborator Author

@copilot There is still an issue relating to the coverage testings in the ltu-ili package. What is causing this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants