A424 robust scaling stats — request for pretraining median/IQR arrays

Hi BrainLM team,

Thank you for sharing BrainLM publicly.

I am a PhD student in the Department of Psychology at Seoul National University, and we are currently evaluating BrainLM (`vandijklab/brainlm`, 13M variant) as a frozen encoder for downstream fMRI analyses.

We would like to preserve the model’s native pretraining input distribution as closely as possible. In particular, we are trying to apply the same A424 per-parcel robust scaling described in `BrainLM_Toolkit.py`, where the median and IQR are computed over the UKB+HCP pretraining corpus.

Could you point us to the saved `median[424]` and `iqr[424]` arrays, or to any artifact that preserves them, such as an `.npz`, Hugging Face file, CSV, or internal preprocessing artifact?

We checked the three Hugging Face release variants (`old_13M`, `vitmae_111M`, and `vitmae_650M`), but could not find separate normalization-statistics files. From reading the toolkit, it looks like the robust scaling statistics may have been applied during data conversion, and the recordings were then stored already normalized in the Arrow column `Voxelwise_RobustScaler_Normalized_Recording`. If so, the original median/IQR arrays may not have been preserved as a separate public artifact.

Could you confirm whether these pretraining robust-scaling statistics were preserved and, if so, whether they can be shared?

If they were not preserved, a simple confirmation would still be very helpful, since it would allow us to avoid speculation in our methods section and instead report that we re-derived approximate normalization statistics from an available reference corpus.

One related question: this appears connected to issue #19. We read the current code around `BrainLM_Toolkit.py:298` as:

    recording = recording - data_median / IQR

which, because of Python operator precedence, evaluates as:

    recording - (data_median / IQR)

rather than the canonical robust-scaling transformation:

    (recording - data_median) / IQR

For compatibility with the released weights, our current wrapper reproduces the apparent implementation rather than the canonical formula. Could you confirm whether this reading matches the actual data transformation used for BrainLM pretraining?

More broadly, we think it would be very useful for the brain foundation model community if pretraining-input artifacts such as normalization statistics were released alongside model weights, similar to how CV/NLP models commonly ship ImageNet preprocessing constants or tokenizers.

Thank you again for making BrainLM publicly available.

Best regards,  
Bogyeom Kim  
Department of Psychology  
Seoul National University

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A424 robust scaling stats — request for pretraining median/IQR arrays #23

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

A424 robust scaling stats — request for pretraining median/IQR arrays #23

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions