Skip to content

A424 robust scaling stats — request for pretraining median/IQR arrays #23

@BogyeomKim

Description

@BogyeomKim

Hi BrainLM team,

Thank you for sharing BrainLM publicly.

I am a PhD student in the Department of Psychology at Seoul National University, and we are currently evaluating BrainLM (vandijklab/brainlm, 13M variant) as a frozen encoder for downstream fMRI analyses.

We would like to preserve the model’s native pretraining input distribution as closely as possible. In particular, we are trying to apply the same A424 per-parcel robust scaling described in BrainLM_Toolkit.py, where the median and IQR are computed over the UKB+HCP pretraining corpus.

Could you point us to the saved median[424] and iqr[424] arrays, or to any artifact that preserves them, such as an .npz, Hugging Face file, CSV, or internal preprocessing artifact?

We checked the three Hugging Face release variants (old_13M, vitmae_111M, and vitmae_650M), but could not find separate normalization-statistics files. From reading the toolkit, it looks like the robust scaling statistics may have been applied during data conversion, and the recordings were then stored already normalized in the Arrow column Voxelwise_RobustScaler_Normalized_Recording. If so, the original median/IQR arrays may not have been preserved as a separate public artifact.

Could you confirm whether these pretraining robust-scaling statistics were preserved and, if so, whether they can be shared?

If they were not preserved, a simple confirmation would still be very helpful, since it would allow us to avoid speculation in our methods section and instead report that we re-derived approximate normalization statistics from an available reference corpus.

One related question: this appears connected to issue #19. We read the current code around BrainLM_Toolkit.py:298 as:

recording = recording - data_median / IQR

which, because of Python operator precedence, evaluates as:

recording - (data_median / IQR)

rather than the canonical robust-scaling transformation:

(recording - data_median) / IQR

For compatibility with the released weights, our current wrapper reproduces the apparent implementation rather than the canonical formula. Could you confirm whether this reading matches the actual data transformation used for BrainLM pretraining?

More broadly, we think it would be very useful for the brain foundation model community if pretraining-input artifacts such as normalization statistics were released alongside model weights, similar to how CV/NLP models commonly ship ImageNet preprocessing constants or tokenizers.

Thank you again for making BrainLM publicly available.

Best regards,
Bogyeom Kim
Department of Psychology
Seoul National University

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions