Hi BrainLM team,
Thank you for sharing BrainLM publicly.
I am a PhD student in the Department of Psychology at Seoul National University, and we are currently evaluating BrainLM (vandijklab/brainlm, 13M variant) as a frozen encoder for downstream fMRI analyses.
We would like to preserve the model’s native pretraining input distribution as closely as possible. In particular, we are trying to apply the same A424 per-parcel robust scaling described in BrainLM_Toolkit.py, where the median and IQR are computed over the UKB+HCP pretraining corpus.
Could you point us to the saved median[424] and iqr[424] arrays, or to any artifact that preserves them, such as an .npz, Hugging Face file, CSV, or internal preprocessing artifact?
We checked the three Hugging Face release variants (old_13M, vitmae_111M, and vitmae_650M), but could not find separate normalization-statistics files. From reading the toolkit, it looks like the robust scaling statistics may have been applied during data conversion, and the recordings were then stored already normalized in the Arrow column Voxelwise_RobustScaler_Normalized_Recording. If so, the original median/IQR arrays may not have been preserved as a separate public artifact.
Could you confirm whether these pretraining robust-scaling statistics were preserved and, if so, whether they can be shared?
If they were not preserved, a simple confirmation would still be very helpful, since it would allow us to avoid speculation in our methods section and instead report that we re-derived approximate normalization statistics from an available reference corpus.
One related question: this appears connected to issue #19. We read the current code around BrainLM_Toolkit.py:298 as:
recording = recording - data_median / IQR
which, because of Python operator precedence, evaluates as:
recording - (data_median / IQR)
rather than the canonical robust-scaling transformation:
(recording - data_median) / IQR
For compatibility with the released weights, our current wrapper reproduces the apparent implementation rather than the canonical formula. Could you confirm whether this reading matches the actual data transformation used for BrainLM pretraining?
More broadly, we think it would be very useful for the brain foundation model community if pretraining-input artifacts such as normalization statistics were released alongside model weights, similar to how CV/NLP models commonly ship ImageNet preprocessing constants or tokenizers.
Thank you again for making BrainLM publicly available.
Best regards,
Bogyeom Kim
Department of Psychology
Seoul National University
Hi BrainLM team,
Thank you for sharing BrainLM publicly.
I am a PhD student in the Department of Psychology at Seoul National University, and we are currently evaluating BrainLM (
vandijklab/brainlm, 13M variant) as a frozen encoder for downstream fMRI analyses.We would like to preserve the model’s native pretraining input distribution as closely as possible. In particular, we are trying to apply the same A424 per-parcel robust scaling described in
BrainLM_Toolkit.py, where the median and IQR are computed over the UKB+HCP pretraining corpus.Could you point us to the saved
median[424]andiqr[424]arrays, or to any artifact that preserves them, such as an.npz, Hugging Face file, CSV, or internal preprocessing artifact?We checked the three Hugging Face release variants (
old_13M,vitmae_111M, andvitmae_650M), but could not find separate normalization-statistics files. From reading the toolkit, it looks like the robust scaling statistics may have been applied during data conversion, and the recordings were then stored already normalized in the Arrow columnVoxelwise_RobustScaler_Normalized_Recording. If so, the original median/IQR arrays may not have been preserved as a separate public artifact.Could you confirm whether these pretraining robust-scaling statistics were preserved and, if so, whether they can be shared?
If they were not preserved, a simple confirmation would still be very helpful, since it would allow us to avoid speculation in our methods section and instead report that we re-derived approximate normalization statistics from an available reference corpus.
One related question: this appears connected to issue #19. We read the current code around
BrainLM_Toolkit.py:298as:which, because of Python operator precedence, evaluates as:
rather than the canonical robust-scaling transformation:
For compatibility with the released weights, our current wrapper reproduces the apparent implementation rather than the canonical formula. Could you confirm whether this reading matches the actual data transformation used for BrainLM pretraining?
More broadly, we think it would be very useful for the brain foundation model community if pretraining-input artifacts such as normalization statistics were released alongside model weights, similar to how CV/NLP models commonly ship ImageNet preprocessing constants or tokenizers.
Thank you again for making BrainLM publicly available.
Best regards,
Bogyeom Kim
Department of Psychology
Seoul National University