Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/python-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ jobs:

- name: Test with pytest
run: |
pytest tests/ --ignore-glob='tests/test_ml_*.py' --cov=coco_pipe/ --cov-report=xml --verbose -s
pytest tests/ --cov=coco_pipe/ --cov-report=xml --verbose -s

- name: Upload coverage reports to Codecov
if: matrix.os == 'ubuntu-latest' && matrix.python-version == '3.10'
Expand Down
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
exclude: '^coco_pipe/decoding/fm_hub/cbramod_src/'
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
Expand Down
174 changes: 59 additions & 115 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,127 +42,77 @@ Whether you're conducting clinical research, developing ML models for brain-comp

For detailed development instructions, please see [CONTRIBUTING.md](CONTRIBUTING.md).

## Using the ML Module
## Using the Decoding Module

CoCo Pipe provides two main ways to use the ML module:
The supported modeling API is `coco_pipe.decoding.Experiment`. It is array-first:
prepare `X` and `y` explicitly, then pass optional sample IDs, groups, feature
names, and time labels when they matter for the analysis.

### 1. Direct Python API Usage
```python
from coco_pipe.decoding import Experiment, ExperimentConfig
from coco_pipe.decoding.configs import (
CVConfig,
FeatureSelectionConfig,
LogisticRegressionConfig,
TuningConfig,
)

You can use the ML module directly in your Python scripts by importing from `coco_pipe.io` for data loading and `coco_pipe.ml` for machine learning pipelines:
config = ExperimentConfig(
task="classification",
models={"logreg": LogisticRegressionConfig(max_iter=500)},
metrics=["accuracy", "roc_auc"],
cv=CVConfig(strategy="stratified", n_splits=5, shuffle=True, random_state=42),
feature_selection=FeatureSelectionConfig(
enabled=True,
method="k_best",
n_features=20,
scoring="f_classif",
),
tuning=TuningConfig(
enabled=True,
param_grid={"model__C": [0.1, 1.0, 10.0]},
scoring="roc_auc",
cv=CVConfig(strategy="stratified", n_splits=3, shuffle=True, random_state=42),
),
n_jobs=1,
)

```python
from coco_pipe.io import load_data
from coco_pipe.ml import MLPipeline

# Load your data into the canonical package container
container = load_data(
"data/your_dataset.csv",
mode="tabular",
target_col="target_class",
sep=",",
result = Experiment(config).run(
X,
y,
groups=subject_ids,
sample_ids=trial_ids,
feature_names=feature_names,
)

# Select a subset explicitly from the container when needed
container = container.select(feature=["feat1", "feat2"], y=["case", "control"])
X = container.X
y = container.y

# Configure and run ML pipeline
config = {
"task": "classification", # or 'regression'
"analysis_type": "baseline", # Options: 'baseline', 'feature_selection', 'hp_search', 'hp_search_fs'
"models": "all", # or list of specific models
"metrics": ["accuracy", "f1-score"],
"cv_strategy": "stratified",
"n_splits": 5,
"n_features": 10, # For feature selection
"direction": "forward", # For feature selection
"search_type": "grid", # For hyperparameter search
"n_iter": 100, # For random search
"scoring": "accuracy",
"n_jobs": -1
}

pipeline = MLPipeline(X=X, y=y, config=config)
results = pipeline.run()
summary = result.summary()
predictions = result.get_predictions()
splits = result.get_splits()
selected = result.get_selected_features()
```

### 2. Using the CLI Tool

For batch processing or experiment management, use the CLI tool with a YAML configuration file:

```yaml
# -----------------------------------------------------------------------------
# Toy config for MLPipeline
# -----------------------------------------------------------------------------

# Global parameters shared across analyses
global_experiment_id: "toy_ml_config"
data_path: "../datasets/toy_dataset.csv"
results_dir: "../results"
results_file: "toy_ml_config"

# Default analysis parameters (can be overridden per analysis)
defaults:
random_state: 42
n_jobs: -1
cv_kwargs:
strategy: "stratified"
n_splits: 5
shuffle: true
random_state: 42
covariates: ["age"]
spatial_units: ["regionX", "regionY"]
feature_names: ["feat1", "feat2", "feat3"]

# List of analyses to run
analyses:
- id: "classification_baseline"
task: "classification"
analysis_type: "baseline"
target_columns: ["target_class"]
row_filter:
- column: "age"
values: 13
operator: ">"
- column: "sex"
values: ["male"]
models:
- "Logistic Regression"
- "Random Forest"
metrics:
- "accuracy"
- "roc_auc"

- id: "regression_hp_search"
task: "regression"
analysis_type: "hp_search"
target_columns: ["target_reg"]
feature_names: ["feat1"]
spatial_units: ["regionX"]
models: "all"
metrics:
- "r2"
- "neg_mse"
cv_kwargs:
strategy: "kfold"
n_splits: 3
search_type: "grid"
n_iter: 20
scoring: "r2"
```
For grouped EEG studies, make the outer and inner CV decisions explicit:

Run the analysis using:
```python
config = ExperimentConfig(
task="classification",
models={"logreg": LogisticRegressionConfig(max_iter=500)},
metrics=["accuracy"],
cv=CVConfig(strategy="group_kfold", n_splits=5),
tuning=TuningConfig(
enabled=True,
param_grid={"model__C": [0.1, 1.0, 10.0]},
scoring="accuracy",
cv=CVConfig(strategy="group_kfold", n_splits=3),
),
)

```bash
python scripts/run_ml.py --config configs/your_config.yml
result = Experiment(config).run(X, y, groups=subject_ids)
```

The pipeline will:
- Load and preprocess your data
- Run all specified analyses
- Save results for each model/analysis
- Generate a combined results file
See the decoding documentation for feature selection, temporal decoding, result
tables, plotting helpers, and report integration. Batch decoding CLIs are not
part of the public surface yet; use the Python API for now.

## Documentation

Expand All @@ -179,12 +129,6 @@ Contributions are welcome! If you have suggestions or find any bugs, please open
- Implement CSV loading and M/EEG data loading functionalities.
- Develop comprehensive unit tests.

#### ML Module
- Restructure to mirror the design of the dim_reduction module.
- Consolidate scripts within the main pipeline.
- Add regression support and enhance cross-validation methods.
- Update and expand unit tests.

#### DL Module
- Define and implement deep learning functionalities.
- Create corresponding unit tests.
Expand Down
69 changes: 62 additions & 7 deletions coco_pipe/decoding/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,67 @@
from .configs import ExperimentConfig
from .core import Experiment
from .registry import get_estimator_cls, register_estimator
from .utils import cross_validate_score
"""
Decoding Module
===============

Core module for scientific decoding and machine learning experiments on
electrophysiological and behavioral data.
"""

from .configs import (
CheckpointConfig,
ClassicalModelConfig,
DeviceConfig,
ExperimentConfig,
FoundationEmbeddingModelConfig,
FrozenBackboneDecoderConfig,
LoRAConfig,
NeuralFineTuneConfig,
QuantizationConfig,
StatisticalAssessmentConfig,
TemporalDecoderConfig,
TrainerConfig,
TrainStageConfig,
)
from .experiment import Experiment
from .registry import (
EstimatorCapabilities,
get_capabilities,
list_capabilities,
register_estimator,
register_estimator_spec,
)
from .result import ExperimentResult
from .stats import (
aggregate_predictions_for_inference,
binomial_accuracy_test,
run_statistical_assessment,
)

__all__ = [
# Configs
"ExperimentConfig",
"register_estimator",
"get_estimator_cls",
"ClassicalModelConfig",
"FoundationEmbeddingModelConfig",
"FrozenBackboneDecoderConfig",
"NeuralFineTuneConfig",
"TemporalDecoderConfig",
"LoRAConfig",
"QuantizationConfig",
"DeviceConfig",
"CheckpointConfig",
"TrainerConfig",
"TrainStageConfig",
"StatisticalAssessmentConfig",
# Execution
"Experiment",
"cross_validate_score",
"ExperimentResult",
# Model Discovery & Metadata
"register_estimator",
"register_estimator_spec",
"get_capabilities",
"list_capabilities",
"EstimatorCapabilities",
# Stats Utilities
"run_statistical_assessment",
"binomial_accuracy_test",
"aggregate_predictions_for_inference",
]
82 changes: 82 additions & 0 deletions coco_pipe/decoding/_cache.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
"""
Cache-key helpers for decoding feature extraction.
==================================================

The decoding module uses these helpers to generate stable, split-safe keys for
caching intermediate artifacts like embeddings or fitted preprocessing steps.
"""

from __future__ import annotations

import hashlib
import json
from typing import Any, Sequence


def make_feature_cache_key(
train_sample_ids: Sequence[Any],
test_sample_ids: Sequence[Any],
preprocessing_fingerprint: str,
backbone_fingerprint: str,
extra_metadata: dict[str, Any] | None = None,
sort_ids: bool = True,
) -> str:
"""
Build a stable cache key for split-specific feature extraction artifacts.

This generates a SHA256 hex digest of a JSON-serialized payload containing
the identities of the train/test samples and the configuration of the
preprocessing and backbone modules. This ensures that fitted transforms
or extracted embeddings cannot be reused for incompatible splits or
different model configurations, preventing data leakage and silent errors.

The sample IDs are converted to strings to ensure stability across
different ID types. By default, IDs are sorted to ensure the cache key
is order-insensitive. If order-dependent preprocessing is used,
set `sort_ids=False`.

Parameters
----------
train_sample_ids : Sequence[Any]
Sample IDs identifying the training fold.
test_sample_ids : Sequence[Any]
Sample IDs identifying the test/validation fold.
preprocessing_fingerprint : str
A unique hash or string representing the preprocessing configuration.
backbone_fingerprint : str
A unique hash or string representing the model/extractor configuration.
extra_metadata : dict[str, Any], optional
Additional dimensions that affect the output (e.g., time indices,
target labels, or stage names). Default is None.
sort_ids : bool, default=True
Whether to sort the sample IDs before hashing. Sorting makes the
key order-insensitive, which is usually desired for reproducibility.

Returns
-------
key : str
The SHA256 hex digest of the normalized JSON payload.
"""
# 1. Normalize identifiers
train_ids = [str(value) for value in train_sample_ids]
test_ids = [str(value) for value in test_sample_ids]

if sort_ids:
train_ids.sort()
test_ids.sort()

payload = {
"train_sample_ids": train_ids,
"test_sample_ids": test_ids,
"preprocessing_fingerprint": preprocessing_fingerprint,
"backbone_fingerprint": backbone_fingerprint,
}

# 2. Handle metadata path (explicit for coverage)
if extra_metadata is not None:
payload["extra_metadata"] = extra_metadata
else:
payload["extra_metadata"] = {}

encoded = json.dumps(payload, sort_keys=True, separators=(",", ":")).encode()
return hashlib.sha256(encoded).hexdigest()
Loading
Loading