HotProt: Genome-wide interactions of intrinsically disordered proteins with DNA reveal a regulatory grammar

Overview

HotProt is an interpretable deep learning model for predicting base-resolution intrinsically disordered protein (IDP)–DNA binding profiles from DNA sequence. Beyond achieving experiment-level accuracy, HotProt

Decodes IDP–DNA regulatory grammar in human SKNMC cells using DisP-seq;
Attributes localized DNA sequence features to intrinsically disordered region (IDR) engagement on hundreds of TF ChIP-seq datasets in HepG2 cells.

HotProt advances the research into structure-informed DNA grammar that is read by IDRs and that quantitatively modulates protein–DNA binding across the human genome.

Installation

To get started with HotProt, first clone the repository and install the necessary dependencies:

git clone https://github.com/ma-compbio/HotProt.git
cd HotProt
pip install -r requirements.txt
pip install -e .

You may need to install the appropriate CUDA version for PyTorch. See the PyTorch website for installation instructions.

Data preparation

HotProt requires DNA sequences and, optionally, ATAC-seq data, as well as configuration files for model training, evaluation, and motif analysis.

Example configuration files for training and evaluation can be found in the configs directory:

./configs/assay: Config for assays like DisP-seq and ChIP-seq.
./configs/genome: Config for genome assembly (e.g., hg19 and GRCh38).
./configs/model: Config for our HotProt model and BPNet.
./configs/task: Config for training.
./configs/datamodule.yaml: Config for data loading.
./configs/experiment.yaml: Config for seed.
./configs/trainer.yaml: Config for Lightning trainer.

We provide all data necessary for training and evaluation on the SKNMC cell line, excluding the hg19 genome assembly:

Google Drive

The hg19 genome assembly can be downloaded from NCBI.

The complete DisP-seq data can be found in

Usage

Training a model

Training uses Lightning:

python -m hotprot.cli fit \
  -c configs/experiment.yaml \
  -c configs/trainer.yaml \
  -c configs/datamodule.yaml \
  -c configs/assay/disp_seq_sknmc.yaml \
  -c configs/genome/hg19.yaml \
  -c configs/task/disp_seq_sknmc.yaml \
  -c configs/model/hotprot_large.yaml

The order of the configuration files matters.

Feel free to modify the config files or override with command-line arguments.

Evaluating a model

python main.py eval \
    --exp_dir <experiment_directory> \
    [--overwrite_atac_paths <optional_atac_paths>]

The eval command evaluates, on held-out data, a trained model in the specified experiment_directory (e.g., ./exp/train_2024-10-27_16:02:32). Optional ATAC-seq paths can override those set, if any, in the config (useful for cross-cell type predictions). An eval.log file will be saved in the specified experiment directory.

Interpreting model predictions

python main.py interpret \
    --split <train|val|test|all> \
    --exp_dir <experiment_directory> \
    --ckpt_name <checkpoint_name> \
    [--seed <random_seed>] \
    [--n_background <number_of_background_sequences>] \
    [--n_samples <number_of_samples>] \
    [--batch_size <batch_size>]

This command generates interpretations (SHAP scores) for model predictions on a specific data split. ckpt_name should be the name of the checkpoint file in the specified experiment_directory (e.g., epoch_100.pth). One-hot encoded sequences and SHAP scores will be saved in the specified experiment directory.

Identifying motifs

python main.py identify \
    --exp_dir <experiment_directory> \
    --split <train|val|test|all> \
    --max_seqlets <max_seqlets_per_metacluster> \
    --window_size <window_size_in_bp> \
    [--n_leiden <number_of_leiden_clusterings>] \
    [--verbose]

Runs TF-MoDISco to identify motifs from model interpretation results on the specified data split. max_seqlets should be sufficiently large to capture all motifs (e.g., 1,000,000). window_size specifies the size of the window around each input sequence (20 kb long by default) to consider, and regions outside this window will be ignored. An HDF5 file containing the identified motifs will be saved in the specified experiment directory.

Generating a motif report

python main.py report \
    --h5py <path_to_motif_hdf5> \
    --output_dir <output_directory> \
    [--meme_db <path_to_meme_database>] \
    [--n_matches <top_tomtom_matches>]

Generates a motif report, optionally comparing motifs to a MEME database for motif matching. An HTML report will be saved in the specified output directory.

Annotating motif patterns

python main.py annotate \
    --exp_dir <experiment_directory> \
    --split <train|val|test|all> \
    --h5py_path <path_to_motif_hdf5> \
    --annot_yaml_path <path_to_annotation_yaml> \
    --output_dir <output_directory>

Annotates motifs by assigning names to motifs based on their similarity to known motifs. The annotation YAML file should contain the names of the motifs and their corresponding identifiers (keys) in the HTML report. An example annotation YAML file is as follows:

actual_window_size: 20000  # Window size for training
trim_window_size: 10000  # Window size for motif identification
patterns:
  - name: 'Motif 1'
    key: 'pos_patterns.pattern_0'
    is_forward: true
  - name: 'Motif 1'
    key: 'pos_patterns.pattern_1'
    is_forward: false
  - name: 'Motif 2'
    key: 'pos_patterns.pattern_6'
    is_forward: true

Note that:

The name field assigns a name to the motif pattern. Patterns with the same name will be merged into a single motif.
The key field specifies the pattern name in the first column of the HTML report.
The is_forward field specifies whether you think the reported forward motif pattern is the actual forward motif pattern. Set this to false if you think the reported forward motif pattern is the reverse complement of the actual forward motif pattern.

Annotated motif instances in YAML format will be saved in the specified output directory.

Calculating motif co-occurrences

python main.py co_occurrence \
    --config_path <path_to_config> \
    --annot_yaml_path <path_to_annotation_yaml> \
    --motif_instances_yaml_path <path_to_motif_instances_yaml> \
    --output_dir <output_directory>

Calculates the co-occurrences between motifs. The configuration file specifies parameters for motif co-occurrence calculation. Here's an example:

num_trials: 100
d_ranges:
  - low: 0
    high: 150
  - low: 150
    high: 300
  - low: 300
    high: 500
  - low: 500
    high: 999999999
seed: 0

num_trials: The number of trials (number of shuffles)
d_ranges: The distance ranges for the calculation
seed: Random seed for reproducibility

The output contingency matrices, in JSON format, will be saved in the specified output directory.

Analyzing strand orientation preferences

python main.py preference \
    --config_path <path_to_config> \
    --annot_yaml_path <path_to_annotation_yaml> \
    --motif_instances_yaml_path <path_to_motif_instances_yaml> \
    --output_dir <output_directory>

Determines motif binding preferences for strand orientations. The configuration file has the exact same format as the motif co-occurrence calculation configuration file. However, you may want to specify different, finer-grained distance ranges for this analysis.

The output, also in JSON format, will be saved in the specified output directory. Note that:

An orientation of 1 stands for the forward strand, and 0 stands for the reverse strand.
shuffled_counts of shape (num_trials, 2) are counts within and outside a particular distance range for each shuffled version of the motif instances.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
configs		configs
src/hotprot		src/hotprot
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HotProt: Genome-wide interactions of intrinsically disordered proteins with DNA reveal a regulatory grammar

Overview

Installation

Data preparation

Usage

Training a model

Evaluating a model

Interpreting model predictions

Identifying motifs

Generating a motif report

Annotating motif patterns

Calculating motif co-occurrences

Analyzing strand orientation preferences

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HotProt: Genome-wide interactions of intrinsically disordered proteins with DNA reveal a regulatory grammar

Overview

Installation

Data preparation

Usage

Training a model

Evaluating a model

Interpreting model predictions

Identifying motifs

Generating a motif report

Annotating motif patterns

Calculating motif co-occurrences

Analyzing strand orientation preferences

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages