HotProt: Genome-wide interactions of intrinsically disordered proteins with DNA reveal a regulatory grammar
HotProt is an interpretable deep learning model for predicting base-resolution intrinsically disordered protein (IDP)–DNA binding profiles from DNA sequence. Beyond achieving experiment-level accuracy, HotProt
- Decodes IDP–DNA regulatory grammar in human SKNMC cells using DisP-seq;
- Attributes localized DNA sequence features to intrinsically disordered region (IDR) engagement on hundreds of TF ChIP-seq datasets in HepG2 cells.
HotProt advances the research into structure-informed DNA grammar that is read by IDRs and that quantitatively modulates protein–DNA binding across the human genome.
To get started with HotProt, first clone the repository and install the necessary dependencies:
git clone https://github.com/ma-compbio/HotProt.git
cd HotProt
pip install -r requirements.txt
pip install -e .You may need to install the appropriate CUDA version for PyTorch. See the PyTorch website for installation instructions.
HotProt requires DNA sequences and, optionally, ATAC-seq data, as well as configuration files for model training, evaluation, and motif analysis.
Example configuration files for training and evaluation can be found in the configs directory:
./configs/assay: Config for assays like DisP-seq and ChIP-seq../configs/genome: Config for genome assembly (e.g., hg19 and GRCh38)../configs/model: Config for our HotProt model and BPNet../configs/task: Config for training../configs/datamodule.yaml: Config for data loading../configs/experiment.yaml: Config for seed../configs/trainer.yaml: Config for Lightning trainer.
We provide all data necessary for training and evaluation on the SKNMC cell line, excluding the hg19 genome assembly:
The hg19 genome assembly can be downloaded from NCBI.
The complete DisP-seq data can be found in
Training uses Lightning:
python -m hotprot.cli fit \
-c configs/experiment.yaml \
-c configs/trainer.yaml \
-c configs/datamodule.yaml \
-c configs/assay/disp_seq_sknmc.yaml \
-c configs/genome/hg19.yaml \
-c configs/task/disp_seq_sknmc.yaml \
-c configs/model/hotprot_large.yamlThe order of the configuration files matters.
Feel free to modify the config files or override with command-line arguments.
python main.py eval \
--exp_dir <experiment_directory> \
[--overwrite_atac_paths <optional_atac_paths>]The eval command evaluates, on held-out data, a trained model in the specified experiment_directory (e.g., ./exp/train_2024-10-27_16:02:32). Optional ATAC-seq paths can override those set, if any, in the config (useful for cross-cell type predictions). An eval.log file will be saved in the specified experiment directory.
python main.py interpret \
--split <train|val|test|all> \
--exp_dir <experiment_directory> \
--ckpt_name <checkpoint_name> \
[--seed <random_seed>] \
[--n_background <number_of_background_sequences>] \
[--n_samples <number_of_samples>] \
[--batch_size <batch_size>]This command generates interpretations (SHAP scores) for model predictions on a specific data split. ckpt_name should be the name of the checkpoint file in the specified experiment_directory (e.g., epoch_100.pth). One-hot encoded sequences and SHAP scores will be saved in the specified experiment directory.
python main.py identify \
--exp_dir <experiment_directory> \
--split <train|val|test|all> \
--max_seqlets <max_seqlets_per_metacluster> \
--window_size <window_size_in_bp> \
[--n_leiden <number_of_leiden_clusterings>] \
[--verbose]Runs TF-MoDISco to identify motifs from model interpretation results on the specified data split. max_seqlets should be sufficiently large to capture all motifs (e.g., 1,000,000). window_size specifies the size of the window around each input sequence (20 kb long by default) to consider, and regions outside this window will be ignored. An HDF5 file containing the identified motifs will be saved in the specified experiment directory.
python main.py report \
--h5py <path_to_motif_hdf5> \
--output_dir <output_directory> \
[--meme_db <path_to_meme_database>] \
[--n_matches <top_tomtom_matches>]Generates a motif report, optionally comparing motifs to a MEME database for motif matching. An HTML report will be saved in the specified output directory.
python main.py annotate \
--exp_dir <experiment_directory> \
--split <train|val|test|all> \
--h5py_path <path_to_motif_hdf5> \
--annot_yaml_path <path_to_annotation_yaml> \
--output_dir <output_directory>Annotates motifs by assigning names to motifs based on their similarity to known motifs. The annotation YAML file should contain the names of the motifs and their corresponding identifiers (keys) in the HTML report. An example annotation YAML file is as follows:
actual_window_size: 20000 # Window size for training
trim_window_size: 10000 # Window size for motif identification
patterns:
- name: 'Motif 1'
key: 'pos_patterns.pattern_0'
is_forward: true
- name: 'Motif 1'
key: 'pos_patterns.pattern_1'
is_forward: false
- name: 'Motif 2'
key: 'pos_patterns.pattern_6'
is_forward: trueNote that:
- The
namefield assigns a name to the motif pattern. Patterns with the same name will be merged into a single motif. - The
keyfield specifies the pattern name in the first column of the HTML report. - The
is_forwardfield specifies whether you think the reported forward motif pattern is the actual forward motif pattern. Set this tofalseif you think the reported forward motif pattern is the reverse complement of the actual forward motif pattern.
Annotated motif instances in YAML format will be saved in the specified output directory.
python main.py co_occurrence \
--config_path <path_to_config> \
--annot_yaml_path <path_to_annotation_yaml> \
--motif_instances_yaml_path <path_to_motif_instances_yaml> \
--output_dir <output_directory>Calculates the co-occurrences between motifs. The configuration file specifies parameters for motif co-occurrence calculation. Here's an example:
num_trials: 100
d_ranges:
- low: 0
high: 150
- low: 150
high: 300
- low: 300
high: 500
- low: 500
high: 999999999
seed: 0num_trials: The number of trials (number of shuffles)d_ranges: The distance ranges for the calculationseed: Random seed for reproducibility
The output contingency matrices, in JSON format, will be saved in the specified output directory.
python main.py preference \
--config_path <path_to_config> \
--annot_yaml_path <path_to_annotation_yaml> \
--motif_instances_yaml_path <path_to_motif_instances_yaml> \
--output_dir <output_directory>Determines motif binding preferences for strand orientations. The configuration file has the exact same format as the motif co-occurrence calculation configuration file. However, you may want to specify different, finer-grained distance ranges for this analysis.
The output, also in JSON format, will be saved in the specified output directory. Note that:
- An orientation of 1 stands for the forward strand, and 0 stands for the reverse strand.
shuffled_countsof shape (num_trials, 2) are counts within and outside a particular distance range for each shuffled version of the motif instances.