Repository with the code developed for the manuscript VarPPUD: predicting variant pathogenicity in the undiagnosed disease patients
- To have access to the Undiagnosed Disease Network database, approval is required.
- Access to the data was done via PIC-SURE.
Code/: contains the files constructing the framework
- PIC-SURE/
HPDS_connection_manager.py: Build connection with PIC-SURE
utils.py: functions access data using PIC-SURE
- preprocess/
data_process.py: data preprocessing and cleaning based on inclusion criteria
feature_data_imputation.py: data imputation methods
data_generation.py: synthetic data generation using CTGAN through constraints
- feature/
feature_generation_gene.py: features generated based on genes
feature_generation_protein.py: feature generated based on protein variant
feature_generation_variant.py: feature generated based on nucleotide variant
- model/
model.py: functions to implement prediction and plot ROC and PR curves
main.py: loading data and run the codes for prediction of variant pathogenicity
method_comparison.py: prediction results using other state-of-the-art methods
- analysis/
statistics.py: statistical analysis for inclusion patients data
visualization.py: figure visualizations
Data/: Raw and intermediate data in the work
- raw/: The raw data is avaialable upon request, accessing to the data was done via PIC-SURE
- database/: databases used to generate features
- feature/: numerical feature representations
- Clone the repository: git clone https://github.com/hms-dbmi/VarPPUD
- Change the the directory of all code files to the location where your data is accessed
- Run data_process.py to curate the inclusion patient information
- Generate and concatenate different features through feature_generation_gene.py, feature_generation_protein.py and feature_generation_variant.py
- Generate synthetic data using data_generation.py for external validation of the model
- Run main.py to predict the variant pathogenicity of undiagnosed patients
- Run statistics.py and visualization.py for statistical analysis and visualization of input data and results
This code supports the analysis presented in: “VarPPUD: predicting variant pathogenicity in the undiagnosed disease patients”.
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.