-
Notifications
You must be signed in to change notification settings - Fork 10
dataset: synthetic from PANTHER #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tristan-f-r
wants to merge
45
commits into
main
Choose a base branch
from
synthetic
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
45 commits
Select commit
Hold shift + click to select a range
20b1580
feat: synthetic pathways
tristan-f-r fc12b4e
Merge branch 'main' into synthetic
tristan-f-r 8ff381f
Merge branch 'main' into synthetic
tristan-f-r f7c0c2d
fix: use full protein links to unify synthetic with databases
tristan-f-r 73b6d93
Merge branch 'main' into synthetic
tristan-f-r 2ce621a
re-correct links
tristan-f-r 280b92a
fix: interactome fetching
tristan-f-r db30556
fix(diseases): fetch correct string links
tristan-f-r 0658528
chore: mv to scripts
tristan-f-r e024e2c
chore: move to scripts, Pathify
tristan-f-r 7b09381
style: fmt
tristan-f-r 2a5feec
drop old thresholding
tristan-f-r e389b32
begin sampling
tristan-f-r af0ac30
chore: mv
tristan-f-r d1ade54
rename
tristan-f-r 7483eea
fix: compute weight counts normally
tristan-f-r 05cf6d6
feat: weight-preserving sampling
tristan-f-r 58e9717
feat: sampling
tristan-f-r a0f7079
feat: scripted sampling
tristan-f-r 3bb00e8
chore: del some raw
tristan-f-r 775d144
drop all raw interactome files
tristan-f-r 5771bc7
feat: finish up tf mapping again
tristan-f-r 813235d
feat: sampling on a pathway
tristan-f-r 7fb4642
style: fmt
tristan-f-r 83fee81
chore: drop p38 mapk, add notes
tristan-f-r d7da699
init candidates explorer
tristan-f-r d45ec82
chore: update directory urls
tristan-f-r 0d3b77e
chore: drop all downloaded pathways
tristan-f-r 751a8f2
fix: file extensions and such
tristan-f-r 2fceaa9
chore: explore and such
tristan-f-r ac5b93c
feat: base thresholding workflow
tristan-f-r 5cb7352
chore: add paxtools
tristan-f-r 9d3e194
feat: trimming
tristan-f-r 81a4e4e
style: fmt
tristan-f-r d2cc7e4
feat: full interactome parsing
tristan-f-r 38aef2c
refactor: isolate argparse parser
tristan-f-r a881afd
docs: suggestion
tristan-f-r 5e9653a
Merge branch 'main' into synthetic
tristan-f-r db5a09e
reorganize, begin using owl file
tristan-f-r 310db00
add pathways.txt.gz
tristan-f-r 3a0b7df
feat: automatically identify pathway ids
tristan-f-r bcc9db8
mv
tristan-f-r df30f24
chore: mv to synthetic_data
tristan-f-r 3adc508
feat: pc owl artifact gen!!
tristan-f-r 009a00b
feat: all the fetching we will ever need
tristan-f-r File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| /intermediate | ||
| /processed | ||
| /raw |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| # Synthetic Data | ||
|
|
||
| ## PANTHER Pathway Fetching | ||
|
|
||
| This dataset has a kind of 'sub'-dataset, which is a separate Snakemake rule | ||
| used for generating the pathway files and their associated metadata to be used inside this one. | ||
|
|
||
| Located under `./panther_pathways`, it provides TODO. | ||
|
|
||
| ## Download New PANTHER Pathways | ||
| 1. Visit [Pathway Commons](https://www.pathwaycommons.org/). | ||
| 2. Search for the desired pathway (e.g., "signaling") and filter the results by the **PANTHER pathway** data source. | ||
| Example: [Search for "Signaling" filtered by PANTHER pathway](https://apps.pathwaycommons.org/search?datasource=panther&q=Signaling&type=Pathway) | ||
| 3. Click on the desired pathway and download the **Extended SIF** version of the pathway. | ||
| 4. In the `raw/pathway-data/` folder, create a new subfolder named after the pathway you downloaded. | ||
| 5. Move the downloaded Extended SIF file to this new folder (as a `.txt` file). Rename the file to match the subfolder name exactly. | ||
|
|
||
| ## Sources and Targets | ||
|
|
||
| [Sources](http://wlab.ethz.ch/surfaceome/), or `table_S3_surfaceome.xlsx`, (see [original paper](https://doi.org/10.1073/pnas.1808790115)) | ||
| are silico human surfaceomes receptors. | ||
|
|
||
| [Targets]( https://guolab.wchscu.cn/AnimalTFDB4//#/), or `Homo_sapiens_TF.tsv`, (see [original paper](https://doi.org/10.1093/nar/gkac907)) | ||
| are human transcription factors. | ||
|
|
||
| ## Steps to Generate SPRAS-Compatible Pathways | ||
|
|
||
| This entire workflow can also be done with `uv run snakemake --cores 1` inside this directory. | ||
|
|
||
| ### 1. Process PANTHER Pathways | ||
|
|
||
| 1. Open `Snakefile` and add the name of any new pathways to the `pathways` entry. | ||
| 2. Run the command: | ||
| ```sh | ||
| uv run scripts/process_panther_pathway.py <pathway> | ||
| ``` | ||
| 3. This will create five new files in the respective `pathway` subfolder of the `pathway-data/` directory: | ||
| - `edges.txt` | ||
| - `nodes.txt` | ||
| - `prizes-100.txt` | ||
| - `sources.txt` | ||
| - `targets.txt` | ||
|
|
||
| ### 2. Convert Pathways to SPRAS-Compatible Format | ||
| 1. In `panther_spras_formatting.py`, add the name of any new pathways to the `pathway_dirs` list on **line 8**. | ||
| 2. From the synthetic-data/ directory, run the command: | ||
| ``` | ||
| python scripts/panther_spras_formatting.py | ||
| ``` | ||
| 3. This will create a new folder named `spras-compatible-pathway-data`, containing subfolders for each PANTHER pathway in SPRAS-compatible format. | ||
| Each subfolder will include the following three files: | ||
| - `<pathway_name>_gs_edges.txt` | ||
| - `<pathway_name>_gs_nodes.txt` | ||
| - `<pathway_name>_node_prizes.txt` | ||
|
|
||
| # Pilot Data | ||
| For the pilot data, use the list `["Wnt_signaling", "JAK_STAT_signaling", "Interferon_gamma_signaling", "FGF_signaling", "Ras"]` in both: | ||
| - the list in `combine.py` | ||
| - the list in `overlap_analytics.py` | ||
|
|
||
| Make sure these pathways in the list are also added `["Wnt_signaling", "JAK_STAT_signaling", "Interferon_gamma_signaling", "FGF_signaling", "Ras"]`to: | ||
| - the `pathways` vector in `ProcessPantherPathway.R` | ||
| - the list in `panther_spras_formatting.py` | ||
|
|
||
| **Once you’ve updated the pathway lists in all relevant scripts, run all the steps above to generate the Pilot dataset.** |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,83 @@ | ||
| include: "../../cache/Snakefile" | ||
| from jsonc_parser.parser import JsoncParser | ||
|
|
||
| pathways = JsoncParser.parse_file("pathways.jsonc") | ||
|
|
||
| # TODO: deduplicate from sampling.py | ||
| thresholds = list(map(str, map(lambda x: (x + 1) / 10, range(10)))) | ||
|
|
||
| rule all: | ||
| input: | ||
| "raw/9606.protein.links.full.v12.0.txt", | ||
| expand([ | ||
| "thresholded/{threshold}/{pathway}/interactome.txt", | ||
| "thresholded/{threshold}/{pathway}/gold_standard_edges.txt", | ||
| ], pathway=pathways, threshold=thresholds) | ||
|
|
||
| produce_fetch_rules({ | ||
| "raw/9606.protein.links.full.v12.0.txt": FetchConfig(["STRING", "9606", "9606.protein.links.full.txt.gz"], uncompress=True), | ||
| "raw/human-interactome/table_S3_surfaceome.xlsx": ["Surfaceome", "table_S3_surfaceome.xlsx"], | ||
| "raw/human-interactome/Homo_sapiens_TF.tsv": ["TranscriptionFactors", "Homo_sapiens_TF.tsv"], | ||
| "raw/human-interactome/HUMAN_9606_idmapping_selected.tsv": FetchConfig(["UniProt", "9606", "HUMAN_9606_idmapping_selected.tab.gz"], uncompress=True), | ||
| "raw/pc-panther-biopax.owl": ["PathwayCommons", "intermediate", "pc-panther-biopax.owl"], | ||
| "raw/denylist.txt": ["PathwayCommons", "denylist.txt"], | ||
| "raw/pathways.txt": FetchConfig(["PathwayCommons", "pathways.txt.gz"], uncompress=True) | ||
| }) | ||
|
|
||
| rule interactome: | ||
| input: | ||
| "raw/9606.protein.links.full.v12.0.txt", | ||
| "raw/9606.protein.aliases.txt" | ||
| output: | ||
| "processed/proteins_missing_aliases.csv", | ||
| "processed/removed_edges.txt", | ||
| "processed/interactome.tsv" | ||
| shell: | ||
| "uv run scripts/interactome.py" | ||
|
|
||
| rule process_tfs: | ||
| input: | ||
| "raw/human-interactome/Homo_sapiens_TF.tsv", | ||
| "raw/human-interactome/HUMAN_9606_idmapping_selected.tsv" | ||
| output: | ||
| "raw/human-interactome/Homo_sapiens_TF_Uniprot.tsv" | ||
| shell: | ||
| "uv run scripts/map_transcription_factors.py" | ||
|
|
||
| rule process_panther_pathway: | ||
| input: | ||
| "raw/pathway-data/{pathway}.txt", | ||
| "raw/human-interactome/table_S3_surfaceome.xlsx", | ||
| "raw/human-interactome/Homo_sapiens_TF_Uniprot.tsv" | ||
| output: | ||
| "intermediate/{pathway}/edges.txt", | ||
| "intermediate/{pathway}/nodes.txt", | ||
| "intermediate/{pathway}/sources.txt", | ||
| "intermediate/{pathway}/targets.txt", | ||
| "intermediate/{pathway}/prizes.txt" | ||
| shell: | ||
| "uv run scripts/process_panther_pathway.py {wildcards.pathway}" | ||
|
|
||
| rule make_spras_compatible: | ||
| input: | ||
| "intermediate/{pathway}/edges.txt", | ||
| "intermediate/{pathway}/nodes.txt", | ||
| "intermediate/{pathway}/sources.txt", | ||
| "intermediate/{pathway}/targets.txt", | ||
| "intermediate/{pathway}/prizes.txt" | ||
| output: | ||
| "processed/{pathway}/{pathway}_node_prizes.txt", | ||
| "processed/{pathway}/{pathway}_gs_edges.txt", | ||
| "processed/{pathway}/{pathway}_gs_nodes.txt" | ||
| shell: | ||
| "uv run scripts/panther_spras_formatting.py {wildcards.pathway}" | ||
|
|
||
| rule threshold: | ||
| input: | ||
| "processed/{pathway}/{pathway}_node_prizes.txt", | ||
| "processed/{pathway}/{pathway}_gs_edges.txt" | ||
| output: | ||
| expand("thresholded/{threshold}/{{pathway}}/interactome.txt", threshold=thresholds), | ||
| expand("thresholded/{threshold}/{{pathway}}/gold_standard_edges.txt", threshold=thresholds) | ||
| shell: | ||
| "uv run scripts/sampling.py {wildcards.pathway}" |
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| /raw | ||
| /intermediate | ||
| /output |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| include: "../../../cache/Snakefile" | ||
|
|
||
| rule all: | ||
| input: | ||
| "output/pc-panther-biopax.owl" | ||
|
|
||
| produce_fetch_rules({ | ||
| "raw/pc-biopax.owl": FetchConfig(["PathwayCommons", "pc-biopax.owl.gz"], uncompress=True), | ||
| "raw/pathways.txt": FetchConfig(["PathwayCommons", "pathways.txt.gz"], uncompress=True) | ||
| }) | ||
|
|
||
| rule fetch_from_owl: | ||
| input: | ||
| "raw/pc-biopax.owl", | ||
| "raw/pathways.txt" | ||
| output: | ||
| "output/pc-panther-biopax.owl" | ||
| shell: | ||
| "uv run fetch_from_owl.py" |
Empty file.
19 changes: 19 additions & 0 deletions
19
datasets/synthetic_data/panther_pathways/fetch_from_owl.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| from pathlib import Path | ||
| from paxtools.fetch import fetch | ||
| from datasets.synthetic_data.util.parse_pc_pathways import parse_pc_pathways | ||
|
|
||
| current_directory = Path(__file__).parent.resolve() | ||
|
|
||
| def main(): | ||
| pathways_df = parse_pc_pathways(current_directory / 'raw' / 'pathways.txt') | ||
| print("Fetching pathways... [This may take some time. On the author's desktop machine, it took 15 minutes.]") | ||
| (current_directory / 'output').mkdir(exist_ok=True) | ||
| fetch( | ||
| current_directory / 'raw' / 'pc-biopax.owl', | ||
| output=(current_directory / 'output' / "pc-panther-biopax.owl"), | ||
| uris=list(pathways_df["PATHWAY_URI"]), | ||
| memory=f"{2**(16 - 1)}m" # this is why we don't run this in CI! This is 32gb of memory. | ||
| ) | ||
|
|
||
| if __name__ == "__main__": | ||
| main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| [ | ||
| // "CCKR signaling map", TODO: not a valid pathway name? | ||
| "Wnt signaling pathway", | ||
| "VEGF signaling pathway", | ||
| "Toll receptor signaling pathway", | ||
| "TGF-beta signaling pathway", | ||
| "PDGF signaling pathway", | ||
| "Notch signaling pathway", | ||
| "JAK/STAT signaling pathway", | ||
| "Interleukin signaling pathway", | ||
| "Interferon-gamma signaling pathway", | ||
| "Integrin signalling pathway", | ||
| "Insulin/IGF pathway-protein kinase B signaling cascade", | ||
| "Inflammation mediated by chemokine and cytokine signaling pathway", | ||
| "Hedgehog signaling pathway", | ||
| "FGF signaling pathway", | ||
| "FAS signaling pathway", | ||
| // "Endothelin signaling pathway", TODO: not a valid pathway name? | ||
| "EGF receptor signaling pathway", | ||
| "Cadherin signaling pathway", | ||
| "Apoptosis signaling pathway", | ||
| "Ras Pathway", | ||
| "PI3 kinase pathway", | ||
| "p38 MAPK pathway", | ||
| "Insulin/IGF pathway-mitogen activated protein kinase kinase/MAP kinase cascade", | ||
| "p53 pathway", | ||
| "Hypoxia response via HIF activation", | ||
| "Oxidative stress response", | ||
| "B cell activation", | ||
| "T cell activation" | ||
| ] |
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| import argparse | ||
| from pathlib import Path | ||
|
|
||
| import pandas | ||
| from paxtools.fetch import fetch | ||
| from paxtools.sif import toSIF | ||
|
|
||
| synthetic_directory = Path(__file__).parent.parent.resolve() | ||
|
|
||
| def parser(): | ||
| parser = argparse.ArgumentParser(prog="PANTHER pathway fetcher") | ||
|
|
||
| parser.add_argument("pathway_name", type=str) | ||
|
|
||
| return parser | ||
|
|
||
| def main(): | ||
| args = parser().parse_args() | ||
| curated_pathways_df = pandas.read_csv(synthetic_directory / 'intermediate' / 'curated_pathways.tsv', sep='\t') | ||
| associated_id = curated_pathways_df.loc[curated_pathways_df["Name"] == args.pathway_name].reset_index(drop=True).loc[0]["ID"] | ||
|
|
||
| pathway_data_dir = synthetic_directory / 'intermediate' / 'pathway-data' | ||
| pathway_data_dir.mkdir(exist_ok=True, parents=True) | ||
|
|
||
| fetch( | ||
| synthetic_directory / 'raw' / 'pc-panther-biopax.owl', | ||
| pathway_data_dir / Path(args.pathway_name).with_suffix(".owl"), | ||
| denylist=synthetic_directory / 'raw' / 'denylist.txt', | ||
| uris=[associated_id], | ||
| absolute=True | ||
| ) | ||
|
|
||
| toSIF( | ||
| pathway_data_dir / Path(args.pathway_name).with_suffix(".owl"), | ||
| pathway_data_dir / Path(args.pathway_name).with_suffix(".sif"), | ||
| # See the paxtools library for information about how these settings were retrieved. | ||
| # These are directly from PathwayCommons. | ||
| denylist=str(synthetic_directory / 'raw' / 'denylist.txt'), | ||
| chemDb=["chebi"], | ||
| seqDb=["hgnc"], | ||
| exclude=["NEIGHBOR_OF"], | ||
| extended=True, | ||
| ) | ||
|
|
||
| if __name__ == "__main__": | ||
| main() |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.