Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
20b1580
feat: synthetic pathways
tristan-f-r Jul 1, 2025
fc12b4e
Merge branch 'main' into synthetic
tristan-f-r Jul 28, 2025
8ff381f
Merge branch 'main' into synthetic
tristan-f-r Jan 6, 2026
f7c0c2d
fix: use full protein links to unify synthetic with databases
tristan-f-r Jan 6, 2026
73b6d93
Merge branch 'main' into synthetic
tristan-f-r Jan 24, 2026
2ce621a
re-correct links
tristan-f-r Jan 24, 2026
280b92a
fix: interactome fetching
tristan-f-r Jan 24, 2026
db30556
fix(diseases): fetch correct string links
tristan-f-r Jan 24, 2026
0658528
chore: mv to scripts
tristan-f-r Jan 30, 2026
e024e2c
chore: move to scripts, Pathify
tristan-f-r Jan 30, 2026
7b09381
style: fmt
tristan-f-r Jan 30, 2026
2a5feec
drop old thresholding
tristan-f-r Jan 30, 2026
e389b32
begin sampling
tristan-f-r Jan 30, 2026
af0ac30
chore: mv
tristan-f-r Jan 30, 2026
d1ade54
rename
tristan-f-r Jan 30, 2026
7483eea
fix: compute weight counts normally
tristan-f-r Jan 30, 2026
05cf6d6
feat: weight-preserving sampling
tristan-f-r Feb 1, 2026
58e9717
feat: sampling
tristan-f-r Feb 2, 2026
a0f7079
feat: scripted sampling
tristan-f-r Feb 2, 2026
3bb00e8
chore: del some raw
tristan-f-r Feb 3, 2026
775d144
drop all raw interactome files
tristan-f-r Feb 3, 2026
5771bc7
feat: finish up tf mapping again
tristan-f-r Feb 3, 2026
813235d
feat: sampling on a pathway
tristan-f-r Feb 3, 2026
7fb4642
style: fmt
tristan-f-r Feb 3, 2026
83fee81
chore: drop p38 mapk, add notes
tristan-f-r Feb 3, 2026
d7da699
init candidates explorer
tristan-f-r Feb 4, 2026
d45ec82
chore: update directory urls
tristan-f-r Feb 4, 2026
0d3b77e
chore: drop all downloaded pathways
tristan-f-r Feb 4, 2026
751a8f2
fix: file extensions and such
tristan-f-r Feb 4, 2026
2fceaa9
chore: explore and such
tristan-f-r Feb 4, 2026
ac5b93c
feat: base thresholding workflow
tristan-f-r Feb 9, 2026
5cb7352
chore: add paxtools
tristan-f-r Feb 11, 2026
9d3e194
feat: trimming
tristan-f-r Feb 12, 2026
81a4e4e
style: fmt
tristan-f-r Feb 12, 2026
d2cc7e4
feat: full interactome parsing
tristan-f-r Feb 18, 2026
38aef2c
refactor: isolate argparse parser
tristan-f-r Feb 18, 2026
a881afd
docs: suggestion
tristan-f-r Feb 18, 2026
5e9653a
Merge branch 'main' into synthetic
tristan-f-r Feb 24, 2026
db5a09e
reorganize, begin using owl file
tristan-f-r Feb 25, 2026
310db00
add pathways.txt.gz
tristan-f-r Feb 25, 2026
3a0b7df
feat: automatically identify pathway ids
tristan-f-r Feb 26, 2026
bcc9db8
mv
tristan-f-r Feb 26, 2026
df30f24
chore: mv to synthetic_data
tristan-f-r Feb 26, 2026
3adc508
feat: pc owl artifact gen!!
tristan-f-r Feb 26, 2026
009a00b
feat: all the fetching we will ever need
tristan-f-r Feb 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@
// For web display
"ghcr.io/devcontainers/features/node:1": {},
// For scripting
"ghcr.io/va-h/devcontainers-features/uv:1": {}
"ghcr.io/va-h/devcontainers-features/uv:1": {},
// For paxtools
"ghcr.io/devcontainers/features/java:1": {}
}
}
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,21 +39,22 @@ uv run snakemake --cores 1

## Organization

There are five primary folders in this repository:
There are six primary folders in this repository:

```
.
├── cache
├── configs
├── datasets
├── spras
├── tools
└── web
```

`spras` is the cloned submodule of [SPRAS](https://github.com/reed-compbio/spras), `web` is an
[astro](https://astro.build/) app which generates the `spras-benchmarking` [output](https://reed-compbio.github.io/spras-benchmarking/),
`configs` is the YAML file used to talk to SPRAS, and `datasets` contains the raw data. `cache` is utility for `datasets` which provides a convenient
way to fetch online files for further processing.
way to fetch online files for further processing. `tools` is the miscellaneous utilities for dataset processing, for tasks common to datasets.

The workflow runs as so:

Expand Down
40 changes: 40 additions & 0 deletions cache/directory.py
Original file line number Diff line number Diff line change
Expand Up @@ -336,6 +336,46 @@ def download(self, output: str | PathLike):
cached="https://drive.google.com/uc?id=1nI5hw-rYRZPs15UJiqokHpHEAabRq6Xj"
)
},
"Surfaceome": {
"table_S3_surfaceome.xlsx": CacheItem(
name="Human surfaceome",
unpinned="http://wlab.ethz.ch/surfaceome/table_S3_surfaceome.xlsx",
cached="https://docs.google.com/uc?id=1cBXYbDnAJVet0lv3BRrizV5FuqfMbBr0"
)
},
"TranscriptionFactors": {
"Homo_sapiens_TF.tsv": CacheItem(
name="Human transcription factors",
# This server has anti-bot protection, so to respect their wishes, we don't download from the server.
# The original URL is https://guolab.wchscu.cn/AnimalTFDB4_static/download/TF_list_final/Homo_sapiens_TF,
# which is accessible from https://guolab.wchscu.cn/AnimalTFDB4//#/Download -> Homo sapiens
# (also under the Internet Archive as of Feb 2nd, 2026. If the original artifact disappears, the drive link below should suffice.)
cached="https://drive.google.com/uc?id=1fVi18GpudUlquRPHgUJl3H1jy54gO-uz",
)
},
"PathwayCommons": {
"pc-biopax.owl.gz": CacheItem(
name="PathwayCommons Universal BioPAX file",
cached="https://drive.google.com/uc?id=1R7uE2ky7fGlZThIWCOblu7iqbpC-aRr0",
pinned="https://download.baderlab.org/PathwayCommons/PC2/v14/pc-biopax.owl.gz"
),
"pathways.txt.gz": CacheItem(
name="PathwayCommons Pathway Identifiers",
cached="https://drive.google.com/uc?id=1SMwuuohuZuNFnTev4zRNJrBnBsLlCHcK",
pinned="https://download.baderlab.org/PathwayCommons/PC2/v14/pathways.txt.gz",
),
"denylist.txt": CacheItem(
name="PathwayCommons small molecule denylist",
cached="https://drive.google.com/uc?id=1QmISJXPvVljA8oKuNYRUNbJJvZKPa_-u",
pinned="https://download.baderlab.org/PathwayCommons/PC2/v14/blacklist.txt"
),
"intermediate": {
"pc-panther-biopax.owl": CacheItem(
name="PathwayCommons PANTHER-only BioPAX file",
cached="https://drive.google.com/uc?id=1MklrD8CJ1BIjh_wWr_g5rrIJ5XJB7FUI"
)
}
}
}


Expand Down
5 changes: 5 additions & 0 deletions datasets/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,8 @@ Many of the datasets here have been stripped of their extra post-analysis. Here,
- [`diseases`](https://github.com/Reed-CompBio/spras-benchmarking/tree/3c0155567dbc43278531b91f9173f6d4f4486dd8/datasets/diseases)
- [`depmap`](https://github.com/Reed-CompBio/spras-benchmarking/tree/b332c0ab53868f111cb89cd4e9f485e8c19aa9e3/datasets/depmap)
- [`yeast-osmotic-stress`](https://github.com/Reed-CompBio/spras-benchmarking/tree/8f69dcdf4a52607347fe3a962b753df396e44cda/yeast-osmotic-stress)

## `explore` folders

To motivate certain decisions made in-code, such as `synthetic-data`'s PANTHER pathway choices, we provide scripts that use live data
to assist in data curation. These folders can also contain exploratory CLIs for motivating e.g. magic constants.
Empty file added datasets/__init__.py
Empty file.
3 changes: 3 additions & 0 deletions datasets/synthetic_data/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/intermediate
/processed
/raw
65 changes: 65 additions & 0 deletions datasets/synthetic_data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Synthetic Data

## PANTHER Pathway Fetching

This dataset has a kind of 'sub'-dataset, which is a separate Snakemake rule
used for generating the pathway files and their associated metadata to be used inside this one.

Located under `./panther_pathways`, it provides TODO.

## Download New PANTHER Pathways
1. Visit [Pathway Commons](https://www.pathwaycommons.org/).
2. Search for the desired pathway (e.g., "signaling") and filter the results by the **PANTHER pathway** data source.
Example: [Search for "Signaling" filtered by PANTHER pathway](https://apps.pathwaycommons.org/search?datasource=panther&q=Signaling&type=Pathway)
3. Click on the desired pathway and download the **Extended SIF** version of the pathway.
4. In the `raw/pathway-data/` folder, create a new subfolder named after the pathway you downloaded.
5. Move the downloaded Extended SIF file to this new folder (as a `.txt` file). Rename the file to match the subfolder name exactly.

## Sources and Targets

[Sources](http://wlab.ethz.ch/surfaceome/), or `table_S3_surfaceome.xlsx`, (see [original paper](https://doi.org/10.1073/pnas.1808790115))
are silico human surfaceomes receptors.

[Targets]( https://guolab.wchscu.cn/AnimalTFDB4//#/), or `Homo_sapiens_TF.tsv`, (see [original paper](https://doi.org/10.1093/nar/gkac907))
are human transcription factors.

## Steps to Generate SPRAS-Compatible Pathways

This entire workflow can also be done with `uv run snakemake --cores 1` inside this directory.

### 1. Process PANTHER Pathways

1. Open `Snakefile` and add the name of any new pathways to the `pathways` entry.
2. Run the command:
```sh
uv run scripts/process_panther_pathway.py <pathway>
```
3. This will create five new files in the respective `pathway` subfolder of the `pathway-data/` directory:
- `edges.txt`
- `nodes.txt`
- `prizes-100.txt`
- `sources.txt`
- `targets.txt`

### 2. Convert Pathways to SPRAS-Compatible Format
1. In `panther_spras_formatting.py`, add the name of any new pathways to the `pathway_dirs` list on **line 8**.
2. From the synthetic-data/ directory, run the command:
```
python scripts/panther_spras_formatting.py
```
3. This will create a new folder named `spras-compatible-pathway-data`, containing subfolders for each PANTHER pathway in SPRAS-compatible format.
Each subfolder will include the following three files:
- `<pathway_name>_gs_edges.txt`
- `<pathway_name>_gs_nodes.txt`
- `<pathway_name>_node_prizes.txt`

# Pilot Data
For the pilot data, use the list `["Wnt_signaling", "JAK_STAT_signaling", "Interferon_gamma_signaling", "FGF_signaling", "Ras"]` in both:
- the list in `combine.py`
- the list in `overlap_analytics.py`

Make sure these pathways in the list are also added `["Wnt_signaling", "JAK_STAT_signaling", "Interferon_gamma_signaling", "FGF_signaling", "Ras"]`to:
- the `pathways` vector in `ProcessPantherPathway.R`
- the list in `panther_spras_formatting.py`

**Once you’ve updated the pathway lists in all relevant scripts, run all the steps above to generate the Pilot dataset.**
83 changes: 83 additions & 0 deletions datasets/synthetic_data/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
include: "../../cache/Snakefile"
from jsonc_parser.parser import JsoncParser

pathways = JsoncParser.parse_file("pathways.jsonc")

# TODO: deduplicate from sampling.py
thresholds = list(map(str, map(lambda x: (x + 1) / 10, range(10))))

rule all:
input:
"raw/9606.protein.links.full.v12.0.txt",
expand([
"thresholded/{threshold}/{pathway}/interactome.txt",
"thresholded/{threshold}/{pathway}/gold_standard_edges.txt",
], pathway=pathways, threshold=thresholds)

produce_fetch_rules({
"raw/9606.protein.links.full.v12.0.txt": FetchConfig(["STRING", "9606", "9606.protein.links.full.txt.gz"], uncompress=True),
"raw/human-interactome/table_S3_surfaceome.xlsx": ["Surfaceome", "table_S3_surfaceome.xlsx"],
"raw/human-interactome/Homo_sapiens_TF.tsv": ["TranscriptionFactors", "Homo_sapiens_TF.tsv"],
"raw/human-interactome/HUMAN_9606_idmapping_selected.tsv": FetchConfig(["UniProt", "9606", "HUMAN_9606_idmapping_selected.tab.gz"], uncompress=True),
"raw/pc-panther-biopax.owl": ["PathwayCommons", "intermediate", "pc-panther-biopax.owl"],
"raw/denylist.txt": ["PathwayCommons", "denylist.txt"],
"raw/pathways.txt": FetchConfig(["PathwayCommons", "pathways.txt.gz"], uncompress=True)
})

rule interactome:
input:
"raw/9606.protein.links.full.v12.0.txt",
"raw/9606.protein.aliases.txt"
output:
"processed/proteins_missing_aliases.csv",
"processed/removed_edges.txt",
"processed/interactome.tsv"
shell:
"uv run scripts/interactome.py"

rule process_tfs:
input:
"raw/human-interactome/Homo_sapiens_TF.tsv",
"raw/human-interactome/HUMAN_9606_idmapping_selected.tsv"
output:
"raw/human-interactome/Homo_sapiens_TF_Uniprot.tsv"
shell:
"uv run scripts/map_transcription_factors.py"

rule process_panther_pathway:
input:
"raw/pathway-data/{pathway}.txt",
"raw/human-interactome/table_S3_surfaceome.xlsx",
"raw/human-interactome/Homo_sapiens_TF_Uniprot.tsv"
output:
"intermediate/{pathway}/edges.txt",
"intermediate/{pathway}/nodes.txt",
"intermediate/{pathway}/sources.txt",
"intermediate/{pathway}/targets.txt",
"intermediate/{pathway}/prizes.txt"
shell:
"uv run scripts/process_panther_pathway.py {wildcards.pathway}"

rule make_spras_compatible:
input:
"intermediate/{pathway}/edges.txt",
"intermediate/{pathway}/nodes.txt",
"intermediate/{pathway}/sources.txt",
"intermediate/{pathway}/targets.txt",
"intermediate/{pathway}/prizes.txt"
output:
"processed/{pathway}/{pathway}_node_prizes.txt",
"processed/{pathway}/{pathway}_gs_edges.txt",
"processed/{pathway}/{pathway}_gs_nodes.txt"
shell:
"uv run scripts/panther_spras_formatting.py {wildcards.pathway}"

rule threshold:
input:
"processed/{pathway}/{pathway}_node_prizes.txt",
"processed/{pathway}/{pathway}_gs_edges.txt"
output:
expand("thresholded/{threshold}/{{pathway}}/interactome.txt", threshold=thresholds),
expand("thresholded/{threshold}/{{pathway}}/gold_standard_edges.txt", threshold=thresholds)
shell:
"uv run scripts/sampling.py {wildcards.pathway}"
Empty file.
3 changes: 3 additions & 0 deletions datasets/synthetic_data/panther_pathways/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/raw
/intermediate
/output
19 changes: 19 additions & 0 deletions datasets/synthetic_data/panther_pathways/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
include: "../../../cache/Snakefile"

rule all:
input:
"output/pc-panther-biopax.owl"

produce_fetch_rules({
"raw/pc-biopax.owl": FetchConfig(["PathwayCommons", "pc-biopax.owl.gz"], uncompress=True),
"raw/pathways.txt": FetchConfig(["PathwayCommons", "pathways.txt.gz"], uncompress=True)
})

rule fetch_from_owl:
input:
"raw/pc-biopax.owl",
"raw/pathways.txt"
output:
"output/pc-panther-biopax.owl"
shell:
"uv run fetch_from_owl.py"
Empty file.
19 changes: 19 additions & 0 deletions datasets/synthetic_data/panther_pathways/fetch_from_owl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from pathlib import Path
from paxtools.fetch import fetch
from datasets.synthetic_data.util.parse_pc_pathways import parse_pc_pathways

current_directory = Path(__file__).parent.resolve()

def main():
pathways_df = parse_pc_pathways(current_directory / 'raw' / 'pathways.txt')
print("Fetching pathways... [This may take some time. On the author's desktop machine, it took 15 minutes.]")
(current_directory / 'output').mkdir(exist_ok=True)
fetch(
current_directory / 'raw' / 'pc-biopax.owl',
output=(current_directory / 'output' / "pc-panther-biopax.owl"),
uris=list(pathways_df["PATHWAY_URI"]),
memory=f"{2**(16 - 1)}m" # this is why we don't run this in CI! This is 32gb of memory.
)

if __name__ == "__main__":
main()
31 changes: 31 additions & 0 deletions datasets/synthetic_data/pathways.jsonc
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
[
// "CCKR signaling map", TODO: not a valid pathway name?
"Wnt signaling pathway",
"VEGF signaling pathway",
"Toll receptor signaling pathway",
"TGF-beta signaling pathway",
"PDGF signaling pathway",
"Notch signaling pathway",
"JAK/STAT signaling pathway",
"Interleukin signaling pathway",
"Interferon-gamma signaling pathway",
"Integrin signalling pathway",
"Insulin/IGF pathway-protein kinase B signaling cascade",
"Inflammation mediated by chemokine and cytokine signaling pathway",
"Hedgehog signaling pathway",
"FGF signaling pathway",
"FAS signaling pathway",
// "Endothelin signaling pathway", TODO: not a valid pathway name?
"EGF receptor signaling pathway",
"Cadherin signaling pathway",
"Apoptosis signaling pathway",
"Ras Pathway",
"PI3 kinase pathway",
"p38 MAPK pathway",
"Insulin/IGF pathway-mitogen activated protein kinase kinase/MAP kinase cascade",
"p53 pathway",
"Hypoxia response via HIF activation",
"Oxidative stress response",
"B cell activation",
"T cell activation"
]
Empty file.
46 changes: 46 additions & 0 deletions datasets/synthetic_data/scripts/fetch_pathway.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
import argparse
from pathlib import Path

import pandas
from paxtools.fetch import fetch
from paxtools.sif import toSIF

synthetic_directory = Path(__file__).parent.parent.resolve()

def parser():
parser = argparse.ArgumentParser(prog="PANTHER pathway fetcher")

parser.add_argument("pathway_name", type=str)

return parser

def main():
args = parser().parse_args()
curated_pathways_df = pandas.read_csv(synthetic_directory / 'intermediate' / 'curated_pathways.tsv', sep='\t')
associated_id = curated_pathways_df.loc[curated_pathways_df["Name"] == args.pathway_name].reset_index(drop=True).loc[0]["ID"]

pathway_data_dir = synthetic_directory / 'intermediate' / 'pathway-data'
pathway_data_dir.mkdir(exist_ok=True, parents=True)

fetch(
synthetic_directory / 'raw' / 'pc-panther-biopax.owl',
pathway_data_dir / Path(args.pathway_name).with_suffix(".owl"),
denylist=synthetic_directory / 'raw' / 'denylist.txt',
uris=[associated_id],
absolute=True
)

toSIF(
pathway_data_dir / Path(args.pathway_name).with_suffix(".owl"),
pathway_data_dir / Path(args.pathway_name).with_suffix(".sif"),
# See the paxtools library for information about how these settings were retrieved.
# These are directly from PathwayCommons.
denylist=str(synthetic_directory / 'raw' / 'denylist.txt'),
chemDb=["chebi"],
seqDb=["hgnc"],
exclude=["NEIGHBOR_OF"],
extended=True,
)

if __name__ == "__main__":
main()
Loading
Loading