Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .dvc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/config.local
/tmp
/cache
5 changes: 5 additions & 0 deletions .dvc/config
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[core]
remote = jared-r2
['remote "jared-r2"']
url = s3://onezoom
endpointurl = https://9d168184d3ac384b6a159313dd90a75a.r2.cloudflarestorage.com
3 changes: 3 additions & 0 deletions .dvcignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore
27 changes: 25 additions & 2 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,32 @@ jobs:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.10'
python-version: "3.10"
- uses: pre-commit/action@v3.0.0

dvc:
name: DVC
runs-on: ubuntu-latest
steps:
- name: Cancel Previous Runs
uses: styfle/cancel-workflow-action@0.6.0
with:
access_token: ${{ github.token }}
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Install dependencies
run: |
python3 -m pip install --upgrade pip
python3 -m pip install '.[dev]'
- name: Check DVC status
run: |
dvc remote modify --local jared-r2 access_key_id ${{ secrets.DVC_ACCESS_KEY_ID }}
dvc remote modify --local jared-r2 secret_access_key ${{ secrets.DVC_SECRET_ACCESS_KEY }}
dvc repro --allow-missing --dry | tee /dev/stderr | grep -q "Data and pipelines are up to date."
if dvc data status --not-in-remote | grep -q "Not in remote"; then exit 1; fi
test:
name: Python
runs-on: ubuntu-latest
Expand All @@ -40,7 +63,7 @@ jobs:
- name: Install dependencies
run: |
python3 -m pip install --upgrade pip
python3 -m pip install '.[test]'
python3 -m pip install '.[dev]'
- name: Test with pytest
run: |
python3 -m pytest tests --conf-file tests/appconfig.ini
Expand Down
26 changes: 24 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,28 @@ repos:
rev: v0.4.5
hooks:
- id: ruff
args: [ "--fix", "--config", "ruff.toml" ]
args: [--fix, --config, ruff.toml]
- id: ruff-format
args: [ "--config", "ruff.toml" ]
args: [--config, ruff.toml]
- repo: https://github.com/treeverse/dvc
rev: 3.67.0
hooks:
- id: dvc-pre-commit
additional_dependencies:
- .[all]
language_version: python3
stages:
- pre-commit
- id: dvc-pre-push
additional_dependencies:
- .[all]
language_version: python3
stages:
- pre-push
- id: dvc-post-checkout
additional_dependencies:
- .[all]
language_version: python3
stages:
- post-checkout
always_run: true
52 changes: 35 additions & 17 deletions README.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,18 @@ The first step to using this repo is to create a Python virtual environment and
source .venv/bin/activate

# Install it
pip install -e .
pip install -e '.[dev]'

After the first time, you just need to run the `source .venv/bin/activate` each time you want to activate it in a new shell.
# Set up git hooks including linting and DVC
pre-commit install --hook-type pre-push --hook-type post-checkout --hook-type pre-commit

If you want to run the test suite, make sure the test requirements are also installed, with:
After the first time, you just need to run the `source .venv/bin/activate` each time you want to activate it in a new shell.

pip install -e '.[test]'
To be able to run the pipeline, you'll also need to install `wget`.

## Testing

Assuming you have installed the test requirements, you should be able to run
Assuming you have installed the 'dev' dependencies, you should be able to run

python -m pytest --conf-file tests/appconfig.ini

Expand All @@ -41,22 +42,39 @@ you will need a valid Azure Image cropping key in your appconfig.ini.

## Building the latest tree from OpenTree

### Setup
This project uses [DVC](https://dvc.org/) to manage the pipeline. The build parameters are defined in `params.yaml` and the pipeline stages are declared in `dvc.yaml`.

We assume that you want to build a OneZoom tree based on the most recent online OpenTree version.
You can check the most recent version of both the synthetic tree (`synth_id`) and the taxonomy (`taxonomy_version`) via the
[API](https://github.com/OpenTreeOfLife/germinator/wiki/Open-Tree-of-Life-Web-APIs) e.g. by running `curl -X POST https://api.opentreeoflife.org/v3/tree_of_life/about`. Later in the build, we use specific environment variables set to these version numbers. Assuming you are in a bash shell or similar, you can set them as follows:
### Quick start (using cached outputs)

You'll need to ask for the DVC remote credentials on the OneZoom Slack channel in order to pull cached results.
Then, if someone has already run the pipeline and pushed the results to the DVC remote, you can reproduce the build and any of the intermediate stages without downloading any of the massive source files:

```bash
source .venv/bin/activate
dvc repro --pull --allow-missing
```
OT_VERSION=14.9 #or whatever your OpenTree version is
OT_TAXONOMY_VERSION=3.6
OT_TAXONOMY_EXTRA=draft1 #optional - the draft for this version, e.g. `draft1` if the taxonomy_version is 3.6draft1
```

### Download
DVC will pull only the cached outputs needed for stages that haven't changed. If all stages are cached, nothing needs to be re-run.

### Full build (first time / updating source data)

1. Set `ot_version` in `params.yaml` to the desired OpenTree synthesis version (e.g. `"v16.1"`). Available versions can be found in the [synthesis manifest](https://raw.githubusercontent.com/OpenTreeOfLife/opentree/master/webapp/static/statistics/synthesis.json). The OpenTree tree and taxonomy will be downloaded automatically by the `download_opentree` pipeline stage.

2. Some source files are unversioned so will use cached results unless forced. To force re-download them all with the latest upstream data:

```bash
dvc repro --force download_eol discover_enwiki_sql_url download_wikipedia_sql discover_wikidata_url download_and_filter_wikidata download_and_filter_pageviews
```

Note that download_and_filter_wikidata and download_and_filter_pageviews take several hours to run.

3. Run the pipeline and push results to the shared cache:

Constructing the full tree of life requires various files downloaded from the internet. They should be placed within the appropriate directories in the `data` directory, as [documented here](data/README.markdown).
```bash
dvc repro
dvc push
```

### Building the tree
4. Commit `dvc.lock` to git.

Once data files are downloaded, you should be set up to actually build the tree and other backend files, by following [these instructions](oz_tree_build/README.markdown).
For detailed step-by-step documentation, see [oz_tree_build/README.markdown](oz_tree_build/README.markdown).
2 changes: 2 additions & 0 deletions data/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
/js_output
/output_files
3 changes: 2 additions & 1 deletion data/EOL/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@

# But not these files...
!.gitignore
!README.markdown
!README.markdown
!*.dvc
1 change: 1 addition & 0 deletions data/OZTreeBuild/AllLife/OpenTreeParts/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/OpenTree_all

This file was deleted.

2 changes: 1 addition & 1 deletion data/OpenTree/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@

# But not these files...
!.gitignore
!README.markdown
!README.markdown
32 changes: 17 additions & 15 deletions data/OpenTree/README.markdown
Original file line number Diff line number Diff line change
@@ -1,26 +1,28 @@
### Directory contents
Files herein are .gitignored. To get the site working, this folder should contain the following files (or symlinks to them)

* `draftversionXXX.tre`
* `ottYYY/taxonomy.tsv`

This folder contains versioned subdirectories of Open Tree of Life data, e.g. `v16.1/`. Each subdirectory is created by the `download_opentree` script and contains:

* `labelled_supertree_simplified_ottnames.tre` -- the raw downloaded tree
* `draftversion.tre` -- the tree with `mrca***` labels removed and whitespace normalised
* `taxonomy.tsv` -- the OTT taxonomy file

These subdirectories are .gitignored and tracked by DVC as pipeline outputs.

### How to get the files
* `draftversionXXX.tre` should contain an OpenTree newick file with simplified names and `mrca***` labels removed. This can be created from the OpenTree download file `labelled_supertree_simplified_ottnames.tre`. To get this file, you can either download the complete OpenTree distribution, or get the single necessary file by following the link from [https://tree.opentreeoflife.org/about/synthesis-release/](https://tree.opentreeoflife.org/about/synthesis-release/) to 'browse full output' then 'labelled_supertree/index.html' (usually at the end of the "Supertree algorithm" section). Make sure that you *don't* get the `...without_monotypic.tre` version, otherwise you will be missing some intermediate nodes, and the popularity ratings may suffer.

Removing the `mrca***` labels can be done by using a simple regular expression substitution, as in the following perl command:

```
# assumes you have defined OT_VERSION as an environment variable, e.g. > OT_VERSION=14.7
perl -pe 's/\)mrcaott\d+ott\d+/\)/g; s/[ _]+/_/g;' labelled_supertree_simplified_ottnames.tre > draftversion${OT_VERSION}.tre
```
Run the download script with the desired synthesis version:

* The OpenTree taxonomy, in a subfolder called ottYYY/ (where YYY is the OT_TAXONOMY_VERSION; the only important file is ottYYY/taxonomy.tsv). Get the `ottYYY.tgz` file (where YYY is the correct taxonomy version for your version XXX of the tree) from [http://files.opentreeoflife.org/ott](http://files.opentreeoflife.org/ott/) and unpack it. Alternatively, the lastest is usually at [https://tree.opentreeoflife.org/about/taxonomy-version](https://tree.opentreeoflife.org/about/taxonomy-version).
```
download_opentree --version v16.1 --output-dir data/OpenTree
```

### Use
The script fetches the [synthesis manifest](https://raw.githubusercontent.com/OpenTreeOfLife/opentree/master/webapp/static/statistics/synthesis.json) to look up the correct OTT taxonomy version, then downloads both the labelled supertree and taxonomy automatically.

This is also available as a DVC pipeline stage (`download_opentree` in `dvc.yaml`), so `dvc repro` will run it when `ot_version` changes in `params.yaml`.

These files are processed by the scripts in ServerScripts/TreeBuild/OpenTreeRefine to create an OpenTree without subspecies, with polytomies resolved, and with all nodes named.
### Use

Note that the `ott/taxonomy.tsv` file is also used by other scripts e.g. for popularity, TaxonMapping, etc.
These files are processed by the pipeline stages in `dvc.yaml` to create the full OneZoom tree. The `taxonomy.tsv` file is also used by other stages (e.g. for popularity mapping, EoL filtering, etc.).

NB: for the rationale of using `...simplified_ottnames` see
[https://github.com/OpenTreeOfLife/treemachine/issues/147#issuecomment-209105659](https://github.com/OpenTreeOfLife/treemachine/issues/147#issuecomment-209105659) and also [here](https://groups.google.com/forum/#!topic/opentreeoflife/EzqctKrJySk)
20 changes: 9 additions & 11 deletions data/README.markdown
Original file line number Diff line number Diff line change
@@ -1,13 +1,11 @@
# Downloading required data files

To build a tree, you will first need to download various files from the internet. These are not provided by OneZoom directly as they are (a) very large and (b) regularly updated. The files you will need are:

* Open Tree of Life files, to be downloaded into the `OpenTree` directory (see [OpenTree/README.markdown](OpenTree/README.markdown)
* `labelled_supertree_simplified_ottnames.tre` (subsequently converted to `draftversionXXX.tre`, as detailed in the instructions)
* `ottX.Y/taxonomy.tsv` (where X.Y is the OT_TAXONOMY_VERSION)
* Wikimedia files, to be downloaded into directories within the `Wiki` directory (see [Wiki/README.markdown](Wiki/README.markdown))
* `wd_JSON/latest-all.json.bz2`
* `wp_SQL/enwiki-latest-page.sql.gz`
* `wp_pagecounts/pageviews-YYYYMM-user.bz2` (several files for different months). Or download preprocessed files from a [release](https://github.com/OneZoom/tree-build/releases)
* EoL files, to be downloaded into the `EOL` directory (see [EOL/README.markdown](EOL/README.markdown))
* `identifiers.csv`
To build a tree, you will first need various data files from the internet. These are not provided by OneZoom directly as they are (a) very large and (b) regularly updated.

All source files are downloaded automatically by DVC pipeline stages:

- **Open Tree of Life** files, downloaded by the `download_opentree` stage into `OpenTree/<version>/` (see [OpenTree/README.markdown](OpenTree/README.markdown))
- **EOL provider IDs**, downloaded by the `download_eol` stage into `EOL/provider_ids.csv.gz`
- **Wikipedia SQL dump**, downloaded by the `download_wikipedia_sql` stage into `Wiki/wp_SQL/enwiki-page.sql.gz` (see [Wiki/README.markdown](Wiki/README.markdown))
- **Wikidata JSON dump**, streamed and filtered by the `download_and_filter_wikidata` stage (see [Wiki/README.markdown](Wiki/README.markdown))
- **Wikipedia pageviews**, streamed and filtered by the `download_and_filter_pageviews` stage (see [Wiki/README.markdown](Wiki/README.markdown))
3 changes: 2 additions & 1 deletion data/Wiki/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@

# But not these files...
!.gitignore
!README.markdown
!README.markdown
!*.dvc
36 changes: 20 additions & 16 deletions data/Wiki/README.markdown
Original file line number Diff line number Diff line change
@@ -1,20 +1,24 @@
To allow mappings to wikipedia and popularity calculations, the following three files
should be uploaded to their respective directories (NB: these could be symlinks to
versions on external storage)
To allow mappings to wikipedia and popularity calculations, the following
files are downloaded and filtered automatically by pipeline stages:

* The `wd_JSON` directory should contain the wikidata JSON dump, as `latest-all.json.bz2`
(download from <http://dumps.wikimedia.org/wikidatawiki/entities/>)
* The `wp_SQL` directory should contain the en.wikipedia SQL dump file, as `enwiki-latest-page.sql.gz`
(download from <http://dumps.wikimedia.org/enwiki/latest/>)
* The `wp_pagecounts` directory should contain the wikipedia pagevisits dump files:
multiple files such as `wp_pagecounts/pageviews-202403-user.bz2` etc...
(download from <https://dumps.wikimedia.org/other/pageview_complete/monthly/>).
- **`download_wikipedia_sql`** downloads the en.wikipedia SQL dump
(`enwiki-page.sql.gz`, ~2 GB) from
<https://dumps.wikimedia.org/enwiki/>. To re-download the latest
version, run `dvc repro --force discover_enwiki_sql_url download_wikipedia_sql`.

For `wp_pagecounts`, as a much faster alternative, you can download preprocessed pageviews files from a [release](https://github.com/OneZoom/tree-build/releases).
- **`download_and_filter_wikidata`** streams the full Wikidata JSON dump
(`latest-all.json.bz2`, ~90 GB) from
<https://dumps.wikimedia.org/wikidatawiki/entities/>, filters it on the fly,
and writes only the small filtered output. To re-download with a fresh dump,
run `dvc repro --force discover_wikidata_url download_and_filter_wikidata`.

You can download the gz file and unpack it in one command. e.g. from `data/Wiki/wp_pagecounts`, run:
```bash
wget https://github.com/OneZoom/tree-build/releases/download/pageviews-202306-202403/OneZoom_pageviews-202306-202403.tar.gz -O - | tar -xz
```
- **`download_and_filter_pageviews`** streams monthly `-user` dumps from
<https://dumps.wikimedia.org/other/pageview_complete/monthly/>, filters them
against the wikidata titles, and caches the small filtered outputs. Only the
most recent N months (configured via `--months` in the DVC stage) are
processed. To pick up newly published months, run
`dvc repro --force download_and_filter_pageviews`.

You will then omit passing pageviews files when you later run `generate_filtered_files` (see [build steps](../../oz_tree_build/README.markdown)).
If someone has already run the pipeline and pushed results to the DVC remote,
you do not need to download these files yourself --
`dvc repro --pull --allow-missing` will pull the cached filtered outputs instead.
1 change: 1 addition & 0 deletions data/Wiki/wd_JSON/.gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
*
!.gitignore
!*.dvc
1 change: 1 addition & 0 deletions data/Wiki/wp_SQL/.gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
*
!.gitignore
!*.dvc
1 change: 1 addition & 0 deletions data/Wiki/wp_pagecounts/.gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
*
!.gitignore
!*.dvc
6 changes: 6 additions & 0 deletions data/filtered/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Ignore everything
*

# But not these files...
!.gitignore
!*.dvc
2 changes: 0 additions & 2 deletions data/output_files/.gitignore

This file was deleted.

Loading