OneZoom · jaredkhan · Mar 22, 2026 · Mar 23, 2026 · Mar 23, 2026 · Mar 24, 2026
diff --git a/.dvc/.gitignore b/.dvc/.gitignore
@@ -0,0 +1,3 @@
+/config.local
+/tmp
+/cache
diff --git a/.dvc/config b/.dvc/config
@@ -0,0 +1,5 @@
+[core]
+    remote = jared-r2
+['remote "jared-r2"']
+    url = s3://onezoom
+    endpointurl = https://9d168184d3ac384b6a159313dd90a75a.r2.cloudflarestorage.com
diff --git a/.dvcignore b/.dvcignore
@@ -0,0 +1,3 @@
+# Add patterns of files dvc should ignore, which could improve
+# the performance. Learn more at
+# https://dvc.org/doc/user-guide/dvcignore
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -17,9 +17,32 @@ jobs:
       - uses: actions/checkout@v3
       - uses: actions/setup-python@v4
         with:
-          python-version: '3.10'
+          python-version: "3.10"
       - uses: pre-commit/action@v3.0.0
 
+  dvc:
+    name: DVC
+    runs-on: ubuntu-latest
+    steps:
+      - name: Cancel Previous Runs
+        uses: styfle/cancel-workflow-action@0.6.0
+        with:
+          access_token: ${{ github.token }}
+      - uses: actions/checkout@v3
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: "3.10"
+      - name: Install dependencies
+        run: |
+          python3 -m pip install --upgrade pip
+          python3 -m pip install '.[dev]'
+      - name: Check DVC status
+        run: |
+          dvc remote modify --local jared-r2 access_key_id ${{ secrets.DVC_ACCESS_KEY_ID }}
+          dvc remote modify --local jared-r2 secret_access_key ${{ secrets.DVC_SECRET_ACCESS_KEY }}
+          dvc repro --allow-missing --dry | tee /dev/stderr | grep -q "Data and pipelines are up to date."
+          if dvc data status --not-in-remote | grep -q "Not in remote"; then exit 1; fi
   test:
     name: Python
     runs-on: ubuntu-latest
@@ -40,7 +63,7 @@ jobs:
       - name: Install dependencies
         run: |
           python3 -m pip install --upgrade pip
-          python3 -m pip install '.[test]'
+          python3 -m pip install '.[dev]'
       - name: Test with pytest
         run: |
           python3 -m pytest tests --conf-file tests/appconfig.ini

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -11,6 +11,28 @@ repos:
     rev: v0.4.5
     hooks:
       - id: ruff
-        args: [  "--fix", "--config", "ruff.toml" ]
+        args: [--fix, --config, ruff.toml]
       - id: ruff-format
-        args: [ "--config", "ruff.toml" ]
+        args: [--config, ruff.toml]
+  - repo: https://github.com/treeverse/dvc
+    rev: 3.67.0
+    hooks:
+      - id: dvc-pre-commit
+        additional_dependencies:
+          - .[all]
+        language_version: python3
+        stages:
+          - pre-commit
+      - id: dvc-pre-push
+        additional_dependencies:
+          - .[all]
+        language_version: python3
+        stages:
+          - pre-push
+      - id: dvc-post-checkout
+        additional_dependencies:
+          - .[all]
+        language_version: python3
+        stages:
+          - post-checkout
+        always_run: true
diff --git a/README.markdown b/README.markdown
@@ -13,17 +13,18 @@ The first step to using this repo is to create a Python virtual environment and
     source .venv/bin/activate
 
     # Install it
-    pip install -e .
+    pip install -e '.[dev]'
 
-After the first time, you just need to run the `source .venv/bin/activate` each time you want to activate it in a new shell.
+    # Set up git hooks including linting and DVC
+    pre-commit install --hook-type pre-push --hook-type post-checkout --hook-type pre-commit
 
-If you want to run the test suite, make sure the test requirements are also installed, with:
+After the first time, you just need to run the `source .venv/bin/activate` each time you want to activate it in a new shell.
 
-    pip install -e '.[test]'
+To be able to run the pipeline, you'll also need to install `wget`.
 
 ## Testing
 
-Assuming you have installed the test requirements, you should be able to run
+Assuming you have installed the 'dev' dependencies, you should be able to run
 
     python -m pytest --conf-file tests/appconfig.ini
 
@@ -41,22 +42,39 @@ you will need a valid Azure Image cropping key in your appconfig.ini.
 
 ## Building the latest tree from OpenTree
 
-### Setup
+This project uses [DVC](https://dvc.org/) to manage the pipeline. The build parameters are defined in `params.yaml` and the pipeline stages are declared in `dvc.yaml`.
 
-We assume that you want to build a OneZoom tree based on the most recent online OpenTree version.
-You can check the most recent version of both the synthetic tree (`synth_id`) and the taxonomy (`taxonomy_version`) via the
-[API](https://github.com/OpenTreeOfLife/germinator/wiki/Open-Tree-of-Life-Web-APIs) e.g. by running `curl -X POST https://api.opentreeoflife.org/v3/tree_of_life/about`. Later in the build, we use specific environment variables set to these version numbers. Assuming you are in a bash shell or similar, you can set them as follows:
+### Quick start (using cached outputs)
 
+You'll need to ask for the DVC remote credentials on the OneZoom Slack channel in order to pull cached results.
+Then, if someone has already run the pipeline and pushed the results to the DVC remote, you can reproduce the build and any of the intermediate stages without downloading any of the massive source files:
+
+```bash
+source .venv/bin/activate
+dvc repro --pull --allow-missing
 ```
-OT_VERSION=14.9 #or whatever your OpenTree version is
-OT_TAXONOMY_VERSION=3.6
-OT_TAXONOMY_EXTRA=draft1 #optional - the draft for this version, e.g. `draft1` if the taxonomy_version is 3.6draft1
-```
 
-### Download
+DVC will pull only the cached outputs needed for stages that haven't changed. If all stages are cached, nothing needs to be re-run.
+
+### Full build (first time / updating source data)
+
+1. Set `ot_version` in `params.yaml` to the desired OpenTree synthesis version (e.g. `"v16.1"`). Available versions can be found in the [synthesis manifest](https://raw.githubusercontent.com/OpenTreeOfLife/opentree/master/webapp/static/statistics/synthesis.json). The OpenTree tree and taxonomy will be downloaded automatically by the `download_opentree` pipeline stage.
+
+2. Some source files are unversioned so will use cached results unless forced. To force re-download them all with the latest upstream data:
+
+   ```bash
+   dvc repro --force download_eol discover_enwiki_sql_url download_wikipedia_sql discover_wikidata_url download_and_filter_wikidata download_and_filter_pageviews
+   ```
+
+Note that download_and_filter_wikidata and download_and_filter_pageviews take several hours to run.
+
+3. Run the pipeline and push results to the shared cache:
 
-Constructing the full tree of life requires various files downloaded from the internet. They should be placed within the appropriate directories in the `data` directory, as [documented here](data/README.markdown).
+   ```bash
+   dvc repro
+   dvc push
+   ```
 
-### Building the tree
+4. Commit `dvc.lock` to git.
 
-Once data files are downloaded, you should be set up to actually build the tree and other backend files, by following [these instructions](oz_tree_build/README.markdown).
+For detailed step-by-step documentation, see [oz_tree_build/README.markdown](oz_tree_build/README.markdown).
diff --git a/data/.gitignore b/data/.gitignore
@@ -0,0 +1,2 @@
+/js_output
+/output_files
diff --git a/data/EOL/.gitignore b/data/EOL/.gitignore
@@ -3,4 +3,5 @@
 
 # But not these files...
 !.gitignore
-!README.markdown
+!README.markdown
+!*.dvc
diff --git a/data/OZTreeBuild/AllLife/OpenTreeParts/.gitignore b/data/OZTreeBuild/AllLife/OpenTreeParts/.gitignore
@@ -0,0 +1 @@
+/OpenTree_all
diff --git a/data/OZTreeBuild/AllLife/OpenTreeParts/OpenTree_all/.gitignore b/data/OZTreeBuild/AllLife/OpenTreeParts/OpenTree_all/.gitignore
diff --git a/data/OpenTree/.gitignore b/data/OpenTree/.gitignore
@@ -3,4 +3,4 @@
 
 # But not these files...
 !.gitignore
-!README.markdown
+!README.markdown
diff --git a/data/OpenTree/README.markdown b/data/OpenTree/README.markdown
@@ -1,26 +1,28 @@
 ### Directory contents
-Files herein are .gitignored. To get the site working, this folder should contain the following files (or symlinks to them)
-
-* `draftversionXXX.tre`
-* `ottYYY/taxonomy.tsv`
+
+This folder contains versioned subdirectories of Open Tree of Life data, e.g. `v16.1/`. Each subdirectory is created by the `download_opentree` script and contains:
+
+* `labelled_supertree_simplified_ottnames.tre` -- the raw downloaded tree
+* `draftversion.tre` -- the tree with `mrca***` labels removed and whitespace normalised
+* `taxonomy.tsv` -- the OTT taxonomy file
+
+These subdirectories are .gitignored and tracked by DVC as pipeline outputs.
 
 ### How to get the files
-* `draftversionXXX.tre` should contain an OpenTree newick file with simplified names and `mrca***` labels removed. This can be created from the OpenTree download file `labelled_supertree_simplified_ottnames.tre`. To get this file, you can either download the complete OpenTree distribution, or get the single necessary file by following the link from [https://tree.opentreeoflife.org/about/synthesis-release/](https://tree.opentreeoflife.org/about/synthesis-release/) to 'browse full output' then 'labelled_supertree/index.html' (usually at the end of the "Supertree algorithm" section). Make sure that you *don't* get the `...without_monotypic.tre` version, otherwise you will be missing some intermediate nodes, and the popularity ratings may suffer.
-
-	Removing the `mrca***` labels can be done by using a simple regular expression substitution, as in the following perl command:
 
-	```
-	# assumes you have defined OT_VERSION as an environment variable, e.g. > OT_VERSION=14.7
-	perl -pe 's/\)mrcaott\d+ott\d+/\)/g; s/[ _]+/_/g;' labelled_supertree_simplified_ottnames.tre > draftversion${OT_VERSION}.tre
-	```
+Run the download script with the desired synthesis version:
 
-* The OpenTree taxonomy, in a subfolder called ottYYY/ (where YYY is the OT_TAXONOMY_VERSION; the only important file is ottYYY/taxonomy.tsv). Get the `ottYYY.tgz` file (where YYY is the correct taxonomy version for your version XXX of the tree) from [http://files.opentreeoflife.org/ott](http://files.opentreeoflife.org/ott/) and unpack it. Alternatively, the lastest is usually at [https://tree.opentreeoflife.org/about/taxonomy-version](https://tree.opentreeoflife.org/about/taxonomy-version).
+```
+download_opentree --version v16.1 --output-dir data/OpenTree
+```
 
-### Use
+The script fetches the [synthesis manifest](https://raw.githubusercontent.com/OpenTreeOfLife/opentree/master/webapp/static/statistics/synthesis.json) to look up the correct OTT taxonomy version, then downloads both the labelled supertree and taxonomy automatically.
+
+This is also available as a DVC pipeline stage (`download_opentree` in `dvc.yaml`), so `dvc repro` will run it when `ot_version` changes in `params.yaml`.
 
-These files are processed by the scripts in ServerScripts/TreeBuild/OpenTreeRefine to create an OpenTree without subspecies, with polytomies resolved, and with all nodes named.
+### Use
 
-Note that the `ott/taxonomy.tsv` file is also used by other scripts e.g. for popularity, TaxonMapping, etc.
+These files are processed by the pipeline stages in `dvc.yaml` to create the full OneZoom tree. The `taxonomy.tsv` file is also used by other stages (e.g. for popularity mapping, EoL filtering, etc.).
 
 NB: for the rationale of using `...simplified_ottnames` see
  [https://github.com/OpenTreeOfLife/treemachine/issues/147#issuecomment-209105659](https://github.com/OpenTreeOfLife/treemachine/issues/147#issuecomment-209105659) and also [here](https://groups.google.com/forum/#!topic/opentreeoflife/EzqctKrJySk)
diff --git a/data/README.markdown b/data/README.markdown
@@ -1,13 +1,11 @@
 # Downloading required data files
-
-To build a tree, you will first need to download various files from the internet. These are not provided by OneZoom directly as they are (a) very large and (b) regularly updated. The files you will need are:
 
-* Open Tree of Life files, to be downloaded into the `OpenTree` directory (see [OpenTree/README.markdown](OpenTree/README.markdown)
-	* `labelled_supertree_simplified_ottnames.tre` (subsequently converted to `draftversionXXX.tre`, as detailed in the instructions)
-	* `ottX.Y/taxonomy.tsv` (where X.Y is the OT_TAXONOMY_VERSION)
-* Wikimedia files, to be downloaded into directories within the `Wiki` directory (see [Wiki/README.markdown](Wiki/README.markdown))
-	* `wd_JSON/latest-all.json.bz2`
-	* `wp_SQL/enwiki-latest-page.sql.gz`
-	* `wp_pagecounts/pageviews-YYYYMM-user.bz2` (several files for different months). Or download preprocessed files from a [release](https://github.com/OneZoom/tree-build/releases)
-* EoL files, to be downloaded into the `EOL` directory (see [EOL/README.markdown](EOL/README.markdown))
-	* `identifiers.csv`
+To build a tree, you will first need various data files from the internet. These are not provided by OneZoom directly as they are (a) very large and (b) regularly updated.
+
+All source files are downloaded automatically by DVC pipeline stages:
+
+- **Open Tree of Life** files, downloaded by the `download_opentree` stage into `OpenTree/<version>/` (see [OpenTree/README.markdown](OpenTree/README.markdown))
+- **EOL provider IDs**, downloaded by the `download_eol` stage into `EOL/provider_ids.csv.gz`
+- **Wikipedia SQL dump**, downloaded by the `download_wikipedia_sql` stage into `Wiki/wp_SQL/enwiki-page.sql.gz` (see [Wiki/README.markdown](Wiki/README.markdown))
+- **Wikidata JSON dump**, streamed and filtered by the `download_and_filter_wikidata` stage (see [Wiki/README.markdown](Wiki/README.markdown))
+- **Wikipedia pageviews**, streamed and filtered by the `download_and_filter_pageviews` stage (see [Wiki/README.markdown](Wiki/README.markdown))
diff --git a/data/Wiki/.gitignore b/data/Wiki/.gitignore
@@ -3,4 +3,5 @@
 
 # But not these files...
 !.gitignore
-!README.markdown
+!README.markdown
+!*.dvc
diff --git a/data/Wiki/README.markdown b/data/Wiki/README.markdown
@@ -1,20 +1,24 @@
-To allow mappings to wikipedia and popularity calculations, the following three files
-should be uploaded to their respective directories (NB: these could be symlinks to
-versions on external storage)
+To allow mappings to wikipedia and popularity calculations, the following
+files are downloaded and filtered automatically by pipeline stages:
 
-* The `wd_JSON` directory should contain the wikidata JSON dump, as `latest-all.json.bz2`
-(download from <http://dumps.wikimedia.org/wikidatawiki/entities/>)
-* The `wp_SQL` directory should contain the en.wikipedia SQL dump file, as `enwiki-latest-page.sql.gz`
-(download from <http://dumps.wikimedia.org/enwiki/latest/>)
-* The `wp_pagecounts` directory should contain the wikipedia pagevisits dump files:
-multiple files such as `wp_pagecounts/pageviews-202403-user.bz2` etc... 
-(download from <https://dumps.wikimedia.org/other/pageview_complete/monthly/>).
+- **`download_wikipedia_sql`** downloads the en.wikipedia SQL dump
+  (`enwiki-page.sql.gz`, ~2 GB) from
+  <https://dumps.wikimedia.org/enwiki/>. To re-download the latest
+  version, run `dvc repro --force discover_enwiki_sql_url download_wikipedia_sql`.
 
-For `wp_pagecounts`, as a much faster alternative, you can download preprocessed pageviews files from a [release](https://github.com/OneZoom/tree-build/releases).
+- **`download_and_filter_wikidata`** streams the full Wikidata JSON dump
+  (`latest-all.json.bz2`, ~90 GB) from
+  <https://dumps.wikimedia.org/wikidatawiki/entities/>, filters it on the fly,
+  and writes only the small filtered output. To re-download with a fresh dump,
+  run `dvc repro --force discover_wikidata_url download_and_filter_wikidata`.
 
-You can download the gz file and unpack it in one command. e.g. from `data/Wiki/wp_pagecounts`, run:
-```bash
-wget https://github.com/OneZoom/tree-build/releases/download/pageviews-202306-202403/OneZoom_pageviews-202306-202403.tar.gz -O - | tar -xz
-```
+- **`download_and_filter_pageviews`** streams monthly `-user` dumps from
+  <https://dumps.wikimedia.org/other/pageview_complete/monthly/>, filters them
+  against the wikidata titles, and caches the small filtered outputs. Only the
+  most recent N months (configured via `--months` in the DVC stage) are
+  processed. To pick up newly published months, run
+  `dvc repro --force download_and_filter_pageviews`.
 
-You will then omit passing pageviews files when you later run `generate_filtered_files` (see [build steps](../../oz_tree_build/README.markdown)).
+If someone has already run the pipeline and pushed results to the DVC remote,
+you do not need to download these files yourself --
+`dvc repro --pull --allow-missing` will pull the cached filtered outputs instead.
diff --git a/data/Wiki/wd_JSON/.gitignore b/data/Wiki/wd_JSON/.gitignore
@@ -1,2 +1,3 @@
 *
 !.gitignore
+!*.dvc
diff --git a/data/Wiki/wp_SQL/.gitignore b/data/Wiki/wp_SQL/.gitignore
@@ -1,2 +1,3 @@
 *
 !.gitignore
+!*.dvc
diff --git a/data/Wiki/wp_pagecounts/.gitignore b/data/Wiki/wp_pagecounts/.gitignore
@@ -1,2 +1,3 @@
 *
 !.gitignore
+!*.dvc
diff --git a/data/filtered/.gitignore b/data/filtered/.gitignore
@@ -0,0 +1,6 @@
+# Ignore everything
+*
+
+# But not these files...
+!.gitignore
+!*.dvc
diff --git a/data/output_files/.gitignore b/data/output_files/.gitignore