CEPHALOPOD - Comprehensive Ensemble Pipeline for Habitat modelling Across Large-scale Ocean Pelagic Observation Datasets
This repository contains all code and technical documentation to understand the basis of CEPHALOPOD and install it on your local machine or institution computing cluster. CEPHALOPOD is a standardized and flexible framework for marine habitat modeling across various types and sources of input data. It is targeted at - but not limited to - plankton observation datasets.
CEPHALOPOD is also designed as the main tool for the Ecosystem Workbench of the Bluecloud2026 E.U. project. This initiative aims to enhance the accessibility, accuracy and inter-comparability of extensive plankton in-situ observation datasets, derived from traditional counting methods, quantitative imaging, and cutting-edge genomic (Metagenome Assembled Genome; MAG) techniques available from repositories such as OBIS, GBIF, AtlantECO or MATOU. It offers a sustainable working for generating ecosystem-level Essential Ocean and Biodiverisity Variables (EOVs & EBVs), adhering to rigorous QA/QC procedures and best practices in the habitat modeling field.
You can download and install CEPHALOPOD on your local machine or computing cluster by following the steps below. We recommend a minimum of 8 cores and 32 Gb of memory for a general usage across most case studies.
-
Initiate an R session (version 4.2. or above) on your local machine or computing cluster
-
Access a terminal at the designated installation path for CEPHALOPOD.
-
Execute the following command:
git -clone https://github.com/alexschickele/CEPHALOPOD.gitor download CEPHALOPOD as an archive and decompress it in the designated installation path. -
Navigate to the file located at: ./code/00_config.R
-
On the first use, execute the R library installation commands line by line, responding affirmatively to any interactive prompts. This ensures the installation of all necessary libraries and corresponding versions.
-
CEPHALOPOD should now be ready for utilization !
Now that you have CEPHALOPOD at your designated installation path, you should see several sub-folders. Here we provide an overview of their content and use.
-
The
./master.Rscript contains an example script demonstrating the utilization of CEPHALOPOD. This file serves as a template to be customized or replicated for each usage instance of CEPHALOPOD (e.g., by modifying the data source or model parameters; we will come to that later). -
The
./codefolder contains the collection of R scripts, each corresponding to the function associated with a distinct step within the CEPHALOPOD workflow. All R scripts are numbered in their order of use, matching the numbering in the./master.Rscript. -
The
./functionfolder contains custom R functions of general use across the different steps (e.g., definition of colorbars, operations on spatial data). -
The
./datafolder corresponds to the directory containing input data and model cache. Please note that some input data are local to the ETH Zurich cluster where CEPHALOPOD was developed, hence not present in this GitHub repository. Refer to the "Current status" section below. -
The
./outputdesignates the directory where CEPHALOPOD automatically stores its output for each instance or case study.
Now that you understand how to install CEPHALOPOD and what it contains, let's dive into an example from a master script. Please note that the following example does not correspond to a real case study and only aims to illustrate how to run CEPHALOPOD in a practical manner. For further information, please refer to associated publications and the technical documentation present in the R scripts in ./code.
The ./master.R script starts by a cleanup of your current R session and defines your CEPHALOPOD instance in the run_name object. The call to ./code/00_config.R will load all necessary libraries and functions in your R environment.
rm(list=ls())
closeAllConnections()
setwd("/your_installation_path")
source(file = "./code/00_config.R")
run_name <- "test"
We kickstart the modeling process by identifying the species available for analysis, from a defined source. We select these species based on specific criteria, such as a minimum sample size, the depth range for sampling, and the temporal range of the data. This selection ensures that the subsequent modeling efforts focus on species that meet the desired criteria. Parameters are the following :
DATA_SOURCE: either "occurrence", "biomass", "abundance", "MAG", "custom". This parameter will redirect the query to the appropriate database to source the data from. The latter corresponds to an already formatted .csv, .txt or .xlsx file.
SAMPLE_SELECT: a list containing the minimum sample size, target depth range and sample temporal range.
list_bio <- list_bio_wrapper(FOLDER_NAME = run_name,
DATA_SOURCE = "occurrence",
SAMPLE_SELECT = list(MIN_SAMPLE = 50, TARGET_MIN_DEPTH = 0, TARGET_MAX_DEPTH = 100, START_YEAR = 1950, STOP_YEAR = 2020))
Now we can look at the LIST_BIO object and build a vector of species identifier to extrapolate the distribution from. Here, we select all species in the Thalassiosira taxa. This code block can be customed depending on the taxa selection needed for your case study.
sp_list <- list_bio %>%
dplyr::filter(grepl("Thalassiosira ", scientificname)) %>%
dplyr::select(worms_id) %>%
unique() %>% pull()
In the second step of the modeling process, several critical actions are taken to prepare for subsequent stages. First, a dedicated output folder is created to systematically store the results of each species-level run. Global parameters for modeling are defined, including environmental variables, data types (e.g., proportions), and the selection of specific modeling methods. These parameters are stored into a CALL.Rdata object, which is referenced by all subsequent steps. This approach minimizes memory consumption and simplifies the execution of each step. Finally, a local repository of monthly environmental climatologies is established, utilizing .nc data. This step is crucial to ensure a uniform set and format of environmental predictors while also optimizing memory usage. Parameters are the following:
-
FOLDER_NAMEname of the run folder we want to work in -
SP_SELECTvector of identifiers (i.e., usually WoRMS identifiers), corresponding to the species to parallelize on -
FASTTRUEorFALSE; ifTRUE, CEPHALOPOD stops the processing of taxa that did not success at a quality check. It avoids unnecessary CPU and memory usage. -
LOAD_FROMload a previousLIST_BIOobject from another folder to be duplicated in the newFOLDER_NAME. It avoids repeating the initial steps of CEPHALOPOD. -
DATA_TYPEthe output type of the data, which can influence the sub-folder architecture. See details in the corresponding function. -
ENV_VARa list of .ncfiles to extract the main variable from, located inENV_PATH. The names correspond to the variable name of the corresponding.ncfile. If a!is specified at the start of the variable name, the variable is excluded from the list. -
ENV_PATHstring or vector of path to the root where the .ncare. -
METHOD_PAmethod of pseudo-absence, either "mindist" or "cumdist" or "density" (recommended) -
NB_PAnumber of pseudo-absences to generate -
PER_RANDOMratio of pseudo-absences that are sampled randomly in the background -
PA_ENV_STRATAifTRUE, increases the pseudo-absence sampling probability in geographical cells with environmental conditions distant from the conditions of the observations. -
DIST_PAifMETHOD_PA = "mindist", distance from presences (in meters), from which to define the background data. Expert use only -
BACKGROUND_FILTERadditional background filter for finer tuning, such as selecting pseudo-absences within the sampled background of a given campaign or instrument deployment. Passed by the user in the form of a 2 column data frame, x = longitude and y = latitude where the pseudo-absences can be sampled. Or a path to a raster object where pseudo-absences are sampled in non NA cells, weighted by the cell values. -
OUTLIERifTRUE, remove outliers further than 2.5 standard deviation from the mean of the observations -
RFEifTRUE, performs a Recursive Feature Elimination procedure that discards predictors that do not present significant importance in explaining the patterns in the observations, relative to the ones already selected. -
ENV_CORnumeric, removes the correlated environmental predictors from the query objects andCALLaccording to the defined threshold. ElseNULL. -
NFOLDnumber of folds, used defined integer -
FOLD_METHODmethod used to create the folds, integer between "kfold" and "lon"; respectively for normal k-fold or longitudinal block of observations (recommended to deal with spatial autocorrelation issues). -
MODEL_LISTlist of algorithms to use for calibration and hyperparameter selection -
LEVELSmaximum number of parameter values to test in each of the hyperparameter grids -
TARGET_TRANSFORMATIONpath to afunction(x, REVERSE = T/F)to transform the target variable prior to the model calibration step. -
ENSEMBLETRUEorFALSE; ifTRUE, computes an ensemble at the evaluation and projection steps -
N_BOOTSTRAPnumber of bootstrap to do for the projections and partial dependency plots -
CUTnumeric orNULL; if numeric, quantile (between 0 and 1) at which the projections are considered to be 0. Projection patches without observation are then removed.
The function is assigned to subfolder_list as it also returns the list of species to parallelize on.
subfolder_list <- run_init(FOLDER_NAME = run_name,
SP_SELECT = sp_list,
FAST = FALSE,
LOAD_FROM = NULL,
DATA_TYPE = "presence_only",
ENV_VAR = c("!climatology_s_0_50","!climatology_s_200_300"),
ENV_PATH = "/net/meso/work/nknecht/Masterarbeit/General_Pipeline/Data
/environmental_climatologies",
METHOD_PA = "density",
PER_RANDOM = 0.05,
OUTLIER = TRUE,
RFE = TRUE,
ENV_COR = 0.8,
NFOLD = 3,
FOLD_METHOD = "lon",
MODEL_LIST = c("GLM","GAM","RF","MLP","BRT","SVM"),
LEVELS = 3,
TARGET_TRANSFORMATION = "/net/meso/work/aschickele/Bluecloud_WB_local/function /target_transformation_yj_auto.R",
ENSEMBLE = TRUE,
N_BOOTSTRAP = 10,
CUT = 0)
Now we can retrieve the biological data for the selected taxa to extrapolate the distribution from. These data form the foundation for training and validating species distribution models. The availability and quality of this data are crucial for the success of the modeling process.
The function is built in parallel over each taxa considered. It does not provide output in the console, as all the retrieve data are saved in a QUERY.RData object. As for all subsequent functions, the parameters are following:
FOLDER_NAME: the name of the folder in which the run is saved
SUBFOLDER_NAME: the name of all sub folders corresponding to each species to parallelize over
mcmapply(FUN = query_bio_wrapper,
FOLDER_NAME = run_name,
SUBFOLDER_NAME = subfolder_list,
mc.cores = min(length(subfolder_list), MAX_CLUSTERS))
We can now extract the environmental data corresponding to each biological sample at the time and location of the sample. In addition, the data is regridded and binned on a 1 x 1° monthly resolution. Taxa that do not meet the minimum number of occurrence after binning are discarded. Therefore, we assign the ouput of this function to an updated subfolder_list object.
subfolder_list <- mcmapply(FUN = query_env,
FOLDER_NAME = run_name,
SUBFOLDER_NAME = subfolder_list,
mc.cores = min(length(subfolder_list), MAX_CLUSTERS)) %>%
unlist() %>%
na.omit(subfolder_list) %>%
.[grep("Error", ., invert = TRUE)] %>%
as.vector()
For occurrence data, we need to artificially generate pseudo-absences to have a balanced dataset between 0 and 1's. This balance is crucial for accurate model training and predictive performance and best practices recommend pseudo-absences following the same biases as presences. They are by default generated following the nearby presence density. The function provides a .PDF file with the presence and pseudo-absence distribution in the geographical space (or the observations for continuous or proportion data types).
mcmapply(FUN = pseudo_abs,
FOLDER_NAME = run_name,
SUBFOLDER_NAME = subfolder_list,
mc.cores = min(length(subfolder_list), MAX_CLUSTERS))
In step 6, we rigorously check the biological and environmental dataset quality. First, we identify and handle biological outliers in the data to avoid introducing bias in the model training and extrapolated maps. Then, we assess the quality and relevance of environmental predictor variables used by discarding correlated environmental predictors and discarding predictors that do not explain a significant portion of the observed data. This step is crucial for a parsimonious feature set. The function updates subfolder_list with the species passing the quality check on the selected predictors. Finally, we perform a Multivariate Environmental Similarity Surface (MESS) analysis to identify the geographical areas outside the range of environmental values in which the model has been trained (i.e., environmental extrapolation).
Several .PDF files are provided in this steps, including a dendrogram of environmental predictor correlation, an a priori predictor importance and ranking of the predictors, as well as the corresponding loss after which, adding a predictor would not significantly increase the model performance.
subfolder_list <- mcmapply(FUN = query_check,
FOLDER_NAME = run_name,
SUBFOLDER_NAME = subfolder_list,
mc.cores = min(length(subfolder_list), MAX_CLUSTERS)) %>%
unlist() %>%
na.omit(subfolder_list) %>%
as.vector()
Now that the input data have been thoroughly quality checked, we can create splits and resampling folds to facilitate the training and assessment of species distribution models. Proper partitioning of data is essential for robust modeling results, and assessing model performance in reproducing the observed patterns. The model training design is oriented around a n-time cross validation between train and test set to find the best hyperparameters for each algorithm. Then, each algorithm is tested against a final evaluation set to assess its performance against an independent dataset.
mcmapply(FUN = folds,
FOLDER_NAME = run_name,
SUBFOLDER_NAME = subfolder_list,
mc.cores = min(length(subfolder_list), MAX_CLUSTERS))
Each algorithm is associated to a set of hyperparameter to best reproduce the observed biological target. This step focuses on configuring hyperparameters for training the species distribution models algorithms. These hyperparameters dictate the model's behavior during training and significantly influence model performance.
hyperparameter(FOLDER_NAME = run_name)
At this stage, we defined all necessary parameters and quality checked all input data. Therefore, we can start to fit species distribution models to the data. We employ various modeling algorithms, such as Generalized Linear Models (GLM), Generalized Additive Models (GAM), Random Forest (RF), Support Vector Machines (SVM), Boosted Regression Trees (BRT) and Multilayer Perceptrons (MLP), to capture the relationships between environmental features and the biological target(s).
From this steps on, all outputs are saved in a MODEL.Rdata object in each species corresponding sub folder. The input parameters remain the same as in the previous steps.
mcmapply(FUN = model_wrapper,
FOLDER_NAME = run_name,
SUBFOLDER_NAME = subfolder_list,
mc.cores = min(length(subfolder_list), MAX_CLUSTERS))
To assess each of the algorithm's performance, we perform a serie of quality checks on the evaluation dataset.
-
FIT: the metric quantifying the fitting performance of the algorithm. How well does it reproduces the patterns in the observed data? -
VIP: the variable importance (PDF output), that quantifies the contribution of each environmental predictor.
mcmapply(FUN = eval_wrapper,
FOLDER_NAME = run_name,
SUBFOLDER_NAME = subfolder_list,
mc.cores = min(length(subfolder_list), MAX_CLUSTERS))
All algorithm that successfully passed the evaluation step (if FAST = TRUE) are considered of sufficient quality to built spatial projections. We retrain a full model with the parameters and hyperparameters previously selected and perform spatial projections for n-bootstrap resamples to estimate the associated uncertainty. A quality check is performed on this uncertainty estimation.
NSD: the normalized standard deviation, quantifying the uncertainty between bootstraps at each grid-cell, averaged over all grid-cells of the projection domain.
mcmapply(FUN = proj_wrapper,
FOLDER_NAME = run_name,
SUBFOLDER_NAME = subfolder_list,
mc.cores = min(length(subfolder_list), MAX_CLUSTERS))
Following the projections, CEPHALOPOD provides a summarized PDF output of the each projections and quality checks for each taxa, algorithm and ensembles. The projections can be provided in a monthly resolution or any lower temporal resolution.
mcmapply(FUN = standard_maps,
FOLDER_NAME = run_name,
SUBFOLDER_NAME = subfolder_list,
mc.cores = min(length(subfolder_list), MAX_CLUSTERS))
In addition, we are also performing partial dependency plots for each species and environmental predictors. It shows the marginal response of the habitat suitability to each environmental feature. The corresponding outputs are also saved in a PDF file.
mcmapply(FUN = pdp,
FOLDER_NAME = run_name,
SUBFOLDER_NAME = subfolder_list,
mc.cores = min(length(subfolder_list), MAX_CLUSTERS))
Finally, all PDF files present in each species folder can be summarized in a unique summary for users, concatenating all quality checks and outputs of the present workbench ecosystem run.
user_synthesis(FOLDER_NAME = run_name)
Supplementary analysis such as diversity estimates can be performed by any user based on the information, data and output stored in the QUERY and MODEL.Rdata objects.
Here you will find updated informations concerning the code status from a development perspective.
- In its current state, the list of suggested predictors is also specific to local data. However, one can link to another predictor collection in the same format (WOA grid, 1x1 degree, depth resolved optional; variable = LAYER; dimensions = lon, lat, time, depth).
- The predictor set can be found online at: https://data.d4science.net/m9WC