Feature | Fetch data from Metalog and create a TSE object#814
Feature | Fetch data from Metalog and create a TSE object#814raivo-otus wants to merge 5 commits into
Conversation
Refactored from previous "draft implementation" with assistance from Claude Opus 4.6 to fit mia style. In the process updated to use httr2 for downloads and some optimizations to logic.
| #' Fetch data from the Metalog database as | ||
| #' \code{TreeSummarizedExperiment} | ||
| #' | ||
| #' \code{fetchMetalogTSE} downloads MetaPhlAn4 taxonomic profiles and |
There was a problem hiding this comment.
not sure it is necessary to emphasize here MetaPhlAn4 (this may change in the future)
| #' \code{fetchMetalogTSE} downloads MetaPhlAn4 taxonomic profiles and | ||
| #' associated sample metadata from the | ||
| #' \href{https://metalog.embl.de/}{Metalog} database and returns them as a | ||
| #' \code{TreeSummarizedExperiment} object. An optional sample list can be |
There was a problem hiding this comment.
I think this kind of situations would be usually written as \code{\link[SummarizedExperiment]{SummarizedExperiment}} for full linking (compare with the other R files)
| #' @param collection \code{Character scalar}. The Metalog collection to | ||
| #' download. Must be one of \code{"human"}, \code{"animal"}, \code{"ocean"}, | ||
| #' or \code{"environmental"}. |
There was a problem hiding this comment.
Is it not possible (or feasible) to load all samples? One might be interested in the comparisons.
Are all of these done with MetaPhlAn4?
There was a problem hiding this comment.
Also declare what is the default collection
There was a problem hiding this comment.
Is it possible to give a vector, like: collection = c("human", "animal")
There was a problem hiding this comment.
I'm sure this can be implemented. Possible add a mergeSE step if more then one collection is requested. Hmm.
There was a problem hiding this comment.
Not critical but could be nice.
There was a problem hiding this comment.
If not added in the function then the examples could show how to combine SEs (or maybe this could be shown in any case as well). Does the tidyomics pkg help with that btw.
| #' @param samplelist \code{Character scalar} or \code{NULL}. File path to a | ||
| #' sample list exported from the Metalog web UI. Supported formats are |
There was a problem hiding this comment.
Could this support both file name and a vector input (listing the samples directly)?
There was a problem hiding this comment.
Yes, this can be changed to support also a vector of sample to keep.
There was a problem hiding this comment.
I think this would be important to have. Many users want to operate with their own sample lists in R.
| #' abundance across the retained samples are also removed. | ||
| #' (Default: \code{NULL}). | ||
| #' | ||
| #' @param use.cache \code{Logical scalar}. Should previously downloaded files |
There was a problem hiding this comment.
Just check that use.cache is the commonly used name for this argument
| #' \code{\link[TreeSummarizedExperiment:TreeSummarizedExperiment-class]{TreeSummarizedExperiment}} | ||
| #' object | ||
| #' | ||
| #' @name fetchMetalogTSE |
There was a problem hiding this comment.
consider simplifying the name, e.g. just fetchMetalog
| #' @references | ||
| #' Metalog database: \url{https://metalog.embl.de/} | ||
| #' | ||
| #' The data is made available under the Open Database License (ODbL) v1.0. |
There was a problem hiding this comment.
If your derived data is an output or result produced from querying the database — such as a data object - this is called a "Produced Work" and you only need to include a notice that the content was obtained from the database and is available under the ODbL opendatacommons — something like: "Contains information from [DATABASE NAME], made available under the Open Database License (ODbL)."
There was a problem hiding this comment.
-> Consider generating such note as message() when the data is downloaded;
-> consider adding the note in TreeSE metadata (or maybe it is there already..?)
| data_files <- .resolve_metalog_url(collection, meta.type, use.cache) | ||
| # Latest database file for taxonomy mapping | ||
| mapping_db <- .download_metalog_file( | ||
| "https://metalog.embl.de/static/download/profiles/metaphlan4_clades.tsv.gz", |
There was a problem hiding this comment.
consider adding this as the function argument instead (and setting this as the default value)
|
We discussed with @TuomasBorman today that this function is slightly misaligned with the purpose of mia and might better serve as a separate small package. This would carry two major benefits;
Based on this I am leaning in favor of splitting this into a separate package. |
|
All OK to me but if it is only for a single function then we need to reconsider and I am not sure if Bioconductor accepts so minimal package. Do we have any other contents to include? |
This is very similar to What user wants is:
Can this be achieved somehow? We could ask from metaLog people if they are interested to collaborate after we have the working draft? It would help in the future maintenance (if there will be changes etc). As there are no API currently(?), this is rather fragile, so would be good to align with them. |
The metalog website offers a interactive way explore available metadata and download a sampleList file. But using the metalog webUI to browse metadata and explore, then downloading a sampleList file and passing it the to function, will filter to the selected samples. I guess, some functionality to browse available studies and metadata in R can be added. To essentially filter down to the selected samples in R, and passing a vector of ID's to the function to construct the tse with selected samples? These don't sidestep the need to download the full profiles though.. Currently the function uses sparseMatrix internally before converting to dense matrix just before tse construction. The use of sparseMatrix was necessary to be able to merge collections without obscene memory requirements. We discussed the use of delayedMatrix as an option. This is tempting as working with such large matrices puts very heavy requirements on available RAM.. if I understand the delayedMatrix capabilities, it would be possible to use disk storage to store the data, essentially sacrificing speed and gaining the ability run the function with less RAM. For context; my machine has 32gb of RAM and had to use the majority of 20gb swap to compile the merged tse. Pulling only "human" (~70k samples) fits comfortable in 32gb of RAM, but I suspect more common configurations of 8-16gb of RAM will struggle calling this function. |
DISCLAIMER: Refactored from previous "draft implementation" with assistance from Claude Opus 4.6 to fit mia style.
This PR adds a function to mia that fetches data from the Metalog database and compiles a ready tse object.
Provenance and license information is added to the
metadata(tse)$metalogslot.Unit tests are included, importantly they use minimal mock input to avoid downloading big datafiles from metalog.