Skip to content

Feature | Fetch data from Metalog and create a TSE object#814

Open
raivo-otus wants to merge 5 commits into
microbiome:develfrom
raivo-otus:feat/create-metalog-tse
Open

Feature | Fetch data from Metalog and create a TSE object#814
raivo-otus wants to merge 5 commits into
microbiome:develfrom
raivo-otus:feat/create-metalog-tse

Conversation

@raivo-otus
Copy link
Copy Markdown
Contributor

DISCLAIMER: Refactored from previous "draft implementation" with assistance from Claude Opus 4.6 to fit mia style.

This PR adds a function to mia that fetches data from the Metalog database and compiles a ready tse object.
Provenance and license information is added to the metadata(tse)$metalog slot.

Unit tests are included, importantly they use minimal mock input to avoid downloading big datafiles from metalog.

Refactored from previous "draft implementation" with assistance from
Claude Opus 4.6 to fit mia style. In the process updated to use httr2
for downloads and some optimizations to logic.
Copy link
Copy Markdown
Member

@antagomir antagomir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Comment thread R/fetchMetalogTSE.R
#' Fetch data from the Metalog database as
#' \code{TreeSummarizedExperiment}
#'
#' \code{fetchMetalogTSE} downloads MetaPhlAn4 taxonomic profiles and
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure it is necessary to emphasize here MetaPhlAn4 (this may change in the future)

Comment thread R/fetchMetalogTSE.R
#' \code{fetchMetalogTSE} downloads MetaPhlAn4 taxonomic profiles and
#' associated sample metadata from the
#' \href{https://metalog.embl.de/}{Metalog} database and returns them as a
#' \code{TreeSummarizedExperiment} object. An optional sample list can be
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this kind of situations would be usually written as \code{\link[SummarizedExperiment]{SummarizedExperiment}} for full linking (compare with the other R files)

Comment thread R/fetchMetalogTSE.R Outdated
Comment on lines +11 to +13
#' @param collection \code{Character scalar}. The Metalog collection to
#' download. Must be one of \code{"human"}, \code{"animal"}, \code{"ocean"},
#' or \code{"environmental"}.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible (or feasible) to load all samples? One might be interested in the comparisons.

Are all of these done with MetaPhlAn4?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also declare what is the default collection

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to give a vector, like: collection = c("human", "animal")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure this can be implemented. Possible add a mergeSE step if more then one collection is requested. Hmm.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not critical but could be nice.

Copy link
Copy Markdown
Member

@antagomir antagomir Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not added in the function then the examples could show how to combine SEs (or maybe this could be shown in any case as well). Does the tidyomics pkg help with that btw.

Comment thread R/fetchMetalogTSE.R
Comment on lines +19 to +20
#' @param samplelist \code{Character scalar} or \code{NULL}. File path to a
#' sample list exported from the Metalog web UI. Supported formats are
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this support both file name and a vector input (listing the samples directly)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this can be changed to support also a vector of sample to keep.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be important to have. Many users want to operate with their own sample lists in R.

Comment thread R/fetchMetalogTSE.R
#' abundance across the retained samples are also removed.
#' (Default: \code{NULL}).
#'
#' @param use.cache \code{Logical scalar}. Should previously downloaded files
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just check that use.cache is the commonly used name for this argument

Comment thread R/fetchMetalogTSE.R
#' \code{\link[TreeSummarizedExperiment:TreeSummarizedExperiment-class]{TreeSummarizedExperiment}}
#' object
#'
#' @name fetchMetalogTSE
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider simplifying the name, e.g. just fetchMetalog

Comment thread R/fetchMetalogTSE.R
#' @references
#' Metalog database: \url{https://metalog.embl.de/}
#'
#' The data is made available under the Open Database License (ODbL) v1.0.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your derived data is an output or result produced from querying the database — such as a data object - this is called a "Produced Work" and you only need to include a notice that the content was obtained from the database and is available under the ODbL opendatacommons — something like: "Contains information from [DATABASE NAME], made available under the Open Database License (ODbL)."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> Consider generating such note as message() when the data is downloaded;
-> consider adding the note in TreeSE metadata (or maybe it is there already..?)

Comment thread R/fetchMetalogTSE.R
data_files <- .resolve_metalog_url(collection, meta.type, use.cache)
# Latest database file for taxonomy mapping
mapping_db <- .download_metalog_file(
"https://metalog.embl.de/static/download/profiles/metaphlan4_clades.tsv.gz",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding this as the function argument instead (and setting this as the default value)

@raivo-otus
Copy link
Copy Markdown
Contributor Author

We discussed with @TuomasBorman today that this function is slightly misaligned with the purpose of mia and might better serve as a separate small package.

This would carry two major benefits;

  • more targeted package for "dynamic TSE loaders"

can be extended in the future to other databases that don't offer direct TSE imports, but expose data and metadata

  • Does not incur more maintenance needs into mia for such a small niche function

Based on this I am leaning in favor of splitting this into a separate package.

@antagomir
Copy link
Copy Markdown
Member

All OK to me but if it is only for a single function then we need to reconsider and I am not sure if Bioconductor accepts so minimal package. Do we have any other contents to include?

@TuomasBorman
Copy link
Copy Markdown
Contributor

All OK to me but if it is only for a single function then we need to reconsider and I am not sure if Bioconductor accepts so minimal package. Do we have any other contents to include?

This is very similar to MGnifyR package. But if there is only one function, that would not be enough for package, I agree. But on the other hand, curatedMetagenomicData package is also very small.

What user wants is:

  1. Check available studies
  2. Browse metadata on the studies
  3. Fetch selected studies

Can this be achieved somehow?

We could ask from metaLog people if they are interested to collaborate after we have the working draft? It would help in the future maintenance (if there will be changes etc). As there are no API currently(?), this is rather fragile, so would be good to align with them.

@raivo-otus
Copy link
Copy Markdown
Contributor Author

raivo-otus commented Apr 10, 2026

All OK to me but if it is only for a single function then we need to reconsider and I am not sure if Bioconductor accepts so minimal package. Do we have any other contents to include?

This is very similar to MGnifyR package. But if there is only one function, that would not be enough for package, I agree. But on the other hand, curatedMetagenomicData package is also very small.

What user wants is:

1. Check available studies

2. Browse metadata on the studies

3. Fetch selected studies

Can this be achieved somehow?

We could ask from metaLog people if they are interested to collaborate after we have the working draft? It would help in the future maintenance (if there will be changes etc). As there are no API currently(?), this is rather fragile, so would be good to align with them.

The metalog website offers a interactive way explore available metadata and download a sampleList file.
Metalog only offers downloads to the full collections of profiles, so downloading only a subset is not possible.

But using the metalog webUI to browse metadata and explore, then downloading a sampleList file and passing it the to function, will filter to the selected samples.

I guess, some functionality to browse available studies and metadata in R can be added. To essentially filter down to the selected samples in R, and passing a vector of ID's to the function to construct the tse with selected samples?

These don't sidestep the need to download the full profiles though.. Currently the function uses sparseMatrix internally before converting to dense matrix just before tse construction. The use of sparseMatrix was necessary to be able to merge collections without obscene memory requirements.

We discussed the use of delayedMatrix as an option. This is tempting as working with such large matrices puts very heavy requirements on available RAM.. if I understand the delayedMatrix capabilities, it would be possible to use disk storage to store the data, essentially sacrificing speed and gaining the ability run the function with less RAM.

For context; my machine has 32gb of RAM and had to use the majority of 20gb swap to compile the merged tse. Pulling only "human" (~70k samples) fits comfortable in 32gb of RAM, but I suspect more common configurations of 8-16gb of RAM will struggle calling this function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants