diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md new file mode 100644 index 0000000..5046dbd --- /dev/null +++ b/.github/CONTRIBUTING.md @@ -0,0 +1,106 @@ +# Contributing to RAMEN + +Hi there! This outlines how to propose a change to RAMEN. First of all, thanks +for considering contributing to our package! It's people like you that make it +rewarding for us - the project maintainers - to work on RAMEN. 😊 + +For a detailed discussion on contributing to this and other tidyverse packages, +please see the [development contributing guide](https://rstd.io/tidy-contrib) +and our [code review principles](https://code-review.tidyverse.org/). + +There are many ways you can contribute to this project (see the +[Open Source Guide](https://opensource.guide/how-to-contribute/)). Here are some +of them: + +## Engage with the package + +### Share the ideas + +Think RAMEN is useful? Let others discover it, by telling them in person, +via BlueSky or a blog post. + +Using RAMEN for a paper you are writing? Consider +[citing it](https://link.springer.com/article/10.1186/s13059-025-03864-4). + +### Ask a question + +Using RAMEN and got stuck? Browse the [documentation][website] to see if you +can find a solution. Still stuck? Post your question as an +[issue on GitHub][new_issue]. While we cannot offer user support, +we'll try to do our best to address it, as questions often lead to better +documentation or the discovery of bugs. + +Want to ask a question in private? Contact the package maintainer by +[email][mailto:email]. + +### Propose an idea 💡 + +Have an idea for a new our_package feature? Take a look at the +[documentation][website] and [issue list][issues] to see if it isn't included +or suggested yet. If not, suggest your idea as an [issue on GitHub][new_issue]. +While we can't promise to implement your idea, it helps to: + +* Explain in detail how it would work. +* Keep the scope as narrow as possible. + +See below if you want to contribute code for your idea as well. + +## Improve the documentation + +Noticed a typo on the website? Think a function could use a better example? +Good documentation makes all the difference, so your help to improve it is very welcome! + +You can fix typos, spelling mistakes, or grammatical errors in the +documentation directly using the GitHub web interface, as long as the changes +are made in the _source_ file. This generally means you'll need to +edit [roxygen2 comments](https://roxygen2.r-lib.org/articles/roxygen2.html) in +an `.R`, not a `.Rd` file. You can find the `.R` file that generates the `.Rd` +by reading the comment in the first line. + +## Bigger changes + +If you want to make a bigger change, it's a good idea to first file an issue and +make sure someone from the team agrees that it’s needed. + +### Report a bug + +Using our_package and discovered a bug? That's annoying! Don't let others have +the same experience and report it as well in an [issue on GitHub][new_issue] so +we can fix it. If you’ve found a bug, please file an issue that illustrates the +bug with a minimal [reprex](https://www.tidyverse.org/help/#reprex) (this will +also help you write a unit test, if needed). +See our guide on [how to create a great issue](https://code-review.tidyverse.org/issues/) +for more advice. Please provide as well your operating system name and version (e.g. Mac OS 10.13.6), +and any details about your local setup that might be helpful in troubleshooting. + +### Pull request process + +We try to follow the [GitHub flow](https://guides.github.com/introduction/flow/) for development. + +* Fork the package and clone onto your computer. If you haven't done this before, we recommend using `usethis::create_from_github("ErickNavarroD/RAMEN", fork = TRUE)`. + +* Install all development dependencies with `devtools::install_dev_deps()`, and then make sure the package passes R CMD check by running `devtools::check()`. + If R CMD check doesn't pass cleanly, it's a good idea to ask for help before continuing. +* Create a Git branch for your pull request (PR). We recommend using `usethis::pr_init("brief-description-of-change")`. + +* Make your changes, commit to git, and then create a PR by running `usethis::pr_push()`, and following the prompts in your browser. + The title of your PR should briefly describe the change. + The body of your PR should contain `Fixes #issue-number`. + +* For user-facing changes, add a bullet to the top of `NEWS.md` (i.e. just below the first header). Follow the style described in . + +### Code style + +* New code should follow the tidyverse [style guide](https://style.tidyverse.org). + You can use the [styler](https://CRAN.R-project.org/package=styler) package to apply these styles, but please don't restyle code that has nothing to do with your PR. + +* We use [roxygen2](https://cran.r-project.org/package=roxygen2), with [Markdown syntax](https://cran.r-project.org/web/packages/roxygen2/vignettes/rd-formatting.html), for documentation. + +* We use [testthat](https://cran.r-project.org/package=testthat) for unit tests. + Contributions with test cases included are easier to accept. + +## Code of Conduct + +Please note that this package is released with a [Contributor +Code of Conduct](https://ropensci.org/code-of-conduct/). +By contributing to this project, you agree to abide by its terms. diff --git a/.github/workflows/R-CMD-check.yaml b/.github/workflows/R-CMD-check.yaml new file mode 100644 index 0000000..562fe0f --- /dev/null +++ b/.github/workflows/R-CMD-check.yaml @@ -0,0 +1,51 @@ +# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples +# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help +on: + push: + branches: [main, master] + pull_request: + +name: R-CMD-check.yaml + +permissions: read-all + +jobs: + R-CMD-check: + runs-on: ${{ matrix.config.os }} + + name: ${{ matrix.config.os }} (${{ matrix.config.r }}) + + strategy: + fail-fast: false + matrix: + config: + - {os: macos-latest, r: 'release'} + - {os: windows-latest, r: 'release'} + - {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'} + - {os: ubuntu-latest, r: 'release'} + - {os: ubuntu-latest, r: 'oldrel-1'} + + env: + GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} + R_KEEP_PKG_SOURCE: yes + + steps: + - uses: actions/checkout@v4 + + - uses: r-lib/actions/setup-pandoc@v2 + + - uses: r-lib/actions/setup-r@v2 + with: + r-version: ${{ matrix.config.r }} + http-user-agent: ${{ matrix.config.http-user-agent }} + use-public-rspm: true + + - uses: r-lib/actions/setup-r-dependencies@v2 + with: + extra-packages: any::rcmdcheck + needs: check + + - uses: r-lib/actions/check-r-package@v2 + with: + upload-snapshots: true + build_args: 'c("--no-manual","--compact-vignettes=gs+qpdf")' diff --git a/.github/workflows/test-coverage.yaml b/.github/workflows/test-coverage.yaml new file mode 100644 index 0000000..0ab748d --- /dev/null +++ b/.github/workflows/test-coverage.yaml @@ -0,0 +1,62 @@ +# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples +# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help +on: + push: + branches: [main, master] + pull_request: + +name: test-coverage.yaml + +permissions: read-all + +jobs: + test-coverage: + runs-on: ubuntu-latest + env: + GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} + + steps: + - uses: actions/checkout@v4 + + - uses: r-lib/actions/setup-r@v2 + with: + use-public-rspm: true + + - uses: r-lib/actions/setup-r-dependencies@v2 + with: + extra-packages: any::covr, any::xml2 + needs: coverage + + - name: Test coverage + run: | + cov <- covr::package_coverage( + quiet = FALSE, + clean = FALSE, + install_path = file.path(normalizePath(Sys.getenv("RUNNER_TEMP"), winslash = "/"), "package") + ) + print(cov) + covr::to_cobertura(cov) + shell: Rscript {0} + + - uses: codecov/codecov-action@v5 + with: + # Fail if error if not on PR, or if on PR and token is given + fail_ci_if_error: ${{ github.event_name != 'pull_request' || secrets.CODECOV_TOKEN }} + files: ./cobertura.xml + plugins: noop + disable_search: true + token: ${{ secrets.CODECOV_TOKEN }} + + - name: Show testthat output + if: always() + run: | + ## -------------------------------------------------------------------- + find '${{ runner.temp }}/package' -name 'testthat.Rout*' -exec cat '{}' \; || true + shell: bash + + - name: Upload test results + if: failure() + uses: actions/upload-artifact@v4 + with: + name: coverage-test-failures + path: ${{ runner.temp }}/package diff --git a/CITATION.cff b/CITATION.cff deleted file mode 100644 index 4d6dbdd..0000000 --- a/CITATION.cff +++ /dev/null @@ -1,36 +0,0 @@ -# This CITATION.cff file was generated with cffinit. -# Visit https://bit.ly/cffinit to generate yours today! - -cff-version: 1.2.0 -title: >- - RAMEN: Regional Association of DNA Methylome variability - with Exposome and geNome. -message: >- - If you use this software, please cite it using the - metadata from this file. -type: software -authors: - - given-names: Erick I. - family-names: Navarro-Delgado - email: erick.navarrodelgado@bcchr.ca - affiliation: The University of British Columbia - orcid: 'https://orcid.org/0000-0003-1040-3519' -repository-code: 'https://github.com/ErickNavarroD/RAMEN' -url: 'https://ericknavarrod.github.io/RAMEN/' -abstract: > - Regional Association of Methylome variability with the - Exposome and geNome (RAMEN) is an R package whose goal is - to identify Variable Methylated Regions (VMRs) in - microarray DNA methylation data. Additionally, using - Genotype (G) and Environmental (E) data, it can identify - which G, E, G+E or GxE model better explains this - variability. -keywords: - - DNA methylation - - Variable methylated regions - - gene-environment interaction - - multi-omic - - exposome -license: GPL-3.0+ -version: 1.0.0 -date-released: '2024-03-01' diff --git a/DESCRIPTION b/DESCRIPTION index f1d224b..01aaa7a 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,14 +1,14 @@ Package: RAMEN Title: RAMEN: Regional Association of Methylome variability with the Exposome and geNome -Version: 1.0.0 +Version: 2.0.0 Authors@R: person("Erick I.", "Navarro-Delgado", , "ericknadel98@hotmail.com", role = c("aut", "cre"), comment = c(ORCID = "0000-0003-1040-3519")) -Description: R package that identifies which genetic (G), environmental (E), additive (G+E) or interaction (GxE) model better explains DNA methylation levels in Variable Methylated Regions using microarray data. +Description: R package that identifies which genetic (G), environmental (E), additive (G+E) or interaction (GxE) effect better explains DNA methylation levels in Variable Methylated Loci using microarray data. License: GPL (>= 3) Encoding: UTF-8 Roxygen: list(markdown = TRUE) -RoxygenNote: 7.2.3 +RoxygenNote: 7.3.3 Suggests: BiocStyle, knitr, @@ -23,6 +23,9 @@ Imports: foreach, GenomicRanges, glmnet, + IlluminaHumanMethylation450kanno.ilmn12.hg19, + IlluminaHumanMethylationEPICanno.ilm10b4.hg19, + IlluminaHumanMethylationEPICv2anno.20a1.hg38, IRanges, iterators, lifecycle, @@ -34,6 +37,7 @@ Imports: tibble VignetteBuilder: knitr Depends: - R (>= 2.10) + R (>= 4.2.0) LazyData: true URL: https://ericknavarrod.github.io/RAMEN/ +BugReports: https://github.com/ErickNavarroD/RAMEN/issues diff --git a/NAMESPACE b/NAMESPACE index 2257f4c..6a697a4 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -2,12 +2,12 @@ export("%>%") export(findCisSNPs) -export(findVMRs) +export(findVML) export(lmGE) export(medCorVMR) export(nullDistGE) export(selectVariables) -export(summarizeVMRs) +export(summarizeVML) importFrom(doRNG,"%dorng%") importFrom(foreach,"%do%") importFrom(foreach,"%dopar%") diff --git a/NEWS.md b/NEWS.md new file mode 100644 index 0000000..86a7b15 --- /dev/null +++ b/NEWS.md @@ -0,0 +1,35 @@ +# RAMEN 2.0.0 + +In this version, we have made an important change in RAMEN terminology across all the code and documentation to more accurately reflect the biological concepts represented by the data. The term "Variably Methylated Regions (VMR)" used in RAMEN v1 has been replaced by "Variably Methylated Loci (VML)" in RAMEN v2, as not all VML are composed of 2 or more highly variable probes. VML are further composed of Variably Methylated Regions (previously named "canonical VMR" in RAMEN v1) and sparse Variably Methylated Probes (sVMPs; previously named "non-canonical VMR" in RAMENv1). To be clear, there are no changes in how these VML are identified, we only changed how we label these categories. + +| Updated name in RAMEN v2 | Deprecated name in RAMEN v1 | +|---------------------------------------|---------------------------------| +| Variably Methylated Loci (VML) | Variably Methylated Region (VMR) | +| Variably Methylated Region (VMR) | canonical VMR (cVMR) | +| sparse Variably Methylated Probe (sVMP) | non-canonical VMR (ncVMR) | + +: Terminology update + +- To reflect the terminology change, the following functions had a name change: `findVML()` (previously named `findVMRs()` in RAMEN v1) and `summarizeVML()` (previously named `summarizeVMRs()` in RAMEN v1). + +- `findVML()`: + + - Output: list does not separate VMRs and sVMPs into two different list elements anymore. Now, a single element ("VML") is returned in the output list, which contains both VMRs and sVMPs, labelled accordingly under the *type* column; this VML element is now a data frame, and not a Genomic Ranges object to facilitate data wrangling and plotting. The function now automatically indexes the VML. + + - The user does not need to provide the array manifest anymore if working with the Illumina 450k, EPICv1 or EPICv2 array. The `array_manifest` argument accepts now "IlluminaHumanMethylation450k", "IlluminaHumanMethylationEPICv1" and "IlluminaHumanMethylationEPICv2". + + - There is a new method to identify VML using ultrastable probes (probes which DNA methylation is known to be stable independently of tissue and developmental stage) to discriminate Highly Variable Probes, which are then grouped into VML. This method is the default one now. For more information please see the `findVML()` documentation and the package vignette. The previously default method to identify Highly Variable Probes (top 10% of probes with the highest variance in the data set) is still available using the argument `var_distribution = "all"`. + +- `nullDistGE()`: Prints messages to keep track of the progress. Fixed a bug that made doFuture parallelization strategies crash. + +- All functions have examples in the documentation. + +- Added tests to reach a code coverage of \>90% in all functions. + +- Improved error catches to make functions stop early when the inputs are not in the right format. Fixed various bugs throughout the code (no user. + +- Added news, citation and contributing files to the repository. + +- Citation info is provided when loading the package. + +- The package repository has now informative badges and Continuous Integration checks. diff --git a/R/findCisSNPs.R b/R/findCisSNPs.R index 1a4c8d3..039fa4c 100644 --- a/R/findCisSNPs.R +++ b/R/findCisSNPs.R @@ -1,51 +1,86 @@ -#' Find cis SNPs around a set of Variable Methylated Regions (VMRs) +#' Find cis SNPs around a set of Variable Methylated Loci (VML) #' -#' Identification of genotyped Single Nucleotide Polymorphisms (SNPs) close to each VMR using a distance threshold. +#' Identification of genotyped Single Nucleotide Polymorphisms (SNPs) close to +#' each VML using a distance threshold. #' -#' **Important**: please make sure that the positions of the VMR data frame and the ones in the genotype information are from the same genome build. +#' **Important**: please make sure that the positions of the VML data frame and +#' the ones in the genotype information are from the same genome build. #' -#' @param VMRs_df A GRanges object converted to a data frame. Must contain the following columns: -#' "seqnames", "start", "end". These columns are present automatically when doing the object conversion and correspond to the chromosome number, and range of the region. -#' @param genotype_information A data frame with information about genotyped sites of interest. It must contain the following -#' columns: "CHROM" - chromosome number, "POS" - Genomic basepair position of SNP in the corresponding -#' chromosome (must contain values of class int), and "ID" - SNP ID. The nomenclature of CHROM must match with the one used in the VMRs_df seqnames column (i.e., if VMRs_df$seqnames uses 1, 2, 3, X, Y or Chr1, Chr2, Chr3, ChrX, ChrY, etc. as chromosome number, the genotype_information$CHROM values must be encoded in the same way). -#' @param distance The distance threshold to be used to identify cis SNPs. Default is 1 Mb. +#' @param VML_df A GRanges-like data frame (i.e. the same columns as a GRanges +#' object converted to a data frame). Must contain the following columns: +#' "seqnames", "start", "end". These columns are present automatically when +#' doing the object conversion and correspond to the chromosome number, and +#' range of the region. +#' @param genotype_information A data frame with information about genotyped +#' sites of interest. It must contain the following columns: "CHROM" +#' (chromosome number), "POS" (Genomic basepair position of the SNP (must be an +#' integer), and "ID" (SNP ID). The nomenclature of CHROM must match with the +#' one used in the VML_df seqnames column (i.e., if VML_df$seqnames uses 1, 2, +#' 3, X, Y or Chr1, Chr2, Chr3, ChrX, ChrY, etc. as chromosome number, the +#' genotype_information$CHROM values must be encoded in the same way). +#' @param distance The distance threshold in basepairs to be used to identify +#' cis SNPs. Default is 1 Mb. #' -#' @return A VMR_df object (a data frame compatible with GRanges conversion) with the following new columns: -#' - The cis SNPs identified for each VMR, the number of SNPs surrounding each VMR in the specified window -#' - VMR_index, which is created if not already existing based on the rownames of the VMR_df. +#' @return The same VML data frame (a data frame compatible with GRanges +#' conversion) with the following new columns: +#' - The cis SNPs identified for each VML and the number of SNPs surrounding +#' each VML in the specified window #' @export +#' @examples +#' ## Find VML in test data +#' VML <- RAMEN::findVML( +#' methylation_data = RAMEN::test_methylation_data, +#' array_manifest = "IlluminaHumanMethylationEPICv1", +#' cor_threshold = 0, +#' var_method = "variance", +#' var_distribution = "ultrastable", +#' var_threshold_percentile = 0.99, +#' max_distance = 1000 +#' ) +#' ## Find cis SNPs around VML +#' VML_with_cis_snps <- RAMEN::findCisSNPs( +#' VML_df = VML$VML, +#' genotype_information = RAMEN::test_genotype_information, +#' distance = 1e6 +#' ) +#' +findCisSNPs <- function(VML_df, genotype_information, distance = 1e6) { + CHROM <- NULL + # Check arguments + if (!is.data.frame(VML_df)) stop("Please make sure the VML_df object is a data frame.") + if (!is.data.frame(genotype_information)) stop("Please make sure the genotype_information object is a data frame.") + if (!all(c("seqnames", "start", "end") %in% colnames(VML_df))) stop("Please make sure the VML_df object has the required columns with the appropiate names (check documentation for further information)") + if (!all(c("CHROM", "POS", "ID") %in% colnames(genotype_information))) stop("Please make sure the genotype_information object has the required columns with the appropiate names (check documentation for further information)") + message("Reminder: please make sure that the positions of the VML data frame and the ones in the genotype information are from the same genome build.") + # Convert VML and snp data into a GenomicRanges object + VML_gr <- GenomicRanges::makeGRangesFromDataFrame(VML_df, keep.extra.columns = TRUE) + genotype_information <- genotype_information %>% + dplyr::arrange(CHROM) # important step for using Rle later when constructing the GenomicRanges object! + seqnames_gr <- table(genotype_information$CHROM) + genot_gr <- GenomicRanges::GRanges( + seqnames = S4Vectors::Rle(names(seqnames_gr), as.numeric(seqnames_gr)), # Number of chromosome; as.numeric to convert from table to numeric vector + ranges = IRanges::IRanges(genotype_information$POS, + end = genotype_information$POS, + names = genotype_information$ID + ) + ) + # Extend each VML 1 Mb up and downstream + VML_extended <- VML_gr + distance -findCisSNPs = function(VMRs_df, genotype_information, distance = 1e6){ - #Check arguments - if(!all(c("seqnames","start","end") %in% colnames(VMRs_df))) stop("Please make sure the VMRs_df object has the required columns with the appropiate names (check documentation for further information)") - if(!all(c("CHROM","POS","ID") %in% colnames(genotype_information))) stop("Please make sure the genotype_information object has the required columns with the appropiate names (check documentation for further information)") - message("Important: please make sure that the positions of the VMR data frame and the ones in the genotype information are from the same genome build.") - #Convert VMR and snp data into a GenomicRanges object - VMRs_gr = GenomicRanges::makeGRangesFromDataFrame(VMRs_df, keep.extra.columns = TRUE) - genotype_information = genotype_information %>% dplyr::arrange(CHROM) #important step for using Rle later when constructing the GenomicRanges object! - seqnames_gr = table(genotype_information$CHROM) - genot_gr = GenomicRanges::GRanges( - seqnames = S4Vectors::Rle(names(seqnames_gr), as.numeric(seqnames_gr)), #Number of chromosome; as.numeric to convert from table to numeric vector - ranges = IRanges::IRanges(genotype_information$POS, end = genotype_information$POS , - names = genotype_information$ID)) - #Extend each VMR 1 Mb up and downstream - VMRs_extended = VMRs_gr + distance - - VMRs_df_with_cisSNPs = VMRs_df - if(!"VMR_index" %in% colnames(VMRs_df_with_cisSNPs)){ # Add a VMR index to each region if not already existing - VMRs_df_with_cisSNPs = VMRs_df_with_cisSNPs %>% - tibble::rownames_to_column(var = "VMR_index") + VML_df_with_cisSNPs <- VML_df + if (!"VML_index" %in% colnames(VML_df_with_cisSNPs)) { # Add a VML index to each region if not already existing + VML_df_with_cisSNPs <- VML_df_with_cisSNPs %>% + dplyr::mutate(VML_index = paste("VML", as.character(dplyr::row_number()), sep = "")) } - #### Get the number of overlaps per extended VMR #### - VMRs_df_with_cisSNPs$surrounding_SNPs = GenomicRanges::countOverlaps(VMRs_extended, genot_gr) + #### Get the number of overlaps per extended VML #### + VML_df_with_cisSNPs$surrounding_SNPs <- GenomicRanges::countOverlaps(VML_extended, genot_gr) - ####Identify the SNPs that are present in each VMR #### - snps_per_vmr_find = GenomicRanges::findOverlaps(VMRs_extended, genot_gr, select = "all") - rownames(genotype_information) = genotype_information$ID - VMRs_df_with_cisSNPs = VMRs_df_with_cisSNPs %>% - dplyr::mutate(SNP = sapply(snps_per_vmr_find, map_revmap_names, genotype_information)) + #### Identify the SNPs that are present in each VML #### + snps_per_vml_find <- GenomicRanges::findOverlaps(VML_extended, genot_gr, select = "all") + rownames(genotype_information) <- genotype_information$ID + VML_df_with_cisSNPs <- VML_df_with_cisSNPs %>% + dplyr::mutate(SNP = lapply(snps_per_vml_find, map_revmap_names, genotype_information)) - return(VMRs_df_with_cisSNPs) + return(VML_df_with_cisSNPs) } diff --git a/R/findVML.R b/R/findVML.R new file mode 100644 index 0000000..2be291d --- /dev/null +++ b/R/findVML.R @@ -0,0 +1,287 @@ +#' Map revmap column to probe names after reducing a GenomicRanges object +#' +#' Given a revmap row (e.g. 1 5 6), we map those positions to their corresponding probe names +#' (and end up with something like "cg00000029", "cg00000158", "cg00000165".This is a helper function of findVML()). +#' +#' @param positions A revmap row in the form of a vector +#' @param manifest_hvp the manifest of the highly variable probes used in the findVML() function +#' with the probes as row names +#' +#' @return a vector with the names of the probes that conform one reduced region +#' +#' @examples +#' \dontrun{ +#' target = data.frame(row.names = c("a", "b", "c", "d"), values = c(1,1,1,1)) +#' query = c(2,1) +#' +#' map_revmap_names(positions = query, manifest_hvp = target) +#' ## Expected output: c("b", "a") +#' } + +map_revmap_names <- function(positions, manifest_hvp) { + # We start with 1 5 6 + # We want to end with cg00000029, cg00000158 cg00000165 + names <- c() + for (element in positions) { + names <- c(names, row.names(manifest_hvp)[element]) + } + return(names) +} + + +#' Identify Variable Methylated Loci in microarrays +#' +#' Identifies Highly Variable Probes (HVP) and groups them into Variable Methylated Loci (VML) given an Illumina manifest.The output of this function provides the HVPs, and the identified VML, which are made of Variable Methylated Regions and sparse Variable Methylated Probes. See Details below for more information. +#' +#' This function identifies HVPs based on MAD scores or variance, and groups them into VML, which are defined as genomic regions with high DNA methylation variability.To best capture methylome variability patterns in microarrays, we identify two types of VML: Variably Methylated Regions (VMRs) and sparse Variably Methylated Probes (sVMPs) . +#' +#' In one hand, we defined VMRs as two or more proximal highly variable probes (default: < 1kb apart) with correlated DNAme level (default: r > 0.15). Modelling DNAme variability through regions rather than individual CpGs provides several methodological advantages in association studies, since CpGs display a significant correlation for co-methylation when they are close (less than or equal to 1 kilobase). Modelling DNAme variability through regions rather than individual CpGs provides several methodological advantages in association studies, since CpGs display a significant correlation for co-methylation when they are close (less than or equal to 1 kilobase) +#' +#' In addition to traditional VMRs, we also identified sparse Variably Methylated Probes (sVMPs), a second type of VML that takes into account the sparse and non-uniformly distributed coverage of CpGs in microarrays to tailor our analysis to this DNAme platform. sVMPs aimed to retain genomic regions with high DNAme variability measured by single probes, where probe grouping based on proximity and correlation is therefore not applicable. This is particularly relevant in the Illumina EPIC v1 array, where most covered regulatory regions (up to 93%) are represented by just one probe. Notably, based on empirical comparisons with whole-genome bisulfite sequencing data, these single probes are mostly representative of local regional DNAme levels due to their positioning (98.5-99.5%) +#' +#' This function uses GenomicRanges::reduce() to group the regions, which is strand-sensitive. In the Illumina microarrays, the MAPINFO for all the probes is usually provided for the + strand. If you are using this array, we recommend to first convert the strand of all the probes to "+". +#' +#' This function supports parallel computing for increased speed. To do so, you have to set the parallel backend +#' in your R session BEFORE running the function (e.g., *doParallel::registerDoParallel(4)*). After that, the function can be run as usual. When working with big datasets, the parallel backend might throw an error if you exceed the maximum allowed size of globals exported for future expression. This can be fixed by increasing the allowed size (e.g. running *options(future.globals.maxSize= +Inf)*) +#' +#' Note: this function does not exclude sex chromosomes. If you want to exclude them, you can do so in the methylation_data object before running the function. +#' +#' @param array_manifest Information about the probes on the array in a format compatible with the Bioconductor annotation packages. The user can specify one of the supported human microarrays ("IlluminaHumanMethylation450k" with the hg19 genome build, "IlluminaHumanMethylationEPICv1" with the hg19 genome build, or "IlluminaHumanMethylationEPICv2" with the hg38 genome build), or provide a manifest. The manifest requires the probe names as row names, and the following columns: "chr" (chromosome); "pos" (genomic location of the probe in the genome); and "strand" (this is very important to set up, since the VMRs will only be created based on CpGs on the same strand; if the positions are reported based on a single DNA strand, this should contain either a vector of only "+", "-" or "*" for all of the probes). +#' @param methylation_data A data frame containing M or B values, with samples as columns and probes as rows. Data is expected to have already passed through quality control and cleaning steps. +#' @param cor_threshold Numeric value (0-1) to be used as the median pearson correlation threshold for identifying VMRs (i.e. +#' all VMRs will have a median pairwise probe correlation higher than this threshold). +#' @param var_method A string indicating the metric to use to represent variability in the data set. The options are "mad" (median absolute deviation) +#' or "variance". +#' @param var_distribution A string indicating which probes in the data set should be used to create a variability distribution; the threshold to identify Highly Variable Probes (determined also with the var_threshold_percentile argument) is established based on this distribution. The options 1 is "ultrastable" (a subset of CpGs that are stably methylated/unmethylated across human tissues and developmental states described by [Edgar R., et al.](https://doi.org/10.1186/1756-8935-7-28) in 2014). This option is recommended, especially if you want to compare different populations or tissues, as the threshold value should be comparable. On the other hand, the user can use option 2: "all" (all probes in the data set). The "ultrastable" option is only compatible with Illumina human microarrays. The default is "ultrastable". +#' @param var_threshold_percentile The percentile (0-1) to be used as cutoff to define Highly Variable Probes (which are then grouped into VML). If using the variability of the "ultrastable" probes, we recommend a high threshold (default is 0.99), since these probes are expected to display a very low variation in human tissues. If using the variability of "all" probes, we recommend using a percentile of 0.9 since it captures the top 10% most variable probes, which has been traditionally used in studies. It is important to note that the top 10% most variable probes will capture the same amount of probes in a data set regardless of their overall variability levels, which might differ between tissues or populations. +#' @param max_distance Maximum distance in base pairs allowed for two probes to be grouped into a region. The default is 1000. +#' +#' @return A list with the following elements: +#' - $var_score_threshold: threshold used to define Highly Variable Probes (mad or variance, depending on the specified choice). +#' - $highly_variable_probes: a data frame with the probes that passed the variability score threshold imposed by the user, and their variability score (MAD score or variance). +#' - $VML: a GRanges-like data frame with VMRs (regions composed of two or more contiguous, correlated and proximal Highly Variable Probes), and sVMPs (highly variable probes without neighboring CpGs measured in *max_distance* on the array). +#' +#' @export +#' @examples +#' +#' VML <- RAMEN::findVML( +#' methylation_data = RAMEN::test_methylation_data, +#' array_manifest = "IlluminaHumanMethylationEPICv1", +#' cor_threshold = 0.15, +#' var_method = "variance", +#' var_distribution = "ultrastable", +#' var_threshold_percentile = 0.99, +#' max_distance = 1000 +#' ) +#' +findVML <- function(methylation_data, + array_manifest, + cor_threshold = 0.15, + var_method = "variance", + var_distribution = "ultrastable", + var_threshold_percentile = 0.99, + max_distance = 1000) { + #### Binding of variables used within the tidyverse framework #### + Methyl450_Loci <- epicv2_probes <- TargetID <- chr <- pos <- strand <- var_score <- probes <- . <- median_correlation <- n_VMPs <- VML_index <- type <- seqnames <- start <- end <- width <- NULL + #### Argument checks #### + # Check that the array manifest is in the right format + if (is.data.frame(array_manifest)) { + if (!all(c("chr", "pos", "strand") %in% colnames(array_manifest))) stop("The array_manifest data frame does not have the required columns. Please provide a manifest with the required columns or provide a string with one of the supported human microarrays ('IlluminaHumanMethylation450k', 'IlluminaHumanMethylationEPICv1','IlluminaHumanMethylationEPICv2')") + # Check that the array strand is in the format expected by the user + if (base::length(base::unique(array_manifest$strand)) > 1) warning("The manifest currently has more than one type of strands. Please note that this function is strand sensitive. So, probes in proximal coordinates but different strands on the manifest will not be grouped together. Many array manifests such as the Illumina EPIC one include the PROBE strand, but the position of the actual CpGs (pos) is reported in the same strand; in those cases we recommend setting all of the probes to the same strand.") + if (var_distribution == "ultrastable") { + # If the user provides their own manifest and is choosing to use the ultrastable probes, make sure that a good number of them is present in the data set. If not, throw an error + if (sum(row.names(array_manifest) %in% RAMEN::ultrastable_cpgs) < 100) stop("The var_distribution = 'ultrastable' option is only compatible with Illumina human microarrays at the moment. If you are using a human Illumina microarray please indicate it with their corresponding string, or make sure that it contains a more than 100 ultrastable probes (RAMEN::ultrastable_cpgs). If not, please get the variability threshold based on all the probes in your data set(var_distribution = 'all', var_threshold_percentile = 0.9). ") + } + } else if (is.character(array_manifest)) { + if (!array_manifest %in% c("IlluminaHumanMethylation450k", "IlluminaHumanMethylationEPICv1", "IlluminaHumanMethylationEPICv2")) stop("The string you provided in array_manifest is not currently supported in RAMEN. Please provide a manifest with the required columns or provide a string with one of the supported human microarrays ('IlluminaHumanMethylation450k', 'IlluminaHumanMethylationEPICv1','IlluminaHumanMethylationEPICv2')") + } else { + stop("The array_manifest object is not a data.frame nor a string. Please provide a manifest with the required columns or provide a string with one of the supported human microarrays ('IlluminaHumanMethylation450k', 'IlluminaHumanMethylationEPICv1','IlluminaHumanMethylationEPICv2')") + } + #Check that cor_threshold is numeric and between 0 and 1 + if (!(is.numeric(cor_threshold) && cor_threshold >= 0 && cor_threshold <= 1)) { + stop("'cor_threshold' must be of type 'numeric' and from 0 to 1") + } + if (!is.data.frame(methylation_data)) stop("The methylation_data object must be a data frame with samples as columns and probes as rows.") + if (!var_distribution %in% c("all","ultrastable")) stop("'var_distribution' must be one of 'all' or 'ultrastable'") + # Check that the method choice is correct + if (var_method == "mad") { + var_scores <- apply(methylation_data, 1, stats::mad) %>% + as.data.frame() %>% + dplyr::rename("var_score" = ".") + } else if (var_method == "variance") { + var_scores <- apply(methylation_data, 1, stats::var) %>% + as.data.frame() %>% + dplyr::rename("var_score" = ".") + } else { + stop("The method must be either 'mad' or 'variance'. Please select one of those options") + } + + #### Identify highly variable probes #### + message("Identifying Highly Variable Probes...") + # Get the variability threshold + if (var_distribution == "all") { + var_threshold <- stats::quantile( + var_scores$var_score, + var_threshold_percentile + ) + } else if (var_distribution == "ultrastable") { + if (is.data.frame(array_manifest)) { + var_threshold <- stats::quantile( + var_scores[(row.names(var_scores) %in% RAMEN::ultrastable_cpgs), ], # Subset only ultrastable probes + var_threshold_percentile + ) + } else if (array_manifest == "IlluminaHumanMethylationEPICv2") { + # Get the name of the ultrastable probes in the EPICv2 format + epicv2_ultrastable_cpgs <- IlluminaHumanMethylationEPICv2anno.20a1.hg38::Other |> + data.frame() |> + dplyr::filter(Methyl450_Loci %in% RAMEN::ultrastable_cpgs) |> + tibble::rownames_to_column("epicv2_probes") |> + dplyr::pull(epicv2_probes) + var_threshold <- stats::quantile( + var_scores[(row.names(var_scores) %in% epicv2_ultrastable_cpgs), ], + var_threshold_percentile + ) + } else { + # EPICv1 or 450k (same probe name as the ultrastable probes) + var_threshold <- stats::quantile( + var_scores[(row.names(var_scores) %in% RAMEN::ultrastable_cpgs), ], # Subset only ultrastable probes + var_threshold_percentile + ) + } + } + # Replace the array manifest if the user provided a string with the name of the array + if (is.character(array_manifest)) { + if (array_manifest == "IlluminaHumanMethylation450k") { + manifest <- data.frame(IlluminaHumanMethylation450kanno.ilmn12.hg19::Locations) + } else if (array_manifest == "IlluminaHumanMethylationEPICv1") { + manifest <- data.frame(IlluminaHumanMethylationEPICanno.ilm10b4.hg19::Locations) + } else if (array_manifest == "IlluminaHumanMethylationEPICv2") { + manifest <- data.frame(IlluminaHumanMethylationEPICv2anno.20a1.hg38::Locations) + } + } else { + manifest <- array_manifest + } + # Filter the manifest to remove the probes that have no variability score information because they were not measured/did not pass the QC and are not highly variable + manifest_hvp <- manifest %>% + tibble::rownames_to_column(var = "TargetID") %>% + dplyr::select(c(TargetID, chr, pos, strand)) %>% + dplyr::filter( + !is.na(pos), # Remove probes with no map info + TargetID %in% row.names(var_scores %>% + dplyr::filter(var_score >= var_threshold)) + ) %>% # Remove probes that have no methylation information in the processed data and are not highly variable + dplyr::left_join( + var_scores %>% # Add variability information + tibble::rownames_to_column(var = "TargetID"), + by = "TargetID" + ) %>% + dplyr::arrange(chr) %>% # important step for using Rle later when constructing the GenomicRanges object! + as.data.frame() + rownames(manifest_hvp) <- manifest_hvp$TargetID + if (is.factor(manifest_hvp$chr)) manifest_hvp <- manifest_hvp %>% dplyr::mutate(chr = droplevels(chr)) + + #### Identify sparse Variable Methylated Probes#### + message("Identifying sparse Variable Methylated Probes") + full_manifest <- manifest %>% + tibble::rownames_to_column(var = "TargetID") %>% + dplyr::select(c(TargetID, chr, pos, strand)) %>% + dplyr::filter( + !is.na(pos), # Remove probes with no map info + TargetID %in% row.names(var_scores) + ) %>% # keep only the probes where we have methylation information + dplyr::arrange(chr) %>% # important step for using Rle later when constructing the GenomicRanges object! + as.data.frame() + rownames(full_manifest) <- full_manifest$TargetID + if (is.factor(full_manifest$chr)) full_manifest <- full_manifest %>% dplyr::mutate(chr = droplevels(chr)) + + # Convert the full manifest to a GenomicRanges object + seqnames_full_manifest_gr <- table(full_manifest$chr) + full_manifest_gr <- GenomicRanges::GRanges( + seqnames = S4Vectors::Rle(names(seqnames_full_manifest_gr), as.numeric(seqnames_full_manifest_gr)), # Number of chromosome; as.numeric to convert from table to numeric vector + ranges = IRanges::IRanges(full_manifest$pos, + end = full_manifest$pos, + names = full_manifest$TargetID + ), + strand = S4Vectors::Rle( + rle(as.character(full_manifest$strand))$values, + rle(as.character(full_manifest$strand))$lengths + ) + ) + + #### Group the probes into regions to detect sVMPs#### + regions_full_manifest <- GenomicRanges::reduce(full_manifest_gr, with.revmap = TRUE, min.gapwidth = max_distance) + # Add the number of probes in each region + S4Vectors::mcols(regions_full_manifest)$n_probes <- vapply(S4Vectors::mcols(regions_full_manifest)$revmap, + length, + FUN.VALUE = numeric(1)) + # Substitute revmap with the name of the probes in each region + S4Vectors::mcols(regions_full_manifest)$probes <- lapply(S4Vectors::mcols(regions_full_manifest)$revmap, map_revmap_names, full_manifest) + # Remove revmap mcol + S4Vectors::mcols(regions_full_manifest)$revmap <- NULL + # Keep elements with only one probe + lonely_probes <- regions_full_manifest[(GenomicRanges::elementMetadata(regions_full_manifest)[, "n_probes"] <= 1)] %>% + as.data.frame() %>% + dplyr::pull(probes) %>% + unlist() + + #### Identify VMRs#### + message("Identifying Variable Methylated Regions...") + # convert the highly variable probes data frame to a GenomicRanges object + seqnames_gr <- table(manifest_hvp$chr) + gr <- GenomicRanges::GRanges( + seqnames = S4Vectors::Rle(names(seqnames_gr), as.numeric(seqnames_gr)), # Number of chromosome; as.numeric to convert from table to numeric vector + ranges = IRanges::IRanges(manifest_hvp$pos, + end = manifest_hvp$pos, + names = manifest_hvp$TargetID + ), + strand = S4Vectors::Rle( + rle(as.character(manifest_hvp$strand))$values, + rle(as.character(manifest_hvp$strand))$lengths + ), + var_score = manifest_hvp$var_score + ) # Metadata + + # Create the regions + candidate_VMRs <- GenomicRanges::reduce(gr, with.revmap = TRUE, min.gapwidth = max_distance) + # Add the number of probes in each region + S4Vectors::mcols(candidate_VMRs)$n_VMPs <- vapply(S4Vectors::mcols(candidate_VMRs)$revmap, + length, + FUN.VALUE = numeric(1)) + # Substitute revmap with the name of the probes in each VMR + S4Vectors::mcols(candidate_VMRs)$probes <- lapply(S4Vectors::mcols(candidate_VMRs)$revmap, map_revmap_names, manifest_hvp) + # Remove revmap mcol + S4Vectors::mcols(candidate_VMRs)$revmap <- NULL + + ### Capture canonical VMRs ### + message("Applying correlation filter to Variable Methylated Regions...") + VMRs <- candidate_VMRs[(GenomicRanges::elementMetadata(candidate_VMRs)[, "n_VMPs"] > 1)] %>% + data.frame() # Convert the GR to a data frame so that I can use medCorVMR() + ### Check for correlation between probes only if we have VMRs + if (nrow(VMRs) > 0) { + VMRs <- VMRs %>% + medCorVMR(VMR_df = ., methylation_data = methylation_data) %>% # Compute the median correlation of each region + dplyr::filter(median_correlation > cor_threshold) %>% # Remove VMRs whose CpGs are not correlated + GenomicRanges::makeGRangesFromDataFrame(keep.extra.columns = TRUE) # Create a GR object again + } else { + warning("No VMRs were found in this data set") + } + + ### Capture non-canonical VMRs ### + sVMPs <- candidate_VMRs[(GenomicRanges::elementMetadata(candidate_VMRs)[, "probes"] %in% lonely_probes)] # Select the lonely probes + GenomicRanges::mcols(sVMPs)$median_correlation <- rep(NA, nrow(GenomicRanges::mcols(sVMPs))) # Add a column of NAs under the name of median_correlation to match the strict_VMRs + + return(list( + var_score_threshold = var_threshold, + highly_variable_probes = var_scores %>% + tibble::rownames_to_column(var = "TargetID") %>% + dplyr::filter(TargetID %in% manifest_hvp$TargetID), + VML = data.frame(VMRs) %>% + rbind(data.frame(sVMPs)) %>% + dplyr::mutate( + type = ifelse(n_VMPs > 1, "VMR", "sVMP"), + VML_index = paste("VML", as.character(dplyr::row_number()), sep = "") + ) %>% + dplyr::select(VML_index, type, seqnames, start, end, width, strand, probes, n_VMPs, median_correlation) + )) +} diff --git a/R/findVMRs.R b/R/findVMRs.R deleted file mode 100644 index 10609b3..0000000 --- a/R/findVMRs.R +++ /dev/null @@ -1,213 +0,0 @@ -#' Map revmap column to probe names after reducing a GenomicRanges object -#' -#' Given a revmap row (e.g. 1 5 6), we map those positions to their corresponding probe names -#' (and end up with something like "cg00000029", "cg00000158", "cg00000165".This is a helper function -#' of findVMRs()). -#' -#' @param positions A revmap row in the form of a vector -#' @param manifest_hvp the manifest of the highly variable probes used in the findVMRs() function -#' with the probes as row names -#' -#' @return a vector with the names of the probes that conform one reduced region -#' -map_revmap_names = function(positions, manifest_hvp){ - #We start with 1 5 6 - #We want to end with cg00000029, cg00000158 cg00000165 - names = c() - for (element in positions){ - names =c(names, row.names(manifest_hvp)[element] ) - } - return(names) -} - - - -#' Identify Variable Methylated Regions in microarrays -#' -#' Identifies autosomal Highly Variable Probes (HVP) and merges them into Variable Methylated Regions (VMRs) given an Illumina manifest. -#' -#' This function identifies HVPs using MAD scores or variance metrics, and groups them into VMRs, which are defined as clusters of proximal and correlated HVPs (distance and correlation defined by the user). Output VMRs can be separated into canonical and non canonical. Canonical VMRs are regions that meet the correlation and closeness criteria. For guidance on which correlation threshold to use, we recommend checking the Supplementary Figure 1 of the CoMeBack R package (Gatev *et al.*, 2020) where a simulation to empirically determine a default guidance specification for a correlation threshold parameter dependent on sample size is done. As default, we use a threshold of 0.15 as per the CoMeBack authors minimum threshold suggestion. On the other hand, non canonical VMRs are regions that are composed of HVPs that have no nearby probes measured in the array (according to the max_distance parameter); this category was created to account for the Illumina EPIC array design, which has a high number of probes in regulatory regions that are represented by a single probe. Furthermore, these probes have been shown to be good representatives of the methylation state of its surroundings (Pidsley et al., 2016). By creating this category, we recover those informative HVPs that otherwise would be excluded from the analysis because of the array design. -#' -#' This function uses GenomicRanges::reduce() to group the regions, which is strand-sensitive. In the Illumina microarrays, the MAPINFO for all the probes -#' is usually provided as for the + strand. If you are using this array, we recommend to first -#' convert the strand of all the probes to "+". -#' -#' This function supports parallel computing for increased speed. To do so, you have to set the parallel backend -#' in your R session BEFORE running the function (e.g., doFuture::registerDoFuture()) and then the evaluation strategy (e.g., future::plan(multisession)). After that, -#' the function can be run as usual. When working with big datasets, the parallel backend might throw an error if you exceed -#' the maximum allowed size of globals exported for future expression. This can be fixed by increasing the allowed size (e.g. running options(future.globals.maxSize= +Inf) ) -#' -#'Note: this function excludes sex chromosomes. -#' -#' @param array_manifest Information about the probes on the array. Requires the columns MAPINFO (basepair position -#' of the probe in the genome), CHR (chromosome), TargetID (probe name) and STRAND (this is very important to set up, since -#' the VMRs will only be created based on CpGs on the same strand; if the positions are reported based on a single DNA strand, this should contain either a vector of only "+", "-" or "*" for all of the probes). -#' @param methylation_data A data frame containing M or B values, with samples as columns and probes as rows. Data is expected to have already passed through quality control and cleaning steps. -#' @param cor_threshold Numeric value (0-1) to be used as the median pearson correlation threshold for identifying VMRs (i.e. -#' all VMRs will have a median pairwise probe correlation of this parameter). -#' @param var_method Method to use to measure variability in the data set. The options are "mad" (median absolute deviation) -#' or "variance". -#' @param var_threshold_percentile The percentile (0-1) to be used as cutoff to define Highly Variable Probes (and -#' therefore VMRs). The default is 0.9 because this percentile has been traditionally used in previous studies. -#' @param max_distance Maximum distance allowed for two probes to be grouped into a region. The default is 1000 -#' because this window has been traditionally used in previous studies. -#' -#' @return A list with the following elements: -#' - $var_score_threshold: threshold used to define Highly Variable Probes (mad or variance, depending on the specified choice). -#' - $highly_variable_probes: a data frame with the probes that passed the variability score threshold imposed by the user, and their variability score (MAD score or variance). -#' - $canonical_VMRs: a GRanges object with strict candidate VMRs - regions composed of two or more -#' contiguous, correlated and proximal Highly Variable Probes; thresholds depend on the ones specified -#' by the user) -#' - $non_canonical_VMRs: a GRanges object with highly variable probes without neighboring -#' CpGs measured in *max_distance* on the array. Category created to take into acccount the Illumina array design of single probes capturing the methylation state of regulatory regions. -#' -#' @export -#' @examples -#' #We need to modify the RAMEN::test_array_manifest object by assigning to -#' #row names to the probe ID column; it was saved this way because storing -#' #the TargetID as row names reduced significantly the size of the data set. -#' test_array_manifest_final = RAMEN::test_array_manifest %>% -#' tibble::rownames_to_column(var = "TargetID") -#' -#' VMRs = RAMEN::findVMRs(array_manifest = test_array_manifest_final, -#' methylation_data = RAMEN::test_methylation_data, -#' cor_threshold = 0, -#' var_method = "variance", -#' var_threshold_percentile = 0.9, -#' max_distance = 1000) -#' -findVMRs = function(array_manifest, - methylation_data, - cor_threshold = 0.15, - var_method = "variance", - var_threshold_percentile = 0.9, - max_distance = 1000){ - #Check that the array manifest is in the right format - if(!all(c("MAPINFO","CHR","TargetID","STRAND") %in% colnames(array_manifest))){ - stop("Please make sure the array manifest has the required columns with the appropiate names (check documentation for further information)") - } - #Check that the array strand is in the format expected by the user - if(base::length(base::unique(array_manifest$STRAND)) > 1){ - warning("The manifest currently has more than one type of strands. Please note that this function is strand sensitive. So, probes in proximal coordinates but different strands on the manifest will not be grouped together. Many array manifests such as the EPIC one include the PROBE strand, but the position of the actual CpGs (MAPINFO) is reported in the same strand; in those cases we recommend setting all of the probes to the same strand.") - } - #Check that the method choice is correct - if(var_method == "mad"){ - var_scores = apply(methylation_data, 1, stats::mad) %>% - as.data.frame() %>% - dplyr::rename("var_score" = ".") - } else if (var_method == "variance") { - var_scores = apply(methylation_data, 1, stats::var) %>% - as.data.frame() %>% - dplyr::rename("var_score" = ".") - } else { - stop("The method must be either 'mad' or 'variance'. Please select one of those options") - } - - ####Identify highly variable probes #### - message("Identifying Highly Variable Probes...") - var_threshold = stats::quantile(var_scores$var_score, var_threshold_percentile) - #Filter the manifest to remove the probes that have no variability score information because they were not measured/did not pass the QC and are not highly variable - manifest_hvp = array_manifest %>% - dplyr::select(c(TargetID, CHR, MAPINFO, STRAND)) %>% - dplyr::filter(!is.na(MAPINFO), #Remove probes with no map info - !CHR %in% c("X","Y"), #Remove sexual chromosomes - TargetID %in% row.names(var_scores %>% - dplyr::filter(var_score >= var_threshold))) %>% #Remove probes that have no methylation information in the processed data and are not highly variable - dplyr::left_join(var_scores %>% #Add variability information - tibble::rownames_to_column(var = "TargetID"), - by = "TargetID") %>% - dplyr::arrange(CHR) %>% #important step for using Rle later when constructing the GenomicRanges object! - as.data.frame() - rownames(manifest_hvp) = manifest_hvp$TargetID - if(is.factor(manifest_hvp$CHR)) manifest_hvp = manifest_hvp %>% dplyr::mutate(CHR = droplevels(CHR)) - - #### Identify probes with no neighbours#### - message("Identifying non canonical Variable Methylated Regions...") - full_manifest = array_manifest %>% - dplyr::select(c(TargetID, CHR, MAPINFO, STRAND)) %>% - dplyr::filter(!is.na(MAPINFO), #Remove probes with no map info - !CHR %in% c("X","Y"), #Remove sexual chromosomes - TargetID %in% row.names(var_scores)) %>% #keep only the probes where we have methylation information - dplyr::arrange(CHR) %>% #important step for using Rle later when constructing the GenomicRanges object! - as.data.frame() - rownames(full_manifest) = full_manifest$TargetID - if(is.factor(full_manifest$CHR)) full_manifest = full_manifest %>% dplyr::mutate(CHR = droplevels(CHR)) - - #Convert the full manifest to a GenomicRanges object - seqnames_full_manifest_gr = table(full_manifest$CHR) - full_manifest_gr = GenomicRanges::GRanges( - seqnames = S4Vectors::Rle(names(seqnames_full_manifest_gr), as.numeric(seqnames_full_manifest_gr)), #Number of chromosome; as.numeric to convert from table to numeric vector - ranges = IRanges::IRanges(full_manifest$MAPINFO, end = full_manifest$MAPINFO , - names = full_manifest$TargetID), - strand = S4Vectors::Rle(rle(as.character(full_manifest$STRAND))$values, - rle(as.character(full_manifest$STRAND))$lengths )) - - #### Group the probes into regions to detect non-canonical VMRs - regions_full_manifest = GenomicRanges::reduce(full_manifest_gr, with.revmap = TRUE, min.gapwidth = max_distance) - #Add the number of probes in each region - S4Vectors::mcols(regions_full_manifest)$n_probes = sapply(S4Vectors::mcols(regions_full_manifest)$revmap, length) - #Substitute revmap with the name of the probes in each region - S4Vectors::mcols(regions_full_manifest)$probes = sapply(S4Vectors::mcols(regions_full_manifest)$revmap, map_revmap_names, full_manifest) - #Remove revmap mcol - S4Vectors::mcols(regions_full_manifest)$revmap = NULL - #Keep elements with only one probe - lonely_probes = regions_full_manifest[(GenomicRanges::elementMetadata(regions_full_manifest)[,"n_probes"] <= 1)] %>% - as.data.frame() %>% - dplyr::pull(probes) %>% - unlist() - - #### Identify VMRs#### - message("Identifying canonical Variable Methylated Regions...") - #convert the highly variable probes data frame to a GenomicRanges object - seqnames_gr = table(manifest_hvp$CHR) - gr = GenomicRanges::GRanges( - seqnames = S4Vectors::Rle(names(seqnames_gr), as.numeric(seqnames_gr)), #Number of chromosome; as.numeric to convert from table to numeric vector - ranges = IRanges::IRanges(manifest_hvp$MAPINFO, end = manifest_hvp$MAPINFO , - names = manifest_hvp$TargetID), - strand = S4Vectors::Rle(rle(as.character(manifest_hvp$STRAND))$values, - rle(as.character(manifest_hvp$STRAND))$lengths ), - var_score = manifest_hvp$var_score) #Metadata - - #Create the regions - candidate_VMRs = GenomicRanges::reduce(gr, with.revmap = TRUE, min.gapwidth = max_distance) - #Add the number of probes in each region - S4Vectors::mcols(candidate_VMRs)$n_VMPs = sapply(S4Vectors::mcols(candidate_VMRs)$revmap, length) - #Add the width of each region - S4Vectors::mcols(candidate_VMRs)$width = S4Vectors::width(candidate_VMRs) - #Substitute revmap with the name of the probes in each VMR - S4Vectors::mcols(candidate_VMRs)$probes = sapply(S4Vectors::mcols(candidate_VMRs)$revmap, map_revmap_names, manifest_hvp) - #Remove revmap mcol - S4Vectors::mcols(candidate_VMRs)$revmap = NULL - - ### Capture canonical VMRs ### - message("Applying correlation filter to canonical Variable Methylated Regions...") - canonical_VMRs = candidate_VMRs[(GenomicRanges::elementMetadata(candidate_VMRs)[,"n_VMPs"] > 1)] %>% - #Check for correlation between probes in these strict regions # - data.frame() #Convert the GR to a data frame so that I can use medCorVMR() - ### Check that the VMRs contain surrounding probes only if we have potential canonical VMRs - if(nrow(canonical_VMRs) > 0){ - canonical_VMRs = canonical_VMRs %>% - medCorVMR(VMR_df = ., methylation_data = methylation_data) %>% # Compute the median correlation of each region - dplyr::filter(median_correlation > cor_threshold) %>% #Remove VMRs whose CpGs are not correlated - GenomicRanges::makeGRangesFromDataFrame(keep.extra.columns = TRUE) #Create a GR object again - colnames(S4Vectors::mcols(canonical_VMRs))[2] = "width" #Changing the name of one metadata variable that was modified when transforming from data frame to GR object - } else warning("No canonical VMRs were found in this data set") - - - ### Capture non-canonical VMRs ### - non_canonical_VMRs = candidate_VMRs[(GenomicRanges::elementMetadata(candidate_VMRs)[,"n_VMPs"] <= 1)] #Select the VMRs with 1 probe per region - non_canonical_VMRs = candidate_VMRs[(GenomicRanges::elementMetadata(candidate_VMRs)[,"probes"] %in% lonely_probes)] #Select the lonely probes - GenomicRanges::mcols(non_canonical_VMRs)$median_correlation = rep(NA, nrow(GenomicRanges::mcols(non_canonical_VMRs))) #Add a column of NAs under the name of median_correlation to match the strict_VMRs - - - - return(list( - var_score_threshold = var_threshold, - highly_variable_probes = var_scores %>% - tibble::rownames_to_column(var = "TargetID") %>% - dplyr::filter(TargetID %in% manifest_hvp$TargetID), - canonical_VMRs = canonical_VMRs, - non_canonical_VMRs = non_canonical_VMRs - )) -} - diff --git a/R/lmGE.R b/R/lmGE.R index 947c9e9..d6b72cd 100644 --- a/R/lmGE.R +++ b/R/lmGE.R @@ -1,23 +1,22 @@ - #' Fit linear G, E, G+E and GxE models and select the winning model #' -#' For a set of Variable Methylated Region (VMR), this function fits a set of genotype (G), environment (E), pairwise additive (G + E) or pairwise interaction (G x E) models, one variable at a time, and selects the best fitting one. Additional information for each winning model is provided, such as its R2, its R2 increase comparing it to a basal model (i.e., a model only fitted with the concomitant variables), the delta AIC/BIC to the next best model from a different category, and the explained variance decomposed for the G, E and GxE components (when applicable). +#' For a set of Variable Methylated Loci (VML), this function fits a set of genotype (G), environment (E), pairwise additive (G + E) or pairwise interaction (G x E) models, one variable at a time, and selects the best fitting one. Additional information for each winning model is provided, such as its R2, its R2 increase comparing it to a basal model (i.e., a model only fitted with the concomitant variables), the delta AIC/BIC to the next best model from a different category, and the explained variance decomposed for the G, E and GxE components (when applicable). If a VML has no variables selected in the selected_variables object, it will be returned with "B" (basal) as the best model (interpreted as no G or E associated effect). #' #' This function supports parallel computing for increased speed. To do so, you have to set the parallel backend -#' in your R session before running the function (e.g., doFuture::registerDoFuture()) and then the evaluation strategy (e.g., future::plan(multisession)). After that, +#' in your R session before running the function (e.g., *doParallel::registerDoParallel(4)*)). After that, #' the function can be run as usual. It is recommended to also set options(future.globals.maxSize= +Inf). #' -#' For each VMR, this function computes a set of models using the variables indicated in the selected_variables object. From the indicated G and E variables, lmGE() fits four groups of models: +#' For each VML, this function computes a set of models using the variables indicated in the selected_variables object. From the indicated G and E variables, lmGE() fits four groups of models: #' - G: Genetics model - fitted one SNP at a time. #' - E: Environmental model - fitted one environmental variable at a time. #' - G+E: Additive model - fitted for each pairwise combination of G and E variables indicated in selected_variables. #' - GxE: Interaction model - fitted for each pairwise combination of G and E variables indicated in selected_variables. #' -#' These models are fit only if the VMR has G or E variables in the selected_variables object. If a VMR does not have neither G nor E variables, that VMR will be ignored and will not be returned in the output object. +#' These models are fit only if the VML has G or E variables in the selected_variables object. If a VML does not have neither G nor E variables, that VML will be ignored and will be returned in the output object with "B" (baseline) as the best explanatory model. #' #' **Model selection** #' -#' Following the model fitting stage, the best model **per group** is selected using Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). Both of these metrics are statistical approaches to select the best model in the same data set, and they have strengths and limitations that make them excel in different situations. We recommend using AIC because BIC assumes that the true model is in the set of compared models. Since this function fits models with individual variables, and we assume that DNAme variability is more likely to be influenced by more than one single SNP/environmental exposure at a time, we hypothesize that in most cases, the true model will not be in the set of compared models. Also, AIC excels in situations where all models in the model space are "incorrect", and AIC is preferentially used in cases where the true underlying function is unknown and our selected model could belong to a very large class of functions where the relationship could be pretty complex. It is worth mentioning however that, both metrics tend to pick the same model in a large number of scenarios. We suggest the users to read Arijit Chakrabarti & Jayanta K. Ghosh, 2011 for further information about the difference between these metrics. +#' Following the model fitting stage, the best model **per group** is selected using Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). Both of these metrics are statistical approaches to select the best model in the same data set, and they have strengths and limitations that make them excel in different situations. We recommend using AIC because BIC assumes that the true model is in the set of compared models. Since this function fits models with individual variables, and we assume that DNAme variability is more likely to be influenced by more than one single SNP/environmental exposure at a time, we hypothesize that in most cases, the true model will not be in the set of compared models. Also, AIC excels in situations where all models in the model space are "incomplete", and AIC is preferentially used in cases where the true underlying function is unknown and our selected model could belong to a very large class of functions where the relationship could be pretty complex. It is worth mentioning however that, both metrics tend to pick the same model in a large number of scenarios. We suggest the users to read Arijit Chakrabarti & Jayanta K. Ghosh, 2011 for further information about the difference between these metrics. #' #' After selecting the best model per group (G,E,G+E pr GxE), the model with the lowest AIC or BIC is declared as the winning model. The delta AIC/BIC and difference of R2 is computed relative to the model with the second lowest AIC/BIC (i.e., the best model from a different group to the winning one), and reported in the final object. #' @@ -25,18 +24,18 @@ #' #' Finally, the variance is decomposed and the relative R2 contribution of each of the variables of interest (G, E and GxE) is reported. This decomposition is done using the relaimpo R package, using the Lindeman, Merenda and Gold (lmg) method, which is based on the heuristic approach of averaging the relative R contribution of each variable over all input orders in the linear model. The estimation of the partitioned R2 of each factor in the models was conducted keeping the covariates always in the model as first entry (i.e., the variables specified in covariates did not change order). For further information, we suggest the users to read the documentation and publication of the relaimpo R package (Grömping, 2006). #' -#' @param selected_variables A data frame obtained with *RAMEN::selectVariables()*. This data frame must contain three columns: 'VMR_index' with characters of an unique ID of each VMR; ´selected_genot' and 'selected_env' with the SNPs and environmental variables, respectively, that will be used for fitting the genotype (G), environment (E), additive (G + E) or interaction (G x E) models. The columns 'selected_env' and 'selected_genot' must contain lists as elements; VMRs with no environmental or genotype selected variables must contain an empty list with NULL, NA , character(0) or "" inside. -#' @param model_selection Which metric to use to select the best model for each VMR. Supported options are "AIC" or BIC". More information about which one to use can be found in the Details section. +#' @param selected_variables A data frame obtained with *RAMEN::selectVariables()*. This data frame must contain three columns: 'VML_index' with characters of an unique ID of each VML; ´selected_genot' and 'selected_env' with the SNPs and environmental variables, respectively, that will be used for fitting the genotype (G), environment (E), additive (G + E) or interaction (G x E) models. The columns 'selected_env' and 'selected_genot' must contain lists as elements; VML with no environmental or genotype selected variables must contain an empty list with NULL, NA , character(0) or "" inside. +#' @param model_selection Which metric to use to select the best model for each VML. Supported options are "AIC" or BIC". More information about which one to use can be found in the Details section. #' @inheritParams selectVariables #' @return A data frame with the following columns: -#' - VMR_index: The unique ID of the VMR +#' - VML_index: The unique ID of the VML #' - model_group: The group to which the winning model belongs to (i.e., G, E, G+E or GxE) #' - variables: The variable(s) that are present in the winning model (excluding the covariates, which are included in all the models) #' - tot_r_squared: R squared of the winning model #' - g_r_squared: Estimated R2 allocated to the G in the winning model, if applicable. #' - e_r_squared: Estimated R2 allocated to the E in the winning model, if applicable. #' - gxe_r_squared: Estimated R2 allocated to the interaction in the winning model (GxE), if applicable. -#' - AIC/BIC: AIC or BIC metric from the best model in each VMR (depending on the option specified in the argument model_selection). +#' - AIC/BIC: AIC or BIC metric from the best model in each VML (depending on the option specified in the argument model_selection). #' - second_winner: The second group that possesses the next best model after the winning one (i.e., G, E, G+E or GxE). This column may have NA if the variables in selected_variables correspond only to one group (G or E), so that there is no other model groups to compare to. #' - delta_aic/delta_bic: The difference of AIC or BIC value (depending on the option specified in the argument model_selection) of the winning model and the best model from the second_winner group (i.e., G, E, G+E or GxE). This column may have NA if the variables in selected_variables correspond only to one group (G or E), so that there is no other groups to compare to. #' - delta_r_squared: The R2 of the winning model - R2 of the second winner model. This column may have NA if the variables in selected_variables correspond only to one group (G or E), so that there is no other groups to compare to. @@ -45,241 +44,418 @@ #' @importFrom foreach %dopar% #' @importFrom foreach %do% #' @export +#' @examples +#' ## Find VML in test data +#' VML <- RAMEN::findVML( +#' methylation_data = RAMEN::test_methylation_data, +#' array_manifest = "IlluminaHumanMethylationEPICv1", +#' cor_threshold = 0, +#' var_method = "variance", +#' var_distribution = "ultrastable", +#' var_threshold_percentile = 0.99, +#' max_distance = 1000 +#' ) +#' ## Find cis SNPs around VML +#' VML_with_cis_snps <- RAMEN::findCisSNPs( +#' VML_df = VML$VML, +#' genotype_information = RAMEN::test_genotype_information, +#' distance = 1e6 +#' ) +#' +#' ## Summarize methylation levels in VML +#' summarized_methyl_VML <- RAMEN::summarizeVML( +#' methylation_data = RAMEN::test_methylation_data, +#' VML_df = VML_with_cis_snps +#' ) +#' +#' ## Select relevant genotype and environmental variables +#' selected_vars <- RAMEN::selectVariables( +#' VML_df = VML_with_cis_snps, +#' genotype_matrix = RAMEN::test_genotype_matrix, +#' environmental_matrix = RAMEN::test_environmental_matrix, +#' covariates = RAMEN::test_covariates, +#' summarized_methyl_VML = summarized_methyl_VML, +#' seed = 1 +#' ) #' -lmGE = function(selected_variables, - summarized_methyl_VMR, - genotype_matrix, - environmental_matrix, - covariates = NULL, - model_selection = "AIC"){ - #Check arguments - # Check that genotype_matrix, environmental_matrix, covariate matrix (in case it is provided) and summarized_methyl_VMR have the same samples - if(!all(rownames(summarized_methyl_VMR) %in% colnames(genotype_matrix))) stop("Individual IDs in summarized_methyl_VMR do not match individual IDs in genotype_matrix") - if (!all(rownames(summarized_methyl_VMR) %in% rownames(environmental_matrix))) stop("Individual IDs in summarized_methyl_VMR do not match individual IDs in environmental_matrix") - if(!is.null(covariates)){ - if (!all(rownames(summarized_methyl_VMR) %in% rownames(covariates)))stop("Individual IDs in summarized_methyl_VMR do not match individual IDs in the covariates matrix")} - #Check that selected_variables has the right columns - if(!all(c("VMR_index","selected_genot", "selected_env") %in% colnames(selected_variables))) stop("Please make sure the selected_variables data frame contains the columns 'VMR_index', 'selected_genot' and 'selected_env'.") - #Check that the selected_genot and selected_env columns on selected_variables is a list and the index is characters - if(!is.list(selected_variables$selected_genot)) stop("Please make sure the 'selected_genot' column in selected_variables contains lists as elements") - if(!is.list(selected_variables$selected_env)) stop("Please make sure the 'selected_env' column in selected_variables contains lists as elements") - if(!is.character(selected_variables$VMR_index)) stop("Please make sure the 'VMR_index' column in selected_variables contains characters") - #Check that genotype, environment and covariates are matrices +#' ## Fit G, E, G+E and GxE models and select the winning one +#' lmge_res <- RAMEN::lmGE( +#' selected_variables = selected_vars, +#' summarized_methyl_VML = summarized_methyl_VML, +#' genotype_matrix = RAMEN::test_genotype_matrix, +#' environmental_matrix = RAMEN::test_environmental_matrix, +#' covariates = RAMEN::test_covariates, +#' model_selection = "AIC" +#' ) +#' +lmGE <- function(selected_variables, + summarized_methyl_VML, + genotype_matrix, + environmental_matrix, + covariates = NULL, + model_selection = "AIC") { + #### Binding of variables used within the tidyverse framework #### + selected_env <- selected_genot <- VML_i <- SNP <- env <- model_group <- AIC <- BIC <- tot_r_squared <- VML_index <- variables <- g_r_squared <- e_r_squared <- gxe_r_squared <- second_winner <- delta_aic <- delta_r_squared <- basal_AIC <- basal_rsquared <- delta_bic <- basal_BIC <- NULL + + # Check arguments + # Check that genotype_matrix, environmental_matrix, covariate matrix (in case it is provided) and summarized_methyl_VML have the same samples + if (!is.data.frame(summarized_methyl_VML)) stop("Please make sure summarized_methyl_VML is a data frame.") + if (!all(rownames(summarized_methyl_VML) %in% colnames(genotype_matrix))) stop("Individual IDs in summarized_methyl_VML do not match individual IDs in genotype_matrix") + if (!all(rownames(summarized_methyl_VML) %in% rownames(environmental_matrix))) stop("Individual IDs in summarized_methyl_VML do not match individual IDs in environmental_matrix") + if (!is.null(covariates)) { + if (!all(rownames(summarized_methyl_VML) %in% rownames(covariates))) stop("Individual IDs in summarized_methyl_VML do not match individual IDs in the covariates matrix") + } + # Check that selected_variables has the right columns + if (!all(c("VML_index", "selected_genot", "selected_env") %in% colnames(selected_variables))) stop("Please make sure the selected_variables data frame contains the columns 'VML_index', 'selected_genot' and 'selected_env'.") + # Check that the selected_genot and selected_env columns on selected_variables is a list and the index is characters + if (!is.list(selected_variables$selected_genot)) stop("Please make sure the 'selected_genot' column in selected_variables contains lists as elements") + if (!is.list(selected_variables$selected_env)) stop("Please make sure the 'selected_env' column in selected_variables contains lists as elements") + if (!is.character(selected_variables$VML_index)) stop("Please make sure the 'VML_index' column in selected_variables contains characters") + # Check that genotype, environment and covariates are matrices if (!is.matrix(genotype_matrix)) stop("Please make sure the genotype data is provided as a matrix.") if (!is.matrix(environmental_matrix)) stop("Please make sure the environmental data is provided as a matrix.") - if (!is.null(covariates)){ - if (!is.matrix(covariates)) stop("Please make sure the covariates data is provided as a matrix.")} - if(!model_selection %in% c("AIC", "BIC")) stop("Please make sure your model_selection method is 'AIC' or 'BIC'") + if (!is.null(covariates)) { + if (!is.matrix(covariates)) stop("Please make sure the covariates data is provided as a matrix.") + } + if (!model_selection %in% c("AIC", "BIC")) stop("Please make sure your model_selection method is 'AIC' or 'BIC'") + ## Check that genotype_matrix, environmental_matrix, and covariates (in case + ## it is provided) have only numeric values and no NA, NaN, Inf values + if ( + sum(vapply(genotype_matrix, is.na, FUN.VALUE = logical(1))) > 0 || + sum(vapply(genotype_matrix, is.nan, FUN.VALUE = logical(1))) > 0 || + sum(!vapply(genotype_matrix, is.numeric, FUN.VALUE = logical(1))) > 0 || + sum(vapply(genotype_matrix, is.infinite, FUN.VALUE = logical(1))) > 0 + ) stop ( + "Please make sure the genotype matrix contains only finite numeric values." + ) + if ( + sum(vapply(environmental_matrix, is.na, FUN.VALUE = logical(1))) > 0 || + sum(vapply(environmental_matrix, is.nan, FUN.VALUE = logical(1))) > 0 || + sum(!vapply(environmental_matrix, is.numeric, FUN.VALUE = logical(1))) > 0 || + sum(vapply(environmental_matrix, is.infinite, FUN.VALUE = logical(1))) > 0 + ) stop ( + "Please make sure the environmental matrix contains only finite numeric values." + ) + if (!is.null(covariates)) { + if ( + sum(vapply(covariates, is.na, FUN.VALUE = logical(1))) > 0 || + sum(vapply(covariates, is.nan, FUN.VALUE = logical(1))) > 0 || + sum(!vapply(covariates, is.numeric, FUN.VALUE = logical(1))) > 0 || + sum(vapply(covariates, is.infinite, FUN.VALUE = logical(1))) > 0 + ) stop ( + "Please make sure the covariates matrix contains only finite numeric values." + ) + } - #Remove VMRs that have no selected G and no selected E - selected_variables = selected_variables %>% - dplyr::filter(!(selected_env %in% c(list(NULL), list(""), list(NA), list(character(0))) & + if ( + sum(vapply(summarized_methyl_VML, + is.na, + FUN.VALUE = logical(nrow(summarized_methyl_VML)) + ) + ) > 0 || + sum(vapply(summarized_methyl_VML, + is.nan, + FUN.VALUE = logical(nrow(summarized_methyl_VML)) + ) + ) > 0 || + sum(!vapply(summarized_methyl_VML, + is.numeric, + FUN.VALUE = logical(1) + ) + ) > 0 || + sum(vapply(summarized_methyl_VML, + is.infinite, + FUN.VALUE = logical(nrow(summarized_methyl_VML)) + ) + ) > 0 + ) stop ( + "Please make sure the summarized_methyl_VML data frame contains only finite numeric values." + ) + + + # Filter VML that have no selected G and no selected E + no_vars_VML <- selected_variables %>% + dplyr::filter((selected_env %in% c(list(NULL), list(""), list(NA), list(character(0))) & selected_genot %in% c(list(NULL), list(""), list(NA), list(character(0))))) + selected_variables <- selected_variables %>% + dplyr::filter(!(selected_env %in% c(list(NULL), list(""), list(NA), list(character(0))) & + selected_genot %in% c(list(NULL), list(""), list(NA), list(character(0))))) - #Select the winning model - winning_models = foreach::foreach(VMR_i = iterators::iter(selected_variables, by = "row"), - .combine = "rbind") %dopar% { #For every VMR - #Create the data frame with all the information for each VMR - summ_vmr_i = as.matrix(summarized_methyl_VMR[,VMR_i$VMR_index]) - colnames(summ_vmr_i) = "DNAme" - if (!VMR_i$selected_env %in% c(list(NULL), list(""), list(NA), list(character(0)))) { - if(length(VMR_i$selected_env[[1]]) == 1){ - env_i = environmental_matrix[rownames(summarized_methyl_VMR), unlist(VMR_i$selected_env)] %>% - as.matrix() - colnames(env_i) = unlist(VMR_i$selected_env) - } else env_i = environmental_matrix[rownames(summarized_methyl_VMR), unlist(VMR_i$selected_env)] - } else env_i = NULL - if (!VMR_i$selected_genot %in% c(list(NULL), list(""), list(NA), list(character(0)))) { - if(length(VMR_i$selected_genot[[1]]) == 1 ){ - genot_i = genotype_matrix[unlist(VMR_i$selected_genot),rownames(summarized_methyl_VMR)] %>% - as.matrix() - colnames(genot_i) = unlist(VMR_i$selected_genot) - } else { - genot_i = genotype_matrix[unlist(VMR_i$selected_genot),rownames(summarized_methyl_VMR)] %>% - t() - } - } else genot_i = NULL - if (!is.null(covariates)){ - if (ncol(covariates) == 1){ - covariates_i = covariates[rownames(summarized_methyl_VMR),] %>% #Match the covariates dataset with the VMRs information - as.matrix() - colnames(covariates_i) = colnames(covariates) - } else covariates_i = covariates[rownames(summarized_methyl_VMR),] - } - full_data_vmr_i = cbind(summ_vmr_i, env_i, genot_i, covariates_i) - colnames(full_data_vmr_i) = make.names(colnames(full_data_vmr_i)) - #Set the basal model (only covariates) - basal_model_formula = colnames(covariates) %>% - make.names() %>% - paste( collapse = " + ") + # Select the winning model + winning_models <- foreach::foreach( + VML_i = iterators::iter(selected_variables, by = "row"), + .combine = "rbind" + ) %dopar% { # For every VML + # Create the data frame with all the information for each VML + summ_vml_i <- as.matrix(summarized_methyl_VML[, VML_i$VML_index]) + colnames(summ_vml_i) <- "DNAme" + if (!VML_i$selected_env %in% c(list(NULL), list(""), list(NA), list(character(0)))) { + if (length(VML_i$selected_env[[1]]) == 1) { + env_i <- environmental_matrix[rownames(summarized_methyl_VML), unlist(VML_i$selected_env)] %>% + as.matrix() + colnames(env_i) <- unlist(VML_i$selected_env) + } else { + env_i <- environmental_matrix[rownames(summarized_methyl_VML), unlist(VML_i$selected_env)] + } + } else { + env_i <- NULL + } + if (!VML_i$selected_genot %in% c(list(NULL), list(""), list(NA), list(character(0)))) { + if (length(VML_i$selected_genot[[1]]) == 1) { + genot_i <- genotype_matrix[unlist(VML_i$selected_genot), rownames(summarized_methyl_VML)] %>% + as.matrix() + colnames(genot_i) <- unlist(VML_i$selected_genot) + } else { + genot_i <- genotype_matrix[unlist(VML_i$selected_genot), rownames(summarized_methyl_VML)] %>% + t() + } + } else { + genot_i <- NULL + } + if (!is.null(covariates)) { + if (ncol(covariates) == 1) { + covariates_i <- covariates[rownames(summarized_methyl_VML), ] %>% # Match the covariates dataset with the VML information + as.matrix() + colnames(covariates_i) <- colnames(covariates) + } else { + covariates_i <- covariates[rownames(summarized_methyl_VML), ] + } + } + full_data_vml_i <- cbind(summ_vml_i, env_i, genot_i, covariates_i) + colnames(full_data_vml_i) <- make.names(colnames(full_data_vml_i)) + # Set the basal model (only covariates) + basal_model_formula <- colnames(covariates) %>% + make.names() %>% + paste(collapse = " + ") - ## Fit models involving G if G has selected variables - if (!VMR_i$selected_genot %in% c(list(NULL), list(""), list(NA), list(character(0)))) { - models_g_involving_df = foreach::foreach(SNP = unlist(VMR_i$selected_genot), - .combine = "rbind") %do% { #For each SNP - ### Fit G models - model_g = stats::lm(data = as.data.frame(full_data_vmr_i), formula = stringr::str_glue("DNAme ~ ", make.names(SNP), " + ", basal_model_formula) ) + ## Fit models involving G if G has selected variables + if (!VML_i$selected_genot %in% c(list(NULL), list(""), list(NA), list(character(0)))) { + models_g_involving_df <- foreach::foreach( + SNP = unlist(VML_i$selected_genot), + .combine = "rbind" + ) %do% { # For each SNP + ### Fit G models + model_g <- stats::lm(data = as.data.frame(full_data_vml_i), formula = stringr::str_glue("DNAme ~ ", make.names(SNP), " + ", basal_model_formula)) - #Create data frame structure for the results - model_g_df = data.frame(model_group = "G") - model_g_df$variables = list(SNP) - if(model_selection == "AIC") model_g_df$AIC = stats::AIC(model_g) - if(model_selection == "BIC") model_g_df$BIC = stats::BIC(model_g) - model_g_df$tot_r_squared = summary(model_g)$r.squared - #model_g_df$tot_adj_r_squared = summary(model_g)$adj.r.squared + # Create data frame structure for the results + model_g_df <- data.frame(model_group = "G") + model_g_df$variables <- list(SNP) + if (model_selection == "AIC") model_g_df$AIC <- stats::AIC(model_g) + if (model_selection == "BIC") model_g_df$BIC <- stats::BIC(model_g) + model_g_df$tot_r_squared <- summary(model_g)$r.squared - if (!VMR_i$selected_env %in% c(list(NULL), list(""), list(NA), list(character(0)))){ - ### Fit GxE and G+E models if E is not empty - models_joint_df = foreach::foreach(env = unlist(VMR_i$selected_env), #For every env var - .combine = "rbind") %do% { - #Fit G + E - model_ge = stats::lm(data = as.data.frame(full_data_vmr_i), formula = stringr::str_glue("DNAme ~ ", make.names(SNP), " + ", make.names(env), " + ", basal_model_formula) ) + if (!VML_i$selected_env %in% c(list(NULL), list(""), list(NA), list(character(0)))) { + ### Fit GxE and G+E models if E is not empty + models_joint_df <- foreach::foreach( + env = unlist(VML_i$selected_env), # For every env var + .combine = "rbind" + ) %do% { + # Fit G + E + model_ge <- stats::lm(data = as.data.frame(full_data_vml_i), formula = stringr::str_glue("DNAme ~ ", make.names(SNP), " + ", make.names(env), " + ", basal_model_formula)) - #Create data frame structure for the results - model_ge_df = data.frame(model_group = "G+E") - model_ge_df$variables = list(c(SNP, env)) - if(model_selection == "AIC") model_ge_df$AIC = stats::AIC(model_ge) - if(model_selection == "BIC") model_ge_df$BIC = stats::BIC(model_ge) - model_ge_df$tot_r_squared = summary(model_ge)$r.squared - #Fit GxE - model_gxe = stats::lm(data = as.data.frame(full_data_vmr_i), formula = stringr::str_glue("DNAme ~ ", make.names(SNP), " + ", make.names(env), " + ", make.names(SNP), "*", make.names(env), " + ", basal_model_formula) ) + # Create data frame structure for the results + model_ge_df <- data.frame(model_group = "G+E") + model_ge_df$variables <- list(c(SNP, env)) + if (model_selection == "AIC") model_ge_df$AIC <- stats::AIC(model_ge) + if (model_selection == "BIC") model_ge_df$BIC <- stats::BIC(model_ge) + model_ge_df$tot_r_squared <- summary(model_ge)$r.squared + # Fit GxE + model_gxe <- stats::lm(data = as.data.frame(full_data_vml_i), formula = stringr::str_glue("DNAme ~ ", make.names(SNP), " + ", make.names(env), " + ", make.names(SNP), "*", make.names(env), " + ", basal_model_formula)) - #Create data frame structure for the results - model_gxe_df = data.frame(model_group = "GxE") - model_gxe_df$variables = list(c(SNP, env)) - if(model_selection == "AIC") model_gxe_df$AIC = stats::AIC(model_gxe) - if(model_selection == "BIC") model_gxe_df$BIC = stats::BIC(model_gxe) - model_gxe_df$tot_r_squared = summary(model_gxe)$r.squared + # Create data frame structure for the results + model_gxe_df <- data.frame(model_group = "GxE") + model_gxe_df$variables <- list(c(SNP, env)) + if (model_selection == "AIC") model_gxe_df$AIC <- stats::AIC(model_gxe) + if (model_selection == "BIC") model_gxe_df$BIC <- stats::BIC(model_gxe) + model_gxe_df$tot_r_squared <- summary(model_gxe)$r.squared - #Return joint models - temp_models_joint = rbind(model_gxe_df, model_ge_df) - temp_models_joint - } - } else models_joint_df = NULL + # Return joint models + temp_models_joint <- rbind(model_gxe_df, model_ge_df) + temp_models_joint + } + } else { + models_joint_df <- NULL + } - #Return object with all the G-involved models - temp_models_g_involving = rbind(model_g_df, models_joint_df) - temp_models_g_involving - } - } else models_g_involving_df = NULL + # Return object with all the G-involved models + temp_models_g_involving <- rbind(model_g_df, models_joint_df) + temp_models_g_involving + } + } else { + models_g_involving_df <- NULL + } - ### Compute E models if E is not empty - if (!VMR_i$selected_env %in% c(list(NULL), list(""), list(NA), list(character(0)))){ #For each env var - models_e_df = foreach::foreach(env = unlist(VMR_i$selected_env), #For every env var - .combine = "rbind") %do% { - #Fit E models - model_e = stats::lm(data = as.data.frame(full_data_vmr_i), formula = stringr::str_glue("DNAme ~ ", make.names(env), " + ", basal_model_formula) ) + ### Compute E models if E is not empty + if (!VML_i$selected_env %in% c(list(NULL), list(""), list(NA), list(character(0)))) { # For each env var + models_e_df <- foreach::foreach( + env = unlist(VML_i$selected_env), # For every env var + .combine = "rbind" + ) %do% { + # Fit E models + model_e <- stats::lm(data = as.data.frame(full_data_vml_i), formula = stringr::str_glue("DNAme ~ ", make.names(env), " + ", basal_model_formula)) - #Create data frame structure for the results - model_e_df = data.frame(model_group = "E") - model_e_df$variables = list(c(env)) - if(model_selection == "AIC") model_e_df$AIC = stats::AIC(model_e) - if(model_selection == "BIC") model_e_df$BIC = stats::BIC(model_e) - model_e_df$tot_r_squared = summary(model_e)$r.squared - #Return the final object - model_e_df - } - } else models_e_df = NULL + # Create data frame structure for the results + model_e_df <- data.frame(model_group = "E") + model_e_df$variables <- list(c(env)) + if (model_selection == "AIC") model_e_df$AIC <- stats::AIC(model_e) + if (model_selection == "BIC") model_e_df$BIC <- stats::BIC(model_e) + model_e_df$tot_r_squared <- summary(model_e)$r.squared + # Return the final object + model_e_df + } + } else { + models_e_df <- NULL + } - #Create object with the metrics for all the fitted models - all_models_VMR_i = rbind(models_g_involving_df, models_e_df) + # Create object with the metrics for all the fitted models + all_models_VML_i <- rbind(models_g_involving_df, models_e_df) - #Select the best model per category (G,E,GxE,G+E) and compute its delta AIC/BIC - if(model_selection == "AIC"){ - best_models_VMR_i = all_models_VMR_i %>% - dplyr::group_by(model_group) %>% - dplyr::filter(AIC == min(AIC)) %>% - dplyr::slice(1) %>% #In case there are more than one model per group with the exact same AIC, pick the first one - dplyr::arrange(AIC, dplyr::desc(tot_r_squared)) %>% - dplyr::ungroup() %>% - dplyr::mutate(delta_aic = abs(AIC - dplyr::lead(AIC))) - } else if (model_selection == "BIC"){ - best_models_VMR_i = all_models_VMR_i %>% - dplyr::group_by(model_group) %>% - dplyr::filter(BIC == min(BIC)) %>% - dplyr::slice(1) %>% #In case there are more than one model per group with the exact same AIC, pick the first one - dplyr::arrange(BIC,dplyr::desc(tot_r_squared) ) %>% - dplyr::ungroup() %>% - dplyr::mutate(delta_bic = abs(BIC - dplyr::lead(BIC))) - } + # Select the best model per category (G,E,GxE,G+E) and compute its delta AIC/BIC + if (model_selection == "AIC") { + best_models_VML_i <- all_models_VML_i %>% + dplyr::group_by(model_group) %>% + dplyr::filter(AIC == min(AIC)) %>% + dplyr::slice(1) %>% # In case there are more than one model per group with the exact same AIC, pick the first one + dplyr::arrange(AIC, dplyr::desc(tot_r_squared)) %>% + dplyr::ungroup() %>% + dplyr::mutate(delta_aic = abs(AIC - dplyr::lead(AIC))) + } else if (model_selection == "BIC") { + best_models_VML_i <- all_models_VML_i %>% + dplyr::group_by(model_group) %>% + dplyr::filter(BIC == min(BIC)) %>% + dplyr::slice(1) %>% # In case there are more than one model per group with the exact same AIC, pick the first one + dplyr::arrange(BIC, dplyr::desc(tot_r_squared)) %>% + dplyr::ungroup() %>% + dplyr::mutate(delta_bic = abs(BIC - dplyr::lead(BIC))) + } - #Create the final object that will be returned - if(model_selection == "AIC"){ - winning_model_VMR_i = best_models_VMR_i %>% - dplyr::filter(AIC == min(AIC)) %>% - #In case there is more than one model with the exact same AIC from different groups, pick the one with the highest tot_r_squared - dplyr::slice(1) %>% - dplyr::mutate(second_winner = best_models_VMR_i$model_group[2], - delta_r_squared = best_models_VMR_i$tot_r_squared[1] - best_models_VMR_i$tot_r_squared[2]) - }else if(model_selection == "BIC"){ - winning_model_VMR_i = best_models_VMR_i %>% - dplyr::filter(BIC == min(BIC)) %>% - #In case there is more than one model with the exact same AIC from different groups, pick the one with the highest tot_r_squared - dplyr::slice(1) %>% - dplyr::mutate(second_winner = best_models_VMR_i$model_group[2], - delta_r_squared = best_models_VMR_i$tot_r_squared[1] - best_models_VMR_i$tot_r_squared[2]) - } + # Create the final object that will be returned + if (model_selection == "AIC") { + winning_model_VML_i <- best_models_VML_i %>% + dplyr::filter(AIC == min(AIC)) %>% + # In case there is more than one model with the exact same AIC from different groups, pick the one with the highest tot_r_squared + dplyr::slice(1) %>% + dplyr::mutate( + second_winner = best_models_VML_i$model_group[2], + delta_r_squared = best_models_VML_i$tot_r_squared[1] - best_models_VML_i$tot_r_squared[2] + ) + } else if (model_selection == "BIC") { + winning_model_VML_i <- best_models_VML_i %>% + dplyr::filter(BIC == min(BIC)) %>% + # In case there is more than one model with the exact same AIC from different groups, pick the one with the highest tot_r_squared + dplyr::slice(1) %>% + dplyr::mutate( + second_winner = best_models_VML_i$model_group[2], + delta_r_squared = best_models_VML_i$tot_r_squared[1] - best_models_VML_i$tot_r_squared[2] + ) + } - #Test the winning model against the basal one and decompose variance for the G, E and GxE components - model_basal = stats::lm(data = as.data.frame(full_data_vmr_i), formula = stringr::str_glue("DNAme ~ ", basal_model_formula) ) #set the basal model for comparing the rest - if (model_selection == "AIC"){ - winning_model_VMR_i$basal_AIC = stats::AIC(model_basal) - } else if(model_selection == "BIC"){ - winning_model_VMR_i$basal_BIC = stats::BIC(model_basal)} - winning_model_VMR_i$basal_rsquared = summary(model_basal)$r.squared - if(winning_model_VMR_i$model_group == "G"){ - winning_lm = stats::lm(data = as.data.frame(full_data_vmr_i), formula = stringr::str_glue("DNAme ~ ", make.names(unlist(winning_model_VMR_i$variables)), " + ", basal_model_formula) ) - r_decomp = relaimpo::calc.relimp.lm(object = winning_lm , - rela = FALSE, - type = "last") #This would be the equivalent to using lmg and setting always = covariates. - winning_model_VMR_i$g_r_squared = r_decomp$last[make.names(unlist(winning_model_VMR_i$variables))[1]] - winning_model_VMR_i$e_r_squared = NA_real_ - winning_model_VMR_i$gxe_r_squared = NA_real_ - }else if (winning_model_VMR_i$model_group == "E"){ - winning_lm = stats::lm(data = as.data.frame(full_data_vmr_i), formula = stringr::str_glue("DNAme ~ ", make.names(unlist(winning_model_VMR_i$variables))[1], " + ", basal_model_formula) ) - r_decomp = relaimpo::calc.relimp.lm(object = winning_lm, - rela = FALSE, - type = "last") #This would be the equivalent to using lmg and setting always = covariates. - winning_model_VMR_i$g_r_squared = NA_real_ - winning_model_VMR_i$e_r_squared = r_decomp$last[make.names(unlist(winning_model_VMR_i$variables))[1]] - winning_model_VMR_i$gxe_r_squared = NA_real_ - }else if (winning_model_VMR_i$model_group == "G+E"){ - winning_lm = stats::lm(data = as.data.frame(full_data_vmr_i), formula = stringr::str_glue("DNAme ~ ", make.names(unlist(winning_model_VMR_i$variables))[1], " + ",make.names(unlist(winning_model_VMR_i$variables))[2], " + ", basal_model_formula) ) - r_decomp = relaimpo::calc.relimp.lm(object = winning_lm, - rela = FALSE, - type = "lmg", - always = colnames(covariates_i)) - winning_model_VMR_i$g_r_squared = r_decomp$lmg[make.names(unlist(winning_model_VMR_i$variables))[1]] - winning_model_VMR_i$e_r_squared = r_decomp$lmg[make.names(unlist(winning_model_VMR_i$variables))[2]] - winning_model_VMR_i$gxe_r_squared = NA_real_ - }else if (winning_model_VMR_i$model_group == "GxE"){ - winning_lm = stats::lm(data = as.data.frame(full_data_vmr_i), formula = stringr::str_glue("DNAme ~ ", make.names(unlist(winning_model_VMR_i$variables))[1], " + ",make.names(unlist(winning_model_VMR_i$variables))[2], " + ", make.names(unlist(winning_model_VMR_i$variables))[1], "*",make.names(unlist(winning_model_VMR_i$variables))[2], " + ", basal_model_formula) ) - r_decomp = relaimpo::calc.relimp.lm(object = winning_lm, - rela = FALSE, - type = "lmg", - always = colnames(covariates_i)) #This slightly underestimates the relative importance compared to not using the covariates as the basal model, but in the interaction option the computational time is greatly increased if the relative contribution of all the other covariates is also estimated (which we dont look at anyways). So, because of the high dimensional nature of this package, this option will be used. - winning_model_VMR_i$g_r_squared = r_decomp$lmg[make.names(unlist(winning_model_VMR_i$variables))[1]] - winning_model_VMR_i$e_r_squared = r_decomp$lmg[make.names(unlist(winning_model_VMR_i$variables))[2]] - winning_model_VMR_i$gxe_r_squared = r_decomp$lmg[stringr::str_glue(make.names(unlist(winning_model_VMR_i$variables))[1], ":",make.names(unlist(winning_model_VMR_i$variables))[2])] - } + # Test the winning model against the basal one and decompose variance for the G, E and GxE components + model_basal <- stats::lm(data = as.data.frame(full_data_vml_i), formula = stringr::str_glue("DNAme ~ ", basal_model_formula)) # set the basal model for comparing the rest + if (model_selection == "AIC") { + winning_model_VML_i$basal_AIC <- stats::AIC(model_basal) + } else if (model_selection == "BIC") { + winning_model_VML_i$basal_BIC <- stats::BIC(model_basal) + } + winning_model_VML_i$basal_rsquared <- summary(model_basal)$r.squared + if (winning_model_VML_i$model_group == "G") { + winning_lm <- stats::lm(data = as.data.frame(full_data_vml_i), formula = stringr::str_glue("DNAme ~ ", make.names(unlist(winning_model_VML_i$variables)), " + ", basal_model_formula)) + r_decomp <- relaimpo::calc.relimp.lm( + object = winning_lm, + rela = FALSE, + type = "last" + ) # This would be the equivalent to using lmg and setting always = covariates. + winning_model_VML_i$g_r_squared <- r_decomp$last[make.names(unlist(winning_model_VML_i$variables))[1]] + winning_model_VML_i$e_r_squared <- NA_real_ + winning_model_VML_i$gxe_r_squared <- NA_real_ + } else if (winning_model_VML_i$model_group == "E") { + winning_lm <- stats::lm(data = as.data.frame(full_data_vml_i), formula = stringr::str_glue("DNAme ~ ", make.names(unlist(winning_model_VML_i$variables))[1], " + ", basal_model_formula)) + r_decomp <- relaimpo::calc.relimp.lm( + object = winning_lm, + rela = FALSE, + type = "last" + ) # This would be the equivalent to using lmg and setting always = covariates. + winning_model_VML_i$g_r_squared <- NA_real_ + winning_model_VML_i$e_r_squared <- r_decomp$last[make.names(unlist(winning_model_VML_i$variables))[1]] + winning_model_VML_i$gxe_r_squared <- NA_real_ + } else if (winning_model_VML_i$model_group == "G+E") { + winning_lm <- stats::lm(data = as.data.frame(full_data_vml_i), formula = stringr::str_glue("DNAme ~ ", make.names(unlist(winning_model_VML_i$variables))[1], " + ", make.names(unlist(winning_model_VML_i$variables))[2], " + ", basal_model_formula)) + r_decomp <- relaimpo::calc.relimp.lm( + object = winning_lm, + rela = FALSE, + type = "lmg", + always = colnames(covariates_i) + ) + winning_model_VML_i$g_r_squared <- r_decomp$lmg[make.names(unlist(winning_model_VML_i$variables))[1]] + winning_model_VML_i$e_r_squared <- r_decomp$lmg[make.names(unlist(winning_model_VML_i$variables))[2]] + winning_model_VML_i$gxe_r_squared <- NA_real_ + } else if (winning_model_VML_i$model_group == "GxE") { + winning_lm <- stats::lm(data = as.data.frame(full_data_vml_i), formula = stringr::str_glue("DNAme ~ ", make.names(unlist(winning_model_VML_i$variables))[1], " + ", make.names(unlist(winning_model_VML_i$variables))[2], " + ", make.names(unlist(winning_model_VML_i$variables))[1], "*", make.names(unlist(winning_model_VML_i$variables))[2], " + ", basal_model_formula)) + r_decomp <- relaimpo::calc.relimp.lm( + object = winning_lm, + rela = FALSE, + type = "lmg", + always = colnames(covariates_i) + ) # This slightly underestimates the relative importance compared to not using the covariates as the basal model, but in the interaction option the computational time is greatly increased if the relative contribution of all the other covariates is also estimated (which we dont look at anyways). So, because of the high dimensional nature of this package, this option will be used. + winning_model_VML_i$g_r_squared <- r_decomp$lmg[make.names(unlist(winning_model_VML_i$variables))[1]] + winning_model_VML_i$e_r_squared <- r_decomp$lmg[make.names(unlist(winning_model_VML_i$variables))[2]] + winning_model_VML_i$gxe_r_squared <- r_decomp$lmg[stringr::str_glue(make.names(unlist(winning_model_VML_i$variables))[1], ":", make.names(unlist(winning_model_VML_i$variables))[2])] + } - winning_model_VMR_i$VMR_index = VMR_i$VMR_index - #Return final object - winning_model_VMR_i - } + winning_model_VML_i$VML_index <- VML_i$VML_index + # Return final object + winning_model_VML_i + } - #Compute FDR and rearrange columns - if (model_selection == "AIC"){ - winning_models = winning_models %>% - dplyr::select(VMR_index, model_group, variables, tot_r_squared, g_r_squared, e_r_squared, gxe_r_squared, AIC, second_winner, delta_aic, delta_r_squared, basal_AIC, basal_rsquared) %>% + # Rearrange columns + if (model_selection == "AIC") { + winning_models <- winning_models %>% + dplyr::select(VML_index, model_group, variables, tot_r_squared, g_r_squared, e_r_squared, gxe_r_squared, AIC, second_winner, delta_aic, delta_r_squared, basal_AIC, basal_rsquared) %>% as.data.frame() - } else if (model_selection == "BIC"){ - winning_models = winning_models %>% - dplyr::select(VMR_index, model_group, variables, tot_r_squared, g_r_squared, e_r_squared, gxe_r_squared, BIC, second_winner, delta_bic, delta_r_squared, basal_BIC, basal_rsquared) %>% + } else if (model_selection == "BIC") { + winning_models <- winning_models %>% + dplyr::select(VML_index, model_group, variables, tot_r_squared, g_r_squared, e_r_squared, gxe_r_squared, BIC, second_winner, delta_bic, delta_r_squared, basal_BIC, basal_rsquared) %>% as.data.frame() } - return(winning_models) -} + if (model_selection == "AIC") { + return(winning_models %>% + rbind(no_vars_VML %>% # Attach VML with no variables selected in selectVariables() + dplyr::select(-selected_genot, -selected_env) %>% # remove empty columns + dplyr::mutate( + model_group = "B", + variables = list(NA_character_), + tot_r_squared = NA_real_, + g_r_squared = NA_real_, + e_r_squared = NA_real_, + gxe_r_squared = NA_real_, + AIC = NA_real_, + second_winner = NA_character_, + delta_aic = NA_real_, + delta_r_squared = NA_real_, + basal_AIC = NA_real_, + basal_rsquared = NA_real_ + ))) + } + if (model_selection == "BIC") { + return(winning_models %>% + rbind(no_vars_VML %>% # Attach VML with no variables selected in selectVariables() + dplyr::select(-selected_genot, -selected_env) %>% # remove empty columns + dplyr::mutate( + model_group = "B", + variables = list(NA_character_), + tot_r_squared = NA_real_, + g_r_squared = NA_real_, + e_r_squared = NA_real_, + gxe_r_squared = NA_real_, + BIC = NA_real_, + second_winner = NA_character_, + delta_bic = NA_real_, + delta_r_squared = NA_real_, + basal_BIC = NA_real_, + basal_rsquared = NA_real_ + ))) + } +} diff --git a/R/medCorVMR.R b/R/medCorVMR.R index df13823..bcbf419 100644 --- a/R/medCorVMR.R +++ b/R/medCorVMR.R @@ -5,47 +5,64 @@ #' its median pairwise probe correlation. #' #' This function supports parallel computing for increased speed. To do so, you have to set the parallel backend -#' in your R session before running the function (e.g., doFuture::registerDoFuture()) and then the evaluation strategy (e.g., future::plan(multisession)). After that, -#' the function can be run as usual. It is recommended to also set options(future.globals.maxSize= +Inf). +#' in your R session before running the function (e.g., *doParallel::registerDoParallel(4)*)). After that, the function can be run as usual. It is recommended to also set options(future.globals.maxSize= +Inf). #' #' @param VMR_df GRanges object converted to a data frame. Must contain the following columns: #' "seqnames", "start", "end" (all of which are produced automatically when doing the object conversion) and "probes" (containing a list in which each element contains a vector with the probes #' constituting the VMR). -#' @inheritParams findVMRs +#' @inheritParams findVML #' @return A data frame like VMR_df with an extra column per region containing the median pairwise correlation. #' #' @importFrom foreach %dopar% #' @export #' -medCorVMR = function(VMR_df, methylation_data){ - if(!is.list(VMR_df$probes)){ +#' @examples +#' +#' #Create a VML data.frame +#' VMR_df <- data.frame(seqnames = c("chr21", "chr21"), +#' start = c(10861376, 10862171), +#' end = c(10862507, 10883548), +#' probes = I(list(c("cg15043638", "cg18287590", "cg17975851"), +#' c("cg13893907", "cg17035109", "cg06187584")))) +#' +#' # Compute median correlation for each VMR +#' medCorVMR(VMR_df = VMR_df, methylation_data = RAMEN::test_methylation_data) +#' +#' +#' +medCorVMR <- function(VMR_df, methylation_data) { + if (!is.list(VMR_df$probes)) { stop("Please make sure the 'probes' column in VMR_df is a column of lists") } - VMR_probes = VMR_df$probes #generate a list where each element will contain a vector with the probes present in one VMR - #Compute correlations - median_correlation = foreach::foreach(i = seq_along(VMR_probes), # For each VMR - .combine = "c" #Combine outputs in a vector - ) %dopar% { - if (length(VMR_probes[[i]]) == 1){ #If the VMR has one probe + VMR_probes <- VMR_df$probes # generate a list where each element will contain a vector with the probes present in one VMR + # Compute correlations + i <- NULL #Bind variable to the environment + median_correlation <- foreach::foreach( + i = seq_along(VMR_probes), # For each VMR + .combine = "c" # Combine outputs in a vector + ) %dopar% { + if (length(VMR_probes[[i]]) == 1) { # If the VMR has one probe NA - } - else{ - VMR_correlation = c() - for (probe_x_i in 1:(length(VMR_probes[[i]])-1)){ #For each probe except the last one - primary_probe = VMR_probes[[i]][probe_x_i] - for (probe_y_i in (probe_x_i+1):length(VMR_probes[[i]])){ #compute the pairwise correlation with the downstream probes - secondary_probe = VMR_probes[[i]][probe_y_i] - VMR_correlation = c(VMR_correlation, - stats::cor(unlist(methylation_data[primary_probe,]), #unlist added to make the subset df a vector - unlist(methylation_data[secondary_probe,]), - method= "pearson")) + } else { + VMR_correlation <- c() + for (probe_x_i in 1:(length(VMR_probes[[i]]) - 1)) { # For each probe except the last one + primary_probe <- VMR_probes[[i]][probe_x_i] + for (probe_y_i in (probe_x_i + 1):length(VMR_probes[[i]])) { # compute the pairwise correlation with the downstream probes + secondary_probe <- VMR_probes[[i]][probe_y_i] + VMR_correlation <- c( + VMR_correlation, + stats::cor(unlist(methylation_data[primary_probe, ]), # unlist added to make the subset df a vector + unlist(methylation_data[secondary_probe, ]), + method = "pearson" + ) + ) } } - median_correlation = stats::median(VMR_correlation) + median_correlation <- stats::median(VMR_correlation) } } - VMR_df$median_correlation = median_correlation + VMR_df$median_correlation <- median_correlation return(VMR_df) } diff --git a/R/nullDistGE.R b/R/nullDistGE.R index acbd51e..bb42f89 100644 --- a/R/nullDistGE.R +++ b/R/nullDistGE.R @@ -2,16 +2,16 @@ #' #' This function simulates the delta R squared distribution under the null hypothesis of G and E having no association with DNA methylation (DNAme) variability through a permutation analysis. To do so, this function shuffles the G and E variables in the dataset, which is followed by a the variable selection and modelling steps with *selectVariables()* and *lmGE()*.These steps are repeated several times as indicated in the *permutations* parameter. By using shuffled G and E data, we simulate the increase of R2 that would be observed in random data using the RAMEN methodology. #' -#' The core pipeline from the RAMEN package identifies the best explanatory model per VMR. However, despite these models being winners in comparison to models including any other G/E variable(s) in the dataset, some winning models might perform no better than what we would expect by chance. Therefore, the goal of this function is to create a distribution of increase in R2 under the null hypothesis of G and E having no associations with DNAme. The null distribution is obtained through shuffling the G and E variables in a given dataset and conducting the variable selection and G/E model selection. That way, we can simulate how much additional variance would be explained by the models defined as winners by the RAMEN methodology in a scenario where the G and E associations with DNAme are randomized. This distribution can be then used to filter out winning models in the non-shuffled dataset that do not add more to the explained variance of the basal model than what randomized data do. +#' The core pipeline from the RAMEN package identifies the best explanatory model per VML. However, despite these models being winners in comparison to models including any other G/E variable(s) in the dataset, some winning models might perform no better than what we would expect by chance. Therefore, the goal of this function is to create a distribution of increase in R2 under the null hypothesis of G and E having no associations with DNAme. The null distribution is obtained through shuffling the G and E variables in a given dataset and conducting the variable selection and G/E model selection. That way, we can simulate how much additional variance would be explained by the models defined as winners by the RAMEN methodology in a scenario where the G and E associations with DNAme are randomized. This distribution can be then used to filter out winning models in the non-shuffled dataset that do not add more to the explained variance of the basal model than what randomized data do. #' -#' Under the assumption that after adjusting for the concomitant variables all VMRs across the genome follow the same behavior regarding an increment of explained variance with randomized G and E data, we can pool the delta R squared values from all VMRs to create a null distribution taking advantage of the high number of VMRs in the dataset. This assumption decreases significantly the number of permutations required to create a null distribution and reduces the computational time. For further information please read the RAMEN paper (in preparation). +#' Under the assumption that after adjusting for the concomitant variables all VML across the genome follow the same behavior regarding an increment of explained variance with randomized G and E data, we can pool the delta R squared values from all VML to create a null distribution taking advantage of the high number of VML in the dataset. This assumption decreases significantly the number of permutations required to create a null distribution and reduces the computational time. For further information please read the RAMEN paper (in preparation). #' #' @param permutations description #' @inheritParams selectVariables #' @inheritParams lmGE #' #' @return A data frame with the following columns: -#' - VMR_index: The unique ID of the VMR. +#' - VML_index: The unique ID of the VML. #' - model_group: The group to which the winning model belongs to (i.e., G, E, G+E or GxE) #' - tot_r_squared: R squared of the winning model #' - R2_difference: the increase in R squared obtained by including the G/E variable(s) from the winning model (i.e., the R squared difference between the winning model and the model only with the concomitant variables specified in *covariates*; tot_r_squared - basal_rsquared in the lmGE output) @@ -19,75 +19,179 @@ #' #' @importFrom foreach %do% #' @export +#' @examples +#' ## Find VML in test data +#' VML <- RAMEN::findVML( +#' methylation_data = RAMEN::test_methylation_data, +#' array_manifest = "IlluminaHumanMethylationEPICv1", +#' cor_threshold = 0, +#' var_method = "variance", +#' var_distribution = "ultrastable", +#' var_threshold_percentile = 0.99, +#' max_distance = 1000 +#' ) +#' ## Find cis SNPs around VML +#' VML_with_cis_snps <- RAMEN::findCisSNPs( +#' VML_df = VML$VML, +#' genotype_information = RAMEN::test_genotype_information, +#' distance = 1e6 +#' ) #' +#' ## Summarize methylation levels in VML +#' summarized_methyl_VML <- RAMEN::summarizeVML( +#' methylation_data = RAMEN::test_methylation_data, +#' VML_df = VML_with_cis_snps +#' ) +#' +#' ## Simulate null distribution of G and E contributions on DNAme variability +#' null_dist <- RAMEN::nullDistGE( +#' VML_df = VML_with_cis_snps, +#' genotype_matrix = RAMEN::test_genotype_matrix, +#' environmental_matrix = RAMEN::test_environmental_matrix, +#' summarized_methyl_VML = summarized_methyl_VML, +#' permutations = 5, +#' covariates = RAMEN::test_covariates, +#' seed = 1, +#' model_selection = "AIC" +#' ) + + +nullDistGE <- function(VML_df, + genotype_matrix, + environmental_matrix, + summarized_methyl_VML, + permutations = 10, + covariates = NULL, + seed = NULL, + model_selection = "AIC") { + ## Check that genotype_matrix, environmental_matrix, and covariates (in case + ## it is provided) have only numeric values and no NA, NaN, Inf values + if (!is.data.frame(summarized_methyl_VML)) stop("Please make sure summarized_methyl_VML is a data frame.") + if ( + sum(vapply(genotype_matrix, is.na, FUN.VALUE = logical(1))) > 0 || + sum(vapply(genotype_matrix, is.nan, FUN.VALUE = logical(1))) > 0 || + sum(!vapply(genotype_matrix, is.numeric, FUN.VALUE = logical(1))) > 0 || + sum(vapply(genotype_matrix, is.infinite, FUN.VALUE = logical(1))) > 0 + ) stop ( + "Please make sure the genotype matrix contains only finite numeric values." + ) + if ( + sum(vapply(environmental_matrix, is.na, FUN.VALUE = logical(1))) > 0 || + sum(vapply(environmental_matrix, is.nan, FUN.VALUE = logical(1))) > 0 || + sum(!vapply(environmental_matrix, is.numeric, FUN.VALUE = logical(1))) > 0 || + sum(vapply(environmental_matrix, is.infinite, FUN.VALUE = logical(1))) > 0 + ) stop ( + "Please make sure the environmental matrix contains only finite numeric values." + ) + if (!is.null(covariates)) { + if ( + sum(vapply(covariates, is.na, FUN.VALUE = logical(1))) > 0 || + sum(vapply(covariates, is.nan, FUN.VALUE = logical(1))) > 0 || + sum(!vapply(covariates, is.numeric, FUN.VALUE = logical(1))) > 0 || + sum(vapply(covariates, is.infinite, FUN.VALUE = logical(1))) > 0 + ) stop ( + "Please make sure the covariates matrix contains only finite numeric values." + ) + } + if ( + sum(vapply(summarized_methyl_VML, + is.na, + FUN.VALUE = logical(nrow(summarized_methyl_VML)) + ) + ) > 0 || + sum(vapply(summarized_methyl_VML, + is.nan, + FUN.VALUE = logical(nrow(summarized_methyl_VML)) + ) + ) > 0 || + sum(!vapply(summarized_methyl_VML, + is.numeric, + FUN.VALUE = logical(1) + ) + ) > 0 || + sum(vapply(summarized_methyl_VML, + is.infinite, + FUN.VALUE = logical(nrow(summarized_methyl_VML)) + ) + ) > 0 + ) stop ( + "Please make sure the summarized_methyl_VML data frame contains only finite numeric values." + ) + -nullDistGE = function(VMRs_df, - genotype_matrix, - environmental_matrix, - summarized_methyl_VMR, - permutations = 10, - covariates = NULL, - seed = NULL, - model_selection = "AIC" -){ - #Get the shuffle order + # Get the shuffle order if (!is.null(seed)) set.seed(seed) - permutation_order = data.frame(sample(rownames(summarized_methyl_VMR), - size = length(rownames(summarized_methyl_VMR)))) - for (i in 1:(permutations-1)){ - permutation_order= cbind(permutation_order, - data.frame(sample(rownames(summarized_methyl_VMR), - size = length(rownames(summarized_methyl_VMR))))) + permutation_order <- data.frame(sample(rownames(summarized_methyl_VML), + size = length(rownames(summarized_methyl_VML)) + )) + for (i in 1:(permutations - 1)) { + permutation_order <- cbind( + permutation_order, + data.frame(sample(rownames(summarized_methyl_VML), + size = length(rownames(summarized_methyl_VML)) + )) + ) } - colnames(permutation_order) = 1:permutations + colnames(permutation_order) <- 1:permutations - #Put the environmental and genotype matrix in the same order to the summarized VMR object - genotype_matrix = genotype_matrix[,rownames(summarized_methyl_VMR)] - environmental_matrix = environmental_matrix[rownames(summarized_methyl_VMR),] + # Put the environmental and genotype matrix in the same order to the summarized VML object + genotype_matrix <- genotype_matrix[, rownames(summarized_methyl_VML)] + environmental_matrix <- environmental_matrix[rownames(summarized_methyl_VML), ] # Permutation analysis - null_dist = foreach::foreach(i = 1:permutations, .combine = rbind) %do% { - #Shuffle the datasets - permutated_genotype = genotype_matrix[,permutation_order[,i]] %>% + null_dist <- foreach::foreach(i = 1:permutations, .combine = rbind) %do% { + message("Starting permutation ", i, " of ", permutations) + # Shuffle the datasets + permutated_genotype <- genotype_matrix[, permutation_order[, i]] %>% as.matrix() - rownames(permutated_genotype) = rownames(genotype_matrix) - colnames(permutated_genotype) = colnames(genotype_matrix) - permutated_environment = environmental_matrix[permutation_order[,i],] %>% + rownames(permutated_genotype) <- rownames(genotype_matrix) + colnames(permutated_genotype) <- colnames(genotype_matrix) + permutated_environment <- environmental_matrix[permutation_order[, i], ] %>% as.matrix() - colnames(permutated_environment) = colnames(environmental_matrix) - rownames(permutated_environment) = rownames(environmental_matrix) + colnames(permutated_environment) <- colnames(environmental_matrix) + rownames(permutated_environment) <- rownames(environmental_matrix) # Run RAMEN - selected_variables = RAMEN::selectVariables(VMRs_df = VMRs_df, - genotype_matrix = permutated_genotype, - environmental_matrix = permutated_environment, - covariates = covariates, - summarized_methyl_VMR = summarized_methyl_VMR, - seed = 1) + message("Starting variable selection of permutation ", i, " of ", permutations) + selected_variables <- RAMEN::selectVariables( + VML_df = VML_df, + genotype_matrix = permutated_genotype, + environmental_matrix = permutated_environment, + covariates = covariates, + summarized_methyl_VML = summarized_methyl_VML, + seed = 1 + ) - lmGE_res = RAMEN::lmGE(selected_variables = selected_variables, - summarized_methyl_VMR = summarized_methyl_VMR, - genotype_matrix = permutated_genotype, - environmental_matrix = permutated_environment, - covariates = covariates, - model_selection = model_selection) - if(model_selection=="AIC"){ - results_perm = data.frame(VMR_index = lmGE_res$VMR_index, - tot_r_squared = lmGE_res$tot_r_squared, - model_group = lmGE_res$model_group, - R2_difference = lmGE_res$tot_r_squared - lmGE_res$basal_rsquared, - AIC_difference = lmGE_res$AIC - lmGE_res$basal_rsquared) - } else if (model_selection=="BIC"){ - results_perm = data.frame(VMR_index = lmGE_res$VMR_index, - model_group = lmGE_res$model_group, - tot_r_squared = lmGE_res$tot_r_squared, - R2_difference = lmGE_res$tot_r_squared - lmGE_res$basal_rsquared, - BIC_difference = lmGE_res$BIC - lmGE_res$basal_rsquared) + message("Starting lmGE in permutation ", i, " of ", permutations) + lmGE_res <- RAMEN::lmGE( + selected_variables = selected_variables, + summarized_methyl_VML = summarized_methyl_VML, + genotype_matrix = permutated_genotype, + environmental_matrix = permutated_environment, + covariates = covariates, + model_selection = model_selection + ) + if (model_selection == "AIC") { + results_perm <- data.frame( + VML_index = lmGE_res$VML_index, + model_group = lmGE_res$model_group, + tot_r_squared = lmGE_res$tot_r_squared, + R2_difference = lmGE_res$tot_r_squared - lmGE_res$basal_rsquared, + AIC_difference = lmGE_res$AIC - lmGE_res$basal_AIC + ) + } else if (model_selection == "BIC") { + results_perm <- data.frame( + VML_index = lmGE_res$VML_index, + model_group = lmGE_res$model_group, + tot_r_squared = lmGE_res$tot_r_squared, + R2_difference = lmGE_res$tot_r_squared - lmGE_res$basal_rsquared, + BIC_difference = lmGE_res$BIC - lmGE_res$basal_BIC + ) } - results_perm$permutation = i #add the number of permutation + message("Wrapping up permutation ", i, " of ", permutations) + results_perm$permutation <- i # add the number of permutation results_perm } return(null_dist) } - diff --git a/R/selectVariables.R b/R/selectVariables.R index c266c87..ac1fdd4 100644 --- a/R/selectVariables.R +++ b/R/selectVariables.R @@ -1,168 +1,287 @@ -#' Selection of environment and genotype variables for Variable Methylated Regions (VMRs) +#' Selection of relevant environment and genotype variables associated with Variably Methylated Loci (VML) #' -#' For each VMR, this function selects genotype and environmental variables using LASSO. +#' For each VML, this function selects potentially relevant genotype and environmental variables associated with DNA methylation levels of said VML using LASSO. See details below for more information. #' -#' This function supports parallel computing for increased speed. To do so, you have to set the parallel back-end -#' in your R session before running the function (e.g., doFuture::registerDoFuture()) and then the evaluation strategy (e.g., future::plan(multisession)). After that, the function can be run as usual. It is recommended to also set options(future.globals.maxSize= +Inf). Please make sure that your data has no NAs, since the LASSO implementation we use in RAMEN does not support missing values. -#' -#' selectVariables() uses LASSO, which is an embedded variable selection method that penalizes models that are more complex (i.e., that contain more variables) in favor of simpler models (i.e. that contain less variables), but not at the expense of reducing predictive power. Using LASSO's variable screening property (with high probability, the LASSO estimated model includes the substantial covariates and drops the redundant ones) this function selects genotype and environment variables with potential relevance in the Variable Methylated Region (VMR) dataset (see also Bühlmann and van de Geer, 2011). For each VMR, LASSO is run three times: 1) including only the genotype variables for the selection step, 2) including only the environmental variables for the selection step, and 3) Including both the genotype and environmental variables in the selection step. This is done to ensure that the function captures the variables that are relevant within their own category (e.g., SNPs that are strongly associated with the DNAme levels of a VMR in the presence of the rest of the SNPs) or in the presence of the variables of the other category (e.g. SNPs that are strongly associated with the DNAme levels of a VMR in the presence of the rest of BOTH the SNPs AND environmental variables). Every time LASSO is run, the basal covariates (i.e., concomitant variables )indicated in the argument *covariates* are not penalized (i.e., those variables are always included in the models and their coefficients are not subjected to shrinkage). That way, only the most promising E and G variables in the presence of the concomitant variables will be selected. +#' selectVariables() uses LASSO, which is an embedded variable selection method that penalizes models that are more complex (i.e., that contain more variables) in favor of simpler models (i.e. that contain less variables), but not at the expense of reducing predictive power. Using LASSO's variable screening property (with high probability, the LASSO estimated model includes the substantial covariates and drops the redundant ones) this function selects genotype and environment variables with potential relevance in the Variable Methylated Loci (VML) dataset (see also Bühlmann and van de Geer, 2011). For each VML, LASSO is run three times: 1) including only the genotype variables for the selection step, 2) including only the environmental variables for the selection step, and 3) Including both the genotype and environmental variables in the selection step. This is done to ensure that the function captures the variables that are relevant within their own category (e.g., SNPs that are strongly associated with the DNAme levels of a VML in the presence of the rest of the SNPs) or in the presence of the variables of the other category (e.g. SNPs that are strongly associated with the DNAme levels of a VML in the presence of the rest of BOTH the SNPs AND environmental variables). Every time LASSO is run, the basal covariates (i.e., concomitant variables )indicated in the argument *covariates* are not penalized (i.e., those variables are always included in the models and their coefficients are not subjected to shrinkage). That way, only the most promising E and G variables in the presence of the concomitant variables will be selected. #' #' Each LASSO model uses a tuned lambda that minimizes the 5-fold cross-validation error within its corresponding data. This function uses the lambda.min value in contrast to lambda.1se because its goal within the RAMEN package is to use LASSO to reduce the number of variables that are going to be used next for fitting pairwise interaction models in *lmGE()*. Since at this step variables are being selected based only on main effects, it is preferable to cast a "wider net" and select a slightly higher number of variables that could potentially have a strong interaction effect when paired with another variable. Furthermore, since in this case LASSO is being used as a screening procedure to select variables that will be fit separately in independent models and compared, the overfitting issue of using lambda.min does not impose a big concern. After finding the best lambda value, the sequence of models is fit by coordinate descent using *glmnet()*. Random numbers in this function are created during the lambda cross validation and the LASSO stages. Setting a seed is highly encouraged for result reproducibility using the *seed* argument. Please note that setting a seed inside of this function modifies the seed globally (which is R's default behavior). #' +#' #' This function supports parallel computing for increased speed. To do so, you have to set the parallel back-end +#' in your R session before running the function (e.g., *doParallel::registerDoParallel(4)*). After that, the function can be run as usual. It is recommended to also set options(future.globals.maxSize= +Inf). Please make sure that your data has no NAs and it's all numerical, since the LASSO implementation we use does not support missing or non-numerical values. +#' #' Note: If you want to conduct the variable selection step only in one data set (i.e., only in the genotype), you can set the argument *environmental_matrix = NULL*. #' #' -#' @param VMRs_df A data frame converted from a GRanges object. Recommended to use the output of *RAMEN::findCisSNPs()*. Must have one VMR per row, and contain the following columns: "VMR_index" (a unique ID for each VMR in VMRs_df AS CHARACTERS) and "SNP" (a column with a list as observation, containing the name of the SNPs surrounding the corresponding VMR). The SNPs contained in the "SNP" column must be present in the object that is indicated in the genotype_matrix argument, and it must contain all the VMRs contained in summarized_methyl_VMR. VMRs with no surrounding SNPs must have an empty list in the SNP column (either list(NULL), list(NA), list("") or list(character(0)) ). +#' @param VML_df A data frame converted from a GRanges object. Recommended to use the output of *RAMEN::findCisSNPs()*. Must have one VML per row, and contain the following columns: "VML_index" (a unique ID for each VML in VML_df AS CHARACTERS) and "SNP" (a column with a list as observation, containing the name of the SNPs surrounding the corresponding VML). The SNPs contained in the "SNP" column must be present in the object that is indicated in the genotype_matrix argument, and it must contain all the VML contained in summarized_methyl_VML. VML with no surrounding SNPs must have an empty list in the SNP column (either list(NULL), list(NA), list("") or list(character(0)) ). #' @param environmental_matrix A matrix of environmental variables. Only numeric values are supported. In case of factor variables, it is recommended to encode them as numbers or re-code them into dummy variables if there are more than two levels. Columns must correspond to environmental variables and rows to individuals. Row names must be the individual IDs. #' @param genotype_matrix A matrix of number-encoded genotypes. Columns must correspond to samples, and rows to SNPs. We suggest using a gene-dosage model, which would encode the SNPs ordinally depending on the genotype allele charge, such as 2 (AA), 1 (AB) and 0 (BB). The column names must correspond with individual IDs. -#' @param summarized_methyl_VMR A data frame containing each individual's VMR summarized region methylation. It is suggested to use the output of RAMEN::summarizeVMRs().Rows must reflects individuals, and columns VMRs The names of the columns must correspond to the index of said VMR, and it must match the index of VMRs_df$VMR_index. The names of the rows must correspond to the sample IDs, and must match with the IDs of the other matrices. +#' @param summarized_methyl_VML A data frame containing each individual's VML summarized methylation. It is suggested to use the output of RAMEN::summarizeVML().Rows must reflects individuals, and columns VML The names of the columns must correspond to the index of said VML, and it must match the index of VML_df$VML_index. The names of the rows must correspond to the sample IDs, and must match with the IDs of the other matrices. #' @param covariates A matrix containing the covariates (i.e., concomitant variables / variables that are not the ones you are interested in) that will be adjusted for in the final GxE models (e.g., cell type proportions, age, etc.). Each column should correspond to a covariate and each row to an individual. Row names must correspond to the individual IDs. #' @param seed An integer number that initializes a pseudo-random number generator. Random numbers in this function are created during the lambda cross validation and the LASSO stages. Setting a seed is highly encouraged for result reproducibility. **Please note that setting a seed in this function modifies the seed globally**. #' #' @return A data frame with three columns: -#' - VMR_index: Unique VMR ID. -#' - selected_genot: List-containing column with the selected SNPs. -#' - selected_env: List-containing column with the selected environmental variables. +#' - VML_index: Unique VML ID. +#' - selected_genot: Column containing lists as values with the selected SNPs. +#' - selected_env: Column containing lists as values with the selected environmental variables. #' #' @importFrom doRNG %dorng% #' @export #' -selectVariables = function(VMRs_df, - genotype_matrix, - environmental_matrix, - covariates = NULL, - summarized_methyl_VMR, - seed = NULL) { - ## Arguments check - # Check that genotype_matrix, environmental_matrix, covariate matrix (in case it is provided) and summarized_methyl_VMR have the same samples - if(!all(rownames(summarized_methyl_VMR) %in% colnames(genotype_matrix))) stop("Individual IDs in summarized_methyl_VMR do not match individual IDs in genotype_matrix") - if (!all(rownames(summarized_methyl_VMR) %in% rownames(environmental_matrix))) stop("Individual IDs in summarized_methyl_VMR do not match individual IDs in environmental_matrix") - if(!is.null(covariates)){ - if (!all(rownames(summarized_methyl_VMR) %in% rownames(covariates)))stop("Individual IDs in summarized_methyl_VMR do not match individual IDs in the covariates matrix")} - #Check that VMRs_df has index and SNP column - if(!all(c("VMR_index","SNP") %in% colnames(VMRs_df))) stop("Please make sure the VMR data frame (VMRs_df) contains the columns 'SNP' and 'VMR_index'.") - #Check that the SNP column on VMRs_df is a list - if(!is.list(VMRs_df$SNP)) stop("Please make sure the 'SNP' column in VMRs_df is a column of lists") - if(!is.character(VMRs_df$VMR_index)) stop("Please make sure the 'VMR_index' column in VMRs_df is a column of characters") - #Check that genotype, environment and covariates are matrices +#' @examples +#' ## Find VML in test data +#' VML <- RAMEN::findVML( +#' methylation_data = RAMEN::test_methylation_data, +#' array_manifest = "IlluminaHumanMethylationEPICv1", +#' cor_threshold = 0, +#' var_method = "variance", +#' var_distribution = "ultrastable", +#' var_threshold_percentile = 0.99, +#' max_distance = 1000 +#' ) +#' ## Find cis SNPs around VML +#' VML_with_cis_snps <- RAMEN::findCisSNPs( +#' VML_df = VML$VML, +#' genotype_information = RAMEN::test_genotype_information, +#' distance = 1e6 +#' ) +#' +#' ## Summarize methylation levels in VML +#' summarized_methyl_VML <- RAMEN::summarizeVML( +#' methylation_data = RAMEN::test_methylation_data, +#' VML_df = VML_with_cis_snps +#' ) +#' +#' ## Select relevant genotype and environmental variables +#' selected_vars <- RAMEN::selectVariables( +#' VML_df = VML_with_cis_snps, +#' genotype_matrix = RAMEN::test_genotype_matrix, +#' environmental_matrix = RAMEN::test_environmental_matrix, +#' covariates = RAMEN::test_covariates, +#' summarized_methyl_VML = summarized_methyl_VML, +#' seed = 1 +#' ) +#' +selectVariables <- function(VML_df, + genotype_matrix, + environmental_matrix, + covariates = NULL, + summarized_methyl_VML, + seed = NULL) { + # Arguments check + ## Check that genotype_matrix, environmental_matrix, covariate matrix (in case it is provided) and summarized_methyl_VML have the same samples + if (!is.data.frame(summarized_methyl_VML)) stop("Please make sure summarized_methyl_VML is a data frame.") + if (!all(rownames(summarized_methyl_VML) %in% colnames(genotype_matrix))) stop("Individual IDs in summarized_methyl_VML do not match individual IDs in genotype_matrix") + if (!all(rownames(summarized_methyl_VML) %in% rownames(environmental_matrix))) stop("Individual IDs in summarized_methyl_VML do not match individual IDs in environmental_matrix") + if (!is.null(covariates)) { + if (!all(rownames(summarized_methyl_VML) %in% rownames(covariates))) stop("Individual IDs in summarized_methyl_VML do not match individual IDs in the covariates matrix") + } + ## Check that VML_df has index and SNP column + if (!all(c("VML_index", "SNP") %in% colnames(VML_df))) stop("Please make sure the VML data frame (VML_df) contains the columns 'SNP' and 'VML_index'.") + ## Check that the SNP column on VML_df is a list + if (!is.list(VML_df$SNP)) stop("Please make sure the 'SNP' column in VML_df is a column containing lists as values") + if (!is.character(VML_df$VML_index)) stop("Please make sure the 'VML_index' column in VML_df is a column of characters") + ## Check that genotype, environment and covariates are matrices if (!is.matrix(genotype_matrix)) stop("Please make sure the genotype data is provided as a matrix.") - if(!is.null(environmental_matrix)){ - if (!is.matrix(environmental_matrix)) stop("Please make sure the environmental data is provided as a matrix.")} - if (!is.null(covariates)){ - if (!is.matrix(covariates)) stop("Please make sure the covariates data is provided as a matrix.")} - if (sum(is.na(genotype_matrix)) > 1 | sum(is.na(environmental_matrix)) > 1 | sum(is.na(covariates)) ) stop("Data contains missing values. Please consider handling NAs by imputation or removal.") + if (!is.null(environmental_matrix)) { + if (!is.matrix(environmental_matrix)) stop("Please make sure the environmental data is provided as a matrix.") + } + if (!is.null(covariates)) { + if (!is.matrix(covariates)) stop("Please make sure the covariates data is provided as a matrix.") + } + ## Check that genotype_matrix, environmental_matrix, and covariates (in case + ## it is provided) have only numeric values and no NA, NaN, Inf values + if ( + sum(vapply(genotype_matrix, is.na, FUN.VALUE = logical(1))) > 0 || + sum(vapply(genotype_matrix, is.nan, FUN.VALUE = logical(1))) > 0 || + sum(!vapply(genotype_matrix, is.numeric, FUN.VALUE = logical(1))) > 0 || + sum(vapply(genotype_matrix, is.infinite, FUN.VALUE = logical(1))) > 0 + ) stop ( + "Please make sure the genotype matrix contains only finite numeric values." + ) + if ( + sum(vapply(environmental_matrix, is.na, FUN.VALUE = logical(1))) > 0 || + sum(vapply(environmental_matrix, is.nan, FUN.VALUE = logical(1))) > 0 || + sum(!vapply(environmental_matrix, is.numeric, FUN.VALUE = logical(1))) > 0 || + sum(vapply(environmental_matrix, is.infinite, FUN.VALUE = logical(1))) > 0 + ) stop ( + "Please make sure the environmental matrix contains only finite numeric values." + ) + if (!is.null(covariates)) { + if ( + sum(vapply(covariates, is.na, FUN.VALUE = logical(1))) > 0 || + sum(vapply(covariates, is.nan, FUN.VALUE = logical(1))) > 0 || + sum(!vapply(covariates, is.numeric, FUN.VALUE = logical(1))) > 0 || + sum(vapply(covariates, is.infinite, FUN.VALUE = logical(1))) > 0 + ) stop ( + "Please make sure the covariates matrix contains only finite numeric values." + ) + } + + if ( + sum(vapply(summarized_methyl_VML, + is.na, + FUN.VALUE = logical(nrow(summarized_methyl_VML)) + ) + ) > 0 || + sum(vapply(summarized_methyl_VML, + is.nan, + FUN.VALUE = logical(nrow(summarized_methyl_VML)) + ) + ) > 0 || + sum(!vapply(summarized_methyl_VML, + is.numeric, + FUN.VALUE = logical(1) + ) + ) > 0 || + sum(vapply(summarized_methyl_VML, + is.infinite, + FUN.VALUE = logical(nrow(summarized_methyl_VML)) + ) + ) > 0 + ) stop ( + "Please make sure the summarized_methyl_VML data frame contains only finite numeric values." + ) ## Set the seed if (!is.null(seed)) set.seed(seed) - - lasso_results = foreach::foreach(VMR_i = iterators::iter(VMRs_df, by = "row"), .combine = "rbind") %dorng%{ - #Select summarized VMR information - summVMRi = summarized_methyl_VMR %>% - dplyr::select(VMR_i$VMR_index) + VML_i <- NULL # To avoid R CMD check note about undefined global variable + lasso_results <- foreach::foreach(VML_i = iterators::iter(VML_df, by = "row"), .combine = "rbind") %dorng% { + # Select summarized VML information + summVMLi <- summarized_methyl_VML %>% + dplyr::select(VML_i$VML_index) ## Prepare data - #subset the genotyping data and match genotype, environment and DNAme IDs - if(VMR_i$SNP %in% list(NULL) | # Catch VMRs with no surrounding SNPs - VMR_i$SNP %in% list("") | - VMR_i$SNP %in% list(NA) | - VMR_i$SNP %in% list(character(0))){ - genot_VMRi = c() - any_snp = FALSE - } else if (length(VMR_i$SNP[[1]]) == 1){ #Special case of sub-setting if SNP is only one because the result is a vector and not a matrix - genot_VMRi = genotype_matrix[unlist(VMR_i$SNP), rownames(summVMRi)] %>% + # subset the genotyping data and match genotype, environment and DNAme IDs + if (VML_i$SNP %in% list(NULL) | # Catch VML with no surrounding SNPs + VML_i$SNP %in% list("") | + VML_i$SNP %in% list(NA) | + VML_i$SNP %in% list(character(0))) { + genot_VMLi <- c() + any_snp <- FALSE + } else if (length(VML_i$SNP[[1]]) == 1) { # Special case of sub-setting if SNP is only one because the result is a vector and not a matrix + genot_VMLi <- genotype_matrix[unlist(VML_i$SNP), rownames(summVMLi)] %>% as.matrix() - colnames(genot_VMRi) = VMR_i$SNP[[1]] - any_snp = TRUE + colnames(genot_VMLi) <- VML_i$SNP[[1]] + any_snp <- TRUE } else { - genot_VMRi = genotype_matrix[unlist(VMR_i$SNP), rownames(summVMRi)] %>% + genot_VMLi <- genotype_matrix[unlist(VML_i$SNP), rownames(summVMLi)] %>% t() - any_snp = TRUE + any_snp <- TRUE } - if(ncol(environmental_matrix) == 1){ - environ_VMRi = environmental_matrix[rownames(summVMRi),] %>% + if (ncol(environmental_matrix) == 1) { + environ_VMLi <- environmental_matrix[rownames(summVMLi), ] %>% as.matrix() - colnames(environ_VMRi) = colnames(environmental_matrix) - } else environ_VMRi = environmental_matrix[rownames(summVMRi),] - environ_genot_VMRi = cbind(genot_VMRi, environ_VMRi) - #Bind covariates data - if (!is.null(covariates)){ - if (ncol(covariates) == 1){ - covariates_VMRi = covariates[rownames(summVMRi),] %>% #Match the covariates dataset with the VMRs information + colnames(environ_VMLi) <- colnames(environmental_matrix) + } else { + environ_VMLi <- environmental_matrix[rownames(summVMLi), ] + } + environ_genot_VMLi <- cbind(genot_VMLi, environ_VMLi) + # Bind covariates data + if (!is.null(covariates)) { + if (ncol(covariates) == 1) { + covariates_VMLi <- covariates[rownames(summVMLi), ] %>% # Match the covariates dataset with the VML information as.matrix() - colnames(covariates_VMRi) = colnames(covariates) - } else covariates_VMRi = covariates[rownames(summVMRi),] - genot_VMRi = cbind(genot_VMRi,covariates_VMRi) - environ_VMRi = cbind(environ_VMRi, covariates_VMRi) - environ_genot_VMRi = cbind(environ_genot_VMRi, covariates_VMRi) - ncol_covariates = ncol(covariates_VMRi) - } else ncol_covariates = 0 + colnames(covariates_VMLi) <- colnames(covariates) + } else { + covariates_VMLi <- covariates[rownames(summVMLi), ] + } + genot_VMLi <- cbind(genot_VMLi, covariates_VMLi) + environ_VMLi <- cbind(environ_VMLi, covariates_VMLi) + environ_genot_VMLi <- cbind(environ_genot_VMLi, covariates_VMLi) + ncol_covariates <- ncol(covariates_VMLi) + } else { + ncol_covariates <- 0 + } ### Run LASSOs ## Genotype only - #Get coefficients with the optimal lambda found by k-fold cross-validation - if (any_snp){ #Catch cases when VMRs dont have surrounding genotyped SNPs - coef_genot = stats::coef(glmnet::cv.glmnet(x = genot_VMRi, #Variables - y = summVMRi[,VMR_i$VMR_index], #Response - alpha = 1, - nfolds = 5, - penalty.factor = c(rep(1, ncol(genot_VMRi)- ncol_covariates), - rep(0, ncol_covariates))), #Unpenalize the variables in covariates (i.e., force LASSO to keep them in all the situations) - s = "lambda.min") - #Select the variables with a coefficient > 0 - coef_genot = coef_genot[abs(coef_genot[,1]) > 0,] - selected_vars_genot = names(coef_genot)[-1] - selected_vars_genot = selected_vars_genot[!selected_vars_genot %in% colnames(covariates)] #Remove covariates from selected variables - } else selected_vars_genot = character(0) + # Get coefficients with the optimal lambda found by k-fold cross-validation + if (any_snp) { # Catch cases when VML dont have surrounding genotyped SNPs + coef_genot <- stats::coef( + glmnet::cv.glmnet( + x = genot_VMLi, # Variables + y = summVMLi[, VML_i$VML_index], # Response + alpha = 1, + nfolds = 5, + penalty.factor = c( + rep(1, ncol(genot_VMLi) - ncol_covariates), + rep(0, ncol_covariates) + ) + ), # Unpenalize the variables in covariates (i.e., force LASSO to keep them in all the situations) + s = "lambda.min" + ) + # Select the variables with a coefficient > 0 + coef_genot <- coef_genot[abs(coef_genot[, 1]) > 0, ] + selected_vars_genot <- names(coef_genot)[-1] + selected_vars_genot <- selected_vars_genot[!selected_vars_genot %in% colnames(covariates)] # Remove covariates from selected variables + } else { + selected_vars_genot <- character(0) + } - #Environment only - #Get coefficients with the optimal lambda found by k-fold cross-validation - if (!is.null(environ_VMRi)){ #catch scenario where users would not add environmental variables - coef_env = stats::coef(glmnet::cv.glmnet(x = environ_VMRi, #Variables - y = summVMRi[,VMR_i$VMR_index], #Response - alpha = 1, - nfolds = 5, - penalty.factor = c(rep(1, ncol(environ_VMRi)- ncol_covariates), #Unpenalize the variables in covariates (i.e., force LASSO to keep them in all the situations) - rep(0, ncol_covariates))), - s = "lambda.min") - #Select the variables with a coefficient > 0 - coef_env = coef_env[abs(coef_env[,1]) > 0,] - selected_vars_env = names(coef_env)[-1] #Remove the intercept from the variables - selected_vars_env = selected_vars_env[!selected_vars_env %in% colnames(covariates)] #Remove covariates from selected variables + # Environment only + # Get coefficients with the optimal lambda found by k-fold cross-validation + if (!is.null(environ_VMLi)) { # catch scenario where users would not add environmental variables + coef_env <- stats::coef( + glmnet::cv.glmnet( + x = environ_VMLi, # Variables + y = summVMLi[, VML_i$VML_index], # Response + alpha = 1, + nfolds = 5, + penalty.factor = c( + rep(1, ncol(environ_VMLi) - ncol_covariates), # Unpenalize the variables in covariates (i.e., force LASSO to keep them in all the situations) + rep(0, ncol_covariates) + ) + ), + s = "lambda.min" + ) + # Select the variables with a coefficient > 0 + coef_env <- coef_env[abs(coef_env[, 1]) > 0, ] + selected_vars_env <- names(coef_env)[-1] # Remove the intercept from the variables + selected_vars_env <- selected_vars_env[!selected_vars_env %in% colnames(covariates)] # Remove covariates from selected variables - if (any_snp){ - #Joint (environment + genotype) only when we have Genotype and Environmental variables. - #Get coefficients with the optimal lambda found by k-fold cross-validation - coef_joint = stats::coef(glmnet::cv.glmnet(x = environ_genot_VMRi, #Variables - y = summVMRi[,VMR_i$VMR_index], #Response - alpha = 1, - nfolds = 5, - penalty.factor = c(rep(1, ncol(environ_genot_VMRi) - ncol_covariates), - rep(0, ncol_covariates))), #Unpenalize the variables in covariates (i.e., force LASSO to keep them in all the situations) - s = "lambda.min") - #Select the variables with an abs(coefficient) > 0 - coef_joint = coef_joint[abs(coef_joint[,1]) > 0,] - selected_vars_joint = names(coef_joint)[-1] #Remove the intercept from the variables - selected_vars_joint = selected_vars_joint[!selected_vars_joint %in% colnames(covariates)] #Remove covariates from selected variables - } else selected_vars_joint = character(0) + if (any_snp) { + # Joint (environment + genotype) only when we have Genotype and Environmental variables. + # Get coefficients with the optimal lambda found by k-fold cross-validation + coef_joint <- stats::coef( + glmnet::cv.glmnet( + x = environ_genot_VMLi, # Variables + y = summVMLi[, VML_i$VML_index], # Response + alpha = 1, + nfolds = 5, + penalty.factor = c( + rep(1, ncol(environ_genot_VMLi) - ncol_covariates), + rep(0, ncol_covariates) + ) + ), # Unpenalize the variables in covariates (i.e., force LASSO to keep them in all the situations) + s = "lambda.min" + ) + # Select the variables with an abs(coefficient) > 0 + coef_joint <- coef_joint[abs(coef_joint[, 1]) > 0, ] + selected_vars_joint <- names(coef_joint)[-1] # Remove the intercept from the variables + selected_vars_joint <- selected_vars_joint[!selected_vars_joint %in% colnames(covariates)] # Remove covariates from selected variables + } else { + selected_vars_joint <- character(0) + } } else { - selected_vars_env = character(0) - selected_vars_joint = character(0) + selected_vars_env <- character(0) + selected_vars_joint <- character(0) } - #Merge results - selected_union_genot = c(selected_vars_genot, selected_vars_joint) %>% + # Merge results + selected_union_genot <- c(selected_vars_genot, selected_vars_joint) %>% unique() %>% - dplyr::setdiff(colnames(environ_VMRi)) #Remove environmental variables and covariates from the joint selection - selected_union_env = c(selected_vars_env, selected_vars_joint) %>% + dplyr::setdiff(colnames(environ_VMLi)) # Remove environmental variables and covariates from the joint selection + selected_union_env <- c(selected_vars_env, selected_vars_joint) %>% unique() %>% - dplyr::setdiff(colnames(genot_VMRi)) #Remove genotype variables and covariates from the joint selection + dplyr::setdiff(colnames(genot_VMLi)) # Remove genotype variables and covariates from the joint selection ### Create final data frame - selected_variables_final = data.frame( - VMR_index = VMR_i$VMR_index) - selected_variables_final$selected_genot = list(selected_union_genot) - selected_variables_final$selected_env = list(selected_union_env) + selected_variables_final <- data.frame( + VML_index = VML_i$VML_index + ) + selected_variables_final$selected_genot <- list(selected_union_genot) + selected_variables_final$selected_env <- list(selected_union_env) selected_variables_final } - return(lasso_results) + return(data.frame(lasso_results)) } diff --git a/R/summarizeVML.R b/R/summarizeVML.R new file mode 100644 index 0000000..7305c0a --- /dev/null +++ b/R/summarizeVML.R @@ -0,0 +1,75 @@ +#' Summarize the methylation states of Variable Methylated Loci (VML) +#' +#' This function computes a representative methylation score for each Variable Methylated Locus (VML) in a dataset. It returns a data frame with the median methylation of each region per individual. +#' For each VML in a dataset, returns a with the median methylation of that region (columns) per individual (rows) as representative score. +#' +#' This function supports parallel computing for increased speed. To do so, you have to set the parallel backend in your R session BEFORE running the function (e.g., *doParallel::registerDoParallel(4)*). After that, +#' the function can be run as usual. +#' +#' @param VML_df A GRanges-like data frame. Must contain the following columns: +#' "seqnames", "start", "end" and "probes" (containing lists as elements, where each contains a vector with the probes constituting the VML). This is the "VML" object returned by the *findVML()* function. +#' @param methylation_data A data frame containing M or B values, with samples as columns and probes as rows. Row names must be the CpG probe IDs. +#' +#' @return A data frame with samples as rows, and VML as columns. The value inside each cell corresponds to the summarized methylation value of said VML in the corresponding individual. The column names correspond to the VML_index. +#' +#' @importFrom foreach %dopar% +#' @export +#' +#' @examples +#' ## Find VML in test data +#' VML <- RAMEN::findVML( +#' methylation_data = RAMEN::test_methylation_data, +#' array_manifest = "IlluminaHumanMethylationEPICv1", +#' cor_threshold = 0, +#' var_method = "variance", +#' var_distribution = "ultrastable", +#' var_threshold_percentile = 0.99, +#' max_distance = 1000 +#' ) +#' +#' ## Summarize methylation states of the found VML +#' summarized_VML <- RAMEN::summarizeVML( +#' VML_df = VML$VML, +#' methylation_data = RAMEN::test_methylation_data +#' ) +#' + +summarizeVML <- function(VML_df, + methylation_data) { + #Input checks + if (!is.data.frame(VML_df)) stop("Please provide a data frame in VML_df" ) + if (!"VML_index" %in% colnames(VML_df)) { # Add a VML index to each region if not already existing + VML_df <- VML_df %>% + dplyr::mutate(VML_index = paste("VML", as.character(dplyr::row_number()), sep = "")) + } + if (!is.data.frame(methylation_data)) { + if (is.matrix(methylation_data)) { + methylation_data <- as.data.frame(methylation_data) + } else { + stop("Please make sure the methylation data is a data frame or matrix with samples as columns and probes as rows.") + } + } + if (!all(unique(unlist(VML_df$probes)) %in% rownames(methylation_data))) { + warning("Some probes listed in the VML data frame are not found in the methylation data. Please check that all probes listed in the 'probes' column of the VML data frame are present in the row names of the methylation data frame to avoid having NAs.") + } + + # Check that probes is a list. + if (!is.list(VML_df$probes)) { + stop("Please make sure the 'probes' column in the VML data frame is a column of lists") + } + + VML_index <- i <- NULL # To avoid R CMD check notes + summarized_VML <- foreach::foreach(i = VML_df$VML_index, .combine = "cbind") %dopar% { + probes <- VML_df %>% + dplyr::filter(VML_index == i) %>% + dplyr::pull(probes) %>% + unlist() + subset_meth <- methylation_data[probes, ] %>% + t() %>% + as.data.frame() + median <- data.frame(apply(subset_meth, 1, median)) + colnames(median) <- i + median + } + return(summarized_VML) +} diff --git a/R/summarizeVMRs.R b/R/summarizeVMRs.R deleted file mode 100644 index 8327c14..0000000 --- a/R/summarizeVMRs.R +++ /dev/null @@ -1,47 +0,0 @@ -#' Summarize the methylation states of Variable Methylated Regions (VMRs) -#' -#' For each VMR in a dataset, returns an object with the median methylation of that region per individual as representative score. -#' -#' This function supports parallel computing for increased speed. To do so, you have to set the parallel backend -#' in your R session BEFORE running the function (e.g., doFuture::registerDoFuture()) and then the evaluation strategy (e.g., future::plan(multisession)). After that, -#' the function can be run as usual. -#' -#' @param VMRs_df A GRanges object converted to a data frame. Must contain the following columns: -#' "seqnames", "start", "end" (all of which are produced automatically when doing the object conversion) -#' and "probes" (containing a list where each element contains a vector with the probes -#' constituting the VMR). -#' @param methylation_data A data frame containing M or B values, with samples as columns and probes as rows. -#' -#' @return A data frame with samples as rows, and VMRs as columns. The value inside each cell corresponds to the summarized -#' methylation value of said VMR in the corresponding individual. The column names correspond to the VMR_index, which is created if not -#' already existing based on the rownames of the VMR_df. -#' -#' @importFrom foreach %dopar% -#' @export - -summarizeVMRs = function(VMRs_df, - methylation_data){ - if(!"VMR_index" %in% colnames(VMRs_df)){ # Add a VMR index to each region if not already existing - VMRs_df = VMRs_df %>% - tibble::rownames_to_column(var = "VMR_index") - } - - # Check that probes is a list. - if(!is.list(VMRs_df$probes)){ - stop("Please make sure the 'probes' column in VMRs_df is a column of lists") - } - - summarized_VMRs = foreach::foreach(i = VMRs_df$VMR_index, .combine = "cbind") %dopar% { - probes = VMRs_df %>% - dplyr::filter(VMR_index == i) %>% - dplyr::pull(probes) %>% - unlist() - subset_meth = methylation_data[probes, ] %>% - t() %>% - as.data.frame() - median = data.frame(apply(subset_meth,1,median)) - colnames(median) = i - median - } - return(summarized_VMRs) -} diff --git a/R/test_array_manifest.R b/R/test_array_manifest.R deleted file mode 100644 index c51fb4d..0000000 --- a/R/test_array_manifest.R +++ /dev/null @@ -1,15 +0,0 @@ -#' Array manifest example data set -#' -#' A subset of data from Illumina's EPIC array manifest (first 3,000 probes of the chromosome 21). -#' -#' @format ## `test_array_manifest` -#' A data frame with 3,000 rows and 3 columns: -#' \describe{ -#' \item{*rownames*}{Probe ID - for storage reasons, this variable was stored as row names, but rownames have to be converted to a new column called "TargetID" prior to its use in RAMEN.} -#' \item{MAPINFO}{Probe genomic position (h19)} -#' \item{CHR}{Chromosome} -#' \item{STRAND}{Strand} -#' ... -#' } -#' @source -"test_array_manifest" diff --git a/R/test_covariates.R b/R/test_covariates.R index a1ffe62..812a9cd 100644 --- a/R/test_covariates.R +++ b/R/test_covariates.R @@ -9,4 +9,3 @@ #' \item{covar1}{Concomitant variable drawn from a normal distribution with mean=0 and sd=1} #' } "test_covariates" - diff --git a/R/ultrastable_cpgs.R b/R/ultrastable_cpgs.R new file mode 100644 index 0000000..190964b --- /dev/null +++ b/R/ultrastable_cpgs.R @@ -0,0 +1,9 @@ +#' Ultrastable probes +#' +#' This data set contains the list of ultrastable probes identified by [Rachel Edgar et. al.,(2014)](https://doi.org/10.1186/1756-8935-7-28). This publication identified ultrastable CpGs across many tissues and conditions using the Illumina 450k array. Ultrastable probes are defined as CpGs consistently methylated or unmethylated in every sample (1,737 samples from 30 publically available studies). These CpGs are used to create a "null DNAme variance" distribution in the RAMEN package, from which a threshold is taken to identify Highly Variable Probes. +#' +#' @format ## `ultrastable_cpgs` +#' A vector with the name of the 15,224 ultrastable probes identified by Edgar et al. (2014). The name of the probes are based on the Illumina 450k manifest. +#' +#' @source https://static-content.springer.com/esm/art%3A10.1186%2F1756-8935-7-28/MediaObjects/13072_2014_333_MOESM2_ESM.txt +"ultrastable_cpgs" diff --git a/R/zzz.R b/R/zzz.R new file mode 100644 index 0000000..5812301 --- /dev/null +++ b/R/zzz.R @@ -0,0 +1,24 @@ +.onAttach <- function(libname, pkgname) { + packageStartupMessage( + " __ _ ___ + )_) /_) )\\/) )_ )\\ ) + / \\ / / ( ( (__ ( \\( + + ( ) ( ( + ( ( ) ( ) + ) ) ( + _.(--'(''--.._ + /, _..-----).._,\\ + | `'''-----'''` | + \\ / + '. .' + '--.....--' + +If you use RAMEN for your analysis, please cite: Navarro-Delgado, E.I., +Czamara, D., Edwards, K. et al. RAMEN: Dissecting individual, additive +and interactive gene-environment contributions to DNA methylome variability +in cord blood. Genome Biol 26, 421 (2025). +https://doi.org/10.1186/s13059-025-03864-4", + domain = NULL, appendLF = TRUE + ) +} # ASCII letters were generated by https://ascii.co.uk/text; the bowl was taken from http://www.geocities.ws/SoHo/7373/food.html and modified diff --git a/RAMEN.Rproj b/RAMEN.Rproj new file mode 100644 index 0000000..69fafd4 --- /dev/null +++ b/RAMEN.Rproj @@ -0,0 +1,22 @@ +Version: 1.0 + +RestoreWorkspace: No +SaveWorkspace: No +AlwaysSaveHistory: Default + +EnableCodeIndexing: Yes +UseSpacesForTab: Yes +NumSpacesForTab: 2 +Encoding: UTF-8 + +RnwWeave: Sweave +LaTeX: pdfLaTeX + +AutoAppendNewline: Yes +StripTrailingWhitespace: Yes +LineEndingConversion: Posix + +BuildType: Package +PackageUseDevtools: Yes +PackageInstallArgs: --no-multiarch --with-keep.source +PackageRoxygenize: rd,collate,namespace diff --git a/README.Rmd b/README.Rmd index 30a2f28..8f73f3e 100644 --- a/README.Rmd +++ b/README.Rmd @@ -16,12 +16,16 @@ knitr::opts_chunk$set( # RAMEN +[![status](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active) [![DOI](https://zenodo.org/badge/585986641.svg)](https://zenodo.org/badge/latestdoi/585986641) +[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable) +[![Codecov test coverage](https://codecov.io/gh/ErickNavarroD/RAMEN/graph/badge.svg)](https://app.codecov.io/gh/ErickNavarroD/RAMEN) +[![R-CMD-check](https://github.com/ErickNavarroD/RAMEN/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/ErickNavarroD/RAMEN/actions/workflows/R-CMD-check.yaml) ## Overview -Regional Association of Methylome variability with the Exposome and geNome (RAMEN) is an R package whose goal is to identify Variable Methylated Regions (VMRs) in microarray DNA methylation data. Additionally, using Genotype (G) and Environmental (E) data, it can identify which G, E, G+E or GxE model better explains this variability. +Regional Association of Methylome variability with the Exposome and geNome (RAMEN) is an R package whose goal is to identify genome-wide Variable Methylated Loci (VML) from microarray DNA methylation data. Then, using genomic and exposomic data, it can identify which model out of the following explains best the DNA methylation variability at each VML: genetic (G), environmental (E), additive (G+E) or interactive (GxE). ## Installation @@ -36,14 +40,14 @@ devtools::install_github("ErickNavarroD/RAMEN") RAMEN consists of six main functions: -- `findVMRs()` identifies Variable Methylated Regions (VMRs) in microarrays -- `summarizeVMRs()`summarizes the regional methylation state of each VMR -- `findCisSNPs()` identifies the SNPs in *cis* of each VMR -- `selectVariables()` conducts a LASSO-based variable selection strategy to identify potentially relevant *cis* SNPs and environmental variables -- `lmGE()` fits linear single-variable genetic (G) and environmental (E), and pairwise additive (G+E) and interaction (GxE) linear models and select the best explanatory model per VMR. -- `nullDistGE()` simulates a delta R squared null distribution of G and E effects on DNAme variability. Useful for filtering out poor-performing best explanatory models selected by *lmGE()*. +- `findVML()` identifies Variable Methylated Loci (VML) from microarray data +- `summarizeVML()`summarizes the regional methylation state of each VML +- `findCisSNPs()` identifies the SNPs in *cis* of each VML +- `selectVariables()` conducts a LASSO-based feature selection strategy to identify potentially relevant *cis* SNPs and environmental variables +- `lmGE()` fits linear single-variable genetic (G), environmental (E), pairwise additive (G+E) and pairwise interaction (GxE) linear models, and select the best explanatory model for each VML. +- `nullDistGE()` simulates a null distribution of G and E effects on DNAme variability. Useful for filtering out poor-performing best explanatory models selected by *lmGE()*. -Altogether, these functions create a pipeline that takes a set of individuals with genotype, environmental exposure and DNA methylation information, and generates an estimation of the contribution of the genotype and environment to its DNA methylation variability. Functions that conduct computationally intensive tasks are compatible with parallel computing. +Altogether, these functions create a pipeline that takes a set of individuals with genome, exposome and DNA methylome information, and generates an estimation of the contribution of genetic variants and environmental exposures to its DNA methylation variability. Functions that conduct computationally intensive tasks are compatible with parallel computing. @@ -53,22 +57,22 @@ For a detailed tutorial on how to use RAMEN, please check the package's vignette ## Variations to the standard workflow -Besides using RAMEN for completing the analysis mentioned above, the package provides individual functions that could help users in other tasks, such as: +Besides using RAMEN for a gene-environment contribution analysis, the package provides individual functions that could help users in other tasks, such as: - - Reduction of tests prior to an EWAS or differential methylation analysis (i.e., conducting the analyses on identified VMRs to reduce redundant tests by grouping nearby correlated CpGs and to avoid tests in non-variant regions) - - Fit additive and interaction models given a set of variables of interest and select the best explanatory model for DNAme data. - - Quickly identify SNPs in *cis* of CpG probes for variable reduction during mQTL analyses. - - Get the median correlation of probes in regions of interest (with `medCorVMR()`). + - Reduction of multiple hypothesis test burden in EWAS or differential methylation analysis by using VML instead of individual probes. + - Fit additive and interaction models given a set of variables of interest and select the best explanatory model for DNAme data (e.g. epistasis or ExE studies). + - Quickly identify SNPs in *cis* of CpG probes. + - Get the median correlation of probes in custom regions of interest with `medCorVMR()`. ## How to get help for RAMEN -If you have any question about RAMEN usage, please post an issue in this github repository so that future users also benefit from the discussion As an alternative option, you can contact Erick Navarro-Delgado at [erick.navarrodelgado\@bcchr.ca](mailto:erick.navarrodelgado@bcchr.ca){.email}. +If you have any question about RAMEN usage, please post an issue in this github repository so that future users also benefit from the discussion As an alternative option, you can contact Erick Navarro-Delgado at [erick.delgado\@ubc.ca](mailto:erick.delgado@ubc.ca){.email}. ## Acknowledgments I want to thank Dr. Keegan Korthauer and Dr. Michael S. Kobor for their supervision, feedback and support throughout the development of this package. Also, I want to thank the members of the Kobor and Korthauer lab for their comments and discussion. -The RAMEN package logo was created by Carlos Cortés-Quiñones and Dorothy Lin. Carlos created the drawing, and Dorothy refined the logo and did the lettering. +The RAMEN package logo was created by Carlos Cortés-Quiñones and Dorothy Lin. Carlos created the draw, and Dorothy refined the logo and did the lettering. ## Funding @@ -76,7 +80,14 @@ This work was supported by the University of British Columbia, the BC Children's ## Citing RAMEN -The manuscript detailing RAMEN and its use is currently under preparation. For more information about this please contact Erick I. Navarro-Delgado at [erick.navarrodelgado\@bcchr.ca](mailto:erick.navarrodelgado@bcchr.ca){.email}. +If you use RAMEN for any of your analyses, please cite the following publication: + + - Navarro-Delgado, E.I., Czamara, D., Edwards, K. et al. RAMEN: Dissecting individual, additive and interactive gene-environment contributions to DNA methylome variability in cord blood. *Genome Biol* 26, 421 (2025). https://doi.org/10.1186/s13059-025-03864-4 + +## Code of conduct +Please note that this package is released with a [Contributor +Code of Conduct](https://ropensci.org/code-of-conduct/). +By contributing to this project, you agree to abide by its terms. ## Licence diff --git a/README.md b/README.md index 00720d1..1aa3a23 100644 --- a/README.md +++ b/README.md @@ -5,16 +5,24 @@ +[![status](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active) [![DOI](https://zenodo.org/badge/585986641.svg)](https://zenodo.org/badge/latestdoi/585986641) +[![Lifecycle: +stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable) +[![Codecov test +coverage](https://codecov.io/gh/ErickNavarroD/RAMEN/graph/badge.svg)](https://app.codecov.io/gh/ErickNavarroD/RAMEN) +[![R-CMD-check](https://github.com/ErickNavarroD/RAMEN/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/ErickNavarroD/RAMEN/actions/workflows/R-CMD-check.yaml) ## Overview Regional Association of Methylome variability with the Exposome and -geNome (RAMEN) is an R package whose goal is to identify Variable -Methylated Regions (VMRs) in microarray DNA methylation data. -Additionally, using Genotype (G) and Environmental (E) data, it can -identify which G, E, G+E or GxE model better explains this variability. +geNome (RAMEN) is an R package whose goal is to identify genome-wide +Variable Methylated Loci (VML) from microarray DNA methylation data. +Then, using genomic and exposomic data, it can identify which model out +of the following explains best the DNA methylation variability at each +VML: genetic (G), environmental (E), additive (G+E) or interactive +(GxE). ## Installation @@ -30,24 +38,24 @@ devtools::install_github("ErickNavarroD/RAMEN") RAMEN consists of six main functions: -- `findVMRs()` identifies Variable Methylated Regions (VMRs) in - microarrays -- `summarizeVMRs()`summarizes the regional methylation state of each VMR -- `findCisSNPs()` identifies the SNPs in *cis* of each VMR -- `selectVariables()` conducts a LASSO-based variable selection strategy +- `findVML()` identifies Variable Methylated Loci (VML) from microarray + data +- `summarizeVML()`summarizes the regional methylation state of each VML +- `findCisSNPs()` identifies the SNPs in *cis* of each VML +- `selectVariables()` conducts a LASSO-based feature selection strategy to identify potentially relevant *cis* SNPs and environmental variables -- `lmGE()` fits linear single-variable genetic (G) and environmental - (E), and pairwise additive (G+E) and interaction (GxE) linear models - and select the best explanatory model per VMR. -- `nullDistGE()` simulates a delta R squared null distribution of G and - E effects on DNAme variability. Useful for filtering out - poor-performing best explanatory models selected by *lmGE()*. +- `lmGE()` fits linear single-variable genetic (G), environmental (E), + pairwise additive (G+E) and pairwise interaction (GxE) linear models, + and select the best explanatory model for each VML. +- `nullDistGE()` simulates a null distribution of G and E effects on + DNAme variability. Useful for filtering out poor-performing best + explanatory models selected by *lmGE()*. Altogether, these functions create a pipeline that takes a set of -individuals with genotype, environmental exposure and DNA methylation -information, and generates an estimation of the contribution of the -genotype and environment to its DNA methylation variability. Functions +individuals with genome, exposome and DNA methylome information, and +generates an estimation of the contribution of genetic variants and +environmental exposures to its DNA methylation variability. Functions that conduct computationally intensive tasks are compatible with parallel computing. @@ -61,27 +69,25 @@ vignette or ## Variations to the standard workflow -Besides using RAMEN for completing the analysis mentioned above, the +Besides using RAMEN for a gene-environment contribution analysis, the package provides individual functions that could help users in other tasks, such as: -- Reduction of tests prior to an EWAS or differential methylation - analysis (i.e., conducting the analyses on identified VMRs to reduce - redundant tests by grouping nearby correlated CpGs and to avoid tests - in non-variant regions) +- Reduction of multiple hypothesis test burden in EWAS or differential + methylation analysis by using VML instead of individual probes. - Fit additive and interaction models given a set of variables of - interest and select the best explanatory model for DNAme data. -- Quickly identify SNPs in *cis* of CpG probes for variable reduction - during mQTL analyses. -- Get the median correlation of probes in regions of interest (with - `medCorVMR()`). + interest and select the best explanatory model for DNAme data + (e.g. epistasis or ExE studies). +- Quickly identify SNPs in *cis* of CpG probes. +- Get the median correlation of probes in custom regions of interest + with `medCorVMR()`. ## How to get help for RAMEN If you have any question about RAMEN usage, please post an issue in this github repository so that future users also benefit from the discussion As an alternative option, you can contact Erick Navarro-Delgado at -. +. ## Acknowledgments @@ -91,8 +97,8 @@ package. Also, I want to thank the members of the Kobor and Korthauer lab for their comments and discussion. The RAMEN package logo was created by Carlos Cortés-Quiñones and Dorothy -Lin. Carlos created the drawing, and Dorothy refined the logo and did -the lettering. +Lin. Carlos created the draw, and Dorothy refined the logo and did the +lettering. ## Funding @@ -101,9 +107,19 @@ Children’s Hospital Research Institute and the Social Exposome Cluster. ## Citing RAMEN -The manuscript detailing RAMEN and its use is currently under -preparation. For more information about this please contact Erick I. -Navarro-Delgado at . +If you use RAMEN for any of your analyses, please cite the following +publication: + +- Navarro-Delgado, E.I., Czamara, D., Edwards, K. et al. RAMEN: + Dissecting individual, additive and interactive gene-environment + contributions to DNA methylome variability in cord blood. *Genome + Biol* 26, 421 (2025). + +## Code of conduct + +Please note that this package is released with a [Contributor Code of +Conduct](https://ropensci.org/code-of-conduct/). By contributing to this +project, you agree to abide by its terms. ## Licence diff --git a/codemeta.json b/codemeta.json new file mode 100644 index 0000000..dbbcf02 --- /dev/null +++ b/codemeta.json @@ -0,0 +1,299 @@ +{ + "@context": "https://doi.org/10.5063/schema/codemeta-2.0", + "@type": "SoftwareSourceCode", + "identifier": "RAMEN", + "description": "R package that identifies which genetic (G), environmental (E), additive (G+E) or interaction (GxE) effect better explains DNA methylation levels in Variable Methylated Loci using microarray data.", + "name": "RAMEN: RAMEN: Regional Association of Methylome variability with the Exposome and geNome", + "codeRepository": "https://github.com/ErickNavarroD/RAMEN", + "issueTracker": "https://github.com/ErickNavarroD/RAMEN/issues", + "license": "https://spdx.org/licenses/GPL-3.0", + "version": "1.0.0.9003", + "programmingLanguage": { + "@type": "ComputerLanguage", + "name": "R", + "url": "https://r-project.org" + }, + "runtimePlatform": "R version 4.4.2 (2024-10-31)", + "author": [ + { + "@type": "Person", + "givenName": "Erick I.", + "familyName": "Navarro-Delgado", + "email": "ericknadel98@hotmail.com", + "@id": "https://orcid.org/0000-0003-1040-3519" + } + ], + "maintainer": [ + { + "@type": "Person", + "givenName": "Erick I.", + "familyName": "Navarro-Delgado", + "email": "ericknadel98@hotmail.com", + "@id": "https://orcid.org/0000-0003-1040-3519" + } + ], + "softwareSuggestions": [ + { + "@type": "SoftwareApplication", + "identifier": "BiocStyle", + "name": "BiocStyle", + "provider": { + "@id": "https://www.bioconductor.org", + "@type": "Organization", + "name": "Bioconductor", + "url": "https://www.bioconductor.org" + }, + "sameAs": "https://bioconductor.org/packages/release/bioc/html/BiocStyle.html" + }, + { + "@type": "SoftwareApplication", + "identifier": "knitr", + "name": "knitr", + "provider": { + "@id": "https://cran.r-project.org", + "@type": "Organization", + "name": "Comprehensive R Archive Network (CRAN)", + "url": "https://cran.r-project.org" + }, + "sameAs": "https://CRAN.R-project.org/package=knitr" + }, + { + "@type": "SoftwareApplication", + "identifier": "rmarkdown", + "name": "rmarkdown", + "provider": { + "@id": "https://cran.r-project.org", + "@type": "Organization", + "name": "Comprehensive R Archive Network (CRAN)", + "url": "https://cran.r-project.org" + }, + "sameAs": "https://CRAN.R-project.org/package=rmarkdown" + }, + { + "@type": "SoftwareApplication", + "identifier": "ggplot2", + "name": "ggplot2", + "provider": { + "@id": "https://cran.r-project.org", + "@type": "Organization", + "name": "Comprehensive R Archive Network (CRAN)", + "url": "https://cran.r-project.org" + }, + "sameAs": "https://CRAN.R-project.org/package=ggplot2" + }, + { + "@type": "SoftwareApplication", + "identifier": "tidyr", + "name": "tidyr", + "provider": { + "@id": "https://cran.r-project.org", + "@type": "Organization", + "name": "Comprehensive R Archive Network (CRAN)", + "url": "https://cran.r-project.org" + }, + "sameAs": "https://CRAN.R-project.org/package=tidyr" + }, + { + "@type": "SoftwareApplication", + "identifier": "testthat", + "name": "testthat", + "version": ">= 3.0.0", + "provider": { + "@id": "https://cran.r-project.org", + "@type": "Organization", + "name": "Comprehensive R Archive Network (CRAN)", + "url": "https://cran.r-project.org" + }, + "sameAs": "https://CRAN.R-project.org/package=testthat" + } + ], + "softwareRequirements": { + "1": { + "@type": "SoftwareApplication", + "identifier": "doRNG", + "name": "doRNG", + "provider": { + "@id": "https://cran.r-project.org", + "@type": "Organization", + "name": "Comprehensive R Archive Network (CRAN)", + "url": "https://cran.r-project.org" + }, + "sameAs": "https://CRAN.R-project.org/package=doRNG" + }, + "2": { + "@type": "SoftwareApplication", + "identifier": "dplyr", + "name": "dplyr", + "provider": { + "@id": "https://cran.r-project.org", + "@type": "Organization", + "name": "Comprehensive R Archive Network (CRAN)", + "url": "https://cran.r-project.org" + }, + "sameAs": "https://CRAN.R-project.org/package=dplyr" + }, + "3": { + "@type": "SoftwareApplication", + "identifier": "foreach", + "name": "foreach", + "provider": { + "@id": "https://cran.r-project.org", + "@type": "Organization", + "name": "Comprehensive R Archive Network (CRAN)", + "url": "https://cran.r-project.org" + }, + "sameAs": "https://CRAN.R-project.org/package=foreach" + }, + "4": { + "@type": "SoftwareApplication", + "identifier": "GenomicRanges", + "name": "GenomicRanges", + "provider": { + "@id": "https://www.bioconductor.org", + "@type": "Organization", + "name": "Bioconductor", + "url": "https://www.bioconductor.org" + }, + "sameAs": "https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html" + }, + "5": { + "@type": "SoftwareApplication", + "identifier": "glmnet", + "name": "glmnet", + "provider": { + "@id": "https://cran.r-project.org", + "@type": "Organization", + "name": "Comprehensive R Archive Network (CRAN)", + "url": "https://cran.r-project.org" + }, + "sameAs": "https://CRAN.R-project.org/package=glmnet" + }, + "6": { + "@type": "SoftwareApplication", + "identifier": "IlluminaHumanMethylation450kanno.ilmn12.hg19", + "name": "IlluminaHumanMethylation450kanno.ilmn12.hg19" + }, + "7": { + "@type": "SoftwareApplication", + "identifier": "IlluminaHumanMethylationEPICanno.ilm10b4.hg19", + "name": "IlluminaHumanMethylationEPICanno.ilm10b4.hg19" + }, + "8": { + "@type": "SoftwareApplication", + "identifier": "IlluminaHumanMethylationEPICv2anno.20a1.hg38", + "name": "IlluminaHumanMethylationEPICv2anno.20a1.hg38" + }, + "9": { + "@type": "SoftwareApplication", + "identifier": "IRanges", + "name": "IRanges", + "provider": { + "@id": "https://www.bioconductor.org", + "@type": "Organization", + "name": "Bioconductor", + "url": "https://www.bioconductor.org" + }, + "sameAs": "https://bioconductor.org/packages/release/bioc/html/IRanges.html" + }, + "10": { + "@type": "SoftwareApplication", + "identifier": "iterators", + "name": "iterators", + "provider": { + "@id": "https://cran.r-project.org", + "@type": "Organization", + "name": "Comprehensive R Archive Network (CRAN)", + "url": "https://cran.r-project.org" + }, + "sameAs": "https://CRAN.R-project.org/package=iterators" + }, + "11": { + "@type": "SoftwareApplication", + "identifier": "lifecycle", + "name": "lifecycle", + "provider": { + "@id": "https://cran.r-project.org", + "@type": "Organization", + "name": "Comprehensive R Archive Network (CRAN)", + "url": "https://cran.r-project.org" + }, + "sameAs": "https://CRAN.R-project.org/package=lifecycle" + }, + "12": { + "@type": "SoftwareApplication", + "identifier": "magrittr", + "name": "magrittr", + "provider": { + "@id": "https://cran.r-project.org", + "@type": "Organization", + "name": "Comprehensive R Archive Network (CRAN)", + "url": "https://cran.r-project.org" + }, + "sameAs": "https://CRAN.R-project.org/package=magrittr" + }, + "13": { + "@type": "SoftwareApplication", + "identifier": "relaimpo", + "name": "relaimpo", + "provider": { + "@id": "https://cran.r-project.org", + "@type": "Organization", + "name": "Comprehensive R Archive Network (CRAN)", + "url": "https://cran.r-project.org" + }, + "sameAs": "https://CRAN.R-project.org/package=relaimpo" + }, + "14": { + "@type": "SoftwareApplication", + "identifier": "S4Vectors", + "name": "S4Vectors", + "provider": { + "@id": "https://www.bioconductor.org", + "@type": "Organization", + "name": "Bioconductor", + "url": "https://www.bioconductor.org" + }, + "sameAs": "https://bioconductor.org/packages/release/bioc/html/S4Vectors.html" + }, + "15": { + "@type": "SoftwareApplication", + "identifier": "stats", + "name": "stats" + }, + "16": { + "@type": "SoftwareApplication", + "identifier": "stringr", + "name": "stringr", + "provider": { + "@id": "https://cran.r-project.org", + "@type": "Organization", + "name": "Comprehensive R Archive Network (CRAN)", + "url": "https://cran.r-project.org" + }, + "sameAs": "https://CRAN.R-project.org/package=stringr" + }, + "17": { + "@type": "SoftwareApplication", + "identifier": "tibble", + "name": "tibble", + "provider": { + "@id": "https://cran.r-project.org", + "@type": "Organization", + "name": "Comprehensive R Archive Network (CRAN)", + "url": "https://cran.r-project.org" + }, + "sameAs": "https://CRAN.R-project.org/package=tibble" + }, + "18": { + "@type": "SoftwareApplication", + "identifier": "R", + "name": "R", + "version": ">= 4.2.0" + }, + "SystemRequirements": null + }, + "fileSize": "3425.405KB", + "relatedLink": "https://ericknavarrod.github.io/RAMEN/", + "readme": "https://github.com/ErickNavarroD/RAMEN/blob/master/README.md", + "keywords": ["exposome", "genome", "methylation-analysis", "methylation-microarrays", "multiomics", "r-package", "bioinformatics-tool", "dna-methylation"] +} diff --git a/data-raw/test_array_manifest.R b/data-raw/test_array_manifest.R deleted file mode 100644 index 3ea4009..0000000 --- a/data-raw/test_array_manifest.R +++ /dev/null @@ -1,18 +0,0 @@ -## code to prepare `test_array_manifest` -temp <- tempfile() -download.file("https://webdata.illumina.com/downloads/productfiles/methylationEPIC/infinium-methylationepic-v-1-0-b4-manifest-file-csv.zip",temp, mode="wb") -unzip(temp) -fData_EPIC <- read_csv("MethylationEPIC_v-1-0_B4.csv", - skip = 7) -array_manifest = fData_EPIC %>% - dplyr::mutate(STRAND = rep(BiocGenerics::strand("+"), nrow(fData_EPIC))) %>% - dplyr::select(MAPINFO, CHR, IlmnID, STRAND) - -#Get the first 3k probes of the 21 chromosome -test_array_manifest = array_manifest %>% - filter(CHR == "21") %>% - arrange(as.numeric(MAPINFO)) %>% - slice_head(n = 3000) %>% - select(-IlmnID) #Remove this column because it takes a lot of space when saving the object, and it is already present in the rownames - -usethis::use_data(test_array_manifest, overwrite = TRUE) diff --git a/data-raw/test_covariates.R b/data-raw/test_covariates.R index fc703f4..58a9cf0 100644 --- a/data-raw/test_covariates.R +++ b/data-raw/test_covariates.R @@ -1,10 +1,11 @@ ## code to prepare `test_covariates` dataset set.seed(123) -sample_size = 30 -test_covariates = matrix(rnorm(sample_size, 0, 1), - nrow = sample_size, - ncol = 1) -rownames(test_covariates) = paste("ID", as.character(1:sample_size), sep = "") -colnames(test_covariates) = "covar1" +sample_size <- 30 +test_covariates <- matrix(rnorm(sample_size, 0, 1), + nrow = sample_size, + ncol = 1 +) +rownames(test_covariates) <- paste("ID", as.character(1:sample_size), sep = "") +colnames(test_covariates) <- "covar1" usethis::use_data(test_covariates, overwrite = TRUE) diff --git a/data-raw/test_environmental_matrix.R b/data-raw/test_environmental_matrix.R index 1e79669..9775637 100644 --- a/data-raw/test_environmental_matrix.R +++ b/data-raw/test_environmental_matrix.R @@ -1,11 +1,16 @@ ## code to prepare `test_environmental_matrix` # Simulate environmental data for 100 variables set.seed(123) -sample_size = 30 -test_environmental_matrix = matrix(rnorm(100*sample_size, 0, 1), - nrow = sample_size, - ncol = 100) -rownames(test_environmental_matrix) = paste("ID", as.character(1:sample_size), sep = "") -colnames(test_environmental_matrix) = paste("E", as.character(1:100), sep = "") +sample_size <- 30 +test_environmental_matrix <- matrix(rnorm(100 * sample_size, 0, 1), + nrow = sample_size, + ncol = 100 +) +rownames(test_environmental_matrix) <- paste("ID", + as.character(1:sample_size), + sep = "") +colnames(test_environmental_matrix) <- paste("E", + as.character(1:100), + sep = "") usethis::use_data(test_environmental_matrix, overwrite = TRUE) diff --git a/data-raw/test_genotype_matrix.R b/data-raw/test_genotype_matrix.R index 825bc42..2ad1b1e 100644 --- a/data-raw/test_genotype_matrix.R +++ b/data-raw/test_genotype_matrix.R @@ -1,12 +1,22 @@ ## code to prepare `test_genotype_matrix` dataset goes here -# This code makes use of test_genotype_information, which was created by just extracting the SNP positions from a real private data set to have an example of the SNP IDs. +# This code makes use of test_genotype_information, which was created by just +# extracting the SNP positions from a real private data set to have an example +# of the SNP IDs. load(test_genotype_information.Rdata) set.seed(123) -test_genotype_matrix = matrix(rbinom(nrow(test_genotype_information)*sample_size, 2, 0.5), - ncol = sample_size, - nrow = nrow(test_genotype_information)) -colnames(test_genotype_matrix) = paste("ID", as.character(1:sample_size), sep = "") -rownames(test_genotype_matrix) = test_genotype_information$ID +test_genotype_matrix <- matrix( + rbinom(nrow(test_genotype_information) * sample_size, + 2, + 0.5 + ), + ncol = sample_size, + nrow = nrow(test_genotype_information) +) +colnames(test_genotype_matrix) <- paste("ID", + as.character(1:sample_size), + sep = "" + ) +rownames(test_genotype_matrix) <- test_genotype_information$ID usethis::use_data(test_genotype_matrix, overwrite = TRUE) diff --git a/data-raw/test_methylation_data.R b/data-raw/test_methylation_data.R index c2efe30..54bb1d9 100644 --- a/data-raw/test_methylation_data.R +++ b/data-raw/test_methylation_data.R @@ -1,35 +1,46 @@ ## code to prepare `test_methylation_data` dataset goes here temp <- tempfile() -download.file("https://webdata.illumina.com/downloads/productfiles/methylationEPIC/infinium-methylationepic-v-1-0-b4-manifest-file-csv.zip",temp, mode="wb") +download.file("https://webdata.illumina.com/downloads/productfiles/methylationEPIC/infinium-methylationepic-v-1-0-b4-manifest-file-csv.zip", temp, mode = "wb") unzip(temp) -fData_EPIC <- read_csv("MethylationEPIC_v-1-0_B4.csv", - skip = 7) -array_manifest = fData_EPIC %>% - dplyr::mutate(STRAND = rep(BiocGenerics::strand("+"), nrow(fData_EPIC))) %>% +fData_epic <- read_csv("MethylationEPIC_v-1-0_B4.csv", + skip = 7 +) +array_manifest <- fData_epic |> + dplyr::mutate(STRAND = rep(BiocGenerics::strand("+"), nrow(fData_epic))) |> dplyr::select(MAPINFO, CHR, IlmnID, STRAND) -#Get the first 3k probes of the 21 chromosome -test_array_manifest = array_manifest %>% - filter(CHR == "21") %>% - arrange(as.numeric(MAPINFO)) %>% - slice_head(n = 3000) %>% - select(-IlmnID) #Remove this column because it takes a lot of space when saving the object, and it is already present in the rownames +# Get the first 3k probes of the 21 chromosome +test_array_manifest <- array_manifest |> + filter(CHR == "21") |> + arrange(as.numeric(MAPINFO)) |> + slice_head(n = 3000) |> + select(-IlmnID) # Remove this column because it takes a lot of space when + #saving the object, and it is already present in the rownames ## Create DNAme dataset -sample_size = 30 +sample_size <- 30 # Simulate DNAme data set.seed(123) -distribution_betas = c(rbeta(n = sample_size*nrow(test_array_manifest)/3*2, 5,1), #Methylated distribution - 2 thirds of the distribution - rbeta(n = sample_size*nrow(test_array_manifest)/3, 2, 10)) #Unmethylated distribution - 1 third of the dist -m_values = log2(distribution_betas/(1-distribution_betas)) +distribution_betas <- c( + # Methylated distribution - 2 thirds of the distribution + rbeta(n = sample_size * nrow(test_array_manifest) / 3 * 2, 5, 1), + # Unmethylated distribution - 1 third of the distribution + rbeta(n = sample_size * nrow(test_array_manifest) / 3, 2, 10) +) +m_values <- log2(distribution_betas / (1 - distribution_betas)) -#Make it a data frame -test_methylation_data = matrix(m_values, - nrow = nrow(test_array_manifest), ncol = sample_size) %>% +# Make it a data frame +test_methylation_data <- matrix(m_values, + nrow = nrow(test_array_manifest), ncol = sample_size +) |> as.data.frame() -colnames(test_methylation_data) = paste("ID", as.character(1:sample_size), sep = "") -rownames(test_methylation_data) = rownames(test_array_manifest) +colnames(test_methylation_data) <- paste( + "ID", + as.character(1:sample_size), + sep = "" + ) +rownames(test_methylation_data) <- rownames(test_array_manifest) usethis::use_data(test_methylation_data, overwrite = TRUE) diff --git a/data-raw/ultrastable_cpgs.R b/data-raw/ultrastable_cpgs.R new file mode 100644 index 0000000..e27bb86 --- /dev/null +++ b/data-raw/ultrastable_cpgs.R @@ -0,0 +1,7 @@ +## code to prepare `ultrastable_cpgs` + +ultrastable_cpgs <- read.table("https://static-content.springer.com/esm/art%3A10.1186%2F1756-8935-7-28/MediaObjects/13072_2014_333_MOESM2_ESM.txt") |> + tibble::rownames_to_column("probe_id") |> + dplyr::pull(probe_id) + +use_data(ultrastable_cpgs) diff --git a/data/test_array_manifest.rda b/data/test_array_manifest.rda deleted file mode 100644 index c436193..0000000 Binary files a/data/test_array_manifest.rda and /dev/null differ diff --git a/data/test_genotype_information.rda b/data/test_genotype_information.rda index eedebcd..be858f1 100644 Binary files a/data/test_genotype_information.rda and b/data/test_genotype_information.rda differ diff --git a/data/ultrastable_cpgs.rda b/data/ultrastable_cpgs.rda new file mode 100644 index 0000000..e4d9aef Binary files /dev/null and b/data/ultrastable_cpgs.rda differ diff --git a/inst/CITATION b/inst/CITATION new file mode 100644 index 0000000..e1daa30 --- /dev/null +++ b/inst/CITATION @@ -0,0 +1,11 @@ +bibentry( + bibtype = "Article", + title = "RAMEN: Dissecting individual, additive and interactive gene-environment contributions to DNA methylome variability in cord blood", + author = "Navarro-Delgado, E.I., Czamara, D., Edwards, K. et al.", + journal = "Genome Biology", + year = "2025", + volume = "26", + number = "1", + pages = "29", + doi = "10.1186/s13059-025-03864-4" +) diff --git a/man/RAMEN-package.Rd b/man/RAMEN-package.Rd new file mode 100644 index 0000000..5daaba1 --- /dev/null +++ b/man/RAMEN-package.Rd @@ -0,0 +1,25 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/RAMEN-package.R +\docType{package} +\name{RAMEN-package} +\alias{RAMEN} +\alias{RAMEN-package} +\title{RAMEN: RAMEN: Regional Association of Methylome variability with the Exposome and geNome} +\description{ +\if{html}{\figure{logo.png}{options: style='float: right' alt='logo' width='120'}} + +R package that identifies which genetic (G), environmental (E), additive (G+E) or interaction (GxE) effect better explains DNA methylation levels in Variable Methylated Loci using microarray data. +} +\seealso{ +Useful links: +\itemize{ + \item \url{https://ericknavarrod.github.io/RAMEN/} + \item Report bugs at \url{https://github.com/ErickNavarroD/RAMEN/issues} +} + +} +\author{ +\strong{Maintainer}: Erick I. Navarro-Delgado \email{ericknadel98@hotmail.com} (\href{https://orcid.org/0000-0003-1040-3519}{ORCID}) + +} +\keyword{internal} diff --git a/man/figures/RAMEN_pipeline.png b/man/figures/RAMEN_pipeline.png index 04e02c6..91260dc 100644 Binary files a/man/figures/RAMEN_pipeline.png and b/man/figures/RAMEN_pipeline.png differ diff --git a/man/findCisSNPs.Rd b/man/findCisSNPs.Rd index 13377d4..bfa6470 100644 --- a/man/findCisSNPs.Rd +++ b/man/findCisSNPs.Rd @@ -2,30 +2,60 @@ % Please edit documentation in R/findCisSNPs.R \name{findCisSNPs} \alias{findCisSNPs} -\title{Find cis SNPs around a set of Variable Methylated Regions (VMRs)} +\title{Find cis SNPs around a set of Variable Methylated Loci (VML)} \usage{ -findCisSNPs(VMRs_df, genotype_information, distance = 1e+06) +findCisSNPs(VML_df, genotype_information, distance = 1e+06) } \arguments{ -\item{VMRs_df}{A GRanges object converted to a data frame. Must contain the following columns: -"seqnames", "start", "end". These columns are present automatically when doing the object conversion and correspond to the chromosome number, and range of the region.} +\item{VML_df}{A GRanges-like data frame (i.e. the same columns as a GRanges +object converted to a data frame). Must contain the following columns: +"seqnames", "start", "end". These columns are present automatically when +doing the object conversion and correspond to the chromosome number, and +range of the region.} -\item{genotype_information}{A data frame with information about genotyped sites of interest. It must contain the following -columns: "CHROM" - chromosome number, "POS" - Genomic basepair position of SNP in the corresponding -chromosome (must contain values of class int), and "ID" - SNP ID. The nomenclature of CHROM must match with the one used in the VMRs_df seqnames column (i.e., if VMRs_df$seqnames uses 1, 2, 3, X, Y or Chr1, Chr2, Chr3, ChrX, ChrY, etc. as chromosome number, the genotype_information$CHROM values must be encoded in the same way).} +\item{genotype_information}{A data frame with information about genotyped +sites of interest. It must contain the following columns: "CHROM" +(chromosome number), "POS" (Genomic basepair position of the SNP (must be an +integer), and "ID" (SNP ID). The nomenclature of CHROM must match with the +one used in the VML_df seqnames column (i.e., if VML_df$seqnames uses 1, 2, +3, X, Y or Chr1, Chr2, Chr3, ChrX, ChrY, etc. as chromosome number, the +genotype_information$CHROM values must be encoded in the same way).} -\item{distance}{The distance threshold to be used to identify cis SNPs. Default is 1 Mb.} +\item{distance}{The distance threshold in basepairs to be used to identify +cis SNPs. Default is 1 Mb.} } \value{ -A VMR_df object (a data frame compatible with GRanges conversion) with the following new columns: +The same VML data frame (a data frame compatible with GRanges +conversion) with the following new columns: \itemize{ -\item The cis SNPs identified for each VMR, the number of SNPs surrounding each VMR in the specified window -\item VMR_index, which is created if not already existing based on the rownames of the VMR_df. +\item The cis SNPs identified for each VML and the number of SNPs surrounding +each VML in the specified window } } \description{ -Identification of genotyped Single Nucleotide Polymorphisms (SNPs) close to each VMR using a distance threshold. +Identification of genotyped Single Nucleotide Polymorphisms (SNPs) close to +each VML using a distance threshold. } \details{ -\strong{Important}: please make sure that the positions of the VMR data frame and the ones in the genotype information are from the same genome build. +\strong{Important}: please make sure that the positions of the VML data frame and +the ones in the genotype information are from the same genome build. +} +\examples{ +## Find VML in test data +VML <- RAMEN::findVML( + methylation_data = RAMEN::test_methylation_data, + array_manifest = "IlluminaHumanMethylationEPICv1", + cor_threshold = 0, + var_method = "variance", + var_distribution = "ultrastable", + var_threshold_percentile = 0.99, + max_distance = 1000 + ) +## Find cis SNPs around VML +VML_with_cis_snps <- RAMEN::findCisSNPs( + VML_df = VML$VML, + genotype_information = RAMEN::test_genotype_information, + distance = 1e6 + ) + } diff --git a/man/findVML.Rd b/man/findVML.Rd new file mode 100644 index 0000000..41e2ad5 --- /dev/null +++ b/man/findVML.Rd @@ -0,0 +1,71 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/findVML.R +\name{findVML} +\alias{findVML} +\title{Identify Variable Methylated Loci in microarrays} +\usage{ +findVML( + methylation_data, + array_manifest, + cor_threshold = 0.15, + var_method = "variance", + var_distribution = "ultrastable", + var_threshold_percentile = 0.99, + max_distance = 1000 +) +} +\arguments{ +\item{methylation_data}{A data frame containing M or B values, with samples as columns and probes as rows. Data is expected to have already passed through quality control and cleaning steps.} + +\item{array_manifest}{Information about the probes on the array in a format compatible with the Bioconductor annotation packages. The user can specify one of the supported human microarrays ("IlluminaHumanMethylation450k" with the hg19 genome build, "IlluminaHumanMethylationEPICv1" with the hg19 genome build, or "IlluminaHumanMethylationEPICv2" with the hg38 genome build), or provide a manifest. The manifest requires the probe names as row names, and the following columns: "chr" (chromosome); "pos" (genomic location of the probe in the genome); and "strand" (this is very important to set up, since the VMRs will only be created based on CpGs on the same strand; if the positions are reported based on a single DNA strand, this should contain either a vector of only "+", "-" or "*" for all of the probes).} + +\item{cor_threshold}{Numeric value (0-1) to be used as the median pearson correlation threshold for identifying VMRs (i.e. +all VMRs will have a median pairwise probe correlation higher than this threshold).} + +\item{var_method}{A string indicating the metric to use to represent variability in the data set. The options are "mad" (median absolute deviation) +or "variance".} + +\item{var_distribution}{A string indicating which probes in the data set should be used to create a variability distribution; the threshold to identify Highly Variable Probes (determined also with the var_threshold_percentile argument) is established based on this distribution. The options 1 is "ultrastable" (a subset of CpGs that are stably methylated/unmethylated across human tissues and developmental states described by \href{https://doi.org/10.1186/1756-8935-7-28}{Edgar R., et al.} in 2014). This option is recommended, especially if you want to compare different populations or tissues, as the threshold value should be comparable. On the other hand, the user can use option 2: "all" (all probes in the data set). The "ultrastable" option is only compatible with Illumina human microarrays. The default is "ultrastable".} + +\item{var_threshold_percentile}{The percentile (0-1) to be used as cutoff to define Highly Variable Probes (which are then grouped into VML). If using the variability of the "ultrastable" probes, we recommend a high threshold (default is 0.99), since these probes are expected to display a very low variation in human tissues. If using the variability of "all" probes, we recommend using a percentile of 0.9 since it captures the top 10\% most variable probes, which has been traditionally used in studies. It is important to note that the top 10\% most variable probes will capture the same amount of probes in a data set regardless of their overall variability levels, which might differ between tissues or populations.} + +\item{max_distance}{Maximum distance in base pairs allowed for two probes to be grouped into a region. The default is 1000.} +} +\value{ +A list with the following elements: +\itemize{ +\item $var_score_threshold: threshold used to define Highly Variable Probes (mad or variance, depending on the specified choice). +\item $highly_variable_probes: a data frame with the probes that passed the variability score threshold imposed by the user, and their variability score (MAD score or variance). +\item $VML: a GRanges-like data frame with VMRs (regions composed of two or more contiguous, correlated and proximal Highly Variable Probes), and sVMPs (highly variable probes without neighboring CpGs measured in \emph{max_distance} on the array). +} +} +\description{ +Identifies Highly Variable Probes (HVP) and groups them into Variable Methylated Loci (VML) given an Illumina manifest.The output of this function provides the HVPs, and the identified VML, which are made of Variable Methylated Regions and sparse Variable Methylated Probes. See Details below for more information. +} +\details{ +This function identifies HVPs based on MAD scores or variance, and groups them into VML, which are defined as genomic regions with high DNA methylation variability.To best capture methylome variability patterns in microarrays, we identify two types of VML: Variably Methylated Regions (VMRs) and sparse Variably Methylated Probes (sVMPs) . + +In one hand, we defined VMRs as two or more proximal highly variable probes (default: < 1kb apart) with correlated DNAme level (default: r > 0.15). Modelling DNAme variability through regions rather than individual CpGs provides several methodological advantages in association studies, since CpGs display a significant correlation for co-methylation when they are close (less than or equal to 1 kilobase). Modelling DNAme variability through regions rather than individual CpGs provides several methodological advantages in association studies, since CpGs display a significant correlation for co-methylation when they are close (less than or equal to 1 kilobase) + +In addition to traditional VMRs, we also identified sparse Variably Methylated Probes (sVMPs), a second type of VML that takes into account the sparse and non-uniformly distributed coverage of CpGs in microarrays to tailor our analysis to this DNAme platform. sVMPs aimed to retain genomic regions with high DNAme variability measured by single probes, where probe grouping based on proximity and correlation is therefore not applicable. This is particularly relevant in the Illumina EPIC v1 array, where most covered regulatory regions (up to 93\%) are represented by just one probe. Notably, based on empirical comparisons with whole-genome bisulfite sequencing data, these single probes are mostly representative of local regional DNAme levels due to their positioning (98.5-99.5\%) + +This function uses GenomicRanges::reduce() to group the regions, which is strand-sensitive. In the Illumina microarrays, the MAPINFO for all the probes is usually provided for the + strand. If you are using this array, we recommend to first convert the strand of all the probes to "+". + +This function supports parallel computing for increased speed. To do so, you have to set the parallel backend +in your R session BEFORE running the function (e.g., \emph{doParallel::registerDoParallel(4)}). After that, the function can be run as usual. When working with big datasets, the parallel backend might throw an error if you exceed the maximum allowed size of globals exported for future expression. This can be fixed by increasing the allowed size (e.g. running \emph{options(future.globals.maxSize= +Inf)}) + +Note: this function does not exclude sex chromosomes. If you want to exclude them, you can do so in the methylation_data object before running the function. +} +\examples{ + +VML <- RAMEN::findVML( + methylation_data = RAMEN::test_methylation_data, + array_manifest = "IlluminaHumanMethylationEPICv1", + cor_threshold = 0.15, + var_method = "variance", + var_distribution = "ultrastable", + var_threshold_percentile = 0.99, + max_distance = 1000 +) + +} diff --git a/man/findVMRs.Rd b/man/findVMRs.Rd deleted file mode 100644 index f3312ff..0000000 --- a/man/findVMRs.Rd +++ /dev/null @@ -1,78 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/findVMRs.R -\name{findVMRs} -\alias{findVMRs} -\title{Identify Variable Methylated Regions in microarrays} -\usage{ -findVMRs( - array_manifest, - methylation_data, - cor_threshold = 0.15, - var_method = "variance", - var_threshold_percentile = 0.9, - max_distance = 1000 -) -} -\arguments{ -\item{array_manifest}{Information about the probes on the array. Requires the columns MAPINFO (basepair position -of the probe in the genome), CHR (chromosome), TargetID (probe name) and STRAND (this is very important to set up, since -the VMRs will only be created based on CpGs on the same strand; if the positions are reported based on a single DNA strand, this should contain either a vector of only "+", "-" or "*" for all of the probes).} - -\item{methylation_data}{A data frame containing M or B values, with samples as columns and probes as rows. Data is expected to have already passed through quality control and cleaning steps.} - -\item{cor_threshold}{Numeric value (0-1) to be used as the median pearson correlation threshold for identifying VMRs (i.e. -all VMRs will have a median pairwise probe correlation of this parameter).} - -\item{var_method}{Method to use to measure variability in the data set. The options are "mad" (median absolute deviation) -or "variance".} - -\item{var_threshold_percentile}{The percentile (0-1) to be used as cutoff to define Highly Variable Probes (and -therefore VMRs). The default is 0.9 because this percentile has been traditionally used in previous studies.} - -\item{max_distance}{Maximum distance allowed for two probes to be grouped into a region. The default is 1000 -because this window has been traditionally used in previous studies.} -} -\value{ -A list with the following elements: -\itemize{ -\item $var_score_threshold: threshold used to define Highly Variable Probes (mad or variance, depending on the specified choice). -\item $highly_variable_probes: a data frame with the probes that passed the variability score threshold imposed by the user, and their variability score (MAD score or variance). -\item $canonical_VMRs: a GRanges object with strict candidate VMRs - regions composed of two or more -contiguous, correlated and proximal Highly Variable Probes; thresholds depend on the ones specified -by the user) -\item $non_canonical_VMRs: a GRanges object with highly variable probes without neighboring -CpGs measured in \emph{max_distance} on the array. Category created to take into acccount the Illumina array design of single probes capturing the methylation state of regulatory regions. -} -} -\description{ -Identifies autosomal Highly Variable Probes (HVP) and merges them into Variable Methylated Regions (VMRs) given an Illumina manifest. -} -\details{ -This function identifies HVPs using MAD scores or variance metrics, and groups them into VMRs, which are defined as clusters of proximal and correlated HVPs (distance and correlation defined by the user). Output VMRs can be separated into canonical and non canonical. Canonical VMRs are regions that meet the correlation and closeness criteria. For guidance on which correlation threshold to use, we recommend checking the Supplementary Figure 1 of the CoMeBack R package (Gatev \emph{et al.}, 2020) where a simulation to empirically determine a default guidance specification for a correlation threshold parameter dependent on sample size is done. As default, we use a threshold of 0.15 as per the CoMeBack authors minimum threshold suggestion. On the other hand, non canonical VMRs are regions that are composed of HVPs that have no nearby probes measured in the array (according to the max_distance parameter); this category was created to account for the Illumina EPIC array design, which has a high number of probes in regulatory regions that are represented by a single probe. Furthermore, these probes have been shown to be good representatives of the methylation state of its surroundings (Pidsley et al., 2016). By creating this category, we recover those informative HVPs that otherwise would be excluded from the analysis because of the array design. - -This function uses GenomicRanges::reduce() to group the regions, which is strand-sensitive. In the Illumina microarrays, the MAPINFO for all the probes -is usually provided as for the + strand. If you are using this array, we recommend to first -convert the strand of all the probes to "+". - -This function supports parallel computing for increased speed. To do so, you have to set the parallel backend -in your R session BEFORE running the function (e.g., doFuture::registerDoFuture()) and then the evaluation strategy (e.g., future::plan(multisession)). After that, -the function can be run as usual. When working with big datasets, the parallel backend might throw an error if you exceed -the maximum allowed size of globals exported for future expression. This can be fixed by increasing the allowed size (e.g. running options(future.globals.maxSize= +Inf) ) - -Note: this function excludes sex chromosomes. -} -\examples{ -#We need to modify the RAMEN::test_array_manifest object by assigning to -#row names to the probe ID column; it was saved this way because storing -#the TargetID as row names reduced significantly the size of the data set. -test_array_manifest_final = RAMEN::test_array_manifest \%>\% -tibble::rownames_to_column(var = "TargetID") - -VMRs = RAMEN::findVMRs(array_manifest = test_array_manifest_final, - methylation_data = RAMEN::test_methylation_data, - cor_threshold = 0, - var_method = "variance", - var_threshold_percentile = 0.9, - max_distance = 1000) - -} diff --git a/man/lmGE.Rd b/man/lmGE.Rd index a242e84..c3bbed1 100644 --- a/man/lmGE.Rd +++ b/man/lmGE.Rd @@ -6,7 +6,7 @@ \usage{ lmGE( selected_variables, - summarized_methyl_VMR, + summarized_methyl_VML, genotype_matrix, environmental_matrix, covariates = NULL, @@ -14,9 +14,9 @@ lmGE( ) } \arguments{ -\item{selected_variables}{A data frame obtained with \emph{RAMEN::selectVariables()}. This data frame must contain three columns: 'VMR_index' with characters of an unique ID of each VMR; ´selected_genot' and 'selected_env' with the SNPs and environmental variables, respectively, that will be used for fitting the genotype (G), environment (E), additive (G + E) or interaction (G x E) models. The columns 'selected_env' and 'selected_genot' must contain lists as elements; VMRs with no environmental or genotype selected variables must contain an empty list with NULL, NA , character(0) or "" inside.} +\item{selected_variables}{A data frame obtained with \emph{RAMEN::selectVariables()}. This data frame must contain three columns: 'VML_index' with characters of an unique ID of each VML; ´selected_genot' and 'selected_env' with the SNPs and environmental variables, respectively, that will be used for fitting the genotype (G), environment (E), additive (G + E) or interaction (G x E) models. The columns 'selected_env' and 'selected_genot' must contain lists as elements; VML with no environmental or genotype selected variables must contain an empty list with NULL, NA , character(0) or "" inside.} -\item{summarized_methyl_VMR}{A data frame containing each individual's VMR summarized region methylation. It is suggested to use the output of RAMEN::summarizeVMRs().Rows must reflects individuals, and columns VMRs The names of the columns must correspond to the index of said VMR, and it must match the index of VMRs_df$VMR_index. The names of the rows must correspond to the sample IDs, and must match with the IDs of the other matrices.} +\item{summarized_methyl_VML}{A data frame containing each individual's VML summarized methylation. It is suggested to use the output of RAMEN::summarizeVML().Rows must reflects individuals, and columns VML The names of the columns must correspond to the index of said VML, and it must match the index of VML_df$VML_index. The names of the rows must correspond to the sample IDs, and must match with the IDs of the other matrices.} \item{genotype_matrix}{A matrix of number-encoded genotypes. Columns must correspond to samples, and rows to SNPs. We suggest using a gene-dosage model, which would encode the SNPs ordinally depending on the genotype allele charge, such as 2 (AA), 1 (AB) and 0 (BB). The column names must correspond with individual IDs.} @@ -24,19 +24,19 @@ lmGE( \item{covariates}{A matrix containing the covariates (i.e., concomitant variables / variables that are not the ones you are interested in) that will be adjusted for in the final GxE models (e.g., cell type proportions, age, etc.). Each column should correspond to a covariate and each row to an individual. Row names must correspond to the individual IDs.} -\item{model_selection}{Which metric to use to select the best model for each VMR. Supported options are "AIC" or BIC". More information about which one to use can be found in the Details section.} +\item{model_selection}{Which metric to use to select the best model for each VML. Supported options are "AIC" or BIC". More information about which one to use can be found in the Details section.} } \value{ A data frame with the following columns: \itemize{ -\item VMR_index: The unique ID of the VMR +\item VML_index: The unique ID of the VML \item model_group: The group to which the winning model belongs to (i.e., G, E, G+E or GxE) \item variables: The variable(s) that are present in the winning model (excluding the covariates, which are included in all the models) \item tot_r_squared: R squared of the winning model \item g_r_squared: Estimated R2 allocated to the G in the winning model, if applicable. \item e_r_squared: Estimated R2 allocated to the E in the winning model, if applicable. \item gxe_r_squared: Estimated R2 allocated to the interaction in the winning model (GxE), if applicable. -\item AIC/BIC: AIC or BIC metric from the best model in each VMR (depending on the option specified in the argument model_selection). +\item AIC/BIC: AIC or BIC metric from the best model in each VML (depending on the option specified in the argument model_selection). \item second_winner: The second group that possesses the next best model after the winning one (i.e., G, E, G+E or GxE). This column may have NA if the variables in selected_variables correspond only to one group (G or E), so that there is no other model groups to compare to. \item delta_aic/delta_bic: The difference of AIC or BIC value (depending on the option specified in the argument model_selection) of the winning model and the best model from the second_winner group (i.e., G, E, G+E or GxE). This column may have NA if the variables in selected_variables correspond only to one group (G or E), so that there is no other groups to compare to. \item delta_r_squared: The R2 of the winning model - R2 of the second winner model. This column may have NA if the variables in selected_variables correspond only to one group (G or E), so that there is no other groups to compare to. @@ -45,14 +45,14 @@ A data frame with the following columns: } } \description{ -For a set of Variable Methylated Region (VMR), this function fits a set of genotype (G), environment (E), pairwise additive (G + E) or pairwise interaction (G x E) models, one variable at a time, and selects the best fitting one. Additional information for each winning model is provided, such as its R2, its R2 increase comparing it to a basal model (i.e., a model only fitted with the concomitant variables), the delta AIC/BIC to the next best model from a different category, and the explained variance decomposed for the G, E and GxE components (when applicable). +For a set of Variable Methylated Loci (VML), this function fits a set of genotype (G), environment (E), pairwise additive (G + E) or pairwise interaction (G x E) models, one variable at a time, and selects the best fitting one. Additional information for each winning model is provided, such as its R2, its R2 increase comparing it to a basal model (i.e., a model only fitted with the concomitant variables), the delta AIC/BIC to the next best model from a different category, and the explained variance decomposed for the G, E and GxE components (when applicable). If a VML has no variables selected in the selected_variables object, it will be returned with "B" (basal) as the best model (interpreted as no G or E associated effect). } \details{ This function supports parallel computing for increased speed. To do so, you have to set the parallel backend -in your R session before running the function (e.g., doFuture::registerDoFuture()) and then the evaluation strategy (e.g., future::plan(multisession)). After that, +in your R session before running the function (e.g., \emph{doParallel::registerDoParallel(4)})). After that, the function can be run as usual. It is recommended to also set options(future.globals.maxSize= +Inf). -For each VMR, this function computes a set of models using the variables indicated in the selected_variables object. From the indicated G and E variables, lmGE() fits four groups of models: +For each VML, this function computes a set of models using the variables indicated in the selected_variables object. From the indicated G and E variables, lmGE() fits four groups of models: \itemize{ \item G: Genetics model - fitted one SNP at a time. \item E: Environmental model - fitted one environmental variable at a time. @@ -60,11 +60,11 @@ For each VMR, this function computes a set of models using the variables indicat \item GxE: Interaction model - fitted for each pairwise combination of G and E variables indicated in selected_variables. } -These models are fit only if the VMR has G or E variables in the selected_variables object. If a VMR does not have neither G nor E variables, that VMR will be ignored and will not be returned in the output object. +These models are fit only if the VML has G or E variables in the selected_variables object. If a VML does not have neither G nor E variables, that VML will be ignored and will be returned in the output object with "B" (baseline) as the best explanatory model. \strong{Model selection} -Following the model fitting stage, the best model \strong{per group} is selected using Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). Both of these metrics are statistical approaches to select the best model in the same data set, and they have strengths and limitations that make them excel in different situations. We recommend using AIC because BIC assumes that the true model is in the set of compared models. Since this function fits models with individual variables, and we assume that DNAme variability is more likely to be influenced by more than one single SNP/environmental exposure at a time, we hypothesize that in most cases, the true model will not be in the set of compared models. Also, AIC excels in situations where all models in the model space are "incorrect", and AIC is preferentially used in cases where the true underlying function is unknown and our selected model could belong to a very large class of functions where the relationship could be pretty complex. It is worth mentioning however that, both metrics tend to pick the same model in a large number of scenarios. We suggest the users to read Arijit Chakrabarti & Jayanta K. Ghosh, 2011 for further information about the difference between these metrics. +Following the model fitting stage, the best model \strong{per group} is selected using Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). Both of these metrics are statistical approaches to select the best model in the same data set, and they have strengths and limitations that make them excel in different situations. We recommend using AIC because BIC assumes that the true model is in the set of compared models. Since this function fits models with individual variables, and we assume that DNAme variability is more likely to be influenced by more than one single SNP/environmental exposure at a time, we hypothesize that in most cases, the true model will not be in the set of compared models. Also, AIC excels in situations where all models in the model space are "incomplete", and AIC is preferentially used in cases where the true underlying function is unknown and our selected model could belong to a very large class of functions where the relationship could be pretty complex. It is worth mentioning however that, both metrics tend to pick the same model in a large number of scenarios. We suggest the users to read Arijit Chakrabarti & Jayanta K. Ghosh, 2011 for further information about the difference between these metrics. After selecting the best model per group (G,E,G+E pr GxE), the model with the lowest AIC or BIC is declared as the winning model. The delta AIC/BIC and difference of R2 is computed relative to the model with the second lowest AIC/BIC (i.e., the best model from a different group to the winning one), and reported in the final object. @@ -72,3 +72,48 @@ After selecting the best model per group (G,E,G+E pr GxE), the model with the lo Finally, the variance is decomposed and the relative R2 contribution of each of the variables of interest (G, E and GxE) is reported. This decomposition is done using the relaimpo R package, using the Lindeman, Merenda and Gold (lmg) method, which is based on the heuristic approach of averaging the relative R contribution of each variable over all input orders in the linear model. The estimation of the partitioned R2 of each factor in the models was conducted keeping the covariates always in the model as first entry (i.e., the variables specified in covariates did not change order). For further information, we suggest the users to read the documentation and publication of the relaimpo R package (Grömping, 2006). } +\examples{ +## Find VML in test data +VML <- RAMEN::findVML( + methylation_data = RAMEN::test_methylation_data, + array_manifest = "IlluminaHumanMethylationEPICv1", + cor_threshold = 0, + var_method = "variance", + var_distribution = "ultrastable", + var_threshold_percentile = 0.99, + max_distance = 1000 + ) +## Find cis SNPs around VML +VML_with_cis_snps <- RAMEN::findCisSNPs( + VML_df = VML$VML, + genotype_information = RAMEN::test_genotype_information, + distance = 1e6 + ) + +## Summarize methylation levels in VML +summarized_methyl_VML <- RAMEN::summarizeVML( + methylation_data = RAMEN::test_methylation_data, + VML_df = VML_with_cis_snps + ) + + ## Select relevant genotype and environmental variables + selected_vars <- RAMEN::selectVariables( + VML_df = VML_with_cis_snps, + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + covariates = RAMEN::test_covariates, + summarized_methyl_VML = summarized_methyl_VML, + seed = 1 + ) + +## Fit G, E, G+E and GxE models and select the winning one +lmge_res <- RAMEN::lmGE( + selected_variables = selected_vars, + summarized_methyl_VML = summarized_methyl_VML, + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + covariates = RAMEN::test_covariates, + model_selection = "AIC" + ) + +} diff --git a/man/map_revmap_names.Rd b/man/map_revmap_names.Rd index f3f063e..1109bc3 100644 --- a/man/map_revmap_names.Rd +++ b/man/map_revmap_names.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/findVMRs.R +% Please edit documentation in R/findVML.R \name{map_revmap_names} \alias{map_revmap_names} \title{Map revmap column to probe names after reducing a GenomicRanges object} @@ -9,7 +9,7 @@ map_revmap_names(positions, manifest_hvp) \arguments{ \item{positions}{A revmap row in the form of a vector} -\item{manifest_hvp}{the manifest of the highly variable probes used in the findVMRs() function +\item{manifest_hvp}{the manifest of the highly variable probes used in the findVML() function with the probes as row names} } \value{ @@ -17,6 +17,14 @@ a vector with the names of the probes that conform one reduced region } \description{ Given a revmap row (e.g. 1 5 6), we map those positions to their corresponding probe names -(and end up with something like "cg00000029", "cg00000158", "cg00000165".This is a helper function -of findVMRs()). +(and end up with something like "cg00000029", "cg00000158", "cg00000165".This is a helper function of findVML()). +} +\examples{ +\dontrun{ + target = data.frame(row.names = c("a", "b", "c", "d"), values = c(1,1,1,1)) + query = c(2,1) + + map_revmap_names(positions = query, manifest_hvp = target) + ## Expected output: c("b", "a") +} } diff --git a/man/medCorVMR.Rd b/man/medCorVMR.Rd index 034d0ad..9896a64 100644 --- a/man/medCorVMR.Rd +++ b/man/medCorVMR.Rd @@ -23,6 +23,20 @@ its median pairwise probe correlation. } \details{ This function supports parallel computing for increased speed. To do so, you have to set the parallel backend -in your R session before running the function (e.g., doFuture::registerDoFuture()) and then the evaluation strategy (e.g., future::plan(multisession)). After that, -the function can be run as usual. It is recommended to also set options(future.globals.maxSize= +Inf). +in your R session before running the function (e.g., \emph{doParallel::registerDoParallel(4)})). After that, the function can be run as usual. It is recommended to also set options(future.globals.maxSize= +Inf). +} +\examples{ + +#Create a VML data.frame +VMR_df <- data.frame(seqnames = c("chr21", "chr21"), + start = c(10861376, 10862171), + end = c(10862507, 10883548), + probes = I(list(c("cg15043638", "cg18287590", "cg17975851"), + c("cg13893907", "cg17035109", "cg06187584")))) + +# Compute median correlation for each VMR +medCorVMR(VMR_df = VMR_df, methylation_data = RAMEN::test_methylation_data) + + + } diff --git a/man/nullDistGE.Rd b/man/nullDistGE.Rd index 357f345..14a2c0d 100644 --- a/man/nullDistGE.Rd +++ b/man/nullDistGE.Rd @@ -5,10 +5,10 @@ \title{Simulate a delta R squared null distribution of G and E effects on DNAme variability} \usage{ nullDistGE( - VMRs_df, + VML_df, genotype_matrix, environmental_matrix, - summarized_methyl_VMR, + summarized_methyl_VML, permutations = 10, covariates = NULL, seed = NULL, @@ -16,13 +16,13 @@ nullDistGE( ) } \arguments{ -\item{VMRs_df}{A data frame converted from a GRanges object. Recommended to use the output of \emph{RAMEN::findCisSNPs()}. Must have one VMR per row, and contain the following columns: "VMR_index" (a unique ID for each VMR in VMRs_df AS CHARACTERS) and "SNP" (a column with a list as observation, containing the name of the SNPs surrounding the corresponding VMR). The SNPs contained in the "SNP" column must be present in the object that is indicated in the genotype_matrix argument, and it must contain all the VMRs contained in summarized_methyl_VMR. VMRs with no surrounding SNPs must have an empty list in the SNP column (either list(NULL), list(NA), list("") or list(character(0)) ).} +\item{VML_df}{A data frame converted from a GRanges object. Recommended to use the output of \emph{RAMEN::findCisSNPs()}. Must have one VML per row, and contain the following columns: "VML_index" (a unique ID for each VML in VML_df AS CHARACTERS) and "SNP" (a column with a list as observation, containing the name of the SNPs surrounding the corresponding VML). The SNPs contained in the "SNP" column must be present in the object that is indicated in the genotype_matrix argument, and it must contain all the VML contained in summarized_methyl_VML. VML with no surrounding SNPs must have an empty list in the SNP column (either list(NULL), list(NA), list("") or list(character(0)) ).} \item{genotype_matrix}{A matrix of number-encoded genotypes. Columns must correspond to samples, and rows to SNPs. We suggest using a gene-dosage model, which would encode the SNPs ordinally depending on the genotype allele charge, such as 2 (AA), 1 (AB) and 0 (BB). The column names must correspond with individual IDs.} \item{environmental_matrix}{A matrix of environmental variables. Only numeric values are supported. In case of factor variables, it is recommended to encode them as numbers or re-code them into dummy variables if there are more than two levels. Columns must correspond to environmental variables and rows to individuals. Row names must be the individual IDs.} -\item{summarized_methyl_VMR}{A data frame containing each individual's VMR summarized region methylation. It is suggested to use the output of RAMEN::summarizeVMRs().Rows must reflects individuals, and columns VMRs The names of the columns must correspond to the index of said VMR, and it must match the index of VMRs_df$VMR_index. The names of the rows must correspond to the sample IDs, and must match with the IDs of the other matrices.} +\item{summarized_methyl_VML}{A data frame containing each individual's VML summarized methylation. It is suggested to use the output of RAMEN::summarizeVML().Rows must reflects individuals, and columns VML The names of the columns must correspond to the index of said VML, and it must match the index of VML_df$VML_index. The names of the rows must correspond to the sample IDs, and must match with the IDs of the other matrices.} \item{permutations}{description} @@ -30,12 +30,12 @@ nullDistGE( \item{seed}{An integer number that initializes a pseudo-random number generator. Random numbers in this function are created during the lambda cross validation and the LASSO stages. Setting a seed is highly encouraged for result reproducibility. \strong{Please note that setting a seed in this function modifies the seed globally}.} -\item{model_selection}{Which metric to use to select the best model for each VMR. Supported options are "AIC" or BIC". More information about which one to use can be found in the Details section.} +\item{model_selection}{Which metric to use to select the best model for each VML. Supported options are "AIC" or BIC". More information about which one to use can be found in the Details section.} } \value{ A data frame with the following columns: \itemize{ -\item VMR_index: The unique ID of the VMR. +\item VML_index: The unique ID of the VML. \item model_group: The group to which the winning model belongs to (i.e., G, E, G+E or GxE) \item tot_r_squared: R squared of the winning model \item R2_difference: the increase in R squared obtained by including the G/E variable(s) from the winning model (i.e., the R squared difference between the winning model and the model only with the concomitant variables specified in \emph{covariates}; tot_r_squared - basal_rsquared in the lmGE output) @@ -46,7 +46,43 @@ A data frame with the following columns: This function simulates the delta R squared distribution under the null hypothesis of G and E having no association with DNA methylation (DNAme) variability through a permutation analysis. To do so, this function shuffles the G and E variables in the dataset, which is followed by a the variable selection and modelling steps with \emph{selectVariables()} and \emph{lmGE()}.These steps are repeated several times as indicated in the \emph{permutations} parameter. By using shuffled G and E data, we simulate the increase of R2 that would be observed in random data using the RAMEN methodology. } \details{ -The core pipeline from the RAMEN package identifies the best explanatory model per VMR. However, despite these models being winners in comparison to models including any other G/E variable(s) in the dataset, some winning models might perform no better than what we would expect by chance. Therefore, the goal of this function is to create a distribution of increase in R2 under the null hypothesis of G and E having no associations with DNAme. The null distribution is obtained through shuffling the G and E variables in a given dataset and conducting the variable selection and G/E model selection. That way, we can simulate how much additional variance would be explained by the models defined as winners by the RAMEN methodology in a scenario where the G and E associations with DNAme are randomized. This distribution can be then used to filter out winning models in the non-shuffled dataset that do not add more to the explained variance of the basal model than what randomized data do. +The core pipeline from the RAMEN package identifies the best explanatory model per VML. However, despite these models being winners in comparison to models including any other G/E variable(s) in the dataset, some winning models might perform no better than what we would expect by chance. Therefore, the goal of this function is to create a distribution of increase in R2 under the null hypothesis of G and E having no associations with DNAme. The null distribution is obtained through shuffling the G and E variables in a given dataset and conducting the variable selection and G/E model selection. That way, we can simulate how much additional variance would be explained by the models defined as winners by the RAMEN methodology in a scenario where the G and E associations with DNAme are randomized. This distribution can be then used to filter out winning models in the non-shuffled dataset that do not add more to the explained variance of the basal model than what randomized data do. -Under the assumption that after adjusting for the concomitant variables all VMRs across the genome follow the same behavior regarding an increment of explained variance with randomized G and E data, we can pool the delta R squared values from all VMRs to create a null distribution taking advantage of the high number of VMRs in the dataset. This assumption decreases significantly the number of permutations required to create a null distribution and reduces the computational time. For further information please read the RAMEN paper (in preparation). +Under the assumption that after adjusting for the concomitant variables all VML across the genome follow the same behavior regarding an increment of explained variance with randomized G and E data, we can pool the delta R squared values from all VML to create a null distribution taking advantage of the high number of VML in the dataset. This assumption decreases significantly the number of permutations required to create a null distribution and reduces the computational time. For further information please read the RAMEN paper (in preparation). +} +\examples{ +## Find VML in test data +VML <- RAMEN::findVML( + methylation_data = RAMEN::test_methylation_data, + array_manifest = "IlluminaHumanMethylationEPICv1", + cor_threshold = 0, + var_method = "variance", + var_distribution = "ultrastable", + var_threshold_percentile = 0.99, + max_distance = 1000 + ) +## Find cis SNPs around VML +VML_with_cis_snps <- RAMEN::findCisSNPs( + VML_df = VML$VML, + genotype_information = RAMEN::test_genotype_information, + distance = 1e6 + ) + +## Summarize methylation levels in VML +summarized_methyl_VML <- RAMEN::summarizeVML( + methylation_data = RAMEN::test_methylation_data, + VML_df = VML_with_cis_snps + ) + +## Simulate null distribution of G and E contributions on DNAme variability +null_dist <- RAMEN::nullDistGE( + VML_df = VML_with_cis_snps, + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + summarized_methyl_VML = summarized_methyl_VML, + permutations = 5, + covariates = RAMEN::test_covariates, + seed = 1, + model_selection = "AIC" + ) } diff --git a/man/selectVariables.Rd b/man/selectVariables.Rd index 45450c3..12e0626 100644 --- a/man/selectVariables.Rd +++ b/man/selectVariables.Rd @@ -2,19 +2,19 @@ % Please edit documentation in R/selectVariables.R \name{selectVariables} \alias{selectVariables} -\title{Selection of environment and genotype variables for Variable Methylated Regions (VMRs)} +\title{Selection of relevant environment and genotype variables associated with Variably Methylated Loci (VML)} \usage{ selectVariables( - VMRs_df, + VML_df, genotype_matrix, environmental_matrix, covariates = NULL, - summarized_methyl_VMR, + summarized_methyl_VML, seed = NULL ) } \arguments{ -\item{VMRs_df}{A data frame converted from a GRanges object. Recommended to use the output of \emph{RAMEN::findCisSNPs()}. Must have one VMR per row, and contain the following columns: "VMR_index" (a unique ID for each VMR in VMRs_df AS CHARACTERS) and "SNP" (a column with a list as observation, containing the name of the SNPs surrounding the corresponding VMR). The SNPs contained in the "SNP" column must be present in the object that is indicated in the genotype_matrix argument, and it must contain all the VMRs contained in summarized_methyl_VMR. VMRs with no surrounding SNPs must have an empty list in the SNP column (either list(NULL), list(NA), list("") or list(character(0)) ).} +\item{VML_df}{A data frame converted from a GRanges object. Recommended to use the output of \emph{RAMEN::findCisSNPs()}. Must have one VML per row, and contain the following columns: "VML_index" (a unique ID for each VML in VML_df AS CHARACTERS) and "SNP" (a column with a list as observation, containing the name of the SNPs surrounding the corresponding VML). The SNPs contained in the "SNP" column must be present in the object that is indicated in the genotype_matrix argument, and it must contain all the VML contained in summarized_methyl_VML. VML with no surrounding SNPs must have an empty list in the SNP column (either list(NULL), list(NA), list("") or list(character(0)) ).} \item{genotype_matrix}{A matrix of number-encoded genotypes. Columns must correspond to samples, and rows to SNPs. We suggest using a gene-dosage model, which would encode the SNPs ordinally depending on the genotype allele charge, such as 2 (AA), 1 (AB) and 0 (BB). The column names must correspond with individual IDs.} @@ -22,28 +22,63 @@ selectVariables( \item{covariates}{A matrix containing the covariates (i.e., concomitant variables / variables that are not the ones you are interested in) that will be adjusted for in the final GxE models (e.g., cell type proportions, age, etc.). Each column should correspond to a covariate and each row to an individual. Row names must correspond to the individual IDs.} -\item{summarized_methyl_VMR}{A data frame containing each individual's VMR summarized region methylation. It is suggested to use the output of RAMEN::summarizeVMRs().Rows must reflects individuals, and columns VMRs The names of the columns must correspond to the index of said VMR, and it must match the index of VMRs_df$VMR_index. The names of the rows must correspond to the sample IDs, and must match with the IDs of the other matrices.} +\item{summarized_methyl_VML}{A data frame containing each individual's VML summarized methylation. It is suggested to use the output of RAMEN::summarizeVML().Rows must reflects individuals, and columns VML The names of the columns must correspond to the index of said VML, and it must match the index of VML_df$VML_index. The names of the rows must correspond to the sample IDs, and must match with the IDs of the other matrices.} \item{seed}{An integer number that initializes a pseudo-random number generator. Random numbers in this function are created during the lambda cross validation and the LASSO stages. Setting a seed is highly encouraged for result reproducibility. \strong{Please note that setting a seed in this function modifies the seed globally}.} } \value{ A data frame with three columns: \itemize{ -\item VMR_index: Unique VMR ID. -\item selected_genot: List-containing column with the selected SNPs. -\item selected_env: List-containing column with the selected environmental variables. +\item VML_index: Unique VML ID. +\item selected_genot: Column containing lists as values with the selected SNPs. +\item selected_env: Column containing lists as values with the selected environmental variables. } } \description{ -For each VMR, this function selects genotype and environmental variables using LASSO. +For each VML, this function selects potentially relevant genotype and environmental variables associated with DNA methylation levels of said VML using LASSO. See details below for more information. } \details{ -This function supports parallel computing for increased speed. To do so, you have to set the parallel back-end -in your R session before running the function (e.g., doFuture::registerDoFuture()) and then the evaluation strategy (e.g., future::plan(multisession)). After that, the function can be run as usual. It is recommended to also set options(future.globals.maxSize= +Inf). Please make sure that your data has no NAs, since the LASSO implementation we use in RAMEN does not support missing values. - -selectVariables() uses LASSO, which is an embedded variable selection method that penalizes models that are more complex (i.e., that contain more variables) in favor of simpler models (i.e. that contain less variables), but not at the expense of reducing predictive power. Using LASSO's variable screening property (with high probability, the LASSO estimated model includes the substantial covariates and drops the redundant ones) this function selects genotype and environment variables with potential relevance in the Variable Methylated Region (VMR) dataset (see also Bühlmann and van de Geer, 2011). For each VMR, LASSO is run three times: 1) including only the genotype variables for the selection step, 2) including only the environmental variables for the selection step, and 3) Including both the genotype and environmental variables in the selection step. This is done to ensure that the function captures the variables that are relevant within their own category (e.g., SNPs that are strongly associated with the DNAme levels of a VMR in the presence of the rest of the SNPs) or in the presence of the variables of the other category (e.g. SNPs that are strongly associated with the DNAme levels of a VMR in the presence of the rest of BOTH the SNPs AND environmental variables). Every time LASSO is run, the basal covariates (i.e., concomitant variables )indicated in the argument \emph{covariates} are not penalized (i.e., those variables are always included in the models and their coefficients are not subjected to shrinkage). That way, only the most promising E and G variables in the presence of the concomitant variables will be selected. +selectVariables() uses LASSO, which is an embedded variable selection method that penalizes models that are more complex (i.e., that contain more variables) in favor of simpler models (i.e. that contain less variables), but not at the expense of reducing predictive power. Using LASSO's variable screening property (with high probability, the LASSO estimated model includes the substantial covariates and drops the redundant ones) this function selects genotype and environment variables with potential relevance in the Variable Methylated Loci (VML) dataset (see also Bühlmann and van de Geer, 2011). For each VML, LASSO is run three times: 1) including only the genotype variables for the selection step, 2) including only the environmental variables for the selection step, and 3) Including both the genotype and environmental variables in the selection step. This is done to ensure that the function captures the variables that are relevant within their own category (e.g., SNPs that are strongly associated with the DNAme levels of a VML in the presence of the rest of the SNPs) or in the presence of the variables of the other category (e.g. SNPs that are strongly associated with the DNAme levels of a VML in the presence of the rest of BOTH the SNPs AND environmental variables). Every time LASSO is run, the basal covariates (i.e., concomitant variables )indicated in the argument \emph{covariates} are not penalized (i.e., those variables are always included in the models and their coefficients are not subjected to shrinkage). That way, only the most promising E and G variables in the presence of the concomitant variables will be selected. Each LASSO model uses a tuned lambda that minimizes the 5-fold cross-validation error within its corresponding data. This function uses the lambda.min value in contrast to lambda.1se because its goal within the RAMEN package is to use LASSO to reduce the number of variables that are going to be used next for fitting pairwise interaction models in \emph{lmGE()}. Since at this step variables are being selected based only on main effects, it is preferable to cast a "wider net" and select a slightly higher number of variables that could potentially have a strong interaction effect when paired with another variable. Furthermore, since in this case LASSO is being used as a screening procedure to select variables that will be fit separately in independent models and compared, the overfitting issue of using lambda.min does not impose a big concern. After finding the best lambda value, the sequence of models is fit by coordinate descent using \emph{glmnet()}. Random numbers in this function are created during the lambda cross validation and the LASSO stages. Setting a seed is highly encouraged for result reproducibility using the \emph{seed} argument. Please note that setting a seed inside of this function modifies the seed globally (which is R's default behavior). +#' This function supports parallel computing for increased speed. To do so, you have to set the parallel back-end +in your R session before running the function (e.g., \emph{doParallel::registerDoParallel(4)}). After that, the function can be run as usual. It is recommended to also set options(future.globals.maxSize= +Inf). Please make sure that your data has no NAs and it's all numerical, since the LASSO implementation we use does not support missing or non-numerical values. + Note: If you want to conduct the variable selection step only in one data set (i.e., only in the genotype), you can set the argument \emph{environmental_matrix = NULL}. } +\examples{ +## Find VML in test data +VML <- RAMEN::findVML( + methylation_data = RAMEN::test_methylation_data, + array_manifest = "IlluminaHumanMethylationEPICv1", + cor_threshold = 0, + var_method = "variance", + var_distribution = "ultrastable", + var_threshold_percentile = 0.99, + max_distance = 1000 + ) +## Find cis SNPs around VML +VML_with_cis_snps <- RAMEN::findCisSNPs( + VML_df = VML$VML, + genotype_information = RAMEN::test_genotype_information, + distance = 1e6 + ) + +## Summarize methylation levels in VML +summarized_methyl_VML <- RAMEN::summarizeVML( + methylation_data = RAMEN::test_methylation_data, + VML_df = VML_with_cis_snps + ) + + ## Select relevant genotype and environmental variables + selected_vars <- RAMEN::selectVariables( + VML_df = VML_with_cis_snps, + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + covariates = RAMEN::test_covariates, + summarized_methyl_VML = summarized_methyl_VML, + seed = 1 + ) + +} diff --git a/man/summarizeVML.Rd b/man/summarizeVML.Rd new file mode 100644 index 0000000..69e0905 --- /dev/null +++ b/man/summarizeVML.Rd @@ -0,0 +1,44 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/summarizeVML.R +\name{summarizeVML} +\alias{summarizeVML} +\title{Summarize the methylation states of Variable Methylated Loci (VML)} +\usage{ +summarizeVML(VML_df, methylation_data) +} +\arguments{ +\item{VML_df}{A GRanges-like data frame. Must contain the following columns: +"seqnames", "start", "end" and "probes" (containing lists as elements, where each contains a vector with the probes constituting the VML). This is the "VML" object returned by the \emph{findVML()} function.} + +\item{methylation_data}{A data frame containing M or B values, with samples as columns and probes as rows. Row names must be the CpG probe IDs.} +} +\value{ +A data frame with samples as rows, and VML as columns. The value inside each cell corresponds to the summarized methylation value of said VML in the corresponding individual. The column names correspond to the VML_index. +} +\description{ +This function computes a representative methylation score for each Variable Methylated Locus (VML) in a dataset. It returns a data frame with the median methylation of each region per individual. +For each VML in a dataset, returns a with the median methylation of that region (columns) per individual (rows) as representative score. +} +\details{ +This function supports parallel computing for increased speed. To do so, you have to set the parallel backend in your R session BEFORE running the function (e.g., \emph{doParallel::registerDoParallel(4)}). After that, +the function can be run as usual. +} +\examples{ +## Find VML in test data +VML <- RAMEN::findVML( + methylation_data = RAMEN::test_methylation_data, + array_manifest = "IlluminaHumanMethylationEPICv1", + cor_threshold = 0, + var_method = "variance", + var_distribution = "ultrastable", + var_threshold_percentile = 0.99, + max_distance = 1000 + ) + +## Summarize methylation states of the found VML +summarized_VML <- RAMEN::summarizeVML( + VML_df = VML$VML, + methylation_data = RAMEN::test_methylation_data + ) + +} diff --git a/man/summarizeVMRs.Rd b/man/summarizeVMRs.Rd deleted file mode 100644 index 87fc189..0000000 --- a/man/summarizeVMRs.Rd +++ /dev/null @@ -1,29 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/summarizeVMRs.R -\name{summarizeVMRs} -\alias{summarizeVMRs} -\title{Summarize the methylation states of Variable Methylated Regions (VMRs)} -\usage{ -summarizeVMRs(VMRs_df, methylation_data) -} -\arguments{ -\item{VMRs_df}{A GRanges object converted to a data frame. Must contain the following columns: -"seqnames", "start", "end" (all of which are produced automatically when doing the object conversion) -and "probes" (containing a list where each element contains a vector with the probes -constituting the VMR).} - -\item{methylation_data}{A data frame containing M or B values, with samples as columns and probes as rows.} -} -\value{ -A data frame with samples as rows, and VMRs as columns. The value inside each cell corresponds to the summarized -methylation value of said VMR in the corresponding individual. The column names correspond to the VMR_index, which is created if not -already existing based on the rownames of the VMR_df. -} -\description{ -For each VMR in a dataset, returns an object with the median methylation of that region per individual as representative score. -} -\details{ -This function supports parallel computing for increased speed. To do so, you have to set the parallel backend -in your R session BEFORE running the function (e.g., doFuture::registerDoFuture()) and then the evaluation strategy (e.g., future::plan(multisession)). After that, -the function can be run as usual. -} diff --git a/man/test_array_manifest.Rd b/man/test_array_manifest.Rd deleted file mode 100644 index c441085..0000000 --- a/man/test_array_manifest.Rd +++ /dev/null @@ -1,29 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/test_array_manifest.R -\docType{data} -\name{test_array_manifest} -\alias{test_array_manifest} -\title{Array manifest example data set} -\format{ -\subsection{\code{test_array_manifest}}{ - -A data frame with 3,000 rows and 3 columns: -\describe{ -\item{\emph{rownames}}{Probe ID - for storage reasons, this variable was stored as row names, but rownames have to be converted to a new column called "TargetID" prior to its use in RAMEN.} -\item{MAPINFO}{Probe genomic position (h19)} -\item{CHR}{Chromosome} -\item{STRAND}{Strand} -... -} -} -} -\source{ -\url{https://webdata.illumina.com/downloads/productfiles/methylationEPIC/infinium-methylationepic-v-1-0-b4-manifest-file-csv.zip} -} -\usage{ -test_array_manifest -} -\description{ -A subset of data from Illumina's EPIC array manifest (first 3,000 probes of the chromosome 21). -} -\keyword{datasets} diff --git a/man/ultrastable_cpgs.Rd b/man/ultrastable_cpgs.Rd new file mode 100644 index 0000000..cdca2a0 --- /dev/null +++ b/man/ultrastable_cpgs.Rd @@ -0,0 +1,22 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/ultrastable_cpgs.R +\docType{data} +\name{ultrastable_cpgs} +\alias{ultrastable_cpgs} +\title{Ultrastable probes} +\format{ +\subsection{\code{ultrastable_cpgs}}{ + +A vector with the name of the 15,224 ultrastable probes identified by Edgar et al. (2014). The name of the probes are based on the Illumina 450k manifest. +} +} +\source{ +https://static-content.springer.com/esm/art\%3A10.1186\%2F1756-8935-7-28/MediaObjects/13072_2014_333_MOESM2_ESM.txt +} +\usage{ +ultrastable_cpgs +} +\description{ +This data set contains the list of ultrastable probes identified by \href{https://doi.org/10.1186/1756-8935-7-28}{Rachel Edgar et. al.,(2014)}. This publication identified ultrastable CpGs across many tissues and conditions using the Illumina 450k array. Ultrastable probes are defined as CpGs consistently methylated or unmethylated in every sample (1,737 samples from 30 publically available studies). These CpGs are used to create a "null DNAme variance" distribution in the RAMEN package, from which a threshold is taken to identify Highly Variable Probes. +} +\keyword{datasets} diff --git a/tests/testthat/test-findCisSNPs.R b/tests/testthat/test-findCisSNPs.R deleted file mode 100644 index 8849056..0000000 --- a/tests/testthat/test-findCisSNPs.R +++ /dev/null @@ -1,3 +0,0 @@ -test_that("multiplication works", { - expect_equal(2 * 2, 4) -}) diff --git a/tests/testthat/test-findVMRs.R b/tests/testthat/test-findVMRs.R deleted file mode 100644 index 8849056..0000000 --- a/tests/testthat/test-findVMRs.R +++ /dev/null @@ -1,3 +0,0 @@ -test_that("multiplication works", { - expect_equal(2 * 2, 4) -}) diff --git a/tests/testthat/test-lmGE.R b/tests/testthat/test-lmGE.R deleted file mode 100644 index 8849056..0000000 --- a/tests/testthat/test-lmGE.R +++ /dev/null @@ -1,3 +0,0 @@ -test_that("multiplication works", { - expect_equal(2 * 2, 4) -}) diff --git a/tests/testthat/test-medCorVMR.R b/tests/testthat/test-medCorVMR.R deleted file mode 100644 index 8849056..0000000 --- a/tests/testthat/test-medCorVMR.R +++ /dev/null @@ -1,3 +0,0 @@ -test_that("multiplication works", { - expect_equal(2 * 2, 4) -}) diff --git a/tests/testthat/test-nullDistGE.R b/tests/testthat/test-nullDistGE.R deleted file mode 100644 index 8849056..0000000 --- a/tests/testthat/test-nullDistGE.R +++ /dev/null @@ -1,3 +0,0 @@ -test_that("multiplication works", { - expect_equal(2 * 2, 4) -}) diff --git a/tests/testthat/test-selectVariables.R b/tests/testthat/test-selectVariables.R deleted file mode 100644 index 8849056..0000000 --- a/tests/testthat/test-selectVariables.R +++ /dev/null @@ -1,3 +0,0 @@ -test_that("multiplication works", { - expect_equal(2 * 2, 4) -}) diff --git a/tests/testthat/test-summarizeVMRs.R b/tests/testthat/test-summarizeVMRs.R deleted file mode 100644 index 8849056..0000000 --- a/tests/testthat/test-summarizeVMRs.R +++ /dev/null @@ -1,3 +0,0 @@ -test_that("multiplication works", { - expect_equal(2 * 2, 4) -}) diff --git a/tests/testthat/test-workflow.R b/tests/testthat/test-workflow.R new file mode 100644 index 0000000..a7c4f8e --- /dev/null +++ b/tests/testthat/test-workflow.R @@ -0,0 +1,717 @@ +# Since the input of some functions is the output of others in the package, +# the whole workflow is going to be tested in this script to minimize the +# run time. + +library(testthat) +library(dplyr) + +#### findVML() #### +VML <- RAMEN::findVML( + methylation_data = RAMEN::test_methylation_data, + array_manifest = "IlluminaHumanMethylationEPICv1", + cor_threshold = 0, + var_method = "variance", + var_distribution = "ultrastable", + var_threshold_percentile = 0.99, + max_distance = 1000 +) + +test_that("findVML variance calculation is correct", { + probe_test <- VML$highly_variable_probes[1:10, ] + observed_variance <- RAMEN::test_methylation_data[probe_test$TargetID, ] |> + apply(1, var) + names(observed_variance) <- NULL + expect_equal(observed_variance, probe_test$var_score) +}) + +test_that("findVML output structure is correct", { + expect_true(is.list(VML)) + expect_true("VML" %in% names(VML)) + expect_true("highly_variable_probes" %in% names(VML)) + expect_true("var_score_threshold" %in% names(VML)) + expect_true(is.data.frame(VML$VML)) + expect_true(is.data.frame(VML$highly_variable_probes)) +}) + +test_that("findVML handles a different var_method option", { + VML_result_mad <- RAMEN::findVML( + methylation_data = RAMEN::test_methylation_data, + array_manifest = "IlluminaHumanMethylationEPICv1", + cor_threshold = 0, + var_method = "mad", + var_distribution = "ultrastable", + var_threshold_percentile = 0.99, + max_distance = 1000 + ) + probe_test <- VML_result_mad$highly_variable_probes[1:10, ] + observed_mad <- RAMEN::test_methylation_data[probe_test$TargetID, ] |> + apply(1, mad) + names(observed_mad) <- NULL + expect_equal(observed_mad, probe_test$var_score) + expect_true(is.list(VML_result_mad)) + expect_true("VML" %in% names(VML_result_mad)) +}) + +test_that("findVML throws errors when expected", { + expect_error( + RAMEN::findVML( + methylation_data = RAMEN::test_methylation_data, + array_manifest = "IlluminaHumanMethylationEPICv1", + cor_threshold = 1000, + var_method = "ultrastable", + var_distribution = "ultrastable", + var_threshold_percentile = 0.99, + max_distance = 1000 + ), + "'cor_threshold' must be of type 'numeric' and from 0 to 1" + ) + expect_error( + RAMEN::findVML( + methylation_data = RAMEN::test_methylation_data, + array_manifest = "IlluminaHumanMethylationEPICv1", + cor_threshold = 0, + var_method = "variance", + var_distribution = "binomial", + var_threshold_percentile = 0.99, + max_distance = 1000 + ), + "'var_distribution' must be one of 'all' or 'ultrastable'" + ) + expect_error( + RAMEN::findVML( + methylation_data = as.matrix(RAMEN::test_methylation_data), + array_manifest = "IlluminaHumanMethylationEPICv1", + cor_threshold = 0, + var_method = "variance", + var_distribution = "ultrastable", + var_threshold_percentile = 0.99, + max_distance = 1000 + ), + "The methylation_data object must be a data frame with samples as columns and probes as rows." + ) + expect_error( + RAMEN::findVML( + methylation_data = RAMEN::test_methylation_data, + array_manifest = "a", + cor_threshold = 0, + var_method = "variance", + var_distribution = "ultrastable", + var_threshold_percentile = 0.99, + max_distance = 1000 + ) + ) + expect_error( + RAMEN::findVML( + methylation_data = RAMEN::test_methylation_data, + array_manifest = "IlluminaHumanMethylationEPICv1", + cor_threshold = 0, + var_method = "a", + var_distribution = "ultrastable", + var_threshold_percentile = 0.99, + max_distance = 1000 + ), + "The method must be either 'mad' or 'variance'. Please select one of those options" + ) + +}) + +test_that("findVML works with EPICv2 probes", { + epic2_methylation_data <- RAMEN::test_methylation_data + rownames(epic2_methylation_data) <- data.frame(IlluminaHumanMethylationEPICv2anno.20a1.hg38::Locations) |> + dplyr::filter(chr == "chr21") |> + arrange(chr, pos) |> #Make sure to extract neighbouring probes to have VML + slice_head(n = nrow(RAMEN::test_methylation_data)) |> + rownames() + + VML_epic2 <- RAMEN::findVML( + methylation_data = epic2_methylation_data, + array_manifest = "IlluminaHumanMethylationEPICv2", + cor_threshold = 0, + var_method = "variance", + var_distribution = "ultrastable", + var_threshold_percentile = 0.99, + max_distance = 1000 + ) + expect_true(is.list(VML_epic2)) + expect_true(is.data.frame(VML_epic2$VML)) + expect_equal(ncol(VML_epic2$VML), 10 ) +}) + +test_that("findVML works with var_distribution = 'all' and mad score", { + VML_allvar <- RAMEN::findVML( + methylation_data = RAMEN::test_methylation_data, + array_manifest = "IlluminaHumanMethylationEPICv1", + cor_threshold = 0, + var_method = "mad", + var_distribution = "all", + var_threshold_percentile = 0.9, + max_distance = 1000 + ) + expect_true(is.list(VML_allvar)) + expect_true(is.data.frame(VML_allvar$VML)) + expect_equal(ncol(VML_allvar$VML), 10 ) +}) + +test_that("sVMPs have no correlation", { + sVMPs <- VML$VML |> + dplyr::filter(type == "sVMP") |> + dplyr::pull(median_correlation) + expect_true(all(is.na(sVMPs))) +}) + +test_that("correlation is computed correctly", { + VML_test <- VML$VML |> + dplyr::filter(type == "VMR") |> + dplyr::arrange(n_VMPs) |> + dplyr::slice_tail(n = 1) # Get the VMR with highest number of HVPs + probes <- unlist(VML_test$probes) + methylation_subset <- RAMEN::test_methylation_data[probes, ] + cor_matrix <- cor(t(methylation_subset)) + cor_values <- cor_matrix[lower.tri(cor_matrix)] + median_cor <- median(cor_values) + expect_equal(VML_test$median_correlation, median_cor) +}) + +#### summarizeVML() #### +summarized_methyl_VML <- RAMEN::summarizeVML( + VML_df = VML$VML, + methylation_data = test_methylation_data +) + +test_that("summarizeVML output structure is correct", { + expect_true(is.data.frame(summarized_methyl_VML)) + expect_equal(ncol(summarized_methyl_VML), nrow(VML$VML)) + expect_equal(nrow(summarized_methyl_VML), ncol(test_methylation_data)) +}) + +test_that("summarizeVML adds VML_index when not present", { + VML_no_index <- VML$VML |> + dplyr::select(-VML_index) + summarized_no_index <- RAMEN::summarizeVML( + VML_df = VML_no_index, + methylation_data = test_methylation_data + ) + expect_true(is.data.frame(summarized_no_index)) + expect_true(all( + colnames(summarized_no_index) %in% paste0("VML", seq_len(nrow(VML_no_index)))) + ) + expect_equal(nrow(summarized_no_index), ncol(test_methylation_data)) +} +) + +test_that("summarizeVML values are correct", { + # First for sVMPs: the summarized value should be equal to the methylation + VML_test <- VML$VML |> + dplyr::filter(type == "sVMP") |> + dplyr::slice_head(n = 1) # Get the first sVMP + probe <- unlist(VML_test$probes) + expected <- RAMEN::test_methylation_data[probe, ] |> unlist() + observed <- summarized_methyl_VML[, VML_test$VML_index] + names(observed) <- rownames(summarized_methyl_VML) + expect_equal(observed, expected) + # now for VMRs: the summarized value should be the median across probes + VMR_test <- VML$VML |> + dplyr::filter(type == "VMR") |> + dplyr::slice_head(n = 1) # Get the first VMR + probes <- unlist(VMR_test$probes) + expected <- apply( + RAMEN::test_methylation_data[probes, ], + 2, + median + ) + observed <- summarized_methyl_VML[, VMR_test$VML_index] + names(observed) <- rownames(summarized_methyl_VML) + expect_equal(observed, expected) +}) + +test_that("summarizeVML throws errors when expected", { + expect_error( + RAMEN::summarizeVML( + VML_df = "a", + methylation_data = test_methylation_data + ), + "Please provide a data frame in VML_df" + ) + expect_error( + RAMEN::summarizeVML( + VML_df = VML$VML, + methylation_data = "a" + ), + "Please make sure the methylation data is a data frame or matrix with samples as columns and probes as rows." + ) +}) + +test_that("summarizeVML works when methylation_data is a matrix", { + summarized_methyl_VML_matrix <- RAMEN::summarizeVML( + VML_df = VML$VML, + methylation_data = as.matrix(test_methylation_data) + ) + expect_true(is.data.frame(summarized_methyl_VML_matrix)) + expect_equal(ncol(summarized_methyl_VML_matrix), nrow(VML$VML)) + expect_equal(nrow(summarized_methyl_VML_matrix), ncol(test_methylation_data)) + svmp <- VML$VML |> + dplyr::filter(type == "sVMP") |> + dplyr::slice_head(n = 1) # Get the first sVMP + expect_equal( + summarized_methyl_VML_matrix[, svmp$VML_index], + test_methylation_data[unlist(svmp$probes), ] |> unlist() |> unname() + ) +}) + +#### findCisSNPs() #### +VML_cis_snps <- RAMEN::findCisSNPs( + VML_df = VML$VML, + genotype_information = RAMEN::test_genotype_information, + distance = 1e+06 +) + +test_that("findCisSNPs adds a VML index when it is not present", { + VML_cis_snps_noID <- RAMEN::findCisSNPs( + VML_df = VML$VML |> + dplyr::select(-VML_index), + genotype_information = RAMEN::test_genotype_information, + distance = 1e+06) + expect_true("VML_index" %in% colnames(VML_cis_snps_noID)) +}) + +test_that("findCisSNPs output structure is correct", { + expect_true(is.data.frame(VML_cis_snps)) + expect_equal(ncol(VML_cis_snps), ncol(VML$VML) + 2) + expect_equal(nrow(VML_cis_snps), nrow(VML$VML)) + expect_true(all( + c(colnames(VML$VML), "surrounding_SNPs", "SNP") %in% + colnames(VML_cis_snps) + )) +}) + +test_that("findCisSNPs throws errors when expected", { + expect_error( + RAMEN::findCisSNPs(VML_df = VML$VML |> + dplyr::select(-seqnames), + genotype_information = RAMEN::test_genotype_information, + distance = 1e+06 + ), + "Please make sure the VML_df object has the required columns with the appropiate names (check documentation for further information)", + fixed = TRUE + ) + expect_error( + RAMEN::findCisSNPs(VML_df = VML$VML, + genotype_information = RAMEN::test_genotype_information |> + dplyr::select(-CHROM), + distance = 1e+06 + ), + "Please make sure the genotype_information object has the required columns with the appropiate names (check documentation for further information)", + fixed = TRUE + ) + expect_error( + RAMEN::findCisSNPs(VML_df = "a", + genotype_information = RAMEN::test_genotype_information, + distance = 1e+06 + ), + "Please make sure the VML_df object is a data frame.", + fixed = TRUE) + expect_error( + RAMEN::findCisSNPs(VML_df = VML$VML, + genotype_information = "a", + distance = 1e+06 + ), + "Please make sure the genotype_information object is a data frame.") +} +) + +test_that("findCisSNPs returns the right number of cis SNPs", { + VML_test <- data.frame( + VML_index = "1", + seqnames = "chr1", + start = 1000, + end = 2000, + type = "VMR" + ) + genot_info_test <- data.frame( + CHROM = c("chr1", "chr1", "chr1", "chr1"), + POS = c(1, 500, 2500, 4000), + ID = c("rs1", "rs2", "rs3", "rs4") + ) + test_1 <- RAMEN::findCisSNPs( + VML_df = VML_test, + genotype_information = genot_info_test, + distance = 1 + ) + test_500 <- RAMEN::findCisSNPs( + VML_df = VML_test, + genotype_information = genot_info_test, + distance = 500 + ) + test_1000 <- RAMEN::findCisSNPs( + VML_df = VML_test, + genotype_information = genot_info_test, + distance = 1000 + ) + test_2000 <- RAMEN::findCisSNPs( + VML_df = VML_test, + genotype_information = genot_info_test, + distance = 2000 + ) + expect_equal(test_1$surrounding_SNPs, 0) # no SNPs are within 1 bp + expect_equal(test_500$surrounding_SNPs, 2) # only rs2 and rs3 are within 500bp + expect_equal(test_1000$surrounding_SNPs, 3) # rs1, rs2 and rs3 are within 1000bp + expect_equal(test_2000$surrounding_SNPs, 4) # all 4 snps are within 2000bp +} +) + +#### selectVariables() #### +selected_variables <- RAMEN::selectVariables( + VML_df = VML_cis_snps, + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + covariates = RAMEN::test_covariates, + summarized_methyl_VML = summarized_methyl_VML, + seed = 1 +) +test_that("selectVariables output structure is correct", { + expect_true(is.data.frame(selected_variables)) + expect_equal(ncol(selected_variables), 3) + expect_equal(nrow(selected_variables), nrow(VML_cis_snps)) + expect_true(all( + c("VML_index", "selected_genot", "selected_env") %in% + colnames(selected_variables) + )) +}) + +#Test that errors happen when expected +test_that("selectVariables throws errors when expected", { + expect_error( + RAMEN::selectVariables( + VML_df = "a", + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + covariates = RAMEN::test_covariates, + summarized_methyl_VML = summarized_methyl_VML + ), + "Please make sure the VML data frame (VML_df) contains the columns 'SNP' and 'VML_index'.", + fixed = TRUE + ) + #Test error when there are ID mismatches + test_genot <- RAMEN::test_genotype_matrix + colnames(test_genot) <- NULL + expect_error( + RAMEN::selectVariables( + VML_df = VML_cis_snps, + genotype_matrix = test_genot, + environmental_matrix = RAMEN::test_environmental_matrix, + covariates = RAMEN::test_covariates, + summarized_methyl_VML = summarized_methyl_VML + ), + "Individual IDs in summarized_methyl_VML do not match individual IDs in genotype_matrix", + fixed = TRUE + ) + #Test error when there is argument mismatch with the environmental_matrix + test_env <- RAMEN::test_environmental_matrix + rownames(test_env) <- NULL + expect_error( + RAMEN::selectVariables( + VML_df = VML_cis_snps, + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = test_env, + covariates = RAMEN::test_covariates, + summarized_methyl_VML = summarized_methyl_VML + ), + "Individual IDs in summarized_methyl_VML do not match individual IDs in environmental_matrix", + fixed = TRUE + ) + #Test error when there is argument mismatch with the covariates + test_cov <- RAMEN::test_covariates + rownames(test_cov) <- NULL + expect_error( + RAMEN::selectVariables( + VML_df = VML_cis_snps, + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + covariates = test_cov, + summarized_methyl_VML = summarized_methyl_VML + ), + "Individual IDs in summarized_methyl_VML do not match individual IDs in the covariates matrix", + fixed = TRUE + ) + + #Test that matrix arguments throw errors if input is not a matrix + expect_error( + RAMEN::selectVariables( + VML_df = VML_cis_snps, + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = as.data.frame(RAMEN::test_environmental_matrix), + covariates = RAMEN::test_covariates, + summarized_methyl_VML = summarized_methyl_VML + ), + "Please make sure the environmental data is provided as a matrix.", + fixed = TRUE + ) + expect_error( + RAMEN::selectVariables( + VML_df = VML_cis_snps, + genotype_matrix = as.data.frame(RAMEN::test_genotype_matrix), + environmental_matrix = RAMEN::test_environmental_matrix, + covariates = RAMEN::test_covariates, + summarized_methyl_VML = summarized_methyl_VML + ), + "Please make sure the genotype data is provided as a matrix.", + fixed = TRUE + ) + expect_error( + RAMEN::selectVariables( + VML_df = VML_cis_snps, + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + covariates = as.data.frame(RAMEN::test_covariates), + summarized_methyl_VML = summarized_methyl_VML + ), + "Please make sure the covariates data is provided as a matrix.", + fixed = TRUE + ) + #Test missing columns in VML_df + expect_error( + RAMEN::selectVariables( + VML_df = VML_cis_snps |> + dplyr::select(-SNP), + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + covariates = RAMEN::test_covariates, + summarized_methyl_VML = summarized_methyl_VML + ), + "Please make sure the VML data frame (VML_df) contains the columns 'SNP' and 'VML_index'.", + fixed = TRUE + ) + #Test missing values in genotype matrix + #Introduce NA values + test_genot_na <- RAMEN::test_genotype_matrix + test_genot_na[1, 1] <- NA + expect_error( + RAMEN::selectVariables( + VML_df = VML_cis_snps, + genotype_matrix = test_genot_na, + environmental_matrix = RAMEN::test_environmental_matrix, + covariates = RAMEN::test_covariates, + summarized_methyl_VML = summarized_methyl_VML + ), + "Please make sure the genotype matrix contains only finite numeric values.", + fixed = TRUE + ) + #Test missing values in environmental matrix + #Introduce NA values + test_env_na <- RAMEN::test_environmental_matrix + test_env_na[1, 1] <- NA + expect_error( + RAMEN::selectVariables( + VML_df = VML_cis_snps, + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = test_env_na, + covariates = RAMEN::test_covariates, + summarized_methyl_VML = summarized_methyl_VML + ), + "Please make sure the environmental matrix contains only finite numeric values.", + fixed = TRUE + ) + #Test missing values in covariates matrix + #Introduce NA values + test_cov_na <- RAMEN::test_covariates + test_cov_na[1, 1] <- NA + expect_error( + RAMEN::selectVariables( + VML_df = VML_cis_snps, + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + covariates = test_cov_na, + summarized_methyl_VML = summarized_methyl_VML + ), + "Please make sure the covariates matrix contains only finite numeric values.", + fixed = TRUE + ) + #Test missing values in summarized methylation VML + test_summeth_na <- summarized_methyl_VML + test_summeth_na[1, 1] <- NA + expect_error( + RAMEN::selectVariables( + VML_df = VML_cis_snps, + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + covariates = RAMEN::test_covariates, + summarized_methyl_VML = test_summeth_na + ), + "Please make sure the summarized_methyl_VML data frame contains only finite numeric values.", + fixed = TRUE + ) +}) + + +#### lmGE() #### +# Use only 10 VML +lmge_res <- RAMEN::lmGE( + selected_variables = selected_variables[1:10,], + summarized_methyl_VML = summarized_methyl_VML, + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + covariates = RAMEN::test_covariates, + model_selection = "AIC" +) + +test_that("lmGE works with BIC model selection", { + lmge_res_bic <- RAMEN::lmGE( + selected_variables = selected_variables[1:10,], + summarized_methyl_VML = summarized_methyl_VML, + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + covariates = RAMEN::test_covariates, + model_selection = "BIC" + ) + expect_true(is.data.frame(lmge_res_bic)) + expect_equal(ncol(lmge_res_bic), 13) + expect_equal(nrow(lmge_res_bic), 10) +}) + +test_that("lmGE output structure is correct", { + expect_true(is.data.frame(lmge_res)) + expect_equal(ncol(lmge_res), 13) + expect_equal(nrow(lmge_res), 10) +}) + +test_that("lmGE throws errors when expected", { + expect_error( + RAMEN::lmGE( + selected_variables = "a", + summarized_methyl_VML = summarized_methyl_VML, + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + covariates = RAMEN::test_covariates, + model_selection = "AIC" + ), + "Please make sure the selected_variables data frame contains the columns 'VML_index', 'selected_genot' and 'selected_env'.", + fixed = TRUE + ) +}) + +#### nullDistGE() #### +permutations <- 2 +null_dist <- RAMEN::nullDistGE( + VML_df = VML_cis_snps[1:10,], + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + summarized_methyl_VML = summarized_methyl_VML, + permutations = permutations, + covariates = RAMEN::test_covariates, + seed = 1, + model_selection = "AIC" +) +test_that("nullDistGE output structure is correct", { + expect_true(is.data.frame(null_dist)) + expect_equal(ncol(null_dist), 6) + expect_equal(nrow(null_dist), nrow(VML_cis_snps[1:10,])*permutations) +}) + +test_that("nullDistGE works with BIC", { + null_dist_bic <- RAMEN::nullDistGE( + VML_df = VML_cis_snps[1:10,], + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + summarized_methyl_VML = summarized_methyl_VML, + permutations = permutations, + covariates = RAMEN::test_covariates, + seed = 1, + model_selection = "BIC" + ) + expect_true(is.data.frame(null_dist_bic)) + expect_equal(ncol(null_dist_bic), 6) + expect_equal(nrow(null_dist_bic), nrow(VML_cis_snps[1:10,])*permutations) +}) + +test_that("nullDistGE throws errors when expected", { + expect_error( + RAMEN::nullDistGE( + VML_df = "a", + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + summarized_methyl_VML = summarized_methyl_VML, + permutations = 2, + covariates = RAMEN::test_covariates, + seed = 1, + model_selection = "AIC" + ), + "Please make sure the VML data frame (VML_df) contains the columns 'SNP' and 'VML_index'.", + fixed = TRUE + ) + #Test error when genotype_matrix has NA + #Introduce NA values + test_genot_na <- RAMEN::test_genotype_matrix + test_genot_na[1, 1] <- NA + expect_error( + RAMEN::nullDistGE( + VML_df = VML_cis_snps[1:10,], + genotype_matrix = test_genot_na, + environmental_matrix = RAMEN::test_environmental_matrix, + summarized_methyl_VML = summarized_methyl_VML, + permutations = 2, + covariates = RAMEN::test_covariates, + seed = 1, + model_selection = "AIC" + ), + "Please make sure the genotype matrix contains only finite numeric values.", + fixed = TRUE + ) + #Test error when environmental_matrix has NA + #Introduce NA values + test_env_na <- RAMEN::test_environmental_matrix + test_env_na[1, 1] <- NA + expect_error( + RAMEN::nullDistGE( + VML_df = VML_cis_snps[1:10,], + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = test_env_na, + summarized_methyl_VML = summarized_methyl_VML, + permutations = 2, + covariates = RAMEN::test_covariates, + seed = 1, + model_selection = "AIC" + ), + "Please make sure the environmental matrix contains only finite numeric values.", + fixed = TRUE + ) + #Test error when covariates has NA + #Introduce NA values + test_cov_na <- RAMEN::test_covariates + test_cov_na[1, 1] <- NA + expect_error( + RAMEN::nullDistGE( + VML_df = VML_cis_snps[1:10,], + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + summarized_methyl_VML = summarized_methyl_VML, + permutations = 2, + covariates = test_cov_na, + seed = 1, + model_selection = "AIC" + ), + "Please make sure the covariates matrix contains only finite numeric values.", + fixed = TRUE + ) + #Test error when summarized_methyl_VML has NA + #Introduce NA values + test_summeth_na <- summarized_methyl_VML + test_summeth_na[1, 1] <- NA + expect_error( + RAMEN::nullDistGE( + VML_df = VML_cis_snps[1:10,], + genotype_matrix = RAMEN::test_genotype_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, + summarized_methyl_VML = test_summeth_na, + permutations = 2, + covariates = RAMEN::test_covariates, + seed = 1, + model_selection = "AIC" + ), + "Please make sure the summarized_methyl_VML data frame contains only finite numeric values.", + fixed = TRUE + ) +} +) + +#### Clean environment #### +rm(VML, lmge_res, null_dist, summarized_methyl_VML, selected_variables, VML_cis_snps, permutations) diff --git a/vignettes/RAMEN.Rmd b/vignettes/RAMEN.Rmd index c3f0702..984e854 100644 --- a/vignettes/RAMEN.Rmd +++ b/vignettes/RAMEN.Rmd @@ -27,28 +27,29 @@ knitr::opts_chunk$set( ``` # Introduction -**Regional Association of Methylome variability with the Exposome and geNome (RAMEN)** is an R package whose goal is to integrate genomic, methylomic and exposomic data to model the contribution of genetics (G) and the environment (E) to DNA methylation (DNAme) variability. RAMEN identifies Variable Methylated Regions (VMRs) in microarray DNAme data and then, using genotype and environmental data, it identifies which of the following models better explains this variability in regions across the methylome: +**Regional Association of Methylome variability with the Exposome and geNome (RAMEN)** is an R package whose goal is to integrate genomic, methylomic and exposomic data to model the contribution of genetics (G) and the environment (E) to DNA methylation (DNAme) variability. RAMEN identifies Variable Methylated Loci (VML) in microarray DNAme data and then, using genotype and environmental data, it identifies which of the following models better explains this variability in regions across the methylome: ```{r modelstable, echo=FALSE} library(knitr) -models = data.frame(Model = c("DNAme ~ G + covars", "DNAme ~ E + covars", "DNAme ~ G + E + covars", "DNAme ~ G + E + G*E + covars"), - Name = c("Genetics", "Environmental exposure", "Additive", "Interaction"), - Abbreviation = c("G", "E", "G+E", "GxE")) - -kable(models, caption = 'Fitted models') +models <- data.frame( + Model = c("DNAme ~ G + covars", "DNAme ~ E + covars", "DNAme ~ G + E + covars", "DNAme ~ G + E + G*E + covars"), + Name = c("Genetics", "Environmental exposure", "Additive", "Interaction"), + Abbreviation = c("G", "E", "G+E", "GxE") +) +kable(models, caption = "Fitted models") ``` where G variables are represented by SNPs, E variables by environmental exposures, and where covars are concomitant variables (i.e. variables that are adjusted for in the model and not of interest in the study such as cell type proportion, age, etc.). The main [gene-environment interaction modeling][ Gene-environment interaction analysis] pipeline is conducted though six core functions: -- `findVMRs()` identifies Variable Methylated Regions (VMRs) in microarrays -- `summarizeVMRs()`summarizes the regional methylation state of each VMR -- `findCisSNPs()` identifies the SNPs in *cis* of each VMR +- `findVML()` identifies Variable Methylated Regions (VML) in microarrays +- `summarizeVML()`summarizes the regional methylation state of each VML +- `findCisSNPs()` identifies the SNPs in *cis* of each VML - `selectVariables()` conducts a LASSO-based variable selection strategy to identify potentially relevant *cis* SNPs and environmental variables -- `lmGE()` fits linear single-variable genetic (G) and environmental (E), and pairwise additive (G+E) and interaction (GxE) linear models and select the best explanatory model per VMR. +- `lmGE()` fits linear single-variable genetic (G) and environmental (E), and pairwise additive (G+E) and interaction (GxE) linear models and select the best explanatory model per VML. - `nullDistGE()` simulates a delta R squared null distribution of G and E effects on DNAme variability. Useful for filtering out poor-performing best explanatory models selected by *lmGE()*. These functions are compatible with parallel computing, which is recommended due to the computationally intensive tasks conducted by the package. @@ -57,7 +58,9 @@ In addition to the [standard gene-environment interaction modeling pipeline][ Ge ## Citation -The manuscript detailing RAMEN and its use is currently under preparation. For more information about this please contact Erick I. Navarro-Delgado at [erick.navarrodelgado\@bcchr.ca](mailto:erick.navarrodelgado@bcchr.ca){.email}. +If you use RAMEN for any of your analyses, please cite the following publication: + + - Navarro-Delgado, E.I., Czamara, D., Edwards, K. et al. RAMEN: Dissecting individual, additive and interactive gene-environment contributions to DNA methylome variability in cord blood. *Genome Biol* 26, 421 (2025). https://doi.org/10.1186/s13059-025-03864-4 # Gene-environment interaction analysis @@ -80,59 +83,91 @@ knitr::include_graphics("RAMEN_pipeline.png") where: - - DNAme data is grouped into VMRs, and then the DNAme state per individual is summarized in each region - - Using the identified VMRs and the genomic information, we identify the SNPs in *cis* for each VMR + - DNAme data is grouped into VML, and then the DNAme state per individual is summarized in each VML. + - Using the identified VML and the genomic information, we identify the SNPs in *cis* for each VML - Both the *cis* SNPs and the exposome data are subjected to the variable selection stage - - The selected variables (SNPs a and Es) enter the modelling stage, which outputs one single winning model per VMR + - The selected variables (Single Nucleotide Polymorphisms a and Environmental Exposures) enter the modelling stage, which outputs one single winning model per VML - The thresholds obtained from the simulated null distribution are used to remove winning models which performance are likely to be due to chance. -In the following sections we will go through each of these steps and guide the user regarding the recommended parameters to use in each function of the package. For illustration purposes, we provide small toy data sets that do not intend to simulate a real biological phenomenon. These data sets are already available in the RAMEN package. +In the following sections we will go through each of these steps and guide the user regarding the recommended parameters to use in each function of the package. For illustration purposes, we provide small toy data sets that do not intend to simulate the real biological phenomenon. These data sets are already available in the RAMEN package. ```{r setup, warning=FALSE, message=FALSE} -#Load the packages used throughout the vignette +# Load the packages used throughout the vignette library(RAMEN) library(dplyr) library(ggplot2) library(tidyr) ``` -## Identify VMRs and summarize their methylation state +## Identify VML and summarize their methylation state -The first step of the pipeline is to identify the **Variable Methylated Regions**(VMRs) in the data set. You might be wondering *"What is a VMR and why do we use them instead of DNAme levels from each CpG site?"*. We use **regions** because it is well established that nearby CpG sites are [very likely to share a similar DNAme profile](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6093082/) and therefore work as functional units. Then, from a statistical point of view, testing separately proximal CpGs that are part of the same unit is redundant. On the other side, we use only **variable** regions because we are interested in the units that display a high level of variability; in other words, in non-variant sites there is no variability left to be explained by genetics or environment. So, in conclusion, we use **VMRs** to increase our power and reduce the multiple hypothesis testing burden by grouping probes that are likely to work as a biological unit, and by only focusing in the set of regions that are of interest of this study. +The first step of the pipeline is to identify the **Variable Methylated Loci**(VML) in the data set. You might be wondering *"What is a VML and why do we use them instead of DNAme levels from each CpG site?"*. We use **loci** because it is well established that nearby CpG sites are [very likely to share a similar DNAme profile](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6093082/) and therefore work as functional units. Then, from a statistical point of view, testing separately proximal CpGs that are part of the same unit is redundant. On the other side, we use only **variable** regions because we are interested in the units that display a high level of variability; in other words, in non-variant sites there is no variability left to be explained by genetics or environment. So, in conclusion, we use **VML** to increase our power and reduce the multiple hypothesis testing burden by grouping probes that are likely to work as a biological unit, and by only focusing in the set of regions that are of interest of this study. -RAMEN identifies 2 categories of VMRs: +RAMEN identifies 2 categories of VML: - - Canonical VMRs: Group of Highly Variable Probes that are proximal and correlated. Highly Variable Probes are defined as probes above a specific variance percentile threshold specified by the user (the default is 90th percentile). The proximity distance and pearson correlation threshold is specified by the user, and the defaults are 1 kilobase and 0.15 respectively. For guidance on which correlation threshold to use, we recommend checking the Supplementary Figure 1 of the CoMeBack R package (Gatev et al., 2020) where a simulation to empirically determine a default guidance specification for a correlation threshold parameter dependent on sample size is done. - - Non canonical VMRs: Regions that are composed of Highly Variable Probes that have no nearby probes measured in the array (according to the distance parameter specified by the user). This category was created to take into account the characteristics of the DNAme microarray plataform, which has probes covering non-homogenelously the genome. This is specially important for microarrays such as the EPIC array which has a high number of probes in regulatory regions that are represented by a single probe. Furthermore, these probes have been shown to be good representatives of the methylation state of its surroundings (Pidsley et al., 2016). By creating this category, we recover those informative HVPs that would otherwise be excluded from the analysis because of working with the canonical VMR definition in the array context. + - Variably Methylated Region (VMR): Group of Highly Variable Probes that are proximal and correlated. Highly Variable Probes are defined as probes above a specific variance percentile threshold specified by the user (more information below). The proximity distance and pearson correlation threshold is specified by the user, and the defaults are 1 kilobase and 0.15 respectively. For guidance on which correlation threshold to use, we recommend checking the Supplementary Figure 1 of the CoMeBack R package (Gatev et al., 2020) where a simulation to empirically determine a default guidance specification for a correlation threshold parameter dependent on sample size is done. + - sparse Variably Methylated Probe (sVMP): Genomic loci that are composed of a Highly Variable Probe that has no nearby probes measured in the array (according to the distance parameter specified by the user). This category was created to take into account the characteristics of the DNAme microarray plataform, which covers non-homogenelously the genome. Due to the limited number of probes that can be measured in an array, this technology tends to interrogate the DNAme of genomic regions with a single probe. This is specially important for microarrays such as the EPIC array which has a high number of probes in regulatory regions that are represented by a single probe. Furthermore, there is empirical evidence that these probes are good representatives of the methylation state of their surroundings (Pidsley et al., 2016). By creating this category, we recover those informative HVPs that would otherwise be excluded from the analysis because of working with the canonical VMR definition in the context of a microarray. -The first step is to identify **Variable Methylated Regions**(VMRs) using the `RAMEN::findVMRs()` function. This function uses GenomicRanges::reduce() to group the regions, which is strand-sensitive. In the Illumina microarrays, the MAPINFO for all the probes is usually provided as for the + strand. If you are using this array, we recommend to first convert the strand of all the probes to "+". +The first step is to identify **Variable Methylated Loci**(VML) using the `RAMEN::findVML()` function. This function uses GenomicRanges::reduce() to group the regions, which is strand-sensitive. In the Illumina microarrays, the MAPINFO for all the probes is usually provided as for the + strand. If you are using this array, we recommend to first convert the strand of all the probes to "+". For this step, we also recommend users to use M-values because its use is more appropriate for statistical analyses (see Pan Du, *et al.*, 2010, *BMC Bioinformatics*). -For this step, we recommend users to use M-values because its use is more appropiate for statistical analyses (see Pan Du, *et al.*, 2010, *BMC Bioinformatics*) +Now, there are a couple of options that we provide to define Highly Variable Probes, which are the building blocks of VML. Let's talk about two of the more important ones: +### var_method + +We need to chose a metric to quantify the variability of each probe across individuals. Different metrics exist for this purpose, each one with its own pros and cons. The user can chose between "MAD" (Median Absolute Deviation) and "variance". We recommend using variance, as it captures cases where the spread is driven by a "low" frequency of individuals that display a substantially different pattern compared to the mean - which could be potentially caused by a genetic variant or environmental exposure. On the other hand, MAD is by nature more robust to outliers, which only picks up cases where there is a consistent variability across most individuals (also MAD has been historically used as a spread metric in GxE methylome-wide studies). In simpler terms, let's say we have a study with 200 individuals. If in a probe, 110 individuals have similar DNAme levels, but 90 (45%) of them have different DNAme levels, the variance method could capture this scenario as a highly variable probe, while MAD will not. Let's see an example: + ```{r} -#We need to modify the RAMEN::test_array_manifest object by assigning to -#row names to the probe ID column; it was saved this way because storing -#the TargetID as row names reduced significantly the size of the data set. -test_array_manifest_final = RAMEN::test_array_manifest %>% - tibble::rownames_to_column(var = "TargetID") - -VMRs = RAMEN::findVMRs(array_manifest = test_array_manifest_final, - methylation_data = RAMEN::test_methylation_data, - cor_threshold = 0, - var_method = "variance", - var_threshold_percentile = 0.9, - max_distance = 1000) - -#Take a look at the resulting object -dplyr::glimpse(VMRs) +set.seed(1) +sample <- c( + rep(0.2, 110), + sample(x = 0:10, size = 90, replace = TRUE) / 10 +) +stats::var(sample) +stats::mad(sample) +``` + +You can see in this simplified example that variability that is not shared by at least 50% of the individuals is ignored by MAD (i.e. it is 0), but not by variance (i.e. it is >0). Because we want to capture probes where the variability is driven by less than half of the individuals in the population, which could be interesting, var_method = "var" is the defualt. + +You might also wonder, does it make that much of a difference? From empirical evidence, MAD and variance are expected to display a high correlation, so using MAD or variance will lead to a similar set of Highly Variable Probes. For instance, let's check the relation between variance and MAD score in the CHILD dataset used in RAMEN's first publication (see Navarro-Delgado EI, *et al.*, 2025, Genome Biology). + +```{r,echo= FALSE, fig.cap="MAD vs var relation"} +knitr::include_graphics("mad_var.png") ``` -As we can see, `RAMEN::findVMRs()` returns a list with four elements: +Additionally, if we were to take the top 10% of probes as highly variable, we found a 86% overlap between the two methods. So think of this more of a fine-tuning parameter rather than a game-changer. + +```{r,echo= FALSE, fig.cap="HVPs with mad vs var"} +knitr::include_graphics("hvps_mad_var.png") +``` + +### var_distribution + +The second argument that we are going to discuss is var_distribution. There are two options that you can choose from: "all" and "ultrastable". The "all" draws a variability distribution (MAD or variance) from all the probes in the array, and labels the top x% as HVPs (x is defined by the user with the var_threshold_percentile argument). So for example if we use a 90th percentile threshold, every probe with a variability score above the 90th percentile of the distribution (i.e. top 10%) will be labeled as Highly Variable Probe. This approach has been used in previous manuscripts, and allows the user to control the proportion of probes that will be labeled as HVPs. + +On the other hand, the "ultrastable" option defines the variability threshold using only the variability scores from probes that are located in ultrastable regions. Ultrastable probes display a very low variability across individuals independent of tissue and developmental stage. Therefore, using these regions to define Highly Variable Probes provides a more stable and comparable definition of HVPs across data sets. When using the "ultrastable" option, we aim to remove all probes that display the same variability behavior as the ultrastable probes (which become our "null distribution"). So we recommend using a high the var_threshold_percentile (default for this option is 99th percentile). However, we don't recommend using the max value (100th percentile) as this can be very easily affected by outliers. The ultrastable probes used in RAMEN were identified by Edgar *et al.* (2014) using 1,737 samples from 30 publicly available studies. These probes are included in the RAMEN package as the `ultrastable_cpgs` data set. + +We recommend using the "ultrastable" option, as it provides a more objective and biologically meaningful definition of Highly Variable Probes. Using a fixed percentile threshold (e.g., 90th percentile) could lead to different definitions of HVPs across data sets, as the overall variability of DNAme can differ between cohorts. For instance, a cohort with a high level of environmental exposure variability might display a higher overall DNAme variability compared to a cohort with low environmental exposure variability. In this scenario, using a fixed percentile threshold will lead to both cohorts having the exact same number of HVPs, despite one of them being way more variable than the other, and a definition of HVPs unique to each data set. + +### Running `RAMEN::findVML()` + +So, after covering all the basics and understanding how the function works, we can start our analysis! Let's give it a try. + +```{r} +VML <- RAMEN::findVML( + methylation_data = RAMEN::test_methylation_data, + array_manifest = "IlluminaHumanMethylationEPICv1", + cor_threshold = 0, + var_method = "variance", + var_distribution = "ultrastable", + var_threshold_percentile = 0.99, + max_distance = 1000 +) - - var_score_threshold: The MAD-score or variance threshold used to define Highly Variable Probes. - - highly_variable_probes: a data frame with the probes that passed the variability score threshold imposed by the user, and their variability score (MAD score or variance). - - canonical_VMRs: a GRanges object with strict candidate VMRs - regions composed of two or more contiguous, correlated and proximal Highly Variable Probes; thresholds depend on the ones specified by the user) - - non_canonical_VMRs: a GRanges object with highly variable probes without neighboring CpGs measured in max_distance on the array. Category created to take into acccount the Illumina array design of single probes capturing the methylation state of regulatory regions. +# Take a look at the resulting object +dplyr::glimpse(VML$var_score_threshold) # check the specific threshold that was used to label HVPs +head(VML$highly_variable_probes) # check the HVPs identified and their variability score +head(VML$VML) # Take a look at the identified VML data frame +``` Furthermore, we can see the following warning message in the chunk above: @@ -140,152 +175,158 @@ Furthermore, we can see the following warning message in the chunk above: #> Warning: executing %dopar% sequentially: no parallel backend registered ``` -This is printed in the screen just to warn us that `RAMEN::findVMRs()` is running sequentially. RAMEN supports parallel computing for increased speed, which is really important when working with real data sets that tend to contain information from the whole genome. To do so, you have to set the parallel backend in your R session BEFORE running the function (e.g., `doFuture::registerDoFuture()`) and then the evaluation strategy (e.g., `future::plan(multisession)`). After that, the function can be run normally. When working with big datasets, the parallel backend might throw an error if you exceed the maximum allowed size of globals exported for future expression. This can be fixed by increasing the allowed size (e.g. running `options(future.globals.maxSize= +Inf)`) +This is printed in the screen just to warn us that `RAMEN::findVML()` is running sequentially. RAMEN supports parallel computing for increased speed, which is really important when working with real data sets that tend to contain information from thousands of probes. To do so, you have to set the parallel backend in your R session BEFORE running the function (e.g., *doParallel::registerDoParallel(4)*)). After that, the function can be run normally. When working with big datasets, the parallel backend might throw an error if you exceed the maximum allowed size of globals exported for future expression. This can be fixed by increasing the allowed size (e.g. running `options(future.globals.maxSize= +Inf)`) -Finally, we will convert the output of `RAMEN::findVMRs()` to a data frame, which is an object that we can easily use to produce plots and explore the results, and the object that is needed for the following parts of the pipeline. +Finally, we will extract the VML data frame, which we can use to produce plots and explore the results. This data frame will also be used for the following parts of the pipeline. ```{r} -VMRs_df = data.frame(VMRs[["canonical_VMRs"]]) %>% - rbind(as.data.frame(VMRs[["non_canonical_VMRs"]])) %>% - dplyr::select( -c(width.1,strand)) +VML_df <- VML$VML -head(VMRs_df) -``` -With the VMRs as a data frame, we can explore our data set using ggplot, such as the following example: - -```{r, fig.cap="Width of non canonical VMRs (base pairs)."} -VMRs_df %>% - dplyr::filter(width > 1) %>% #Only plot canonical VMRs - ggplot2::ggplot(aes(x = width))+ - ggplot2::geom_histogram(binwidth = 50, fill = "#BAB4D8")+ - ggplot2::theme_classic()+ - ggplot2::ggtitle("Canonical VMRs width (bp)") +# Example of an epxloration plot +VML_df %>% + dplyr::filter(width > 1) %>% # Only plot VMRs, since sVMPs all have a lenght of 1 + ggplot2::ggplot(aes(x = width)) + + ggplot2::geom_histogram(binwidth = 50, fill = "#BAB4D8") + + ggplot2::theme_classic() + + ggplot2::ggtitle("VMRs width (bp)") ``` -Next, we want to summarize the DNAme level of each VMR per individual. To do this, we use `RAMEN::summarizeVMRs()`. For non canonical VMRs, there is nothing to summarize, so the DNAme level of the corresponding probe is returned. For canonical VMRs, the median DNAme level of the region is returned per individual as the representative value. +Next, we want to summarize the DNAme level of each VML per individual. To do this, we use `RAMEN::summarizeVML()`. For sparse VMPs, there is nothing to summarize as we have one probe per loci, so the DNAme level of the corresponding probe is returned. For VMRs, the median DNAme level of all the probes in the region is returned per individual as the representative value. ```{r} -summarized_methyl_VMR = RAMEN::summarizeVMRs(VMRs_df = VMRs_df, - methylation_data = test_methylation_data) +summarized_methyl_VML <- RAMEN::summarizeVML( + VML_df = VML_df, + methylation_data = test_methylation_data +) # Look at the resulting object -summarized_methyl_VMR[1:5,1:5] +summarized_methyl_VML[1:5, 1:5] ``` -The result is a data frame of VMR IDs as columns and individual IDs as rows. +The result is a data frame of VML IDs as columns and individual IDs as rows. ## Identify *cis* SNPs -After identifying the VMRs, we recommend to use only SNPs in *cis* of each VMR, since genetic variants that associate with DNAme changes tend to be more abundant in the surroundings of the corresponding DNAme site (McClay *et al.*, 2015). Also, the effect sizes of mQTLs (genetic variants associated with DNAme changes) are stronger in *cis* SNPs compared to *trans* SNPs. Then, by restricting the analysis to *cis* SNPs, we greatly reduce the number of variables while keeping most of the important ones. +After identifying the VML, we recommend to use only SNPs in *cis* of each loci, since genetic variants that associate with DNAme changes tend to be more abundant in the surroundings of the corresponding DNAme site (McClay *et al.*, 2015). Also, the effect sizes of mQTLs (genetic variants associated with DNAme changes) are stronger in *cis* SNPs compared to *trans* SNPs. Then, by restricting the analysis to *cis* SNPs, we greatly reduce the number of variables while keeping most of the important ones. -There is not a clear consensus on how close a SNP has to be from a DNAme site to be considered *cis* - the distance threshold tend to go from few kb to 1 megabase. We recommend to use a 1 Mb window to cast a wide net to catch potentially relevant SNPs. +There is not a clear consensus on how close a SNP has to be from a DNAme site to be considered *cis* - the distance threshold tend to go from few kb to 1 megabase. We recommend to use a 1 Mb window to cast a wide net and catch most potentially relevant SNPs. ```{r} -VMRs_df = RAMEN::findCisSNPs(VMRs_df = VMRs_df, - genotype_information = RAMEN::test_genotype_information, - distance = 1e+06) +VML_cis_snps <- RAMEN::findCisSNPs( + VML_df = VML_df, + genotype_information = RAMEN::test_genotype_information, + distance = 1e+06 +) -#Take a look at the result -dplyr::glimpse(VMRs_df) +# Take a look at the result +head(VML_cis_snps) ``` -We can see that the resulting VMRs_df object is almost exactly the same, but with two new columns (*surrounding_SNPs* and *SNP*) that contain information about how many SNPs were found in *cis* and what are their IDs according to the genotype data that we have. - -It is important to highlight the following characteristics of the VMRs_df object: +We can see that the resulting data frame is almost exactly the same, but with two new columns (*surrounding_SNPs* and *SNP*) that contain information about how many SNPs were found in *cis* and what are their IDs according to the genotype data that we have. - - The *VMR_index* column is a character vector. This column corresponds to the unique identifier of each VMR in our data set. It is important to **keep it as a character** and be careful with it not being converted to numeric, which can happen if you save the VMRs_df object as a table, read it, and use that second object in the rest of the pipeline. - - The columns *probes* and *SNP* contain **lists**. This structure is really important for the rest of the analysis and columns containing lists will keep appearing in other function outputs. If you want to know the recommended way to save and load these objects, please check the [ Frequently Asked Questions][]. +It is important to highlight the columns *probes* and *SNP* contain **lists** as values. This structure is really important for the rest of the analysis, and columns containing lists will keep appearing in other function outputs. If you want to know the recommended way to save and load these objects, please check the [ Frequently Asked Questions][]. We can also explore the resulting object through plots such as the following: -```{r cissnps, fig.cap="Disribution of SNPs in cis of VMRs."} -VMRs_df %>% - dplyr::mutate(type = case_when(n_VMPs == 1 ~ "non canonical", #non canonical VMRs have only 1 probe by definition - TRUE ~ "canonical"), - surrounding_SNPs = case_when( surrounding_SNPs > 3000 ~ 3000, - TRUE ~ surrounding_SNPs)) %>% +```{r cissnps, fig.cap="Disribution of SNPs in cis of VML."} +VML_cis_snps %>% + dplyr::mutate(surrounding_SNPs = case_when( + surrounding_SNPs > 3000 ~ 3000, + TRUE ~ surrounding_SNPs + )) %>% ggplot2::ggplot(aes(x = surrounding_SNPs)) + ggplot2::geom_density() + ggplot2::facet_grid("type") + + ggplot2::xlab("Number of cis SNPs") + ggplot2::theme_classic() + +# Check the average number of cis snps in out VML data set +mean(VML_cis_snps$surrounding_SNPs) ``` ## Conduct variable selection on genome and exposome variables The following stage in the pipeline is to screen the available variables in our environmental and *cis* SNPs data sets to identify the potentially relevant ones. This is achieved with the `RAMEN::selectVariables()` function. This function uses a data-driven approach based on LASSO, which is an embedded variable selection method commonly used in machine learning. -In a nutshell, LASSO penalizes models that are more complex (i.e., that contain more variables) in favor of simpler models (i.e. that contain less variables), but not at the expense of reducing predictive power. Using LASSO's variable screening property (i.e., with high probability, the LASSO estimated model includes the substantial covariates and drops the redundant ones) this function is intended to select genotype and environmental variables in each Variable Methylated Region (VMR) with potential relevance in the presence of the user-specified concomitant variables (which are known DNAme confounders such as age, cell type proportion, etc.). For more information about the method, we encourage the users to read the documentation of the function, and for further information about LASSO we direct readers to Bühlmann and Van de Geer, 2011. +In a nutshell, LASSO penalizes models that are more complex (i.e., that contain more variables) in favor of simpler models (i.e. that contain less variables), but not at the expense of reducing predictive power. Using LASSO's variable screening property (i.e., with high probability, the LASSO estimated model includes the substantial covariates and drops the redundant ones) this function is intended to select genotype and environmental variables in each VML with potential relevance in the presence of the user-specified concomitant variables (which are known DNAme confounders such as age, cell type proportion, etc.). For more information about the method, we encourage the users to read the documentation of the function, and for further information about LASSO we direct readers to Bühlmann and Van de Geer, 2011. -Overall, conducting our variable selection strategy reduces the overall computational time and improves the modeling performance by: +Overall, conducting our variable selection strategy reduces the downstream computational time and improves the modeling performance by: - - Reducing the universe of variables that will be used to fit models in the following stage (G/E/G+E/GxE model fitting and comparison) + - Reducing the space of variables that will be used to fit models in the following stage (G/E/G+E/GxE model fitting and comparison) - Removing redundant variables, which are highly expected in genetic and environmental data sets with a high number of variables - Limiting the interactions terms to scenarios where both the G and E main effects were selected to be potentially relevant, which can be think of as an interaction variable selection using a weak hierarchy - Using LASSO, a method with good variable selection performance and scalability -Please make sure that your data has no NAs, since the LASSO implementation we use in RAMEN does not support missing values. If your data has missing values, consider [handling](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/) them. +Please make sure that your data has no NAs, since the LASSO implementation we use in RAMEN does not support missing values, and that all values are numeric. If your data has missing values, consider [handling](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/) them. ```{r} -selected_variables = RAMEN::selectVariables( - VMRs_df = VMRs_df, +selected_variables <- RAMEN::selectVariables( + VML_df = VML_cis_snps, genotype_matrix = RAMEN::test_genotype_matrix, - environmental_matrix= RAMEN::test_environmental_matrix, + environmental_matrix = RAMEN::test_environmental_matrix, covariates = RAMEN::test_covariates, - summarized_methyl_VMR = summarized_methyl_VMR, + summarized_methyl_VML = summarized_methyl_VML, seed = 1 ) ``` Since LASSO makes use of Random Number Generation, setting a seed is highly encouraged for result's reproducibility using the *seed* argument. As a note, setting a seed inside of this function modifies the seed globally (which is R's default behavior). -The output of `RAMEN::selectVariables()` is an object with the VMR index, and the G and E variables selected for each VMR. +The output of `RAMEN::selectVariables()` is an object with the VML index, and the G and E variables selected for each VML. ```{r} dplyr::glimpse(selected_variables) ``` -We can see how using `RAMEN::selectVariables()` reduces the number of variables (originally 100 environmental variables and 785 SNPs per VMR on average as seen in Figure \@ref(fig:cissnps)). +We can see how using `RAMEN::selectVariables()` reduces the number of variables (originally 100 environmental variables and `r mean(VML_cis_snps$surrounding_SNPs)` SNPs per VML on average as seen in Figure \@ref(fig:cissnps)). ```{r selectedvars, fig.cap="Number of G and E selected variables."} -selected_variables %>% - dplyr::left_join(VMRs_df %>% - select(c(VMR_index,n_VMPs)), - by = "VMR_index") %>% - dplyr::transmute(VMR_index = VMR_index, - VMR_type = case_when(n_VMPs > 1 ~ "canonical", - n_VMPs == 1 ~ "non canonical"), - genotype = lengths(selected_genot), - environment = lengths(selected_env)) %>% - tidyr::pivot_longer(-c(VMR_index, VMR_type)) %>% - dplyr::rename(group = name, - variables = value) %>% - ggplot2::ggplot(aes(x = VMR_type, y = variables)) + - ggplot2::geom_violin() + - ggplot2::geom_boxplot(width=0.1, outlier.shape=NA) + - ggplot2::facet_wrap(~group)+ - ggplot2::ggtitle("Selected variables") + +selected_variables %>% + dplyr::left_join( + VML_cis_snps %>% + select(c(VML_index, type)), + by = "VML_index" + ) %>% + dplyr::transmute( + VML_index = VML_index, + type = type, + Genome = lengths(selected_genot), + Exposome = lengths(selected_env) + ) %>% + tidyr::pivot_longer(-c(VML_index, type)) %>% + dplyr::rename( + group = name, + variables = value + ) %>% + ggplot2::ggplot(aes(x = type, y = variables)) + + ggplot2::geom_violin() + + ggplot2::geom_boxplot(width = 0.1, outlier.shape = NA) + + ggplot2::facet_wrap(~group) + + ggplot2::ggtitle("Selected variables") + ggplot2::theme_classic() ``` -It is also expected in real data to have VMRs where no SNP and/or no environmental variables were selected, since not all the DNAme sites in the genome are expected to show an association with the genetic variation or environmental exposures data sets that are captured in a study. The proportion of VMRs under these scenarios will depend on the data sets. +It is also expected in real data to have VML where no SNP and/or no environmental variables were selected, since not all the DNAme sites in the genome are expected to show an association with the genetic variation or environmental exposures data sets that are captured in a study. The proportion of VML under these scenarios will depend on the data sets. ### Author's note about variables interpretation -LASSO variable selection is not consistent when there is multicollinearity in the data (i.e., correlation between variables), which is expected due to the high amount of G and E variables that are present in studies of this kind. This means that if you were to run LASSO several times, and two variables were to be highly correlated, the method would select one and drop the other one at random. This is not a problem with the pipeline because the main conclusion per VMR is whether the DNAme is better explained by G and/or E components. As an example, if a VMR is better explained by SNP1 and SNP2, which are both highly correlated one with the other, LASSO will randomly pick SNP1 OR SNP2 (because they are relevant but they provide redundant information); if we were to fit a model with SNP1 or SNP2 in the following stage, the winning model would still be G. In other words, the main goal of the pipeline is to know whether the VMR's DNAme is better explained by G and/or E. The user is therefore warned to **be cautious not to over-interpret the individual selected variables**. Selected variables might be used as hypothesis generators of associations, keeping in mind that the selected variable might be representing other variables in the data set that provide similar information. +LASSO variable selection is not consistent when there is multicollinearity in the data (i.e., correlation between variables), which is expected due to the high amount of G and E variables that are present in studies of this kind. This means that if you were to run LASSO several times, and two variables were to be highly correlated, the method would select one and drop the other one at random. This is not a problem with the pipeline because the main conclusion per VML is whether the DNAme is better explained by G and/or E components. As an example, if a VML is better explained by SNP1 and SNP2, which are both highly correlated one with the other, LASSO will randomly pick SNP1 OR SNP2 (because they are relevant but they provide redundant information); if we were to fit a model with SNP1 or SNP2 in the following stage, the winning model would still be G. In other words, the main goal of the pipeline is to know whether the VML's DNAme is better explained by G and/or E. The user is therefore warned to **be cautious not to over-interpret the individual selected variables**. Selected variables might be used as hypothesis generators of associations, keeping in mind that the selected variable might be representing other variables in the data set that provide similar information. + +## Identify the best explanatory model (G/E/G+E/GxE) per VML -## Identify the best explanatory model (G/E/G+E/GxE) per VMR +### Fit and compare the models and select the best one -Now that we have selected the list of potentially relevant G and E variables, we will fit the models mentioned in Table \@ref(tab:modelstable) using the `RAMEN::lmGE()` function. This function fits, for each VMR, G and E models with all of the variables selected, as well as all their possible pairwise combinations of G+E and GxE. +Now that we have selected the list of potentially relevant G and E variables, we will fit the models mentioned in Table \@ref(tab:modelstable) using the `RAMEN::lmGE()` function. This function fits, for each VML, G and E models with all of the variables selected, as well as all their possible pairwise combinations of G+E and GxE. After fitting this model, the best model per group (group = G, E, G+E or GxE) is selected using Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). We recommend using AIC because BIC assumes that the true model is in the set of compared models. Since this function fits models with individual variables, and we assume that DNAme variability is more likely to be influenced by more than one single SNP/environmental exposure at a time, we hypothesize that in most cases, the true model will not be in the set of compared models. Also, AIC excels in situations where all models in the model space are "incorrect", and AIC is preferentially used in cases where the true underlying function is unknown and our selected model could belong to a very large class of functions where the relationship could be pretty complex. It is worth mentioning however that, both metrics tend to pick the same model in a large number of scenarios. We suggest the users to read Arijit Chakrabarti & Jayanta K. Ghosh, 2011 for further information about the difference between these metrics. After selecting the best model per group (G,E,G+E pr GxE), the model with the lowest AIC or BIC will be declared as the winning model. Additionally, `RAMEN::lmGE()` conducts a variance decomposition analysis, so that the relative R2 contribution of each of the variables of interest (G, E and GxE) is reported. This decomposition is done using the `r CRANpkg("relaimpo")` R package, using the Lindeman, Merenda and Gold (lmg) method, which is based on the heuristic approach of averaging the relative R contribution of each variable over all input orders in the linear model. ```{r} -lmge_res = RAMEN::lmGE( +lmge_res <- RAMEN::lmGE( selected_variables = selected_variables, - summarized_methyl_VMR = summarized_methyl_VMR, + summarized_methyl_VML = summarized_methyl_VML, genotype_matrix = RAMEN::test_genotype_matrix, environmental_matrix = RAMEN::test_environmental_matrix, covariates = RAMEN::test_covariates, @@ -297,14 +338,14 @@ dplyr::glimpse(lmge_res) ``` The output of `RAMEN::lmGE()` is a data frame with the following 13 columns: - - *VMR_index*: The index of the respective VMR - - *model_group*: The selected winning model (G, E, G+E or GxE) + - *VML_index*: The index of the respective VML + - *model_group*: The selected winning model (G, E, G+E or GxE). For the VML that had no variables selected and therefore no model could be fitted, this column will have "B" (baseline), which indicates that the best model was the basal one (i.e., no G or E variables improved the model since the variable selection stage); these VML will have NA in all of the following columns. - *variables*: The variable(s) that are present in the winning model (excluding the covariates, which are included in all the models) - *tot_r_squared*: total R squared of the winning model - *g_r_squared*: Estimated R2 allocated to the G component in the winning model, if applicable - *e_r_squared*: Estimated R2 allocated to the E in the winning model, if applicable. - *gxe_r_squared*: Estimated R2 allocated to the interaction in the winning model (GxE), if applicable. - - *AIC/BIC*: AIC or BIC metric from the best model in each VMR (depending on the option specified in the argument model_selection). + - *AIC/BIC*: AIC or BIC metric from the best model in each VML (depending on the option specified in the argument model_selection). - *second_winner*: The second group that possesses the next best model after the winning one (i.e., G, E, G+E or GxE). This column may have NA if the variables in selected_variables correspond only to one group (G or E), so that there is no other model groups to compare to. - *delta_aic/delta_bic*: The difference of AIC or BIC (depending on the option specified in the argument model_selection) of the winning model and the best model from the second_winner group (i.e., G, E, G+E or GxE). This column may have NA if the variables in selected_variables correspond only to one group (G or E), so that there is no other groups to compare to. - *delta_r_squared*: The R2 of the winning model - R2 of the second winner model. This column may have NA if the variables in selected_variables correspond only to one group (G or E), so that there is no other groups to compare to. @@ -313,43 +354,53 @@ The output of `RAMEN::lmGE()` is a data frame with the following 13 columns: ### Remove poor performing winning models -The core pipeline from the RAMEN package identifies the best explanatory model per VMR. However, despite these models being winners in comparison to models including any other G/E variable(s) in the dataset, some winning models might perform no better than what we would expect by chance. Therefore, The last step of the pipeline after identifying the winning models per VMR is to compute a null distribution to remove the winning models that are very likely to be winners by chance. To do so, we use `RAMEN::nullDistGE()`. +The core pipeline from the RAMEN package identifies the best explanatory model per VML. However, despite these models being winners in comparison to models including any other G/E variable(s) in the dataset, some winning models might perform no better than what we would expect by chance. Therefore, The last step of the pipeline is to compute a null distribution to remove the best models that are likely to be so by chance. To do so, we use `RAMEN::nullDistGE()`. -The goal of `RAMEN::nullDistGE()` is to create a distribution of increase in R2 after including the SNP/E/SNP*E variables under the null hypothesis of G and E having no associations with DNAme. The null distribution is obtained through shuffling the G and E variables in a given dataset and conducting the variable selection and G/E model selection. That way, we can simulate how much additional variance would be explained by the models defined as winners by the RAMEN methodology in a scenario where the G and E associations with DNAme are randomized. This distribution can be then used to filter out winning models in the non-shuffled dataset that do not add more to the explained variance of the basal model than what randomized data do. +The goal of `RAMEN::nullDistGE()` is to create a distribution of how much the R2 increases when we include the SNP or Environmentl Exposure (EE) or SNPxEE variables **when G and E having no associations with DNAme.** This distribution that we obtain when there is no effect (null distribution) is obtained through shuffling the G and E variables in a given dataset, and conducting the variable selection and G/E model selection. That way, we can simulate how much additional variance would be explained by the models defined as winners by the RAMEN methodology in a scenario where the G and E associations with DNAme are pure noise. This distribution can be then used to filter out winning models in our original dataset that do not add more to the explained variance of the basal model than what shuffled data do. -For clarification, please note that in this vignette when we refer to SNP*E, we are referring to the interaction term that is present in the the interaction model (i.e. interaction variable in the GxE model). +For clarification, please note that in this vignette when we refer to SNPxEE, we are referring to the interaction term that is present in the the interaction model (i.e. interaction variable in the GxE model). -Under the assumption that after adjusting for the concomitant variables all VMRs across the genome follow the same behavior regarding an increment of explained variance with randomized G and E data, we can pool the delta R squared values from all VMRs to create a null distribution taking advantage of the high number of VMRs in the dataset. This assumption decreases significantly the number of permutations required to create a null distribution and reduces the computational time. For further information please read the RAMEN paper (in preparation). +Under the assumption that after adjusting for the concomitant variables all VML across the genome share a minimum increment of explained variance, we can pool the delta R squared values from all VML to create a null distribution taking advantage of the high number of VML in the dataset. This assumption decreases significantly the number of permutations required to create a null distribution and reduces the computational time. For further information on how this is done please read the RAMEN paper (Navarro-Delgado EI *et al.*, 2025). `RAMEN::nullDistGE()` shuffles the G and E variables in the dataset and runs findVML, selectVariables() and lmGE(). This is repeated as many times as indicated in the *permutations* parameter. The number of permutations that we recommend depends on the size of your VML data set. We recommend running as many permutations as needed to obtain ~300k informative observations in total (i.e., excluding VML which best model was Baseline (B)). Since many VML will be labelled as B during the the selectVariables() stage, we recommend using the following formula: -`RAMEN::nullDistGE()` simulates the delta R squared distribution under the null hypothesis of G and E having no association with DNA methylation (DNAme) variability through a permutation analysis. To do so, this function shuffles the G and E variables in the dataset, which is followed by a the variable selection and modelling steps with selectVariables() and lmGE().These steps are repeated several times as indicated in the *permutations* parameter. In other words, by using shuffled G and E data, we simulate the increase of R2 that would be observed in random data using the RAMEN methodology. +```{r} +vml_size <- nrow(VML_cis_snps) # Number of VML in your data set +desired_obs <- 400000 # We want ~300k informative observations, and we are adding +# 100k extra to account for the VML that will be labelled as Basal (B) during +# selectVariables() +(permutations <- ceiling(desired_obs / vml_size)) +``` + +Since this is a toy example, for demonstration purposes we will run only 2 permutations (this is the most time consuming part of the analysis!). But please make sure to run the recommended number of permutations in your real data analysis. ```{r} -# Compute the null distribution -null_dist = RAMEN::nullDistGE( - VMRs_df = VMRs_df, +# Compute the null distribution +null_dist <- RAMEN::nullDistGE( + VML_df = VML_cis_snps, genotype_matrix = RAMEN::test_genotype_matrix, environmental_matrix = RAMEN::test_environmental_matrix, - summarized_methyl_VMR = summarized_methyl_VMR, - permutations = 5, + summarized_methyl_VML = summarized_methyl_VML, + permutations = 2, covariates = RAMEN::test_covariates, seed = 1, model_selection = "AIC" ) -#Take a look at the object +# Take a look at the object head(null_dist) ``` -The output is a data frame where the most useful column for our purpose is *R2_difference*, which corresponds to the increase in R squared obtained by including the G/E variable(s) from the winning model (i.e., the R squared difference between the winning model and the model only with the concomitant variables specified in covariates; tot_r_squared - basal_rsquared in the lmGE output) +The output is a data frame where the most useful column for our purpose is *R2_difference*, which corresponds to the increase in R squared obtained by including the SNP/EE variable(s) from the best explanatory model (i.e., the R squared difference between the chosen final model and the model only with the concomitant variables specified in covariates; tot_r_squared - basal_rsquared in the lmGE output) We recommend to use two different thresholds for the winning models depending of whether they are marginal (G or E) or joint models (G+E or GxE). The reason for this is that they have different R2_difference distributions. E and G models have a lower mean R2_difference because they have a single shuffled term in the model (SNP or E). In comparison, joint models have a higher mean R2_difference because they have two or three shuffled terms (SNP, E and SNP*E), which just by chance increases their probability of having a higher R2_difference. ```{r, fig.cap = "R2 difference (winner - basal) in a suffled data set."} # See the distribution of R2_difference across different winning models -null_dist %>% +null_dist %>% + drop_na() %>% # Remove Basal models from the results, where there is no difference between chosen model and basal model ggplot2::ggplot(aes(x = R2_difference)) + - ggplot2::geom_histogram() + + ggplot2::geom_histogram() + ggplot2::facet_grid("model_group") + + ggplot2::xlab("R2 difference") + ggplot2::theme_classic() ``` @@ -357,73 +408,88 @@ We suggest using the 95th percentile of those distributions as a threshold to re ```{r} # Get a cutoff of the 95th percentile of the null distribution for single and joint models -cutoff_single = quantile(null_dist %>% - filter(model_group %in% c("G","E")) %>% - pull(R2_difference), - 0.95) -cutoff_joint = quantile(null_dist %>% - filter(model_group %in% c("G+E","GxE")) %>% - pull(R2_difference), - 0.95) - -#Filter out bad performing winning models. -filtered_res = lmge_res %>% - dplyr::mutate(r2_difference_basal = tot_r_squared - basal_rsquared, - pass_cutoff_threshold = case_when(model_group %in% c("G", "E") ~ r2_difference_basal > cutoff_single, - model_group %in% c("G+E", "GxE") ~ r2_difference_basal > cutoff_joint)) %>% - dplyr::filter(pass_cutoff_threshold) %>% #Filter based on the cutoff threshold - dplyr::select(-pass_cutoff_threshold) #Drop temporary column - -#Check the final results +cutoff_single <- quantile( + null_dist %>% + filter(model_group %in% c("G", "E")) %>% + pull(R2_difference), + 0.95 +) +cutoff_joint <- quantile( + null_dist %>% + filter(model_group %in% c("G+E", "GxE")) %>% + pull(R2_difference), + 0.95 +) + +# Get a data frame with the final results results +final_res <- lmge_res %>% + dplyr::mutate( + r2_difference_basal = tot_r_squared - basal_rsquared, + # Label if the best explanatory model passes its corresponding threshold + pass_cutoff_threshold = case_when( + model_group %in% c("G", "E") ~ r2_difference_basal > cutoff_single, + model_group %in% c("G+E", "GxE") ~ r2_difference_basal > cutoff_joint + ), + # Label the final model group, replacing bad performing winning models with "B" (basal) + model_group = case_when( + pass_cutoff_threshold ~ model_group, + TRUE ~ "B" + ) + ) %>% + dplyr::select(-pass_cutoff_threshold) # Drop temporary column + +# Keep only VML that have informative models with out data +filtered_res <- final_res %>% + dplyr::filter(!model_group == "B") # Filter based on the cutoff threshold + +# Check the VML with informative models dplyr::glimpse(filtered_res) ``` -We can see that the final data set in this example dropped almost all of the VMRs we had. This is something expected since we are working with toy data coming from random sampling. - -We recommend the users of the package to include the number of VMRs where we could not find a conclusive best model in the final results report (either because no variables were selected with `RAMEN::selectVariables()` or because they did not pass the R2_difference threshold obtained with `RAMEN::nullDistGE()`). +We can see that the final data set in this example dropped almost all of the VMRs we had (only ```r nrow(filtered_res)```/```r nrow(lmge_res)``` survived!). This is something expected (and desired) since we are working with a toy data coming from random sampling, so we should end up with almost no good-performing chosen models. -```{r finalresults, fig.cap="Variable Methylated Regions best explanatory models"} -# Include the VMRs with no winning model in the final results object -final_res = VMRs_df %>% - dplyr::left_join(filtered_res, - by = "VMR_index") +We recommend the users of the package to include the number of VML with Basal models (i.e. where we could not find a conclusive best model in the final results either because no variables were selected with `RAMEN::selectVariables()` or because they did not pass the R2_difference threshold obtained with `RAMEN::nullDistGE()`). +```{r finalresults, fig.cap="Variable Methylated Loci best explanatory models"} # Plot final results -final_res %>% - dplyr::group_by(model_group) %>% - dplyr::summarise(count = n()) %>% +final_res %>% + dplyr::group_by(model_group) %>% + dplyr::summarise(count = n()) %>% ggplot2::ggplot(aes(x = model_group, y = count)) + ggplot2::geom_col() + ggplot2::xlab("Best explanatory model") + - ggplot2::ylab("VMRs") + - ggplot2::theme_classic() + ggplot2::ylab("VML") + + ggplot2::theme_classic() + +table(final_res$model_group) ``` So, we can see that for this toy example, we got the following results: - - VMRs better explained by a G model: 2 - - VMRs better explained by a E model: 1 - - VMRs better explained by a G+E model: 1 - - VMRs better explained by a GxE model: 3 - - VMRs with no conclusive explanatory model: 124 + - VML better explained by a G model: 2 + - VML better explained by a E model: 1 + - VML better explained by a G+E model: 1 + - VML better explained by a GxE model: 3 + - VML with no conclusive explanatory model: 111 +And that's it! We finished the tutorial. Now go grab some yummy food, we deserve it! ### Author's note about model interpretation -For model simplicity, each winning model have a single E and/or G variable (and its interaction term when applicable). That means that in a scenario where a given VMR is under the influence of 2 or more Es or Gs, only the one that better explains the VMR's DNAme alone will be selected. In other words, if a VMR in reality is influenced by e.g. folate intake and smoking, and we have information about both environmental exposures, the best model (E) will have only folate intake (in case that is the variable that better explains DNAme variability in that region alone). So, interpreting this as the VMR not being potentially under the influence of smoking might not be correct. We recommend the user to check the total R2 of the winning model to explore the remaining variance that is not explained by the winning model. +For model simplicity, each winning model have a single EE and/or SNP variable (and its interaction term when applicable). That means that in a scenario where a given VML is under the influence of 2 or more EEs or SNPs, only the one that better explains the VMR's DNAme alone will be selected. In other words, if a VMR in reality is influenced by e.g. folate intake and smoking, and we have information about both environmental exposures, the best model (E) will have only folate intake (in case that is the variable that better explains DNAme variability in that region alone). So, interpreting this as the VMR not being potentially under the influence of smoking might not be correct. We recommend the user to check the total R2 of the winning model to explore the remaining variance that is not explained by the winning model. -We also stress that **interpretation of individual variables should be done with caution and used only as exploration and research hypothesis generation**. Please see [ Author's note about variables interpretation ][]) where we advice against over-interpretation of the selected variables; the same logic applies to the variables present in the winning models. +We also stress that **interpretation of individual variables should be done with caution and used as exploration and research hypothesis generation**. Please see [ Author's note about variables interpretation ][]) where we advice against over-interpretation of the selected variables; the same logic applies to the variables present in the winning models. # Variations to the standard workflow Besides using RAMEN for completing the analysis mentioned above, the outputs of the package's function can help users in other DNAme analyses, such as: - - Reduction of tests prior to an EWAS or differential methylation analysis with `RAMEN::findVMRs()`(i.e., conducting the analysis on identified VMRs which 1) reduces redundant tests by grouping nearby correlated CpGs, and 2) avoids tests in non-variant regions). This can help to reduce the multiple hypothesis testing burden. - - Summarize a DNAme region of interest with `RAMEN::summarizeVMRs()` + - Reduction of tests prior to an EWAS or differential methylation analysis with `RAMEN::findVML()`(i.e., conducting the analysis on identified VML which 1) reduces redundant tests by grouping nearby correlated CpGs, and 2) avoids tests in non-variant regions). This can help to reduce the multiple hypothesis testing burden. + - Summarize a DNAme region of interest with `RAMEN::summarizeVML()` - Easily conduct variable selection in high-dimensional data sets to identify potentially relevant variables from one or two independent data sets with `RAMEN::selectVariables()`. - - Fit additive and interaction models given two sets of variables of interest (not limited to G and E) and select the best explanatory model for DNAme data with `RAMEN::selectVariables()` and `RAMEN::lmGE()` (e.g. exploring the interaction between two environmental dimensions and their contribution to DNAme variability). - - Quickly identify SNPs in *cis* of CpG probes with `RAMEN::findCisSNPs()` (e.g. for cis mQTL analyses) - - Get the median correlation of probes in regions of interest (with `RAMEN::medCorVMR()`). + - Fit additive and interaction models given two sets of variables of interest (not limited to G and E) and select the best explanatory model for DNAme data with `RAMEN::selectVariables()` and `RAMEN::lmGE()` (e.g. exploring the interaction between two environmental dimensions and their contribution to DNAme variability, or epistasis effects). + - Quickly identify SNPs in *cis* of CpG probes with `RAMEN::findCisSNPs()` + - Get the median correlation of probes in regions of interest (with `RAMEN::medCorVML()`). # Frequently Asked Questions @@ -438,10 +504,12 @@ Saving the data frames produced by RAMEN might seem difficult because it has lis data.table::fwrite(selected_variables, file = "path/selected_variables.csv") # Read the csv file and make lists the elements in the required columns -selected_variables = fread("path/selected_variables.csv", data.table = FALSE) %>% - mutate(selected_genot = str_split(selected_genot, pattern = "\\|"), # fwrite saves lists as strings separated by | - selected_env =str_split(selected_env, pattern = "\\|"), - VMR_index = as.character(VMR_index)) +selected_variables <- fread("path/selected_variables.csv", data.table = FALSE) %>% + mutate( + selected_genot = str_split(selected_genot, pattern = "\\|"), # fwrite saves lists as strings separated by |, so we need to splut them + selected_env = str_split(selected_env, pattern = "\\|"), + VMR_index = as.character(VMR_index) + ) ``` 2. Save files as .rds @@ -450,9 +518,8 @@ selected_variables = fread("path/selected_variables.csv", data.table = FALSE) %> # Example for saving the selected_variables object saveRDS(selected_variables, file = "path/selected_variables.Rds") -#Load the object +# Load the object readRDS(file = "path/selected_variables.Rds") - ``` # Session info diff --git a/vignettes/RAMEN_method.png b/vignettes/RAMEN_method.png index 4110f88..823fb02 100644 Binary files a/vignettes/RAMEN_method.png and b/vignettes/RAMEN_method.png differ diff --git a/vignettes/RAMEN_pipeline.png b/vignettes/RAMEN_pipeline.png index 04e02c6..91260dc 100644 Binary files a/vignettes/RAMEN_pipeline.png and b/vignettes/RAMEN_pipeline.png differ diff --git a/vignettes/hvps_mad_var.png b/vignettes/hvps_mad_var.png new file mode 100644 index 0000000..ed6b1f4 Binary files /dev/null and b/vignettes/hvps_mad_var.png differ diff --git a/vignettes/mad_var.png b/vignettes/mad_var.png new file mode 100644 index 0000000..dbd8350 Binary files /dev/null and b/vignettes/mad_var.png differ