-
Notifications
You must be signed in to change notification settings - Fork 4
Expand file tree
/
Copy pathestimation_pointers.Rmd
More file actions
138 lines (94 loc) · 5.37 KB
/
Copy pathestimation_pointers.Rmd
File metadata and controls
138 lines (94 loc) · 5.37 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
title: "Variable selection and estimation using pointers"
author: "Lino Galiana"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Using pointers with OpenCancer}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
This vignette is the occasion to explore the possibilities offered by the `bigmemory` package family to efficiently work with Epidemium data. **For the moment, it is based on a sample of true data: results presented here should not be relied**
We will assume data have been imported using the `OpenCancer` package (see dedicated vignette). A csv file has been stored in a subdirectory `inst` of the current working directory. To see an example, use `/vignettes/inst`, in the package directory. This example dataframe will show the interest of working with C++ pointers rather than dataframes imported in R memory (hence in RAM).
```{r loadpackage, message=FALSE,warning=FALSE}
library(OpenCancer)
datadir <- if (stringr::str_detect(getwd(),"/vignettes")) paste0(getwd(),"/inst") else paste0(getwd(),"/vignettes/inst")
```
# Variable selection by LASSO
```{r, include = TRUE,message = F, warning=F, eval = F}
url <- "https://github.com/EpidemiumOpenCancer/OpenCancer/raw/master/vignettes/inst/exampledf.csv"
download.file(url,destfile = paste0(datadir,"/exampledf.csv"))
```
```{r readbigmatrix, include = TRUE,message = F, warning=F, eval = T}
X <- bigmemory::read.big.matrix(paste0(datadir,"/exampledf.csv"), header = TRUE)
```
The matrix is not explicitly imported in R. `X` is a C++ pointer, a trick made possible by `bigmemory` package. As any `big.matrix` object, it is possible to access `X` content by importing it in the RAM. Working with pointers is a huge advantage in terms of memory:
```{r, include = FALSE, eval=FALSE}
pryr::mem_used(X)
```
The memory gain comes has a cost in terms of flexibility since working with pointers requires C++ functions. However, a series of package (mostly `biglasso` and `biganalytics`) allow to apply statistical functions to pointers.
The `big.simplelasso` function we created has been designed to perform a feature selection on an OpenCancer dataframes that is imported as a pointer. Assuming our explained variable is called `'incidence'` (default) and we want to perform a cross-validation on 5 folds
```{r bigsimplelasso, message = F, warning=F}
pooledLASSO <- big.simplelasso(X,yvar = 'incidence', labelvar = c("cancer", "age",
"Country_Transco", "year", "area.x", "area.y"), crossvalidation = T,
nfolds = 5, returnplot = F)
summary(pooledLASSO$model)
```
`labelvar` argument is here to exclude these variables from the set of features included in the LASSO.
```{r, message = F, warning=F, fig.width = 8, fig.height=5}
plot(pooledLASSO$model)
```
In that case, we see that from `r length(pooledLASSO$model$fit$beta@i)` variables, LASSO selects `r sum(pooledLASSO$coeff != 0)` variables.
Now, let's say we want to make a feature selection for each age classes separately. While a standard dataframe would allow to use `group_by + do` or `nest + mutate`, we must find another method for pointers. The `bigsplit` function is useful for such a project. As an example, we only keep three groups,
```{r, message = F, warning=F}
groupingvar <- c('age')
indices <- bigtabulate::bigsplit(X,groupingvar, splitcol=NA_real_)
indices <- indices[5:8]
# ESTIMATE MODEL WITH PARALLELIZED GROUPS
model <- foreach(i = indices, .combine='list',
.multicombine = TRUE,
.maxcombine = nrow(X),
.errorhandling = 'pass',
.packages = c("bigmemory","biglasso","biganalytics",
'OpenCancer')) %do% {
return(
list(results = big.simplelasso(bigmemory::deepcopy(X, rows = i),
yvar = 'incidence',
labelvar = c("cancer", 'sex',
"Country_Transco", "year", "area.x", "area.y"),
crossvalidation = T, nfolds = 5, returnplot = F),
indices = i
)
)
}
```
Results are stored as a list and have the same order as `indices` groups.
```{r, message = F, warning=F}
summary(model[[1]]$results$model)
summary(model[[2]]$results$model)
summary(model[[3]]$results$model)
```
# Feature selection and linear regression on selected features
`big.model.FElasso` performs feature selection on a `big.matrix` and returns a linear regression with selected features.
```{r, message = F, warning=F}
# POOLED OLS
pooledOLS <- big.model.FElasso(X,yvar = "incidence",returnplot = F,
relabel = T)
DTsummary.biglm(pooledOLS)$coefftab
DTsummary.biglm(pooledOLS)$modeltab
```
It is also possible to perform regressions by group using `groupingvar` argument. In that case,
```{r, message = F, warning=F, eval = F}
model <- big.model.FElasso(X,yvar = "incidence",
groupingvar = c('sex','age'),
returnplot = F,
relabel = T)
DTsummary.biglm(model[[38]]$results)$coefftab
DTsummary.biglm(model[[38]]$results)$modeltab
```