When generating a matrix of features for RIVER, how do the developers handle situations where no variant near a particular gene has a CADD annotation for features like TFBS or EncOCCombPVal? glmnet cannot handle NAs, but n my dataset 95% of genes have at least one missing feature annotation, so removing such cases would waste most of the data.
Ex:
|
cHmmTx |
cHmmTssBiv |
cHmmHet |
cHmmBivFlnk |
cHmmTxFlnk |
TFBS |
EncOCCombPVal |
| GTEX-111YS:ENSG00000007923 |
0.016 |
0 |
0 |
0 |
0.000 |
NA |
NA |
| GTEX-117YW:ENSG00000007923 |
0.000 |
0 |
0 |
0 |
0.000 |
NA |
NA |
| GTEX-1192X:ENSG00000007923 |
0.000 |
0 |
0 |
0 |
0.000 |
NA |
NA |
| GTEX-11EM3:ENSG00000007923 |
0.000 |
0 |
0 |
0 |
0.008 |
NA |
NA |
| GTEX-11EQ8:ENSG00000007923 |
0.000 |
0 |
0 |
0 |
0.000 |
NA |
NA |
| GTEX-11EQ9:ENSG00000007923 |
0.016 |
0 |
0 |
0 |
0.000 |
NA |
NA |
When generating a matrix of features for RIVER, how do the developers handle situations where no variant near a particular gene has a CADD annotation for features like TFBS or EncOCCombPVal? glmnet cannot handle NAs, but n my dataset 95% of genes have at least one missing feature annotation, so removing such cases would waste most of the data.
Ex: