Skip to content
This repository was archived by the owner on Mar 23, 2021. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 29 additions & 22 deletions lessons/06_matching_reordering.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,34 +10,41 @@ Approximate time: 110 min

* Implement matching and re-ordering data within data structures.

## Matching data
## Matching data

Often when working with genomic data, we have a data file that corresponds with our metadata file. The data file contains measurements from the biological assay for each individual sample. In this case, the biological assay is gene expression and data was generated using RNA-Seq.
Often when working with genomic data, we have a data file that corresponds with our metadata file. The data file contains measurements from the biological assay for each individual sample. In this case, the biological assay is gene expression and data was generated using RNA-Seq.

Let's read in our expression data (RPKM matrix) that we downloaded previously:
Let's read in our expression data (RPKM matrix) that we downloaded previously from https://raw.githubusercontent.com/hbc/NGS_Data_Analysis_Course/master/sessionII/data/counts.rpkm.csv:

```r
rpkm_data <- read.csv("data/counts.rpkm.csv")
```

If you don't have the metadata loaded, download it from https://github.com/hbc/NGS_Data_Analysis_Course/raw/master/sessionII/data/mouse_exp_design.csv and load it with:

```r
metadata <- read.csv("data/mouse_exp_design.csv")
```


Take a look at the first few lines of the data matrix to see what's in there.

```r
head(rpkm_data)
```

It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it's hard to tell since they are not in the same order. We can do a quick check of the number of columns in the count data and the rows in the metadata and at least see if the numbers match up.
It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it's hard to tell since they are not in the same order. We can do a quick check of the number of columns in the count data and the rows in the metadata and at least see if the numbers match up.

```r
ncol(rpkm_data)
nrow(metadata)
```

What we want to know is, **do we have data for every sample that we have metadata?**
What we want to know is, **do we have data for every sample that we have metadata?**

## The `%in%` operator
Although lacking in [documentation](http://dr-k-lo.blogspot.com/2013/11/) this operator is well-used and convenient once you get the hang of it. The operator is used with the following syntax:

Although lacking in [documentation](http://dr-k-lo.blogspot.com/2013/11/) this operator is well-used and convenient once you get the hang of it. The operator is used with the following syntax:

```r
vector1_of_values %in% vector2_of_values
Expand All @@ -49,7 +56,7 @@ It will take a vector as input to the left and will **evaluate each element to s
A <- c(1,3,5,7,9,11) # odd numbers
B <- c(2,4,6,8,10,12) # even numbers

# test to see if each of the elements of A is in B
# test to see if each of the elements of A is in B
A %in% B
```

Expand All @@ -62,7 +69,7 @@ Since vector A contains only odd numbers and vector B contains only even numbers

```r
A <- c(1,3,5,7,9,11) # odd numbers
B <- c(2,4,6,8,1,5) # add some odd numbers in
B <- c(2,4,6,8,1,5) # add some odd numbers in
```

```r
Expand All @@ -74,7 +81,7 @@ A %in% B
## [1] TRUE FALSE TRUE FALSE FALSE FALSE
```

The logical vector returned denotes which elements in `A` are also in `B` and which are not.
The logical vector returned denotes which elements in `A` are also in `B` and which are not.

We saw previously that we could use the output from a logical expression to subset data by returning only the values corresponding to `TRUE`. Therefore, we can use the output logical vector to subset our data, and return only those elements in `A`, which are also in `B` by returning only the TRUE values:

Expand Down Expand Up @@ -116,7 +123,7 @@ Suppose we had **two vectors that had the same values but just not in the same o

```r
A <- c(10,20,30,40,50)
B <- c(50,40,30,20,10) # same numbers but backwards
B <- c(50,40,30,20,10) # same numbers but backwards

# test to see if each element of A is in B
A %in% B
Expand Down Expand Up @@ -166,22 +173,22 @@ We have a list of IDs for marker genes of particular interest. We want to extrac
```r
important_genes <- c("ENSMUSG00000083700", "ENSMUSG00000080990", "ENSMUSG00000065619", "ENSMUSG00000047945", "ENSMUSG00000081010", "ENSMUSG00000030970")
```

2. Extract the rows containing the important genes from your `rpkm_data` dataset using the `%in%` operator.

3. **Extra Credit:** Using the `important_genes` vector, extract the rows containing the important genes from your `rpkm_data` dataset without using the `%in%` operator.

***

## Reordering data using indices
Indexing `[ ]` can be used to extract values from a dataset as we saw earlier, but we can also use it to rearrange our data values.
Indexing `[ ]` can be used to extract values from a dataset as we saw earlier, but we can also use it to rearrange our data values.

```r
teaching_team <- c("Mary", "Meeta", "Radhika")
```
![reordering](../img/teachin-team.png)

Remember that we can return values in a vector by specifying it's position or index:
Remember that we can return values in a vector by specifying its position or index:

```r
teaching_team[c(2, 3)] # Extracting values from a vector
Expand All @@ -208,9 +215,9 @@ reorder_teach <- teaching_team[c(3, 1, 2)] # Saving the results to a variable

## The `match` function

Now that we know how to reorder using indices, we can use the `match()` function to match the values in two vectors. We'll be using it to evaluate which samples are present in both our counts and metadata dataframes, and then to re-order the columns in the counts matrix to match the row names in the metadata matrix.
Now that we know how to reorder using indices, we can use the `match()` function to match the values in two vectors. We'll be using it to evaluate which samples are present in both our counts and metadata dataframes, and then to re-order the columns in the counts matrix to match the row names in the metadata matrix.

`match()` takes at least 2 arguments:
`match()` takes at least 2 arguments:

1. a vector of values in the order you want
2. a vector of values to be reordered
Expand All @@ -226,13 +233,13 @@ second <- c("B","D","E","A","C") # same letters but different order
***How would you reorder `second` vector to match `first` using indices?***

If we had large datasets, it would be difficult to reorder them by searching for the indices of the matching elements. This is where the `match` function comes in really handy:

```r
match(first,second)
[1] 4 1 5 2 3
```

The function should return a vector of size `length(first)`. Each number that is returned represents the index of the `second` vector where the matching value was observed.
The function should return a vector of size `length(first)`. Each number that is returned represents the index of the `second` vector where the matching value was observed.

Now, we can just use the indices to reorder the elements of the `second` vector to be in the same positions as the matching elements in the `first` vector:

Expand All @@ -248,7 +255,7 @@ second_reordered <- second[reorder_idx] # Reordering and saving the output to a

Now that we know how `match()` works, let's change vector `second` so that only a subset are retained:

```r
```r
first <- c("A","B","C","D","E")
second <- c("D","B","A") # remove values
```
Expand All @@ -268,12 +275,12 @@ match(first,second)
### Reordering genomic data using `match()` function

Using the `match` function, we now would like to match the row names of our metadata to the column names of our expression data*, so these will be the arguments for `match`. Using these two arguments we will retrieve a vector of match indices. The resulting vector represents the re-ordering of the column names in our data matrix to be identical to the rows in metadata:

```r
rownames(metadata)

colnames(rpkm_data)

genomic_idx <- match(rownames(metadata), colnames(rpkm_data))
genomic_idx
```
Expand Down
Loading