diff --git a/lessons/06_matching_reordering.md b/lessons/06_matching_reordering.md index 7cda074..65515e4 100644 --- a/lessons/06_matching_reordering.md +++ b/lessons/06_matching_reordering.md @@ -10,34 +10,41 @@ Approximate time: 110 min * Implement matching and re-ordering data within data structures. -## Matching data +## Matching data -Often when working with genomic data, we have a data file that corresponds with our metadata file. The data file contains measurements from the biological assay for each individual sample. In this case, the biological assay is gene expression and data was generated using RNA-Seq. +Often when working with genomic data, we have a data file that corresponds with our metadata file. The data file contains measurements from the biological assay for each individual sample. In this case, the biological assay is gene expression and data was generated using RNA-Seq. -Let's read in our expression data (RPKM matrix) that we downloaded previously: +Let's read in our expression data (RPKM matrix) that we downloaded previously from https://raw.githubusercontent.com/hbc/NGS_Data_Analysis_Course/master/sessionII/data/counts.rpkm.csv: ```r rpkm_data <- read.csv("data/counts.rpkm.csv") ``` +If you don't have the metadata loaded, download it from https://github.com/hbc/NGS_Data_Analysis_Course/raw/master/sessionII/data/mouse_exp_design.csv and load it with: + +```r +metadata <- read.csv("data/mouse_exp_design.csv") +``` + + Take a look at the first few lines of the data matrix to see what's in there. ```r head(rpkm_data) ``` -It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it's hard to tell since they are not in the same order. We can do a quick check of the number of columns in the count data and the rows in the metadata and at least see if the numbers match up. +It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it's hard to tell since they are not in the same order. We can do a quick check of the number of columns in the count data and the rows in the metadata and at least see if the numbers match up. ```r ncol(rpkm_data) nrow(metadata) ``` -What we want to know is, **do we have data for every sample that we have metadata?** +What we want to know is, **do we have data for every sample that we have metadata?** ## The `%in%` operator - -Although lacking in [documentation](http://dr-k-lo.blogspot.com/2013/11/) this operator is well-used and convenient once you get the hang of it. The operator is used with the following syntax: + +Although lacking in [documentation](http://dr-k-lo.blogspot.com/2013/11/) this operator is well-used and convenient once you get the hang of it. The operator is used with the following syntax: ```r vector1_of_values %in% vector2_of_values @@ -49,7 +56,7 @@ It will take a vector as input to the left and will **evaluate each element to s A <- c(1,3,5,7,9,11) # odd numbers B <- c(2,4,6,8,10,12) # even numbers -# test to see if each of the elements of A is in B +# test to see if each of the elements of A is in B A %in% B ``` @@ -62,7 +69,7 @@ Since vector A contains only odd numbers and vector B contains only even numbers ```r A <- c(1,3,5,7,9,11) # odd numbers -B <- c(2,4,6,8,1,5) # add some odd numbers in +B <- c(2,4,6,8,1,5) # add some odd numbers in ``` ```r @@ -74,7 +81,7 @@ A %in% B ## [1] TRUE FALSE TRUE FALSE FALSE FALSE ``` -The logical vector returned denotes which elements in `A` are also in `B` and which are not. +The logical vector returned denotes which elements in `A` are also in `B` and which are not. We saw previously that we could use the output from a logical expression to subset data by returning only the values corresponding to `TRUE`. Therefore, we can use the output logical vector to subset our data, and return only those elements in `A`, which are also in `B` by returning only the TRUE values: @@ -116,7 +123,7 @@ Suppose we had **two vectors that had the same values but just not in the same o ```r A <- c(10,20,30,40,50) -B <- c(50,40,30,20,10) # same numbers but backwards +B <- c(50,40,30,20,10) # same numbers but backwards # test to see if each element of A is in B A %in% B @@ -166,7 +173,7 @@ We have a list of IDs for marker genes of particular interest. We want to extrac ```r important_genes <- c("ENSMUSG00000083700", "ENSMUSG00000080990", "ENSMUSG00000065619", "ENSMUSG00000047945", "ENSMUSG00000081010", "ENSMUSG00000030970") ``` - + 2. Extract the rows containing the important genes from your `rpkm_data` dataset using the `%in%` operator. 3. **Extra Credit:** Using the `important_genes` vector, extract the rows containing the important genes from your `rpkm_data` dataset without using the `%in%` operator. @@ -174,14 +181,14 @@ We have a list of IDs for marker genes of particular interest. We want to extrac *** ## Reordering data using indices -Indexing `[ ]` can be used to extract values from a dataset as we saw earlier, but we can also use it to rearrange our data values. +Indexing `[ ]` can be used to extract values from a dataset as we saw earlier, but we can also use it to rearrange our data values. ```r teaching_team <- c("Mary", "Meeta", "Radhika") ``` ![reordering](../img/teachin-team.png) -Remember that we can return values in a vector by specifying it's position or index: +Remember that we can return values in a vector by specifying its position or index: ```r teaching_team[c(2, 3)] # Extracting values from a vector @@ -208,9 +215,9 @@ reorder_teach <- teaching_team[c(3, 1, 2)] # Saving the results to a variable ## The `match` function -Now that we know how to reorder using indices, we can use the `match()` function to match the values in two vectors. We'll be using it to evaluate which samples are present in both our counts and metadata dataframes, and then to re-order the columns in the counts matrix to match the row names in the metadata matrix. +Now that we know how to reorder using indices, we can use the `match()` function to match the values in two vectors. We'll be using it to evaluate which samples are present in both our counts and metadata dataframes, and then to re-order the columns in the counts matrix to match the row names in the metadata matrix. -`match()` takes at least 2 arguments: +`match()` takes at least 2 arguments: 1. a vector of values in the order you want 2. a vector of values to be reordered @@ -226,13 +233,13 @@ second <- c("B","D","E","A","C") # same letters but different order ***How would you reorder `second` vector to match `first` using indices?*** If we had large datasets, it would be difficult to reorder them by searching for the indices of the matching elements. This is where the `match` function comes in really handy: - + ```r match(first,second) [1] 4 1 5 2 3 ``` -The function should return a vector of size `length(first)`. Each number that is returned represents the index of the `second` vector where the matching value was observed. +The function should return a vector of size `length(first)`. Each number that is returned represents the index of the `second` vector where the matching value was observed. Now, we can just use the indices to reorder the elements of the `second` vector to be in the same positions as the matching elements in the `first` vector: @@ -248,7 +255,7 @@ second_reordered <- second[reorder_idx] # Reordering and saving the output to a Now that we know how `match()` works, let's change vector `second` so that only a subset are retained: -```r +```r first <- c("A","B","C","D","E") second <- c("D","B","A") # remove values ``` @@ -268,12 +275,12 @@ match(first,second) ### Reordering genomic data using `match()` function Using the `match` function, we now would like to match the row names of our metadata to the column names of our expression data*, so these will be the arguments for `match`. Using these two arguments we will retrieve a vector of match indices. The resulting vector represents the re-ordering of the column names in our data matrix to be identical to the rows in metadata: - + ```r rownames(metadata) - + colnames(rpkm_data) - + genomic_idx <- match(rownames(metadata), colnames(rpkm_data)) genomic_idx ``` diff --git a/lessons/07_ggplot2.md b/lessons/07_ggplot2.md index a0788f9..55c991b 100644 --- a/lessons/07_ggplot2.md +++ b/lessons/07_ggplot2.md @@ -6,7 +6,7 @@ date: "Wednesday, September 8, 2017" Approximate time: 60 minutes -## Learning Objectives +## Learning Objectives * Plot graphs using the external package "ggplot2". * Use the "map" function for iterative tasks on data structures. @@ -14,7 +14,7 @@ Approximate time: 60 minutes ## Setting up a data frame for visualization -In this lesson we want to make various plots related to the average expression in each sample. When we make the plots, we also want to use all the metadata available to appropriately annotate the plots. +In this lesson we want to make various plots related to the average expression in each sample. When we make the plots, we also want to use all the metadata available to appropriately annotate the plots. Let's take a closer look at our counts data. Each column represents a sample in our experiment, and each sample has ~38K values corresponding to the expression of different transcripts. We want to compute **the average value of expression** for each sample eventually. Taking this one step at a time, what would we do if we just wanted the average expression for Sample 1 (across all transcripts)? We can use the R base package provided function called 'mean()`: @@ -28,7 +28,7 @@ Programming languages typically have a way to allow the execution of a single li ### The `map` family of functions -The `map()` family of functions is available from the **`purrr`** package, which is part of the tidyverse suite of packages. More detailed information is available in the [R for Data Science](http://r4ds.had.co.nz/iteration.html#the-map-functions) book. This family includes several functions, each taking a vector as input and outputting a vector of a specified type. For example, we can use these functions to execute some task/function on every element in a vector, or every column in a dataframe, or every component of a list, and so on. +The `map()` family of functions is available from the **`purrr`** package, which is part of the tidyverse suite of packages. More detailed information is available in the [R for Data Science](http://r4ds.had.co.nz/iteration.html#the-map-functions) book. This family includes several functions, each taking a vector as input and outputting a vector of a specified type. For example, we can use these functions to execute some task/function on every element in a vector, or every column in a dataframe, or every component of a list, and so on. - `map()` creates a list. - `map_lgl()` creates a logical vector. @@ -36,7 +36,7 @@ The `map()` family of functions is available from the **`purrr`** package, which - `map_dbl()` creates a "double" or numeric vector. - `map_chr()` creates a character vector. -The syntax for the `map()` family of functions is: +The syntax for the `map()` family of functions is: ```r ## DO NOT RUN @@ -46,16 +46,16 @@ map(object, function_to_apply) If you would like to practice with the `map()` family of functions, we have [additional materials](https://hbctraining.github.io/Intro-to-R/lessons/map_purrr.html) available. ### Wrangling our data with `map_dbl()` -To obtain **mean values for all samples** we can use the `map_dbl()` function which generates a numeric vector. +To obtain **mean values for all samples** we can use the `map_dbl()` function which generates a numeric vector. ```r library(purrr) # Load the purrr -samplemeans <- map_dbl(rpkm_ordered, mean) +samplemeans <- map_dbl(rpkm_ordered, mean) ``` We can add this 12 element containing vector as a column to our metadata data frame, thus combining the average expression with experimental metadata. The `cbind()` or "column bind" function allows us to do this very easily. - + ```r new_metadata <- cbind(metadata, samplemeans) ``` @@ -63,10 +63,10 @@ new_metadata <- cbind(metadata, samplemeans) Before we start to plot, we also want to add an additional metadata column to `new_metadata`, this new column lists the age of each of the mouse samples in days. ```r -age_in_days <- c(40, 32, 38, 35, 41, 32, 34, 26, 28, 28, 30, 32) +age_in_days <- c(40, 32, 38, 35, 41, 32, 34, 26, 28, 28, 30, 32) # Create a numeric vector with ages. Note that there are 12 elements here. - -new_metadata <- cbind(new_metadata, age_in_days) + +new_metadata <- cbind(new_metadata, age_in_days) # add the new vector as the last column to the new_metadata dataframe ``` @@ -78,7 +78,7 @@ When we are working with large sets of numbers it can be useful to display that More recently, R users have moved away from base graphic options towards `ggplot2` since it offers a lot more functionality as compared to the base R plotting functions. The `ggplot2` syntax takes some getting used to, but once you get it, you will find it's extremely powerful and flexible. We will start with drawing a simple x-y scatterplot of `samplemeans` versus `age_in_days` from the `new_metadata` data frame. `ggplot2` assumes that the input is a data frame. -Let's start by loading the `ggplot2` library, you downloaded and installed this library as part of the `tidyverse` package. +Let's start by loading the `ggplot2` library. You downloaded and installed this library as part of the `tidyverse` package. ```r library(ggplot2) @@ -86,13 +86,13 @@ library(ggplot2) The `ggplot()` function is used to **initialize the basic graph structure**, then we add to it. The basic idea is that you specify different parts of the plot, and add them together using the `+` operator. These parts are often referred to as layers. -Let's start: +Let's start: ```r -ggplot(new_metadata) # what happens? +ggplot(new_metadata) # what happens? ``` -You get an blank plot, because you need to **specify layers** using the `+` operator. +You get a blank plot, because you need to **specify layers** using the `+` operator. One type of layer is **geometric objects**. These are the actual marks we put on a plot. Examples include: @@ -100,7 +100,7 @@ One type of layer is **geometric objects**. These are the actual marks we put on * lines (`geom_line`, for time series, trend lines, etc) * boxplot (`geom_boxplot`, for, well, boxplots!) -For a more exhaustive list on all possible geometric objects and when to use them check out [Hadley Wickham's RPubs](http://rpubs.com/hadley/ggplot2-layers) or the [RStudio cheatsheet](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf). +For a more exhaustive list on all possible geometric objects and when to use them check out [Hadley Wickham's RPubs](http://rpubs.com/hadley/ggplot2-layers) or the [RStudio cheatsheet](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf). A plot **must have at least one `geom`**; there is no upper limit. You can add a `geom` to a plot using the `+` operator @@ -113,7 +113,7 @@ You will find that even though we have added a layer by specifying `geom_point`, * position (i.e., on the x and y axes) * color ("outside" color) -* fill ("inside" color) +* fill ("inside" color) * shape (of points) * linetype * size @@ -125,7 +125,7 @@ ggplot(new_metadata) + geom_point(aes(x = age_in_days, y= samplemeans)) ``` - ![ggscatter1](../img/ggscatter-1.png) + ![ggscatter1](../img/ggscatter-1.png) Now that we have the required aesthetics, let's add some extras like color to the plot. We can **`color` the points on the plot based on genotype**, by specifying the column header. You will notice that there are a default set of colors that will be used so we do not have to specify. Also, the **legend has been conveniently plotted for us!** @@ -133,20 +133,20 @@ Now that we have the required aesthetics, let's add some extras like color to th ```r ggplot(new_metadata) + - geom_point(aes(x = age_in_days, y= samplemeans, color = genotype)) + geom_point(aes(x = age_in_days, y= samplemeans, color = genotype)) ``` - ![ggscatter1.1](../img/ggscatter-2.png) + ![ggscatter1.1](../img/ggscatter-2.png) Alternatively, we could color based on celltype by changing it to `color =celltype`. Let's try something different and have both **celltype and genotype identified on the plot**. To do this we can assign the `shape` aesthetic the column header, so that each celltype is plotted with a different shaped data point. Add in `shape = celltype` to your aesthetic and see how it changes your plot: ```r ggplot(new_metadata) + geom_point(aes(x = age_in_days, y= samplemeans, color = genotype, - shape=celltype)) + shape=celltype)) ``` - ![ggscatter3](../img/ggscatter-3.png) + ![ggscatter3](../img/ggscatter-3.png) The **size of the data points** are quite small. We can adjust that within the `geom_point()` layer, but does **not** need to be **included in `aes()`** since we are specifying how large we want the data points, rather than mapping it to a variable. Add in the `size` argument by specifying a number for the size of the data point: @@ -154,11 +154,11 @@ The **size of the data points** are quite small. We can adjust that within the ` ```r ggplot(new_metadata) + geom_point(aes(x = age_in_days, y= samplemeans, color = genotype, - shape=celltype), size=3.0) + shape=celltype), size=3.0) ``` ![ggscatter4](../img/ggscatter-4.png) - + The labels on the x- and y-axis are also quite small and hard to read. To change their size, we need to add an additional **theme layer**. The ggplot2 `theme` system handles non-data plot elements such as: @@ -175,10 +175,10 @@ Let's add a layer `theme_bw()`. Do the axis labels or the tick labels get any la ggplot(new_metadata) + geom_point(aes(x = age_in_days, y= samplemeans, color = genotype, shape=celltype), size=3.0) + - theme_bw() + theme_bw() ``` -Not in this case. But we can add arguments using `theme()` to change it ourselves. Since we are adding this layer on top (i.e later in sequence), any features we change will override what is set in the `theme_bw()`. Here we'll **increase the size of the axes labels and axes tick labels to be 1.5 times the default size.** When modfying the size of text we often use the `rel()` function. In this way the size we specify is relative to the default (similar to `cex` for base plotting). We can also provide the number vaue as we did with the data point size, but can be cumbersome if you don't know what the default font size is to begin with. +Not in this case. But we can add arguments using `theme()` to change it ourselves. Since we are adding this layer on top (i.e later in sequence), any features we change will override what is set in the `theme_bw()`. Here we'll **increase the size of the axes labels and axes tick labels to be 1.5 times the default size.** When modfying the size of text we often use the `rel()` function. In this way the size we specify is relative to the default (similar to `cex` for base plotting). We can also provide the number vaue as we did with the data point size, but can be cumbersome if you don't know what the default font size is to begin with. ```r ggplot(new_metadata) + @@ -186,13 +186,13 @@ ggplot(new_metadata) + shape=celltype), size=3.0) + theme_bw() + theme(axis.text = element_text(size=rel(1.5)), - axis.title = element_text(size=rel(1.5))) + axis.title = element_text(size=rel(1.5))) ``` - + ![ggscatter5](../img/ggscatter-5.png) - -> *NOTE:* You can use the `example("geom_point")` function here to explore a multitude of different aesthetics and layers that can be added to your plot. As you scroll through the different plots, take note of how the code is modified. You can use this with any of the different geometric object layers available in ggplot2 to learn how you can easily modify your plots! + +> *NOTE:* You can use the `example("geom_point")` function here to explore a multitude of different aesthetics and layers that can be added to your plot. As you scroll through the different plots, take note of how the code is modified. You can use this with any of the different geometric object layers available in ggplot2 to learn how you can easily modify your plots! > *NOTE:* RStudio provide this very [useful cheatsheet](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf) for plotting using `ggplot2`. Different example plots are provided and the associated code (i.e which `geom` or `theme` to use in the appropriate situation.) @@ -232,7 +232,7 @@ personal_theme <- function(){ theme_bw() + theme(axis.text=element_text(size=rel(1.5)), axis.title=element_text(size=rel(1.5)), - plot.title=element_text(hjust=0.5)) + plot.title=element_text(hjust=0.5)) } ``` @@ -249,7 +249,7 @@ ggplot(new_metadata) + ## Boxplot -Now that we have all the required information for plotting with ggplot2 let's try plotting a boxplot. A boxplot provides a graphical view of the distribution of data based on a five number summary. The top and bottom of the box represent the (1) first and (2) third quartiles (25th and 75th percentiles, respectively). The line inside the box represents the (3) median (50th percentile). The whiskers extending above and below the box represent the (4) maximum, and (5) minimum of a data set. The whiskers of the plot reach the minimum and maximum values that are not outliers. +Now that we have all the required information for plotting with ggplot2 let's try plotting a boxplot. A boxplot provides a graphical view of the distribution of data based on a five number summary. The top and bottom of the box represent the (1) first and (2) third quartiles (25th and 75th percentiles, respectively). The line inside the box represents the (3) median (50th percentile). The whiskers extending above and below the box represent the (4) maximum, and (5) minimum of a data set. The whiskers of the plot reach the minimum and maximum values that are not outliers. Outliers are determined using the interquartile range (IQR), which is defined as: Q3 - Q1. Any values that exceeds 1.5 x IQR below Q1 or above Q3 are considered outliers and are represented as points above or below the whiskers. These outliers are useful to identify any unexpected observations. @@ -278,22 +278,23 @@ There are two ways in which figures and plots can be output to a file (rather th The second option is to use R functions and have the write to file hard-coded in to your script. This would allow you to run the script from start to finish and automate the process (not requiring human point-and-click actions to save). In R’s terminology, **output is directed to a particular output device and that dictates the output format that will be produced**. A device must be created or “opened” in order to receive graphical output and, for devices that create a file on disk, the device must also be closed in order to complete the output. -Let's print our scatterplot to a pdf file format. First you need to initialize a plot using a function which specifies the graphical format you intend on creating i.e.`pdf()`, `png()`, `tiff()` etc. Within the function you will need to specify a name for your image, and the with and height (optional). This will open up the device that you wish to write to: +Let's print our scatterplot to a pdf file format. First you need to initialize a plot using a function which specifies the graphical format you intend on creating i.e.`pdf()`, `png()`, `tiff()` etc. Within the function you will need to specify a name for your image, and the width and height (optional). This will open up the device that you wish to write to: ```r pdf("figures/scatterplot.pdf") ``` +*There must be a `figures` directory in your current working directory* -If you wish to modify the size and resolution of the image you will need to add in the appropriate parameters as arguments to the function when you initialize. Then we plot the image to the device, using the ggplot scatterplot that we just created. +If you wish to modify the size and resolution of the image you will need to add in the appropriate parameters as arguments to the function when you initialize. Then we plot the image to the device, using the ggplot scatterplot that we just created. ```r ggplot(new_metadata) + geom_point(aes(x = age_in_days, y= samplemeans, color = genotype, - shape=celltype), size=rel(3.0)) + shape=celltype), size=rel(3.0)) ``` -Finally, close the "device", or file, using the `dev.off()` function. There are also `bmp`, `tiff`, and `jpeg` functions, though the jpeg function has proven less stable than the others. - -```r +Finally, close the "device", or file, using the `dev.off()` function. There are also `bmp`, `tiff`, and `jpeg` functions, though the jpeg function has proven less stable than the others. + +```r dev.off() ``` diff --git a/lessons/08_intro_tidyverse.md b/lessons/08_intro_tidyverse.md index 0f060c9..a83a31a 100644 --- a/lessons/08_intro_tidyverse.md +++ b/lessons/08_intro_tidyverse.md @@ -28,13 +28,20 @@ res_tableOE <- read.csv(file = "data/Mov10oe_DE_results.csv", row.names = 1) library(tidyverse) ``` +If you haven't already loaded the metadata, download it [here](https://raw.githubusercontent.com/hbctraining/Intro-to-R-with-DGE/master/data/mouse_exp_design.csv) and load it with: + +```r + metadata <- read.csv(file="data/mouse_exp_design.csv") + + ``` + ## Tidyverse basics The Tidyverse suite of packages introduces users to a set of data structures, functions and operators to make working with data more intuitive, but is slightly different from the way we do things in base R. **Two important new concepts we will focus on are pipes and tibbles**. ### Pipes -Stringing together commands in R can be quite daunting. Also, trying to understand code that has many nested functions can be confusing. +Stringing together commands in R can be quite daunting. Also, trying to understand code that has many nested functions can be confusing. To make R code more human readable, the Tidyverse tools use the pipe, `%>%`, which was acquired from the `magrittr` package and is now part of the `dplyr` package that is installed automatically with Tidyverse. **The pipe allows the output of a previous command to be used as input to another command instead of using nested functions.** @@ -61,7 +68,7 @@ The pipe represents a much easier way of writing and deciphering R code, and so 1. Extract the `replicate` column from the `metadata` data frame (use the `$` notation) and save the values to a vector named `rep_number`. 2. Use the pipe (`%>%`) to perform two steps in a single line: - + 1. Turn `rep_number` into a factor. 2. Use the `head()` function to return the first six values of the `rep_number` factor. @@ -69,7 +76,7 @@ The pipe represents a much easier way of writing and deciphering R code, and so ### Tibbles -A core component of the [tidyverse](http://tidyverse.org/) is the [tibble](http://tibble.tidyverse.org/). **Tibbles are a modern rework of the standard `data.frame`, with some internal improvements** to make code more reliable. They are data frames, but do not follow all of the same rules. For example, tibbles can have numbers/symbols for column names, which is not normally allowed in base R. +A core component of the [tidyverse](http://tidyverse.org/) is the [tibble](http://tibble.tidyverse.org/). **Tibbles are a modern rework of the standard `data.frame`, with some internal improvements** to make code more reliable. They are data frames, but do not follow all of the same rules. For example, tibbles can have numbers/symbols for column names, which is not normally allowed in base R. **Important: [tidyverse](http://tidyverse.org/) is very opininated about row names**. These packages insist that all column data (e.g. `data.frame`) be treated equally, and that special designation of a column as `rownames` should be deprecated. [Tibble](http://tibble.tidyverse.org/) provides simple utility functions to handle rownames: `rownames_to_column()` and `column_to_rownames()`. More help for dealing with row names in tibbles can be found: @@ -77,7 +84,7 @@ A core component of the [tidyverse](http://tidyverse.org/) is the [tibble](http: help("rownames", "tibble") ``` -Tibbles can be created directly using the `tibble()` function or data frames can be converted into tibbles using `as_tibble(name_of_df)`. +Tibbles can be created directly using the `tibble()` function or data frames can be converted into tibbles using `as_tibble(name_of_df)`. >**NOTE:** The function `as_tibble()` will ignore row names, so if a column representing the row names is needed, then the function `rownames_to_column(name_of_df)` should be run prior to turning the data.frame into a tibble. Also, `as_tibble()` will not coerce character vectors to factors by default. @@ -90,7 +97,7 @@ Tibbles can be created directly using the `tibble()` function or data frames can *** -A nice feature of a tibble is that **when printing a variable to screen, it will show only the first 10 rows and the columns that fit to the screen by default**. This is nice since you don't have to specify `head()` to take a quick look at your dataset. +A nice feature of a tibble is that **when printing a variable to screen, it will show only the first 10 rows and the columns that fit to the screen by default**. This is nice since you don't have to specify `head()` to take a quick look at your dataset. ```r @@ -98,8 +105,8 @@ A nice feature of a tibble is that **when printing a variable to screen, it will rpkm_data # Default printing of tibble -rpkm_data %>% - rownames_to_column() %>% +rpkm_data %>% + rownames_to_column() %>% as_tibble() ``` @@ -108,13 +115,13 @@ rpkm_data %>% > > ``` > # Printing of tibble with print() - change defaults -> rpkm_data %>% -> rownames_to_column() %>% -> as_tibble() %>% +> rpkm_data %>% +> rownames_to_column() %>% +> as_tibble() %>% > print(n = 20, width = Inf) > ``` -*** +*** ## Tidyverse tools @@ -144,11 +151,11 @@ To extract columns from a tibble we can use the `select()` function. ```r # Convert the res_tableOE data frame to a tibble -res_tableOE <- res_tableOE %>% - rownames_to_column(var="gene") %>% +res_tableOE <- res_tableOE %>% + rownames_to_column(var="gene") %>% as_tibble() -# extract selected columns from res_tableOE +# extract selected columns from res_tableOE res_tableOE %>% select(gene, baseMean, log2FoldChange, padj) ``` @@ -242,15 +249,15 @@ sub_res %>% ## # A tibble: 23,368 x 3 ## gene baseMean log10BaseMean ## - ## 1 1/2-SBSRNA4 45.7 1.66 - ## 2 A1BG 61.1 1.79 - ## 3 A1BG-AS1 176. 2.24 + ## 1 1/2-SBSRNA4 45.7 1.66 + ## 2 A1BG 61.1 1.79 + ## 3 A1BG-AS1 176. 2.24 ## 4 A1CF 0.238 -0.624 - ## 5 A2LD1 89.6 1.95 + ## 5 A2LD1 89.6 1.95 ## 6 A2M 5.86 0.768 ## 7 A2ML1 2.42 0.385 ## 8 A2MP1 1.32 0.121 - ## 9 A4GALT 64.5 1.81 + ## 9 A4GALT 64.5 1.81 ## 10 A4GNT 0.191 -0.718 ## # ... with 23,358 more rows @@ -266,16 +273,16 @@ sub_res %>% ## # A tibble: 23,368 x 4 ## symbol baseMean log2FoldChange padj ## - ## 1 1/2-SBSRNA4 45.7 0.268 0.264 - ## 2 A1BG 61.1 0.209 0.357 - ## 3 A1BG-AS1 176. -0.0519 0.781 - ## 4 A1CF 0.238 0.0130 NA - ## 5 A2LD1 89.6 0.345 0.0722 - ## 6 A2M 5.86 -0.274 0.226 - ## 7 A2ML1 2.42 0.240 NA - ## 8 A2MP1 1.32 0.0811 NA + ## 1 1/2-SBSRNA4 45.7 0.268 0.264 + ## 2 A1BG 61.1 0.209 0.357 + ## 3 A1BG-AS1 176. -0.0519 0.781 + ## 4 A1CF 0.238 0.0130 NA + ## 5 A2LD1 89.6 0.345 0.0722 + ## 6 A2M 5.86 -0.274 0.226 + ## 7 A2ML1 2.42 0.240 NA + ## 8 A2MP1 1.32 0.0811 NA ## 9 A4GALT 64.5 0.798 0.0000240 - ## 10 A4GNT 0.191 0.00952 NA + ## 10 A4GNT 0.191 0.00952 NA ## # ... with 23,358 more rows @@ -370,30 +377,30 @@ There are two main functions in Tidyr, `gather()` and `spread()`. These function ### gather() -The `gather()` function changes a wide data format into a long data format. This function is particularly helpful when using 'ggplot2' to get all of the values to plot into a single column. +The `gather()` function changes a wide data format into a long data format. This function is particularly helpful when using 'ggplot2' to get all of the values to plot into a single column. To use this function, you need to give the columns in the data frame you would like to gather together as a single column. Then, provide a name to give the column where all of the column names will be present using the `key` argument, and the name to give the column where all of the values will be present using the `value` argument. ```r -rpkm_data_tb <- rpkm_data %>% - rownames_to_column() %>% +rpkm_data_tb <- rpkm_data %>% + rownames_to_column() %>% as_tibble() gathered <- rpkm_data_tb %>% gather(colnames(rpkm_data_tb)[2:13], key = "samplename", value = "rpkm") -``` - +``` + ### spread() The `spread()` function is the reverse of the `gather()` function. The categories of the `key` column will become separate columns, and the values in the `value` column split across the associated `key` columns. ```r -gathered %>% - spread(key = "samplename", +gathered %>% + spread(key = "samplename", value = "rpkm") -``` +``` @@ -401,12 +408,12 @@ gathered %>% ## Stringr -Stringr is a powerful tool for working with sequences of characters, or **strings**. While there are a plethora of functions in stringr that are useful for working with strings, we will only cover a those we find to be the most useful: +Stringr is a powerful tool for working with sequences of characters, or **strings**. While there are a plethora of functions in stringr that are useful for working with strings, we will only cover those we find to be the most useful: - `str_c()` concatenates strings together - `str_split()` splits string by specifying a separator - `str_sub()` extracts characters from a string at specific locations -- `str_replace()` replaces a string with another string +- `str_replace()` replaces a string with another string - `str_to_()` group of functions that change the case of the strings, includes `str_to_upper()`, `str_to_lower()`, and `str_to_title()` - `str_detect()` identifies whether a pattern exists in each of the elements in a vector - `str_subset()` returns only those elements that match a pattern @@ -427,18 +434,18 @@ metadata <- metadata %>% In contrast to `str_c()`, `str_split()` will separate values based on a designated separator. ```r -metadata %>% - pull(sample) %>% +metadata %>% + pull(sample) %>% str_split("_") -``` +``` ### str_sub() For extracting characters from a string, the `str_sub()` function can be used to denote which positions in the string to extract: ```r -metadata %>% - pull(sample) %>% +metadata %>% + pull(sample) %>% str_sub(start = 1, end = 8) ``` @@ -499,8 +506,8 @@ metadata[idx, ] To only return those values that match a pattern, the `str_subset()` function will extract only those values: ```r -metadata %>% - pull(sample) %>% +metadata %>% + pull(sample) %>% str_subset("typeA_1") ```