From a5d58a5f6fa570c90ab40d492baed00e2375d2ed Mon Sep 17 00:00:00 2001 From: Colin Leach Date: Tue, 26 May 2026 16:06:45 -0700 Subject: [PATCH 1/6] [Draft Concept] Dataframes --- concepts/dataframes/.meta/config.json | 7 + concepts/dataframes/about.md | 465 ++++++++++++++++++++++++++ concepts/dataframes/introduction.md | 1 + concepts/dataframes/links.json | 14 + config.json | 5 + 5 files changed, 492 insertions(+) create mode 100644 concepts/dataframes/.meta/config.json create mode 100644 concepts/dataframes/about.md create mode 100644 concepts/dataframes/introduction.md create mode 100644 concepts/dataframes/links.json diff --git a/concepts/dataframes/.meta/config.json b/concepts/dataframes/.meta/config.json new file mode 100644 index 00000000..8e7b02cf --- /dev/null +++ b/concepts/dataframes/.meta/config.json @@ -0,0 +1,7 @@ +{ + "authors": [ + "colinleach" + ], + "contributors": [], + "blurb": "Dataframes are rectangular collections of potentially heterogeneous data. They are central to most use of modern R, and to the Tidyverse ecosystem." +} \ No newline at end of file diff --git a/concepts/dataframes/about.md b/concepts/dataframes/about.md new file mode 100644 index 00000000..24d73f09 --- /dev/null +++ b/concepts/dataframes/about.md @@ -0,0 +1,465 @@ +# About + +In other parts of the syllabus, we have seen various data types with different characteristics. + +- Atomic [vectors][concept-vectors] are 1-dimensional and homogenous in type. +- [Lists][concept-lists] are 1-dimensional and elements can be of heterogenous types. +- [Matrices and arrays][concept-matrices-arrays] are multi-dimensional and homogeneous. + +This Concept will look at ways to store multi-dimensional, heterogenous data. +In practice, _most_ real-world data is like this, so we are now getting to the heart of how R is (mostly) used in practice. + +## Dataframe variants + +Over the decades, R has added multiple data types to handle tabular data. + +This syllabus will focus mainly on tibbles, but it is useful to know about some alternatives. + +### The `data.frame` + +In Base R, a [`data.frame`][web-dataframe] is a `list` of equal-length `vectors`. +This can be thought of as a rectangular table of data, in which each column is homogeneous, but each row can (and usually does) contain different types of data. + +An example to illustrate this: + +```R +# create the column vectors +languages <- c("Fortran", "R", "Python", "Julia") +created <- c(1957, 1993, 1991, 2012) +has.syllabus <- c(FALSE, TRUE, TRUE, TRUE) + +# join columns to create the dataframe +df <- data.frame(languages, created, has.syllabus) +df +#> languages created has.syllabus +#> 1 Fortran 1957 FALSE +#> 2 R 1993 TRUE +#> 3 Python 1991 TRUE +#> 4 Julia 2012 TRUE + +# look at the structure +str(df) +#> 'data.frame': 4 obs. of 3 variables: + #> $ languages : chr "Fortran" "R" "Python" "Julia" + #> $ created : num 1957 1993 1991 2012 + #> $ has.syllabus: logi FALSE TRUE TRUE TRUE + ``` + + We have a column of character strings, a column of numbers and a column of booleans. + When scaled up, this is an intuitive way to represent many collections of real world data. + +### The `tibble` + +The `data.frame` design is _old_. + +Multi-decade experience, plus changing patterns of how R is used, led to a redesign, creating a modernized alternative in the Tidyverse: [tibbles][web-tibble]. + +Compared to Base R, tibbles have: + +- Different defaults, to reduce common problems. +- Less willingness to coerce data types during input. +- More and clearer error messages. +- Different, usually better, display formats. + +In short, a `tibble` aims to "_do less and complain more_", also described as "_lazy and surly_". + +The types are usually interchangeable: any function which accepts a `data.frame` will also accept a `tibble`, and _vice versa_. + +For new work, using tibbles will probably help you create more robust code. +However, legacy code and legacy data is very plentiful in the R world, so the `data.frame` is likely to remain common for a long time. + +```R +# column vectors are same as for data.frame +library(tibble) +tbl <- tibble(languages, created, has.syllabus) +tbl + # A tibble: 4 × 3 +#> languages created has.syllabus +#> +#> 1 Fortran 1957 FALSE +#> 2 R 1993 TRUE +#> 3 Python 1991 TRUE +#> 4 Julia 2012 TRUE + +str(tbl) +#> tibble [4 × 3] (S3: tbl_df/tbl/data.frame) +#> $ languages : chr [1:4] "Fortran" "R" "Python" "Julia" +#> $ created : num [1:4] 1957 1993 1991 2012 +#> $ has.syllabus: logi [1:4] FALSE TRUE TRUE TRUE +``` + +Note the default print format: the comment line with dimensions is printed automatically, and column types are also displayed. + +### The `data.table` + +Tibbles are one relatively recent evolution of the original `data.frame`, fully integrated into the Tidyverse packages and available in the Exercism test runner. + +Separately, [`data.table`][ref-data-table] is an alternative attempt to improve on the `data.frame`, in a third-party package. + +Both are well-respected, and there is inevitably much argument about which is "better". +Maybe there is some degree of consensus around the following points (_even if they will be criticized as simplistic_). + +- `tibble` is optimized mainly for ease of use, and integration with the Tidyverse ecosystem. +- `data.table` is optimized mainly for raw power and scalability, especially when working with very large datasets. + +In any case, `data.table` is not available within Exercism, so it is mentioned here just for completeness. + +## Working with tibbles + +Tibbles are a core part of the Tidyverse, so add them with either `library(tibble)` or `library(tidyverse)`. + +Documentation is fairly extensive, in the Tidyverse style: + +- A [website][web-tibble]. +- A [function reference][ref-tibbles]. +- A [chapter][book-tibble] in R for Data Science. + +### Creating a tibble + +Most simply, we can use the [`tibble()`][ref-tibble] function to join column vectors, as for `data.frame()`. +An example of this was shown in a previous section. + +If it is more convenient to enter values row-wise, the corresponding function is [`tribble()`][ref-tribble]. + +```R +tbl_r <- tribble( + # column names marked with tilde prefix + ~languages, ~created, ~has.syllabus, + "Fortran", 1957, FALSE, + "R", 1993, TRUE, + "Python", 1991, TRUE, + "Julia", 2012, TRUE +) + +tbl_r +# A tibble: 4 × 3 +#> languages created has.syllabus +#> +#> 1 Fortran 1957 FALSE +#> 2 R 1993 TRUE +#> 3 Python 1991 TRUE +#> 4 Julia 2012 TRUE +``` + +In practice, there are dozens of ways to create tibbles, as they are the default output format from a diverse range of Tidyverse functions. +We will return to this in a future Concept. + +## Manipulating a tibble + +The [Functional Programming][concept-funcprog] Concept discussed the [`purrr`][web-purrr] library to manipulate vectors and lists (1-D data structures). + +For dataframes (whether traditional or tibbles), the corresponding library to use is [`dplyr`][web-dplyr]. + +We introduced `dplyr` previously, in the [Switch Concept][concept-switch]. +That just used a few utility functions, but now we can start to explore the rest of this large library. + +### Subsetting + +Dataframes, including tibbles, can be treated as lists of column vectors, so list indexing recovers a specified column. + +```R +tbl +# A tibble: 4 × 3 +#> languages created has.syllabus +#> +#> 1 Fortran 1957 FALSE +#> 2 R 1993 TRUE +#> 3 Python 1991 TRUE +#> 4 Julia 2012 TRUE + +tbl$created +#> [1] 1957 1993 1991 2012 +``` + +A dataframe can also be indexed with [matrix-style][concept-matrices-arrays] indexing. + +```R +tbl[c(2, 4), 1:2] + # A tibble: 2 × 2 +#> languages created +#> +#> 1 R 1993 +#> 2 Julia 2012 +``` + +In modern R with the Tidyverse ecosystem, [`dplyr`][web-dplyr] functions are generally more flexible and convenient, and will be the focus for the rest of this Concept. + +~~~~exercism/note +Because many (_not all!_) students interested in dataframes have previous experience of Python-Pandas and/or SQL, we will provide examples in those other languages for operations we descibe in R (where appropriate). + +Such examples are just a convenience for some students, so _please feel free to ignore them_. +~~~~ + +### Column-wise operations + +Get a single column with [`pull()`][ref-pull] with the name or sequential number (negative numbers to count right-to-left). + +```R +tbl |> pull(created) +#> [1] 1957 1993 1991 2012 +``` + +This is the same result as `tbl$created`, but using a pipeline-friendly function. + +To get multiple columns, the appropriate function is [`select()`][ref-select], which is highly versatile. +Get (or drop) columns based on properties of their name or type. + +```R +# Range with position and/or name +tbl |> select(1:created) + # A tibble: 4 × 2 +#> languages created +#> +#> 1 Fortran 1957 +#> 2 R 1993 +#> 3 Python 1991 +#> 4 Julia 2012 + +# Exclude a column +tbl |> select(!created) + # A tibble: 4 × 2 +#> languages has.syllabus +#> +#> 1 Fortran FALSE +#> 2 R TRUE +#> 3 Python TRUE +#> 4 Julia TRUE + +# Use type of column +tbl |> select(where(is.numeric)) + # A tibble: 4 × 1 +#> created +#> +#> 1 1957 +#> 2 1993 +#> 3 1991 +#> 4 2012 +``` + +Multiple criteria are allowed, using Boolean operators `&`, `|` and `!` (and, or not). + +Column names that are _valid R identifiers_ do not need quotes within a [`select()`][ref-select]. +Invalid names (e.g. those including spaces) can be enclosed in backticks, though renaming them might be better. + +The [`select()`][ref-select] function can work with a range of helper functions to pick column names: [`starts_with`][ref-starts_with], [`contains`][ref-starts_with], [`num_range`][ref-starts_with] and various others. +[`matches`][ref-starts_with] allows full [RegEx][concept-regex] matching. +See the [documentation][ref-select] for details. + +Such power seems quite silly with our toy dataframe of languages. +Fortunately, the [`starwars`][ref-starwars] tibble is included in `dplyr`, giving us something bigger to practice with. + +```R +# limit display to top 3 rows of non-list columns +starwars |> + select(!where(is.list)) |> + head(3) + # A tibble: 3 × 11 +#> name height mass hair_color skin_color eye_color birth_year sex gender homeworld species +#> +#> 1 Luke Skywalker 172 77 blond fair blue 19 male masculine Tatooine Human +#> 2 C-3PO 167 75 NA gold yellow 112 none masculine Tatooine Droid +#> 3 R2-D2 96 32 NA white, blue red 33 none masculine Naboo Droid + +# pick a subset of columns +starwars |> + select(name | ends_with("color")) |> + head(5) + # A tibble: 5 × 4 +#> name hair_color skin_color eye_color +#> +#> 1 Luke Skywalker blond fair blue +#> 2 C-3PO NA gold yellow +#> 3 R2-D2 NA white, blue red +#> 4 Darth Vader none white yellow +#> 5 Leia Organa brown light brown +``` + +### Row-wise operations + +~~~~exercism/note +Clearly, `dplyr` provides powerful ways to select columns by name. + +Can we do similar things with row names? + +_No!_ +Traditional R dataframes can have row names, but (after a history of bugs and performance issues) row names are _not allowed_ in `tibbles`. + +If you want names, put them in a `` column (typically column 1), used like any other column. +Import functions such as [`as_tibble()`][ref-as_tibble] will create this automatically when importing data with named rows. + +If this row-name limitation seems oddly restrictive, remember that most large database systems handle tables the same way: Oracle, SQL Server, PostgreSQL, MySQL... + +[ref-as_tibble]: https://tibble.tidyverse.org/reference/as_tibble.html +~~~~ + +Get rows matching some criteria with [`filter()`][ref-filter], or exclude them with `filter_out()`. + +```R +starwars |> + select(name:mass) |> + filter(between(height, 150, 165) & !is.na(mass)) + # A tibble: 4 × 3 +#> name height mass +#> +#> 1 Leia Organa 150 49 +#> 2 Beru Whitesun Lars 165 75 +#> 3 Nien Nunb 160 68 +#> 4 Ben Quadinaros 163 65 +``` + +Filter criteria can be arbitrarily complex, but always based on row _contents_. + +If row _numbers_ are known, we can use a variety of [`slice()`][ref-slice] functions. + +```R +starwars |> + select(name | homeworld) |> + slice(20:25) + # A tibble: 6 × 2 +#> name homeworld +#> +#> 1 Palpatine Naboo +#> 2 Boba Fett Kamino +#> 3 IG-88 NA +#> 4 Bossk Trandosha +#> 5 Lando Calrissian Socorro +#> 6 Lobot Bespin + +# random sample of rows +starwars |> + select(name | homeworld) |> + slice_sample(n = 4) + # A tibble: 4 × 2 +#> name homeworld +#> +#> 1 Shaak Ti Shili +#> 2 Luminara Unduli Mirial +#> 3 Grievous Kalee +#> 4 Palpatine Naboo +``` + +To remove duplicate rows, use [`distinct()`][ref-distinct]. + +## Modifying a tibble + +First caveat: the [_copy-on-modify_][concept-functions] default means that the original tibble usually remains unchanged. + +Most modifications are applied column-wise. + +Column names can be changed with [`rename(newname = oldname)`][ref-rename], or `rename_with()` to apply a function. +A typical use would be cleaning up imported names to make them easier to work with in R, by removing whitespace and forcing a consistent format for related names. + +Note the syntax within `rename()`. +The _contents_ of column `oldname` are _bound_ to name `newname`, hence the order. + +Column order can be changed with [`relocate()`][ref-relocate]. +Specified column(s) are moved to the left-most position(s) by default, but a `.before` or `.after` argument can be used for finer positioning. + +```R +sw <- starwars |> select(name:species) |> slice(1:4) +sw + # A tibble: 4 × 11 +#> name height mass hair_color skin_color eye_color birth_year sex gender homeworld species +#> +#> 1 Luke Skywalker 172 77 blond fair blue 19 male masculine Tatooine Human +#> 2 C-3PO 167 75 NA gold yellow 112 none masculine Tatooine Droid +#> 3 R2-D2 96 32 NA white, blue red 33 none masculine Naboo Droid +#> 4 Darth Vader 202 136 none white yellow 41.9 male masculine Tatooine Human + +sw |> relocate(c(species, homeworld), .after = name) + # A tibble: 4 × 11 +#> name species homeworld height mass hair_color skin_color eye_color birth_year sex gender +#> +#> 1 Luke Skywalker Human Tatooine 172 77 blond fair blue 19 male masculine +#> 2 C-3PO Droid Tatooine 167 75 NA gold yellow 112 none masculine +#> 3 R2-D2 Droid Naboo 96 32 NA white, blue red 33 none masculine +#> 4 Darth Vader Human Tatooine 202 136 none white yellow 41.9 male masculine +``` + +For bigger changes, [`mutate()`][ref-mutate] lets you: + +- Create new columns that are functions of existing columns. +- Replace an existing column, by creating a new column with the same name. +- Delete a column, by setting its value to [`NULL`][concept-nothingness] + +Clearly, `mutate()` is powerful, potentially confusing, and a reason to be very grateful for copy-on-modify. + +There is no obvious reason to care about the [Body Mass Index][wiki-bmi] of Star Wars characters, but just in case: + +```R +starwars |> + select(c(name, species, height, mass)) |> + mutate(BMI = mass / (height / 100)^2) |> + head(4) + # A tibble: 4 × 5 +#> name species height mass BMI +#> +#> 1 Luke Skywalker Human 172 77 26.0 +#> 2 C-3PO Droid 167 75 26.9 +#> 3 R2-D2 Droid 96 32 34.7 +#> 4 Darth Vader Human 202 136 33.3 +``` + +Row-wise operations are less common for modifying single tibbles (merging multiple tibbles will be discussed in a later concept). + +One exception: [`arrange()`][ref-arrange] sorts rows by the values in one or more columns. + +```R +tbl + # A tibble: 4 × 3 +#> languages created has.syllabus +#> +#> 1 Fortran 1957 FALSE +#> 2 R 1993 TRUE +#> 3 Python 1991 TRUE +#> 4 Julia 2012 TRUE + +tbl |> arrange(languages) + # A tibble: 4 × 3 +#> languages created has.syllabus +#> +#> 1 Fortran 1957 FALSE +#> 2 Julia 2012 TRUE +#> 3 Python 1991 TRUE +#> 4 R 1993 TRUE +``` + +## Summary + +Dataframes, whether traditional or tibbles, are central to the way modern R is typically used. + +Most of the Tidyverse functions (not just `dplyr`) take tibbles as input and/or create them as output. + +This concept just provided a brief introduction, barely scratching the surface of what is possible. + +Later concepts will discuss several other aspects of dataframes (_within the technical contraints of Exercism_). + +[web-dataframe]: https://bioinformatics.ccr.cancer.gov/docs/rintro/Lesson_3/ +[web-tibble]: https://tibble.tidyverse.org/ +[ref-tibbles]: https://tibble.tidyverse.org/reference/index.html +[book-tibble]: https://r4ds.had.co.nz/tibbles.html +[ref-data-table]: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html +[ref-tibble]: https://tibble.tidyverse.org/reference/tibble.html +[ref-tribble]: https://tibble.tidyverse.org/reference/tribble.html +[web-purrr]: https://purrr.tidyverse.org/index.html +[web-dplyr]: https://dplyr.tidyverse.org/index.html +[ref-pull]: https://dplyr.tidyverse.org/reference/pull.html +[ref-select]: https://dplyr.tidyverse.org/reference/select.html +[ref-relocate]: https://dplyr.tidyverse.org/reference/relocate.html +[ref-starts_with]: https://tidyselect.r-lib.org/reference/starts_with.html +[ref-starwars]: https://dplyr.tidyverse.org/reference/starwars.html +[ref-filter]: https://dplyr.tidyverse.org/reference/filter.html +[ref-slice]: https://dplyr.tidyverse.org/reference/slice.html +[ref-mutate]: https://dplyr.tidyverse.org/reference/mutate.html +[ref-distinct]: https://dplyr.tidyverse.org/reference/distinct.html +[ref-rename]: https://dplyr.tidyverse.org/reference/rename.html +[ref-arrange]: https://dplyr.tidyverse.org/reference/arrange.html +[concept-vectors]: https://exercism.org/tracks/r/concepts/vectors +[concept-lists]: https://exercism.org/tracks/r/concepts/lists +[concept-switch]: https://exercism.org/tracks/r/concepts/switch +[concept-funcprog]: https://exercism.org/tracks/r/concepts/functional-programming +[concept-matrices-arrays]: https://exercism.org/tracks/r/concepts/matrices-arrays +[concept-functions]: https://exercism.org/tracks/r/concepts/functions +[concept-regex]: https://exercism.org/tracks/r/concepts/regular-expressions +[concept-nothingness]: https://exercism.org/tracks/r/concepts/nothingness +[wiki-bmi]: https://en.wikipedia.org/wiki/Body_mass_index diff --git a/concepts/dataframes/introduction.md b/concepts/dataframes/introduction.md new file mode 100644 index 00000000..e10b99d0 --- /dev/null +++ b/concepts/dataframes/introduction.md @@ -0,0 +1 @@ +# Introduction diff --git a/concepts/dataframes/links.json b/concepts/dataframes/links.json new file mode 100644 index 00000000..7664e4b8 --- /dev/null +++ b/concepts/dataframes/links.json @@ -0,0 +1,14 @@ +[ + { + "url": "https://bioinformatics.ccr.cancer.gov/docs/rintro/Lesson_3/", + "description": "Introduction to dataframes in bioinformatics." + }, + { + "url": "https://tibble.tidyverse.org/", + "description": "The tibble library, a modern version of dataframes." + }, + { + "url": "https://bioinformatics.ccr.cancer.gov/docs/rintro/Lesson_3/", + "description": "The dplyr library, which makes dataframe manipulation much easier." + } + ] \ No newline at end of file diff --git a/config.json b/config.json index 15aecf3d..6163c321 100644 --- a/config.json +++ b/config.json @@ -1243,6 +1243,11 @@ "uuid": "5362cbb2-b9a7-48b4-a09b-1d68b6a1a6a4", "slug": "matrices-arrays", "name": "Matrices and Arrays" + }, + { + "uuid": "1412e5dc-87fd-4c39-8147-379eb99dfda4", + "slug": "dataframes", + "name": "Dataframes" } ], "key_features": [ From d3992a2d6099480721d912cbf5cd99e6c57e78fc Mon Sep 17 00:00:00 2001 From: Colin Leach Date: Tue, 26 May 2026 16:09:05 -0700 Subject: [PATCH 2/6] add blank line --- concepts/dataframes/.meta/config.json | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/concepts/dataframes/.meta/config.json b/concepts/dataframes/.meta/config.json index 8e7b02cf..d567f42a 100644 --- a/concepts/dataframes/.meta/config.json +++ b/concepts/dataframes/.meta/config.json @@ -4,4 +4,4 @@ ], "contributors": [], "blurb": "Dataframes are rectangular collections of potentially heterogeneous data. They are central to most use of modern R, and to the Tidyverse ecosystem." -} \ No newline at end of file +} From 70a78a5c6841d428b7006623aacdf2c97858c9ec Mon Sep 17 00:00:00 2001 From: Colin Leach Date: Tue, 9 Jun 2026 12:09:43 -0700 Subject: [PATCH 3/6] df-tibble conversions --- concepts/dataframes/about.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/concepts/dataframes/about.md b/concepts/dataframes/about.md index 24d73f09..d1a71de3 100644 --- a/concepts/dataframes/about.md +++ b/concepts/dataframes/about.md @@ -65,6 +65,8 @@ In short, a `tibble` aims to "_do less and complain more_", also described as "_ The types are usually interchangeable: any function which accepts a `data.frame` will also accept a `tibble`, and _vice versa_. +Conversions between the types are easy, with [`as_tibble(df)`][ref-as_tibble] and [`as.data.frame(tbl)`][ref-asdataframe]. + For new work, using tibbles will probably help you create more robust code. However, legacy code and legacy data is very plentiful in the R world, so the `data.frame` is likely to remain common for a long time. @@ -443,6 +445,8 @@ Later concepts will discuss several other aspects of dataframes (_within the tec [ref-tribble]: https://tibble.tidyverse.org/reference/tribble.html [web-purrr]: https://purrr.tidyverse.org/index.html [web-dplyr]: https://dplyr.tidyverse.org/index.html +[ref-as_tibble]: https://tibble.tidyverse.org/reference/as_tibble.html +[ref-asdataframe]: https://www.rdocumentation.org/packages/base/versions/3.1.1/topics/as.data.frame [ref-pull]: https://dplyr.tidyverse.org/reference/pull.html [ref-select]: https://dplyr.tidyverse.org/reference/select.html [ref-relocate]: https://dplyr.tidyverse.org/reference/relocate.html From 9a73bcc607d5c591f723da5f06e98b93cdaf72dd Mon Sep 17 00:00:00 2001 From: Colin Leach Date: Tue, 9 Jun 2026 17:17:05 -0700 Subject: [PATCH 4/6] reviewer comments --- concepts/dataframes/about.md | 34 +++++++++++++--------------------- concepts/dataframes/links.json | 26 +++++++++++++------------- 2 files changed, 26 insertions(+), 34 deletions(-) diff --git a/concepts/dataframes/about.md b/concepts/dataframes/about.md index d1a71de3..6231b4f5 100644 --- a/concepts/dataframes/about.md +++ b/concepts/dataframes/about.md @@ -15,6 +15,18 @@ Over the decades, R has added multiple data types to handle tabular data. This syllabus will focus mainly on tibbles, but it is useful to know about some alternatives. +### The `data.table` + +The [`data.table`][ref-data-table] is an attempt to improve on the `data.frame`, in a third-party package. + +Tibbles and data.tables are both well-respected, and there is inevitably much argument about which is "better". +Maybe there is some degree of consensus around the following points (_even if they will be criticized as simplistic_). + +- `tibble` is optimized mainly for ease of use, and integration with the Tidyverse ecosystem. +- `data.table` is optimized mainly for raw power and scalability, especially when working with very large datasets. + +In any case, `data.table` is not available within Exercism, so it is mentioned here just for completeness. + ### The `data.frame` In Base R, a [`data.frame`][web-dataframe] is a `list` of equal-length `vectors`. @@ -92,20 +104,6 @@ str(tbl) Note the default print format: the comment line with dimensions is printed automatically, and column types are also displayed. -### The `data.table` - -Tibbles are one relatively recent evolution of the original `data.frame`, fully integrated into the Tidyverse packages and available in the Exercism test runner. - -Separately, [`data.table`][ref-data-table] is an alternative attempt to improve on the `data.frame`, in a third-party package. - -Both are well-respected, and there is inevitably much argument about which is "better". -Maybe there is some degree of consensus around the following points (_even if they will be criticized as simplistic_). - -- `tibble` is optimized mainly for ease of use, and integration with the Tidyverse ecosystem. -- `data.table` is optimized mainly for raw power and scalability, especially when working with very large datasets. - -In any case, `data.table` is not available within Exercism, so it is mentioned here just for completeness. - ## Working with tibbles Tibbles are a core part of the Tidyverse, so add them with either `library(tibble)` or `library(tidyverse)`. @@ -164,7 +162,7 @@ tbl # A tibble: 4 × 3 #> languages created has.syllabus #> -#> 1 Fortran 1957 FALSE +#> 1 Fortran 1957 FALSE T ar #> 2 R 1993 TRUE #> 3 Python 1991 TRUE #> 4 Julia 2012 TRUE @@ -186,12 +184,6 @@ tbl[c(2, 4), 1:2] In modern R with the Tidyverse ecosystem, [`dplyr`][web-dplyr] functions are generally more flexible and convenient, and will be the focus for the rest of this Concept. -~~~~exercism/note -Because many (_not all!_) students interested in dataframes have previous experience of Python-Pandas and/or SQL, we will provide examples in those other languages for operations we descibe in R (where appropriate). - -Such examples are just a convenience for some students, so _please feel free to ignore them_. -~~~~ - ### Column-wise operations Get a single column with [`pull()`][ref-pull] with the name or sequential number (negative numbers to count right-to-left). diff --git a/concepts/dataframes/links.json b/concepts/dataframes/links.json index 7664e4b8..21ff13ad 100644 --- a/concepts/dataframes/links.json +++ b/concepts/dataframes/links.json @@ -1,14 +1,14 @@ [ - { - "url": "https://bioinformatics.ccr.cancer.gov/docs/rintro/Lesson_3/", - "description": "Introduction to dataframes in bioinformatics." - }, - { - "url": "https://tibble.tidyverse.org/", - "description": "The tibble library, a modern version of dataframes." - }, - { - "url": "https://bioinformatics.ccr.cancer.gov/docs/rintro/Lesson_3/", - "description": "The dplyr library, which makes dataframe manipulation much easier." - } - ] \ No newline at end of file + { + "url": "https://bioinformatics.ccr.cancer.gov/docs/rintro/Lesson_3/", + "description": "Introduction to dataframes in bioinformatics." + }, + { + "url": "https://tibble.tidyverse.org/", + "description": "The tibble library, a modern version of dataframes." + }, + { + "url": "https://bioinformatics.ccr.cancer.gov/docs/rintro/Lesson_3/", + "description": "The dplyr library, which makes dataframe manipulation much easier." + } +] From 58decbda995076b36344d06fdec99dd39bf86301 Mon Sep 17 00:00:00 2001 From: Colin Leach Date: Wed, 10 Jun 2026 12:48:10 -0700 Subject: [PATCH 5/6] add pick() --- concepts/dataframes/about.md | 31 +++++++++++++++++++++++++------ 1 file changed, 25 insertions(+), 6 deletions(-) diff --git a/concepts/dataframes/about.md b/concepts/dataframes/about.md index 6231b4f5..f9f4202a 100644 --- a/concepts/dataframes/about.md +++ b/concepts/dataframes/about.md @@ -17,15 +17,14 @@ This syllabus will focus mainly on tibbles, but it is useful to know about some ### The `data.table` -The [`data.table`][ref-data-table] is an attempt to improve on the `data.frame`, in a third-party package. - -Tibbles and data.tables are both well-respected, and there is inevitably much argument about which is "better". +Tibbles (described below) and data.tables are both well-respected attempts to improve on the traditional `data.frame` in Base R. +There is inevitably much argument about which is "better". Maybe there is some degree of consensus around the following points (_even if they will be criticized as simplistic_). - `tibble` is optimized mainly for ease of use, and integration with the Tidyverse ecosystem. - `data.table` is optimized mainly for raw power and scalability, especially when working with very large datasets. -In any case, `data.table` is not available within Exercism, so it is mentioned here just for completeness. +In any case, [`data.table`][ref-data-table] is not available within Exercism, so it is mentioned here just for completeness. ### The `data.frame` @@ -162,7 +161,7 @@ tbl # A tibble: 4 × 3 #> languages created has.syllabus #> -#> 1 Fortran 1957 FALSE T ar +#> 1 Fortran 1957 FALSE #> 2 R 1993 TRUE #> 3 Python 1991 TRUE #> 4 Julia 2012 TRUE @@ -394,7 +393,26 @@ starwars |> #> 4 Darth Vader Human 202 136 33.3 ``` -Row-wise operations are less common for modifying single tibbles (merging multiple tibbles will be discussed in a later concept). +When you want to operate on a subset of the columns with functions such as `mutate()`, the `select() |> mutate()` sequence in the above example is one option. +Only the selected columns will be in the result. + +Alternatively, it can be convenient to use [`pick()`][ref-pick] _within_ the `mutate()` call: + +```R +starwars |> mutate(pick(c(name, species, height, mass)), BMI = mass / (height / 100)^2) |> head(4) +# A tibble: 4 × 15 + name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles + +1 Luke Skywal… 172 77 blond fair blue 19 male mascu… Tatooine Human +2 C-3PO 167 75 NA gold yellow 112 none mascu… Tatooine Droid +3 R2-D2 96 32 NA white, bl… red 33 none mascu… Naboo Droid +4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooine Human +# ℹ 2 more variables: starships , BMI +``` + +Only the `pick`ed columns are used in the mutation, but all columns are returned. + +**Row-wise operations** are less common for modifying single tibbles (merging multiple tibbles will be discussed in a later concept). One exception: [`arrange()`][ref-arrange] sorts rows by the values in one or more columns. @@ -447,6 +465,7 @@ Later concepts will discuss several other aspects of dataframes (_within the tec [ref-filter]: https://dplyr.tidyverse.org/reference/filter.html [ref-slice]: https://dplyr.tidyverse.org/reference/slice.html [ref-mutate]: https://dplyr.tidyverse.org/reference/mutate.html +[ref-pick]: https://dplyr.tidyverse.org/reference/pick.html [ref-distinct]: https://dplyr.tidyverse.org/reference/distinct.html [ref-rename]: https://dplyr.tidyverse.org/reference/rename.html [ref-arrange]: https://dplyr.tidyverse.org/reference/arrange.html From 8ceb1d9ffdb98913ddad4260670024fab719c857 Mon Sep 17 00:00:00 2001 From: Colin Leach Date: Thu, 11 Jun 2026 11:31:35 -0700 Subject: [PATCH 6/6] added intro, updated about --- concepts/dataframes/about.md | 54 ++++- concepts/dataframes/introduction.md | 331 ++++++++++++++++++++++++++++ 2 files changed, 377 insertions(+), 8 deletions(-) diff --git a/concepts/dataframes/about.md b/concepts/dataframes/about.md index f9f4202a..ca680495 100644 --- a/concepts/dataframes/about.md +++ b/concepts/dataframes/about.md @@ -400,14 +400,21 @@ Alternatively, it can be convenient to use [`pick()`][ref-pick] _within_ the `mu ```R starwars |> mutate(pick(c(name, species, height, mass)), BMI = mass / (height / 100)^2) |> head(4) -# A tibble: 4 × 15 - name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles - -1 Luke Skywal… 172 77 blond fair blue 19 male mascu… Tatooine Human -2 C-3PO 167 75 NA gold yellow 112 none mascu… Tatooine Droid -3 R2-D2 96 32 NA white, bl… red 33 none mascu… Naboo Droid -4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooine Human -# ℹ 2 more variables: starships , BMI +When you want to operate on a subset of the columns with functions such as `mutate()`, the `select() |> mutate()` sequence in the above example is one option. +Only the selected columns will be in the result. + +Alternatively, it can be convenient to use `pick()` _within_ the `mutate()` call: + +```R +starwars |> mutate(pick(c(name, species, height, mass)), BMI = mass / (height / 100)^2) |> head(4) + # A tibble: 4 × 15 +#> name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles +#> +#> 1 Luke Skywal… 172 77 blond fair blue 19 male mascu… Tatooine Human +#> 2 C-3PO 167 75 NA gold yellow 112 none mascu… Tatooine Droid +#> 3 R2-D2 96 32 NA white, bl… red 33 none mascu… Naboo Droid +#> 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooine Human + # ℹ 2 more variables: starships , BMI ``` Only the `pick`ed columns are used in the mutation, but all columns are returned. @@ -436,6 +443,37 @@ tbl |> arrange(languages) #> 4 R 1993 TRUE ``` +~~~~exercism/caution +Many functions, such as `arrange()`, `filter()` and `mutate()` are [data-masking][ref-data-masking] and require data-masking variables. +For this reason, non-data-masking arguments (e.g. character vectors) need to be converted to be used in data-masking functions. + +A full treatment of how data-masking works in R is beyond the scope of this concept, but it's useful to know there are ways of making this conversion which include options such as: [`pick()`][ref-pick], `.data[[]]` and `!!sym()`. + +```R + # arrange() with string input fails silently +tbl |> arrange("languages") +#> A tibble: 4 × 3 +#> languages created has.syllabus +#> +#> 1 Fortran 1957 FALSE +#> 2 R 1993 TRUE +#> 3 Python 1991 TRUE +#> 4 Julia 2012 TRUE + +tbl |> arrange(pick("languages")) +#> A tibble: 4 × 3 +#> languages created has.syllabus +#> +#> 1 Fortran 1957 FALSE +#> 2 Julia 2012 TRUE +#> 3 Python 1991 TRUE +#> 4 R 1993 TRUE +``` + +[ref-pick]: https://dplyr.tidyverse.org/reference/pick.html +[ref-data-masking]: https://rlang.r-lib.org/reference/topic-data-mask.html +~~~~ + ## Summary Dataframes, whether traditional or tibbles, are central to the way modern R is typically used. diff --git a/concepts/dataframes/introduction.md b/concepts/dataframes/introduction.md index e10b99d0..b96d19b8 100644 --- a/concepts/dataframes/introduction.md +++ b/concepts/dataframes/introduction.md @@ -1 +1,332 @@ # Introduction + +In other parts of the syllabus, we have seen various data types with different characteristics. + +- Atomic vectors are 1-dimensional and homogenous in type. +- Lists are 1-dimensional and elements can be of heterogenous types. +- Matrices and arrays are multi-dimensional and homogeneous. + +This Concept will look at ways to store multi-dimensional, heterogenous data. +In practice, _most_ real-world data is like this, so we are now getting to the heart of how R is (mostly) used in practice. + +## Dataframe variants + +Over the decades, R has added multiple data types to handle tabular data. + +This syllabus will focus mainly on tibbles, but it is useful to know about alternatives. + +### The `data.frame` + +In Base R, a `data.frame` is a `list` of equal-length `vectors`. +This can be thought of as a rectangular table of data, in which each column is homogeneous, but each row can (and usually does) contain different types of data. + +An example to illustrate this: + +```R +# create the column vectors +languages <- c("Fortran", "R", "Python", "Julia") +created <- c(1957, 1993, 1991, 2012) +has.syllabus <- c(FALSE, TRUE, TRUE, TRUE) + +# join columns to create the dataframe +df <- data.frame(languages, created, has.syllabus) +df +#> languages created has.syllabus +#> 1 Fortran 1957 FALSE +#> 2 R 1993 TRUE +#> 3 Python 1991 TRUE +#> 4 Julia 2012 TRUE + ``` + + We have a column of character strings, a column of numbers and a column of booleans. + When scaled up, this is an intuitive way to represent many collections of real world data. + +### The `tibble` + +The `data.frame` design is _old_. + +Multi-decade experience, plus changing patterns of how R is used, led to a redesign, creating a modernized alternative in the Tidyverse: tibbles. + +Compared to Base R, tibbles have: + +- Different defaults, to reduce common problems. +- Less willingness to coerce data types during input. +- More and clearer error messages. +- Different, usually better, display formats. + +In short, a `tibble` aims to "_do less and complain more_", also described as "_lazy and surly_". + +The types are usually interchangeable: any function which accepts a `data.frame` will also accept a `tibble`, and _vice versa_. + +Conversions between the types are easy, with `as_tibble(df)` and `as.data.frame(tbl)`. + +For new work, using tibbles will probably help you create more robust code. +However, legacy code and legacy data is very plentiful in the R world, so the `data.frame` is likely to remain common for a long time. + +```R +# column vectors are same as for data.frame +library(tibble) +tbl <- tibble(languages, created, has.syllabus) +tbl + # A tibble: 4 × 3 +#> languages created has.syllabus +#> +#> 1 Fortran 1957 FALSE +#> 2 R 1993 TRUE +#> 3 Python 1991 TRUE +#> 4 Julia 2012 TRUE +``` + +Note the default print format: the comment line with dimensions is printed automatically, and column types are also displayed. + +## Working with tibbles + +Tibbles are a core part of the Tidyverse, so add them with either `library(tibble)` or `library(tidyverse)`. + +### Creating a tibble + +Most simply, we can use the `tibble()` function to join column vectors, as for `data.frame()`. +An example of this was shown in a previous section. + +If it is more convenient to enter values row-wise, the corresponding function is `tribble()`. + +In practice, there are dozens of ways to create tibbles, as they are the default output format from a diverse range of Tidyverse functions. + +## Manipulating a tibble + +The Functional Programming Concept discussed the `purrr` library to manipulate vectors and lists (1-D data structures). + +For dataframes (whether traditional or tibbles), the corresponding library to use is `dplyr`. + +### Subsetting + +Dataframes, including tibbles, can be treated as lists of column vectors, so list indexing recovers a specified column. + +```R +tbl +# A tibble: 4 × 3 +#> languages created has.syllabus +#> +#> 1 Fortran 1957 FALSE +#> 2 R 1993 TRUE +#> 3 Python 1991 TRUE +#> 4 Julia 2012 TRUE + +tbl$created +#> [1] 1957 1993 1991 2012 +``` + +A dataframe can also be indexed with matrix-style indexing. + +```R +tbl[c(2, 4), 1:2] + # A tibble: 2 × 2 +#> languages created +#> +#> 1 R 1993 +#> 2 Julia 2012 +``` + +In modern R with the Tidyverse ecosystem, `dplyr` functions are generally more flexible and convenient, and will be the focus for the rest of this Concept. + +### Column-wise operations + +Get a single column with `pull()` with the name or sequential number (negative numbers to count right-to-left). + +```R +tbl |> pull(created) +#> [1] 1957 1993 1991 2012 +``` + +This is the same result as `tbl$created`, but using a pipeline-friendly function. + +To get multiple columns, the appropriate function is `select()`, which is highly versatile. +Get (or drop) columns based on properties of their name or type. + +```R +# Range with position and/or name +tbl |> select(1:created) + # A tibble: 4 × 2 +#> languages created +#> +#> 1 Fortran 1957 +#> 2 R 1993 +#> 3 Python 1991 +#> 4 Julia 2012 + +# Use type of column +tbl |> select(where(is.numeric)) + # A tibble: 4 × 1 +#> created +#> +#> 1 1957 +#> 2 1993 +#> 3 1991 +#> 4 2012 +``` + +Multiple criteria are allowed, using Boolean operators `&`, `|` and `!` (and, or not). + +Such power seems quite silly with our toy dataframe of languages. +Fortunately, the `starwars` tibble is included in `dplyr`, giving us something bigger to practice with. + +```R +# pick a subset of columns +starwars |> + select(name | ends_with("color")) |> + head(5) + # A tibble: 5 × 4 +#> name hair_color skin_color eye_color +#> +#> 1 Luke Skywalker blond fair blue +#> 2 C-3PO NA gold yellow +#> 3 R2-D2 NA white, blue red +#> 4 Darth Vader none white yellow +#> 5 Leia Organa brown light brown +``` + +### Row-wise operations + +Get rows matching some criteria with `filter()`, or exclude them with `filter_out()`. + +```R +starwars |> + select(name:mass) |> + filter(between(height, 150, 165) & !is.na(mass)) + # A tibble: 4 × 3 +#> name height mass +#> +#> 1 Leia Organa 150 49 +#> 2 Beru Whitesun Lars 165 75 +#> 3 Nien Nunb 160 68 +#> 4 Ben Quadinaros 163 65 +``` + +Filter criteria can be arbitrarily complex, but always based on row _contents_. + +If row _numbers_ are known, we can use a variety of `slice()` functions. + +```R +starwars |> + select(name | homeworld) |> + slice(20:25) + # A tibble: 6 × 2 +#> name homeworld +#> +#> 1 Palpatine Naboo +#> 2 Boba Fett Kamino +#> 3 IG-88 NA +#> 4 Bossk Trandosha +#> 5 Lando Calrissian Socorro +#> 6 Lobot Bespin +``` + +## Modifying a tibble + +First caveat: the _copy-on-modify_ default means that the original tibble usually remains unchanged. + +Most modifications are applied column-wise. + +Column names can be changed with `rename(newname = oldname)`, or `rename_with()` to apply a function. +Note the syntax within `rename()`. +The _contents_ of column `oldname` are _bound_ to name `newname`, hence the order. + +Column order can be changed with `relocate()`. +Specified column(s) are moved to the left-most position(s) by default, but a `.before` or `.after` argument can be used for finer positioning. + +```R +sw <- starwars |> select(name:species) |> slice(1:4) +sw + # A tibble: 4 × 11 +#> name height mass hair_color skin_color eye_color birth_year sex gender homeworld species +#> +#> 1 Luke Skywalker 172 77 blond fair blue 19 male masculine Tatooine Human +#> 2 C-3PO 167 75 NA gold yellow 112 none masculine Tatooine Droid +#> 3 R2-D2 96 32 NA white, blue red 33 none masculine Naboo Droid +#> 4 Darth Vader 202 136 none white yellow 41.9 male masculine Tatooine Human + +sw |> relocate(c(species, homeworld), .after = name) + # A tibble: 4 × 11 +#> name species homeworld height mass hair_color skin_color eye_color birth_year sex gender +#> +#> 1 Luke Skywalker Human Tatooine 172 77 blond fair blue 19 male masculine +#> 2 C-3PO Droid Tatooine 167 75 NA gold yellow 112 none masculine +#> 3 R2-D2 Droid Naboo 96 32 NA white, blue red 33 none masculine +#> 4 Darth Vader Human Tatooine 202 136 none white yellow 41.9 male masculine +``` + +For bigger changes, `mutate()` lets you: + +- Create new columns that are functions of existing columns. +- Replace an existing column, by creating a new column with the same name. +- Delete a column, by setting its value to `NULL`. + +Clearly, `mutate()` is powerful, potentially confusing, and a reason to be very grateful for copy-on-modify. + +There is no obvious reason to care about the Body Mass Index of Star Wars characters, but just in case: + +```R +starwars |> + select(c(name, species, height, mass)) |> + mutate(BMI = mass / (height / 100)^2) |> + head(4) + # A tibble: 4 × 5 +#> name species height mass BMI +#> +#> 1 Luke Skywalker Human 172 77 26.0 +#> 2 C-3PO Droid 167 75 26.9 +#> 3 R2-D2 Droid 96 32 34.7 +#> 4 Darth Vader Human 202 136 33.3 +``` + +**Row-wise operations** are less common for modifying single tibbles (merging multiple tibbles will be discussed in a later concept). + +One exception: `arrange()` sorts rows by the values in one or more columns. + +```R +tbl +#> A tibble: 4 × 3 +#> languages created has.syllabus +#> +#> 1 Fortran 1957 FALSE +#> 2 R 1993 TRUE +#> 3 Python 1991 TRUE +#> 4 Julia 2012 TRUE + +tbl |> arrange(languages) +#> A tibble: 4 × 3 +#> languages created has.syllabus +#> +#> 1 Fortran 1957 FALSE +#> 2 Julia 2012 TRUE +#> 3 Python 1991 TRUE +#> 4 R 1993 TRUE +``` + +~~~~exercism/caution +Many functions, such as `arrange()`, `filter()` and `mutate()` are data-masking and require data-masking variables. +For this reason, non-data-masking arguments (e.g. character vectors) need to be converted to be used in data-masking functions. + +A full treatment of how data-masking works in R is beyond the scope of this concept, but it's useful to know there are ways of making this conversion which include options such as: `pick()`, `.data[[]]` and `!!sym()`. + +```R + # arrange() with string input fails silently +tbl |> arrange("languages") +#> A tibble: 4 × 3 +#> languages created has.syllabus +#> +#> 1 Fortran 1957 FALSE +#> 2 R 1993 TRUE +#> 3 Python 1991 TRUE +#> 4 Julia 2012 TRUE + +tbl |> arrange(pick("languages")) +#> A tibble: 4 × 3 +#> languages created has.syllabus +#> +#> 1 Fortran 1957 FALSE +#> 2 Julia 2012 TRUE +#> 3 Python 1991 TRUE +#> 4 R 1993 TRUE +``` +~~~~