Skip to content

[New Concept] Dataframes draft#506

Open
colinleach wants to merge 6 commits into
exercism:mainfrom
colinleach:dataframes
Open

[New Concept] Dataframes draft#506
colinleach wants to merge 6 commits into
exercism:mainfrom
colinleach:dataframes

Conversation

@colinleach

Copy link
Copy Markdown
Contributor

A big one: at least as important to R as the multiple-dispatch concept is in Julia.

I've tried to get the scope right: enough for an introductory concept, but deferring a lot to later concepts. Not sure I succeeded!

One obvious problem you will notice is that I boasted about including comparisons with Pandas and SQL syntax, then didn't include any. I still think it would be good in the About (not the Intro, obviously). I just need to find the time and brain-power. Using dplyr fully for the first time, I'm very impressed. Quite a few things will be harder in the other languages.

I still have no idea about an exercise to pair with this. Agreeing a concept scope will be a good first step.

@depial

depial commented May 28, 2026

Copy link
Copy Markdown
Contributor

After a first look through, I see no problem with the scope since it seems to hit the main points well without over extending.

I'll do a more careful reading whenever it's time to review ;)

@colinleach

Copy link
Copy Markdown
Contributor Author

Another cop-out suggestion: I'm backing away from the idea of including Pandas and SQL comparisons in the first release of this concept. It's still a longer-term ambition, but probably not the most urgent priority as we try to move the syllabus towards public release.

Thoughts?

@depial

depial commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

It's still a longer-term ambition, but probably not the most urgent priority as we try to move the syllabus towards public release.

I can agree with this because adding those comparisons definitely falls into the above-and-beyond category since they are only relevant to the subset of students who have worked / will work with Pandas or SQL (large as that subset may be).

@depial

depial commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

As for exercises, I'm thinking we could create one around a built in dataset. This is close to what you did in the about.md, but it could have more real-world context.

The questions I guess we'd have to answer are:

  1. Scope - From simple manipulations to a more complete data wrangling pipeline.
  2. Testing - Hard coded tests might be awkward with tibbles, but likely manageable by testing subsets.

I think answering the first question would help pick an appropriate dataset to play with.

This seems like the most straightforward way to come up with an exercise. Any thoughts?

@colinleach

Copy link
Copy Markdown
Contributor Author

Ooh, I didn't know about the datasets library. I've confirmed it is available in the test runner. It's clearly a much wider range than the stuff in dplyr (we used starwars, there is also bands for practicing multi-table joins, and storms for time series plotting). Other common practice datasets (such as palmerpenguins) need to be installed from CRAN.

I've tended to assume we'll keep the scope relatively simple for this introductory concept: subsetting by rows and columns, column mutation, maybe some row sorting (though that's pretty boring).

There will be future concept(s) for broader data wrangling. I'm pretty sure of that, because it's the sort of thing I would enjoy writing (unlike some concepts that I just felt duty-bound to add, such as regex).

Testing will need some thought. Small subsets can be defined within the test file, bigger stuff could be a CSV or similar, imported with readr.

@colinleach

Copy link
Copy Markdown
Contributor Author

Incidentally, I won't bother fixing merge conflicts until we're ready to merge. There are still multiple open PRs that will stomp on each other.

@colinleach

Copy link
Copy Markdown
Contributor Author

When picking a dataset:

  • We probably want to avoid factors, or else convert them to string columns before the exercise stubs.
  • If we have dates/times, that would force another prereq on this concept.

Current plan, so far as there is one:
image

@depial

depial commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Testing will need some thought. Small subsets can be defined within the test file, bigger stuff could be a CSV or similar, imported with readr.

The idea I had was that, since we will be doing deterministic manipulations, we don't have to compare entire tibbles. All we would likely need is a representative slice of the expected return that can be hard coded, and then we can compare it to the same slice of the student's actual return. As a simplified example, say there's a sorting operation to check. We probably just need to check a (few) chosen location(s) to make sure the correct order is there. This effectively turns a tibble comparison into a small vector comparison.

We probably want to avoid factors

I agree we should avoid them, mainly because I was thinking it might be nice to show how a dataset is loaded and made into a tibble as part of the exercise. Either we could make this into a task that the student has to do, or we could include the code as part of the stub so the student can at least see it. With the scope of the about.md right now, it would likely have to be the latter, which is fine since that allows for more task space for operations/manipulations.

Also, I vote to avoid dates and times since they seem like an unnecessary complication.

@colinleach

Copy link
Copy Markdown
Contributor Author

we could make this into a task that the student has to do, or we could include the code as part of the stub so the student can at least see it

I think this is just as_tibble(). If it's not already clear from the about.md I should add it (it's buried in a note block at the moment).

@depial

depial commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

After a cursory look through the list of datasets, I've found three candidates that seem to meet our requirements (namely relatively small with numeric entries):

There are certainly more that might work, and I didn't look at every one on the list, but I did have a bit of a hard time finding something that was both sufficiently interesting (i.e. complex) while not involving time series or factors. If you don't like any of these, feel free to propose something else since I'm certainly not attached to any.

I figure that after we pick a dataset, we could talk about which manipulations we'd like to include in an exercise. I'd like to defer to your judgement on that since I feel you have a better idea of what might be most important, but I'd naively lean towards a core including:

  • loading dataset / creating tibble
  • columnwise operations
  • mutate

Other operations would almost certainly to be tacked onto this core depending on the flow of the story (e.g. subsetting, arrange, etc). Let me know what you're feelings are on the matter.

(it's buried in a note block at the moment)

I think I missed it because I had just skimmed the about.md to get a quick review and had expected to find this information closer to the beginning (e.g. in the Creating a tibble section).

@colinleach

Copy link
Copy Markdown
Contributor Author

I'm intrigued by the Swiss one: it has a bit more structure for the exercise to work with. Also, it's a nostalgic reminder of places I've cycled through or (Porrentruy) been to many times on cross-country skis. Not a good basis for designing a syllabus!

I should certainly add a couple of lines on as_tibble() and as.data.frame() higher up the docs.

@colinleach

Copy link
Copy Markdown
Contributor Author

I've added the interconversions bit, and (I think) fixed the merge conflict.

Looking more closely at swiss, it's not clear what operations we can do on it. Great for looking at correlations (lm() and scatter plots, not so much for calculating new columns. Am I missing something?

@depial

depial commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Am I missing something?

Nope, I think you're on point here. After having a closer look, I'm having a bit of difficulty coming up with a narrative that works with different operations (and my mind keeps defaulting to looking at correlations). That said:

  • I was thinking there could be a way to combine Fertility and Infant.Mortality into a new column.
  • The above would involve columnwise operations.
  • Subsetting functions would be fairly easy to come up with.
  • arrange could be used trivially to sort by an arbitrary column, and relocate could be used at the same time.

Does that sound like enough to go on? Or is it a bit contrived and we should we explore another dataset?

With the trees dataset, I see potential for:

  • Renaming a column (Girth -> Diameter)
  • Creating a new column (e.g. difference between Volume and a cone calculated from Diameter and Height)
  • Subsetting and arrange are probably going to be easy for any dataset.

With the USPersonalExpenditure dataset, I see largely the same thing.

@colinleach

colinleach commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

I'm starting to think we're going to have to supply our own dataset as a CSV file, so we have a better range of options. Let me think about it (after lunch). There may be something we can pick up from the astrophysics classes I used to take.

If we want to borrow from the Python world, there's a repo of Seaborn data. Titanic is used a lot in Pandas textbooks, I need to look through the others. Let's keep notes on things that may be useful in later concepts, even if unsuitable for this one.

@colinleach

Copy link
Copy Markdown
Contributor Author

I see lots of data for:

  • time series plotting
  • grouping and pivoting
  • linear correlations and curve fitting

All good for future concepts, but we need something simpler for this first introduction.

Your idea about trees is starting to look attractive, working around the wrongly-labelled girth column. We could estimate actual girth (circumference ) in one of the tasks. It's just a pity there aren't more columns, to practice subsetting.

@depial

depial commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

If you want, I can try to put something together for trees.

I like the idea of providing a proper Girth column after renaming the current one to Diameter. That gives four columns, and I think we could easily add another in the same task (e.g. Weight of timber) to give five. That might give us enough for somewhat meaningful subsetting and/or arranging.

Should I give it a go?

@colinleach

Copy link
Copy Markdown
Contributor Author

Let's try it!

@depial

depial commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

I've thrown together a quick exemplar.R with four tasks to see if we can move ahead with the idea:

  1. Create the tibble and rename Girth
  2. Add columns Girth and Weight
  3. Rearrange the column order and arrange by the (new) first column.
  4. Subset by the given column names, between min-max height and max weight.
library(datasets)
library(tidyverse)

# Task 1
tree_tibble <- trees |> as_tibble() |> rename(Diameter=Girth)

# Task2
add_girth_weight <- function(tbbl, rnd=1) {
    tbbl |> mutate(Girth = pi * Diameter, Weight = 35 * Volume) |> round(rnd)
}

# Task 3
rearrange <- function(tbbl, rearrangement) {
    tbbl |> relocate(rearrangement) |> arrange(rearrangement[1])
}

# Task 4
lumber <- function(tbbl, selection, min_height, max_height, max_weight) {
    tbbl |> select(selection) |> filter(between(Height, min_height, max_height) & Weight < max_weight)
}

The idea I'm having for the story is basically a lumber farm type operation. Is this a good enough start to continue with? If so, any ideas for further tasks or modifications to these ones?

Note: I've included a rnd argument in Task 2 so I could justify it being a function, otherwise, it seems like it could just be combined with Task 1, and I wouldn't want to over-complicate things right off the bat.

@depial depial left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks pretty good to me. Just a couple small suggestions.

Comment thread concepts/dataframes/about.md Outdated
Comment thread concepts/dataframes/about.md Outdated
Comment thread concepts/dataframes/links.json Outdated
@colinleach

Copy link
Copy Markdown
Contributor Author

Is this a good enough start to continue with?

Yes, and better than anything I can think of right now. We may get a few more ideas once it's implemented.

I'll update the About, but we can't do much with the concept until it's clear what the exercise will need in the Intro.

@depial

depial commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

I'll try to open the PR for the exercise tomorrow so we can move everything along here.

Comment thread concepts/dataframes/about.md
Comment thread concepts/dataframes/about.md
Comment thread concepts/dataframes/about.md Outdated
3 R2-D2 96 32 NA white, bl… red 33 none mascu… Naboo Droid <chr> <chr>
4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooine Human <chr> <chr>
# ℹ 2 more variables: starships <list>, BMI <dbl>
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that this needs to be formatted with #>

I'll be honest, I don't see what the purpose of pick() is here. Does it just save computation of using the entire dataframe in the mutation? If so, would it be better to use c(height, mass) since that's all that's needed for the computation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had similar questions running through my head. No answers.

@depial depial Jun 10, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give me a minute to play with this... I'm starting to think the way I used it might be a bit of a hack

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I'll be figuring this out anytime soon...

@colinleach

Copy link
Copy Markdown
Contributor Author

I've copied the introduction.md from the exercise as-is. I've tried to copy changes back into the about.md (especially the data-masking block at the end), though TBH it's anyone's guess if I caught everything.

@colinleach colinleach marked this pull request as ready for review June 11, 2026 18:33
@depial

depial commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

I've been looking into the issues we've run into with data masking and tidy selection, which fall under the umbrella concept of tidy evaluation. Long story short, they have made a more common thing easier (hard coded scripts with pipelines), which made something else harder (generalized code).

At a minimum, I think I should rewrite the caution block, since I have more knowledge about how things are working now, but I'm wondering how you're feeling about the inclusion of these ideas.

Summary and possible rewrite of caution block

Tidyverse works with tidy evaluation. In a nutshell, this is scoping behavior that allows dataset variables to be used in pipes without explicit mention of the dataset, as seen in examples throughout this concept. This streamlines a lot of one-off scripts, which are quite common in data science, but it introduces difficulties when trying to write general code or apps (e.g. wrapper functions).

A full treatment of tidy evaluation is beyond the scope of this concept, but, for the purposes of this exercise, the two main points are:

  • Data masking functions: merge(), filter(), arrange()
  • Tidy selection functions: select(), relocate()

Data masking functions require data-masking variables, which string vectors are not. This can be handled by wrapping a variable str in: .data[[str]] or pick(str). If these are not used in the data-masking function, it may fail silently.

Tidy selection functions don't allow for outside vectors. This can often be handled with all_of(vector), any_of(vector) and {{ vector }}. If these are not used, usually an error will be thrown.


Specifically:

  • Is it easy enough for a student to do the exercise to warrant inclusion (even if they don't understand that part)?
  • What are the chances you would include this topic in a future concept (either as part of another concept or on its own)?

While I find this information to be important and interesting, that certainly doesn't mean that the general R public will. Even if it's not high enough priority, I would like to try to include it if it's easy enough for a student to get through the exercise (again, even without knowledge), since I thinks it's helpful to make students at least aware of the issue.

If you think we should try to drop it, there are a few ways I could see doing that:

  1. Ditch data masking - Remove the need for pick() in Task 3.
  2. Ditch tidy selection - Remove the need for all_of() in Tasks 3 and 4.

If we drop these, there are no guarantees that students wouldn't naively try to use character strings (and/or outside vectors) anyways; so any subset that does, will have to learn what went wrong on their own.

In the end, I would like to find a way to include it, albeit as concise as possible, but I need to defer to your judgement since you are more familiar with the usage space.

@colinleach

Copy link
Copy Markdown
Contributor Author

Ummm...

I don't have a quick and simple answer for this, so let's take time and think about it. It raises interesting but non-trivial issues that I wasn't aware of - though I need to learn!

I'm tempted to park this for a few days, and try to get the middle of the syllabus sorted (functions, functional programming). I'd (maybe) even be willing to launch the syllabus without dataframes in the beta, so longer as it is a fast-follower.

@colinleach

Copy link
Copy Markdown
Contributor Author

Thinking about it some more (in the shower: a classic approach), these are some more options:

  1. Merge a simplified version of the concept + exercise initially, while creating an issue to address this in a post-launch revision.
  2. Remove it permanently from this introduction-to-dataframes concept, and address it properly in a later concept.

I'm currently moving towards (2), avoiding feature creep here. But would we be left with enough of an exercise?

@depial

depial commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

I have no problem waiting a bit on this in order to get this topic sorted. Tidy evaluation doesn't look terribly complex after reading a bit about it and seeing it as scoping behavior. I think it was just not something I'd thought of or encountered before since I'd only ever written hard coded scripts in the few times I've had to use R and Tidyverse before.

Here are two other links I found useful when trying to get a handle on the subject:

Mastering Shiny: Chapter 12
Github blog post w/ examples

The first one is the most comprehensive, but doesn't say much more than the link I provided in my previous post. The interesting thing about it is they are describing it so users can build Shiny apps (i.e. use wrapper functions and more generic code). The second one is more examples than explanation and shows some usage with !! that is different from other treatments of the topic I've seen, so I found it interesting but somewhat irrelevant to our needs.

@depial

depial commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

I'm currently moving towards (2), avoiding feature creep here. But would we be left with enough of an exercise?

We would likely be able to keep most of the exercise as is, just the student would have to hard code the columns in (ala the filter() part of Task 4). It would be a bit lamer (and the story a bit more contrived), but I don't think it would be a big deal.

That said, I would think it's best to pause while we familiarize ourselves with the topic and we can make a better decision then.

EDIT: Just a thought, since this is a Tidyverse specific feature, I think it would have to make up a section in a wider concept, rather than be a concept in its own right.

@colinleach

Copy link
Copy Markdown
Contributor Author

it would have to make up a section in a wider concept, rather than be a concept in its own right

No problem. Nothing written yet, but I've started thinking about a database-operations concept to follow this. Including pivoting and grouping (very Tidy data things), but I can imagine this fitting in with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants