[New Concept] Dataframes draft#506
Conversation
|
After a first look through, I see no problem with the scope since it seems to hit the main points well without over extending. I'll do a more careful reading whenever it's time to review ;) |
|
Another cop-out suggestion: I'm backing away from the idea of including Pandas and SQL comparisons in the first release of this concept. It's still a longer-term ambition, but probably not the most urgent priority as we try to move the syllabus towards public release. Thoughts? |
I can agree with this because adding those comparisons definitely falls into the above-and-beyond category since they are only relevant to the subset of students who have worked / will work with Pandas or SQL (large as that subset may be). |
|
As for exercises, I'm thinking we could create one around a built in dataset. This is close to what you did in the The questions I guess we'd have to answer are:
I think answering the first question would help pick an appropriate dataset to play with. This seems like the most straightforward way to come up with an exercise. Any thoughts? |
|
Ooh, I didn't know about the I've tended to assume we'll keep the scope relatively simple for this introductory concept: subsetting by rows and columns, column mutation, maybe some row sorting (though that's pretty boring). There will be future concept(s) for broader data wrangling. I'm pretty sure of that, because it's the sort of thing I would enjoy writing (unlike some concepts that I just felt duty-bound to add, such as regex). Testing will need some thought. Small subsets can be defined within the test file, bigger stuff could be a CSV or similar, imported with |
|
Incidentally, I won't bother fixing merge conflicts until we're ready to merge. There are still multiple open PRs that will stomp on each other. |
The idea I had was that, since we will be doing deterministic manipulations, we don't have to compare entire tibbles. All we would likely need is a representative slice of the expected return that can be hard coded, and then we can compare it to the same slice of the student's actual return. As a simplified example, say there's a sorting operation to check. We probably just need to check a (few) chosen location(s) to make sure the correct order is there. This effectively turns a tibble comparison into a small vector comparison.
I agree we should avoid them, mainly because I was thinking it might be nice to show how a dataset is loaded and made into a tibble as part of the exercise. Either we could make this into a task that the student has to do, or we could include the code as part of the stub so the student can at least see it. With the scope of the Also, I vote to avoid dates and times since they seem like an unnecessary complication. |
I think this is just |
|
After a cursory look through the list of datasets, I've found three candidates that seem to meet our requirements (namely relatively small with numeric entries):
There are certainly more that might work, and I didn't look at every one on the list, but I did have a bit of a hard time finding something that was both sufficiently interesting (i.e. complex) while not involving time series or factors. If you don't like any of these, feel free to propose something else since I'm certainly not attached to any. I figure that after we pick a dataset, we could talk about which manipulations we'd like to include in an exercise. I'd like to defer to your judgement on that since I feel you have a better idea of what might be most important, but I'd naively lean towards a core including:
Other operations would almost certainly to be tacked onto this core depending on the flow of the story (e.g. subsetting,
I think I missed it because I had just skimmed the |
|
I'm intrigued by the Swiss one: it has a bit more structure for the exercise to work with. Also, it's a nostalgic reminder of places I've cycled through or (Porrentruy) been to many times on cross-country skis. Not a good basis for designing a syllabus! I should certainly add a couple of lines on |
|
I've added the interconversions bit, and (I think) fixed the merge conflict. Looking more closely at |
Nope, I think you're on point here. After having a closer look, I'm having a bit of difficulty coming up with a narrative that works with different operations (and my mind keeps defaulting to looking at correlations). That said:
Does that sound like enough to go on? Or is it a bit contrived and we should we explore another dataset? With the
With the |
|
I'm starting to think we're going to have to supply our own dataset as a CSV file, so we have a better range of options. Let me think about it (after lunch). There may be something we can pick up from the astrophysics classes I used to take. If we want to borrow from the Python world, there's a repo of Seaborn data. Titanic is used a lot in Pandas textbooks, I need to look through the others. Let's keep notes on things that may be useful in later concepts, even if unsuitable for this one. |
|
I see lots of data for:
All good for future concepts, but we need something simpler for this first introduction. Your idea about |
|
If you want, I can try to put something together for I like the idea of providing a proper Should I give it a go? |
|
Let's try it! |
|
I've thrown together a quick
library(datasets)
library(tidyverse)
# Task 1
tree_tibble <- trees |> as_tibble() |> rename(Diameter=Girth)
# Task2
add_girth_weight <- function(tbbl, rnd=1) {
tbbl |> mutate(Girth = pi * Diameter, Weight = 35 * Volume) |> round(rnd)
}
# Task 3
rearrange <- function(tbbl, rearrangement) {
tbbl |> relocate(rearrangement) |> arrange(rearrangement[1])
}
# Task 4
lumber <- function(tbbl, selection, min_height, max_height, max_weight) {
tbbl |> select(selection) |> filter(between(Height, min_height, max_height) & Weight < max_weight)
}The idea I'm having for the story is basically a lumber farm type operation. Is this a good enough start to continue with? If so, any ideas for further tasks or modifications to these ones? Note: I've included a |
depial
left a comment
There was a problem hiding this comment.
Everything looks pretty good to me. Just a couple small suggestions.
Yes, and better than anything I can think of right now. We may get a few more ideas once it's implemented. I'll update the About, but we can't do much with the concept until it's clear what the exercise will need in the Intro. |
|
I'll try to open the PR for the exercise tomorrow so we can move everything along here. |
| 3 R2-D2 96 32 NA white, bl… red 33 none mascu… Naboo Droid <chr> <chr> | ||
| 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooine Human <chr> <chr> | ||
| # ℹ 2 more variables: starships <list>, BMI <dbl> | ||
| ``` |
There was a problem hiding this comment.
I just noticed that this needs to be formatted with #>
I'll be honest, I don't see what the purpose of pick() is here. Does it just save computation of using the entire dataframe in the mutation? If so, would it be better to use c(height, mass) since that's all that's needed for the computation?
There was a problem hiding this comment.
I've had similar questions running through my head. No answers.
There was a problem hiding this comment.
Give me a minute to play with this... I'm starting to think the way I used it might be a bit of a hack
There was a problem hiding this comment.
I don't think I'll be figuring this out anytime soon...
|
I've copied the |
|
I've been looking into the issues we've run into with data masking and tidy selection, which fall under the umbrella concept of tidy evaluation. Long story short, they have made a more common thing easier (hard coded scripts with pipelines), which made something else harder (generalized code). At a minimum, I think I should rewrite the caution block, since I have more knowledge about how things are working now, but I'm wondering how you're feeling about the inclusion of these ideas. Summary and possible rewrite of caution blockTidyverse works with tidy evaluation. In a nutshell, this is scoping behavior that allows dataset variables to be used in pipes without explicit mention of the dataset, as seen in examples throughout this concept. This streamlines a lot of one-off scripts, which are quite common in data science, but it introduces difficulties when trying to write general code or apps (e.g. wrapper functions). A full treatment of tidy evaluation is beyond the scope of this concept, but, for the purposes of this exercise, the two main points are:
Data masking functions require data-masking variables, which string vectors are not. This can be handled by wrapping a variable Tidy selection functions don't allow for outside vectors. This can often be handled with Specifically:
While I find this information to be important and interesting, that certainly doesn't mean that the general R public will. Even if it's not high enough priority, I would like to try to include it if it's easy enough for a student to get through the exercise (again, even without knowledge), since I thinks it's helpful to make students at least aware of the issue. If you think we should try to drop it, there are a few ways I could see doing that:
If we drop these, there are no guarantees that students wouldn't naively try to use character strings (and/or outside vectors) anyways; so any subset that does, will have to learn what went wrong on their own. In the end, I would like to find a way to include it, albeit as concise as possible, but I need to defer to your judgement since you are more familiar with the usage space. |
|
Ummm... I don't have a quick and simple answer for this, so let's take time and think about it. It raises interesting but non-trivial issues that I wasn't aware of - though I need to learn! I'm tempted to park this for a few days, and try to get the middle of the syllabus sorted (functions, functional programming). I'd (maybe) even be willing to launch the syllabus without dataframes in the beta, so longer as it is a fast-follower. |
|
Thinking about it some more (in the shower: a classic approach), these are some more options:
I'm currently moving towards (2), avoiding feature creep here. But would we be left with enough of an exercise? |
|
I have no problem waiting a bit on this in order to get this topic sorted. Tidy evaluation doesn't look terribly complex after reading a bit about it and seeing it as scoping behavior. I think it was just not something I'd thought of or encountered before since I'd only ever written hard coded scripts in the few times I've had to use R and Tidyverse before. Here are two other links I found useful when trying to get a handle on the subject: Mastering Shiny: Chapter 12 The first one is the most comprehensive, but doesn't say much more than the link I provided in my previous post. The interesting thing about it is they are describing it so users can build Shiny apps (i.e. use wrapper functions and more generic code). The second one is more examples than explanation and shows some usage with |
We would likely be able to keep most of the exercise as is, just the student would have to hard code the columns in (ala the That said, I would think it's best to pause while we familiarize ourselves with the topic and we can make a better decision then. EDIT: Just a thought, since this is a Tidyverse specific feature, I think it would have to make up a section in a wider concept, rather than be a concept in its own right. |
No problem. Nothing written yet, but I've started thinking about a database-operations concept to follow this. Including pivoting and grouping (very Tidy data things), but I can imagine this fitting in with it. |

A big one: at least as important to R as the
multiple-dispatchconcept is in Julia.I've tried to get the scope right: enough for an introductory concept, but deferring a lot to later concepts. Not sure I succeeded!
One obvious problem you will notice is that I boasted about including comparisons with Pandas and SQL syntax, then didn't include any. I still think it would be good in the About (not the Intro, obviously). I just need to find the time and brain-power. Using
dplyrfully for the first time, I'm very impressed. Quite a few things will be harder in the other languages.I still have no idea about an exercise to pair with this. Agreeing a concept scope will be a good first step.