Python & R Guide Update#4
Python & R Guide Update#4clayton-halim wants to merge 4 commits intouwaterloo-datascience:masterfrom clayton-halim:master
Conversation
jxnl
left a comment
There was a problem hiding this comment.
Awesome communications with all the text, mostly just some style changes. Also actually easier to output it as markdown/html.
| @@ -0,0 +1,545 @@ | |||
| { | |||
There was a problem hiding this comment.
This is in the .ipynb_checkpoints which you should include in the .gitignore
| @@ -0,0 +1,96 @@ | |||
| --- | |||
| title: "R Kaggle Guide (Titanic)" | |||
| author: "UWaterloo Data Science Club" | |||
There was a problem hiding this comment.
Welcome to use your name here :) you should take credit for the tut.
| This guide will look at the Titanic dataset, we will see if we can predict what types of people would have survived on the Titanic. | ||
|
|
||
| So first we will import some useful libraries. R is old and there are confusing things about the language that came up over time, the tidyverse stack is a set of libraries that make these functions more consistent and powerful. | ||
| ```{R} |
There was a problem hiding this comment.
you can add args like the ones below to ignores warning messages.
{R includes=FALSE, warnings=FALSE}
| The `$` let's us select specific variables in a dataframe. | ||
|
|
||
| ```{R} | ||
| titanic_data$Survived <- as.factor(titanic_data$Survived) |
There was a problem hiding this comment.
Is there a reason to use this notation vs dplyr mutate
titanic_data %>%
mutate(Survived=as.factor(Survived),
Pclass=as.factor(Pclass),
...)
| We can observe the first `n` entries of our dataframe by using the `head()` function, likewise we to observe the last `n` entires we can use `tail()`. If there are too many variables, the output will omit them to save space. | ||
|
|
||
| ```{R} | ||
| head(titanic_data, 5) |
There was a problem hiding this comment.
I'd love for you to introduce the %>% operator just because its preferred way of doing things.
perhaps explain what it does, and show that you can do both
head(df, 5) and df %>% head(5)
|
|
||
| ## INCOMPLETE SECTION | ||
|
|
||
| Another method of imputation is through prediction. It would be naive to use simple methods such as mean because we have other data that hint towards the age of a passenger. We can make a model to estimate the age from the other information we have. |
There was a problem hiding this comment.
Please include a section of missing data mechanisms.
more information can be found in Elements of Statistical Learning in the missing data section.
basically that missing data in itself can be predictive and we can always include is.na(feature) as a new indicator variable feature.
R
Python
pandas.DataFrame.info()to determine the amount of missing values in each column