Python & R Guide Update#4

Open

clayton-halim wants to merge 4 commits intouwaterloo-datascience:masterfrom

clayton-halim:master

Member

clayton-halim commented Aug 18, 2017 •

edited

Loading

R

Created guide:
- import
- imputation
- a little bit of plotting.

Python

Removed the large output in the python guide.
Used pandas.DataFrame.info() to determine the amount of missing values in each column

clayton-halim added 3 commits

August 18, 2017 18:14


          Add R Guide

753d9ce


          Remove giant dataframe output


          Add extra context to guide introduction

59b8d58

clayton-halim changed the title ~~Started R guide~~ Python & R Guide Update


          Update section to find NA values through info()

697f93c

jxnl self-requested a review

August 19, 2017 04:09

jxnl requested changes

View reviewed changes

Collaborator

jxnl left a comment

Awesome communications with all the text, mostly just some style changes. Also actually easier to output it as markdown/html.

Python Guide/.ipynb_checkpoints/Python Kaggle Guide (Titantic)-checkpoint.ipynb

		@@ -0,0 +1,545 @@
		{

Collaborator

jxnl Aug 19, 2017

This is in the .ipynb_checkpoints which you should include in the .gitignore

R Guide/R Kaggle Guide (Titanic).Rmd

@@ @@ -0,0 +1,96 @@ @@
+              ---
+              title: "R Kaggle Guide (Titanic)"
+              author: "UWaterloo Data Science Club"

Collaborator

jxnl Aug 19, 2017

Welcome to use your name here :) you should take credit for the tut.

R Guide/R Kaggle Guide (Titanic).Rmd

+              This guide will look at the Titanic dataset, we will see if we can predict what types of people would have survived on the Titanic.
+              So first we will import some useful libraries. R is old and there are confusing things about the language that came up over time, the tidyverse stack is a set of libraries that make these functions more consistent and powerful.
+              ```{R}

Collaborator

jxnl Aug 19, 2017

you can add args like the ones below to ignores warning messages.

{R includes=FALSE, warnings=FALSE}

R Guide/R Kaggle Guide (Titanic).Rmd

+              The `$` let's us select specific variables in a dataframe.
+              ```{R}
+              titanic_data$Survived <- as.factor(titanic_data$Survived)

Collaborator

jxnl Aug 19, 2017

Is there a reason to use this notation vs dplyr mutate

titanic_data %>%
   mutate(Survived=as.factor(Survived),
          Pclass=as.factor(Pclass), 
   ...)

R Guide/R Kaggle Guide (Titanic).Rmd

+              We can observe the first `n` entries of our dataframe by using the `head()` function, likewise we to observe the last `n` entires we can use `tail()`. If there are too many variables, the output will omit them to save space.
+              ```{R}
+              head(titanic_data, 5)

Collaborator

jxnl Aug 19, 2017

I'd love for you to introduce the %>% operator just because its preferred way of doing things.

perhaps explain what it does, and show that you can do both

head(df, 5) and df %>% head(5)

R Guide/R Kaggle Guide (Titanic).Rmd


		## INCOMPLETE SECTION

		Another method of imputation is through prediction. It would be naive to use simple methods such as mean because we have other data that hint towards the age of a passenger. We can make a model to estimate the age from the other information we have.

Collaborator

jxnl Aug 19, 2017

Please include a section of missing data mechanisms.

more information can be found in Elements of Statistical Learning in the missing data section.

basically that missing data in itself can be predictive and we can always include is.na(feature) as a new indicator variable feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet