Stat155Notes/02-visualization.Rmd at master · Mac-STAT/Stat155Notes · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r setup2, include=FALSE, warnings = FALSE, message=FALSE}
knitr::opts_chunk$set(echo = TRUE, warnings=FALSE, message=FALSE, tidy=FALSE, fig.align = 'center')
options(warn=-1)
library(broom)
library(dplyr)
library(ggplot2)
library(stringr)
library(rvest)
library(mosaicData)
library(ggmosaic)
library(NHANES)
library(tidytext)
library(glue)
library(readr)
library(ggdag)
```


# Visualizing Data

The first step in any data analysis is to visually explore your data.

There is a saying that "a picture is worth a 1000 words." In making visualizations, our goal is to quickly and easily get a better understanding of the variability, structure, and relationships that exist in the data.

Here we will cover the standard appropriate graphics for univariate variation and bivariate relationships. We will also cover techniques for multivariate relationships (3 or more variables). The choice of the graphic depends on the type(s) of variable(s): quantitative or categorical. Thus, the first step in creating a graphic is to think about the variables that you are interested in visualizing and determining whether they are quantitative or categorical.

For each type of variable, we use a real dataset to illustrate the visualizations.

## Good Visualization Principles

Before we discuss the standard graphics, let's lay out the basic design principles for good data visualizations.

1. **Show the data:** This may be self-explanatory, but make sure that the data is the focus and driver of the visualization.

2. **Avoid distorting the data:** Avoid 3D charts as the added dimension distorts the comparison. The area in a graph should equal the magnitude of the data it is representing.

3. **Simplify:** In 1983, Edward Tufte said that "A large share of ink on a graphic should present data-information, the ink changing as the data change. Data-ink is the non-erasable core of a graphic, the non-redundant ink arranged in response to variation in the numbers represented." Remove any unnecessary "ink" that does not assist in the presentation of the data. Remove distractions.

4. **Facilitate comparisons:** In order to explain variation, we want the graphics to facilitate comparisons between groups. The design should make it easier to compare between groups rather than harder.

5. **Use contrast:** Humans have developed to seek out visual contrast. When choosing colors and annotation, strive for more contrast in luminance (white to dark) to make it easier for everyone to visually perceive.

6. **Use color appropriately:** Think about your audience. A small proportion of the population is color-blind; try printing it in grayscale to see if the color palette is still effective. Also, every culture has different associations with colors; ask others for feedback on color choices. Neuroscience research has shown that humans are more sensitive to red and yellow, so those are good colors to use for highlighting key points.

7. **Annotate appropriately:** Informative text is crucial for providing data context. Make sure to use informative axis labels and titles. It may be worth adding text to explain extreme outliers.

For examples of good data visualizations in the news and discussion around them, check out the New York Times column ["What's Going on in This Graph?"](https://www.nytimes.com/column/whats-going-on-in-this-graph).

## Brief Intro to R

Throughout this class, we use R and RStudio to visualize, analyze, and model real data. To straighten out which is which: [R](https://cran.r-project.org/) is the name of the language itself (syntax, words, etc.) and [RStudio](https://www.rstudio.com/) is a convenient software interface that you'll interact with on the computer.

While you'll be learning about and using R throughout the course, this is not a course on R. Our focus will be on data and statistical modelling. We will be using R and RStudio as tools to help us get information from data.

### Basic Syntax

For this class, we will have data that we want to pass to a function that performs a particular operation (does something cool) on our data. Thus, we'll pass **inputs** as arguments to a **function**:

```{r eval=FALSE}
FunctionName(argument1 = a1, argument2 = a2,..., argumentk = ak)
```

Note the `FunctionName` and the use of parantheses. Inside the parantheses, the argument name (`argument1`) goes first and the value you are passing as an input is after = (`a1`).


We may want to save the **output** of the function by assigning it a name using the assignment operator, `<-`:

```{r eval=FALSE}
OutputName <- FunctionName(argument1 = a1, argument2 = a2,..., argumentk = ak)
```


R allows us to be lazy and not include the argument name as long as we provide the input in the correct order:

```{r eval=FALSE}
OutputName <- FunctionName(a1, a2,..., ak)
```

We can also **nest** functions by first performing one operation and then passing that as an input into another function. In the code below, `Function1()` would first run with the input `data` and create some output that is then passed as the first input in `Function2()`. So R evaluates functions from the inside-out.

```{r eval=FALSE}
Function2(Function1(data))
```

As we go through real examples below, notice the names of the functions that we use. The name comes right before `(` and the inputs we pass in right after `(`.

Additionally, we are going to use a shortcut that makes our code more readable. It is called a **pipe** and looks like `%>%`. What this does is pass the output on its left as the first argument to the function on the right. The following two sections of code do exactly the same thing but the second is easier to read. For this code, we take data and summarize the variable height and then take the mean of the heights.

```{r eval=FALSE}
summarize(data, mean(height))

data %>%
  summarize(mean(height))
```

There is so much more we could say about functions in R, but we will stop here for now.


With this in mind, we'll point to external references if you'd like to go deeper in your understanding of R as a programming language throughout this class.

To get a broad sense of R, you can work through R primers (https://rstudio.cloud/learn/primers) in RStudio Cloud in addition to any coursework and use the R cheatsheets available online (https://rstudio.cloud/learn/cheat-sheets).


## Anatomy of a ggplot command

To learn more about visualizing data with the ggplot2 R package, see [Hadley Wickham's textbook](https://r4ds.had.co.nz/data-visualisation.html).

In this course, we'll largely construct visualizations using the `ggplot()` function from the `ggplot2` R package. NOTE: `gg` is short for "grammar of graphics". Plots constructed from the `ggplot()` function are constructed in layers, and the syntax used to create plots is meant to reflect this layered construction. As you read through the rest of this chapter, pay attention to how the syntax generally follows this structure:

```{r eval=FALSE}
data %>%
    ggplot(aes(x = X_AXIS_VARIABLE, y = Y_AXIS_VARIABLE)) +
    VISUAL_LAYER1 +
    VISUAL_LAYER2 +
    VISUAL_LAYER3 + ...
```

We pass the aesthetic mapping from the data set to the plot with `aes()`. The visual layers are features such as points, lines, and panels. We'll introduce these soon. The `+`'s allow us to add layers to build up a plot (note this is not the pipe!).


```{block, type="reflect"}
What are the function names in the example above? There are only two as it is written right now.
```


## One Categorical Variable

First, we consider survey data of the electoral registrar in Whickham in the UK (Source: Appleton et al 1996). A survey was conducted in 1972-1974 to study heart disease and thyroid disease and a few baseline characteristics were collected: age and smoking status. 20 years later, a follow-up was done to check on mortality status (alive/dead).

Let's first consider the age distribution of this sample. Age, depending on how it is measured, could act as a quantitative variable or categorical variable. In this case, age is recorded as a quantitative variable because it is recorded to the nearest year. But, for illustrative purposes, let's create a categorical variable by separating age into intervals.

**Distribution:** *the way something is spread out (the way in which values vary).*

```{r}
# Note: anything to the right of a hashtag is a comment and is not evaluated as R code

library(dplyr) # Load the dplyr package
library(ggplot2) # Load the ggplot2 package
data(Whickham) # Load the data set from Whickham R package

# Create a new categorical variable with 4 categories based on age
Whickham <- Whickham %>%
    mutate(ageCat = cut(age, 4))

head(Whickham)
```

```{block, type="reflect"}
What do you lose when you convert a quantitative variable to a categorical variable? What do you gain?
```

### Bar Plot

One of the best ways to show the distribution of one categorical variable is with a bar plot. For a bar plot,

- The **height of the bars** is the only part that encodes the data (width is meaningless).
- The height can either represent the **frequency** (count of units) or the **relative frequency** (proportion of units).


```{r tidy=FALSE}
## Numerical summary (frequency and relative frequency)
Whickham %>%
    count(ageCat) %>%
    mutate(relfreq = n / sum(n))

## Graphical summary (bar plot)
Whickham %>%
    ggplot(aes(x = ageCat)) +
    geom_bar(fill="steelblue") +
    labs(x = 'Age Categories in Years', y = 'Counts') +
    theme_classic()
```

```{block type='reflect'}
What do you notice? What do you wonder?
```

### Pie Chart

Pie charts are only useful if you have 2 to 3 possible categories and you want to show relative group sizes.

This is the best use for a pie chart:

```{r, out.width = '.5\\textwidth', echo=FALSE, fig.align = 'center'}
knitr::include_graphics("Photos/pie.jpg")
```

We are intentionally not showing you how to make a pie chart because a bar chart is a better choice.

Here is a good summary of why many people strongly dislike pie charts: http://www.businessinsider.com/pie-charts-are-the-worst-2013-6. Keep in mind Visualization Principle #4: Facilitate Comparisons. We are much better at comparing heights of bars than areas of slices of a pie chart.


## Two Categorical Variables

Now, let's consider two other variables in the same Whickham data set. What is the relationship between the 20-year mortality outcome and smoking status at the beginning of the study?

### Side by Side Bar Plot

There are a few options for visualizing the relationship between two categorical variables. One option is to use a bar plot and add bars for different categories next to each other, called a **side-by-side bar plot**. For these plots,

- The **height of the bars** shows the frequency of the categories within subsets.

```{r tidy=FALSE}
## Numerical summary (frequency and overall relative frequency)
Whickham %>%
  count(outcome, smoker) %>%
  mutate(relfreq = n / sum(n))

## Graphical summary (side-by-side bar plot)
Whickham %>%
  ggplot(aes(x = smoker, fill = outcome)) +
  geom_bar(position = "dodge") +
  labs(x = 'Smoker Status', y = 'Counts', fill = '20 Year Mortality') +
  scale_fill_manual(values = c("steelblue", "lightblue")) +
  theme_classic()
```

```{block type='reflect'}
What additional information do you gain by considering smoking status?
```

### Stacked Bar Plot

Another way to show the same data is by stacking the bars on top of each other with a category. For a **stacked bar plot**,

- The **height** of the entire bar shows the **marginal distribution** (frequency of the X variable, ignoring the other variable).
- The **relative heights** show **conditional distributions** (frequencies within subsets), but it is hard to compare distributions between bars because the overall heights differ.
- The **widths** of the bars have no meaning.


```{r}
## Numerical summary (conditional distribution - conditioning on outcome)
Whickham %>%
    count(outcome, smoker) %>%
    group_by(outcome) %>%
    mutate(relfreq = n / sum(n))

## Numerical summary (conditional distribution - conditioning on smoker)
Whickham %>%
    count(outcome, smoker) %>%
    group_by(smoker) %>%
    mutate(relfreq = n / sum(n))

## Graphical summary (stacked bar plot)
Whickham %>%
    ggplot(aes(x = smoker, fill = outcome)) +
    geom_bar() +
    labs(x = 'Smoker Status', y = 'Counts', fill = '20 Year Mortality') +
    scale_fill_manual(values = c("steelblue", "lightblue")) +
    theme_classic()
```

```{block type='reflect'}
What information is highlighted when you stack the bars as compared to having them side-by-side?
```

### Stacked Bar Plot (Relative Frequencies)

We can adjust the stacked bar plot to make the heights the same, so that you can compare conditional distributions. For a **stacked bar plot based on proportions** (also called a **proportional bar plot**),

- The **relative heights** show **conditional distributions** (relative frequencies within subsets).
- The **widths** of the bars have no meaning.

The code below computes the conditional distributions first (fractions of outcomes within the two smoking groups) and then plots these proportions.


```{r}
Whickham %>%
    ggplot(aes(x = smoker, fill = outcome)) +
    geom_bar(position = "fill") +
    labs(x = 'Smoker Status', y = 'Proportion', fill = '20 Year Mortality') +
    scale_fill_manual(values = c("steelblue", "lightblue")) +
    theme_classic()
```

### Mosaic Plot

The best (Prof. Heggeseth's opinion) graphic for two categorical variables is a variation on the stacked bar plot called a **mosaic plot**. The total heights of the bars are the same so we can compare the conditional distributions. For a **mosaic plot**,

- The **relative height** of the bars shows the **conditional distribution** (relative frequency within subsets).
- The **width** of the bars shows the **marginal distribution** (relative frequency of the X variable, ignoring the other variable).
- Making mosaic plots in R requires another package: `ggmosaic`

```{r}
library(ggmosaic)
Whickham %>%
    ggplot() +
    geom_mosaic(aes(x = product(outcome, smoker), fill = outcome)) +
    labs(x = 'Smoker Status', y = '', fill = '20 Year Mortality') +
    scale_fill_manual(values = c("steelblue", "lightblue")) +
    theme_classic()
```

```{block, type="reflect"}
What information is highlighted when you focus on relative frequency in the mosaic plots as compared to other bar plots?
```

With this type of plot, you can see that there are more non-smokers than smokers. Also, you see that there is a higher mortality rate for non-smokers.

```{block, type="reflect"}
Does our data suggest that smoking *is associated* with a lower mortality rate?  Does our data suggest that smoking *reduces* mortality? Note the difference in these two questions - the second implies a cause and effect relationship.
```

Let's consider a third variable here, age distribution. We can create the same plot, separately for each age group.

```{r}
Whickham %>%
    ggplot() +
    geom_mosaic(aes(x = product(outcome, smoker), fill = outcome)) +
    facet_grid( . ~ ageCat) +
    labs(x = 'Smoker Status', y = '', fill = '20 Year Mortality') +
    scale_fill_manual(values = c("steelblue", "lightblue")) +
    theme_classic()
```

```{block, type="reflect"}
What do you gain by creating plots within subgroups?
```

```{block, type="reflect"}
How is it that our conclusions are exactly the opposite if we consider the relationship between smoking and mortality within age subsets? What might be going on?
```

This is called **Simpson's Paradox,** which is a situation in which you come to two different conclusions if you look at results overall versus within subsets (e.g. age groups).

Let's look at the marginal distribution of smoking status within each age group. For groups of people that were 68 years of age or younger, it was about 50-50 in terms of smoker vs. non smoker. But, the oldest age group were primarily nonsmokers.

Now look at the mortality rates within each age category. The 20-year mortality rate among young people (35 or less) was very low, but mortality increases with increased age. So the oldest age group had the highest mortality rate, due primarily to their age, and also had the highest rate of non-smokers. So when we look at everyone together (not subsetting by age), it looks like smoking is associated with a lower mortality rate, when in fact age was just confounding the relationship between smoking status and mortality.


## One Quantitative Variable

Next, we use data from one of the largest ongoing health studies in the USA, named NHANES. In particular, we will focus on data from the NHANES between 2009-2012 (Source: CDC). For more info about NHANES: https://www.cdc.gov/nchs/nhanes/index.htm.

Since sleep is vitally important to daily functioning, let's look at the number of hours of sleep respondents reported.

### Histogram

One main graphical summary we use for quantitative variables is a histogram. It resembles a bar plot, but there are a few key differences:

- The x-axis is a number line that is divided into intervals called **bins**. Bins technically do not all have to be of equal width but almost always are. When making histograms in R, R chooses a default bin width, but you have options to change the number and/or width of the bins/intervals.
- The **height** of the bars shows either the **frequency within intervals** (counts of units that fall into that bin/interval) or the **density** (fraction of units that fall into that bin/interval).
- Gaps between bars are meaningful. They indicate absence of values within an interval.

```{r}
data(NHANES)
NHANES %>%
    ggplot(aes(x = SleepHrsNight)) +
    geom_histogram(fill = "steelblue") +
    labs(x = 'Hours of Sleep (hours)', y = 'Counts') +
    theme_classic()
```

Note the warning message above: "Removed __ rows containing non-finite values (stat_bin)." Sometimes there is missing information for a variable for some units in the dataset. We cannot plot these because we don't know their values! This warning message is just a friendly reminder from R to let you know what it is doing.

Also note the message that R gives about bin width to remind us that we can choose this if we wish. If we want to specify the width of the intervals or bins, we can specify `binwidth = DESIRED_BIN_WIDTH` within `geom_histogram`.

```{r}
NHANES %>%
    ggplot(aes(x = SleepHrsNight)) +
    geom_histogram(binwidth = 1, fill = "steelblue") +
    labs(x = 'Hours of Sleep (hours)', y = 'Counts') +
    theme_classic()
```

Lastly, notice that the y-axis in the previous two histograms has been the counts (or frequency) within each sleep hour interval. We can adjust this to **density**, which is relative frequency adjusted for the width of interval so that the sum of the areas of the bars (height x width) equals 1.

```{r}
NHANES %>%
    ggplot(aes(x = SleepHrsNight)) +
    geom_histogram(aes(y = ..density..), binwidth = 1, fill = "steelblue") +
    geom_density(alpha = 0.2, fill = "steelblue", adjust = 3) +
    labs(x = 'Hours of Sleep (hours)', y = 'Density') +
    theme_classic()
```

The smooth curved line on this plot is called a **density plot**. It is essentially a smoother version of the histogram. Both the area under a density plot and the total area of all the rectangles in a density histogram equal 1.

When describing a distribution, we focus on three aspects of the histogram:

- **Shape:** Is it **symmetric** (can you fold it in half and the sides match up)? or is it **skewed to the right or left**? (A distribution is **left-skewed** if there is a long left tail and **right-skewed** if it has a long right tail.) How many **modes** ("peaks"/"bumps" in the distribution) do you see?
- **Center:** Where is a typical value located?
- **Spread** (or variation): How spread out are the values? Concentrated around one or more values or spread out?
- **Unusual features:** Are there **outliers** (points far from the rest)? Are there gaps? Why?


Here is another data set for comparison. Here are the annual salaries for the highest paid CEOs in 2016 (Source: NYTimes). To get the data, we are scraping the data from a NYTimes website. For fun, you can look at the code below.

```{r echo=FALSE}
nyturl <- 'https://www.nytimes.com/interactive/2017/05/26/business/highest-paid-ceos.html?mcubz=0'
dat <- read_html(nyturl)
ceo <- dat %>%
    html_nodes(".nytg-compensation , .nytg-year") %>%
    html_text() %>%
    str_replace('\\$|-','') #webscraping data
ceo <- data.frame(matrix(ceo,ncol = 2,byrow = TRUE))
names(ceo) <- c('year','salary')
ceo$salary <- as.numeric(ceo$salary)
ceo <- ceo %>%
    filter(year == '2016')
```

Let's create a density histogram of the annual salaries for the highest paid CEO's in the U.S. in 2016.

```{r}
ceo %>%
    ggplot(aes(x = salary)) +
    geom_histogram(aes(y = ..density..), binwidth = 5, fill = "steelblue") +
    geom_density(alpha = 0.2, fill = "steelblue") +
    labs(x = 'Salary ($ Millions)', y = 'Density') +
    theme_classic()
```

We note that some of the highest salaries were close to 200 million U.S. dollars (in 2016), but the majority of the salaries in this sample are closer to 50 million U.D. dollars.

```{block, type="reflect"}
Is this distribution of salaries left-skewed or right-skewed? In what populations do you think salaries might be left-skewed? Right-skewed?
```

### Center

There are some choices for numerically summarizing the center of a distribution:

- **Mean**: The sum of the values divided by the number of values (sample size), $\bar{y} = \frac{\sum^n_{i=1}y_i}{n}$
    + Sensitive to outliers, but it efficiently uses all the data
- **Median**: The "middle" value. The number for which half of the values are below and half are above.
    + Insensitive to outliers, but it doesn't use all the actual values
- **Trimmed means**:  Drop the lowest and highest k% and take the mean of the rest.
    + A good compromise, but not widely used.

```{block, type="mathbox"}
The Greek capital letter sigma, $\sum$, is used in mathematics to denote a sum. We let $y_i$ represent the value of the $i$th person for a variable called $y$. So $\sum^n_{i=1}y_i$ is the sum of all the $n$ values of a variable $y$, all the way from the 1st person to the $n$th person.
```

We can calculate all of these in R.

- **Hours of sleep per night from the NHANES dataset**

```{r}
NHANES %>%
  select(SleepHrsNight) %>%
  summary()

NHANES %>%
  summarize(
    mean(SleepHrsNight, na.rm = TRUE),
    median(SleepHrsNight, na.rm = TRUE),
    mean(SleepHrsNight, trim = 0.05, na.rm = TRUE)) # na.rm = TRUE removes missing values

#Trimmed mean: trim 5% from both tails before taking mean

```

- **CEO salary information from NYT**

```{r}
ceo %>%
  select(salary) %>%
  summary()  # Note the differences between mean and median

ceo %>%
  summarize(
    mean(salary),
    median(salary),
    mean(salary, trim = 0.05))
```

Note that the mean, median, and trimmed mean are all fairly close for the sleep hours distribution, which looks fairly symmetric.

Note also that the mean, median, and trimmed mean are somewhat different for the salary distribution, which looks right skewed. Often with right skewed distributions, the mean tends to be higher than the median because particularly large values are being summed in the calculation. The median and trimmed mean are not as sensitive to these outliers because of the sorting that is involved in their calculation.

### Boxplot

An alternative graphical summary is a boxplot, which is a simplification of the histogram. The plot consists of

- A Box: the bottom of the box is at the 25th percentile ($Q1$) and top of the box is at the 75th percentile ($Q3$)
- Line in Box: the line in the middle of the box is at the 50th percentile, the median
- Tails/Whiskers: The lines extend out from the box to most extreme observed values within $1.5 \times (Q3-Q1)$ from $Q1$ (bottom) or $Q3$ (top)
- Points: If any points are beyond $1.5 \times (Q3-Q1)$ from the box edges, they are considered outliers and are plotted separately

```{block, type="mathbox"}
A percentile is a measure indicating the value below which a given percentage of observations in a group of observations fall. So the 25th percentile is the value at which 25% of the values are below. The 95th percentile is the point at which 95% of the observations are below.
```

Here is a boxplot of the sleep amount from NHANES.

```{r}
NHANES %>%
    ggplot(aes(y = SleepHrsNight)) +
    geom_boxplot() +
    ylab('Hours of Sleep (hours)') +
    theme_classic()
```

Compare that to the boxplot of the CEO salaries.

```{r echo=TRUE}
ceo %>%
    ggplot(aes(y = salary)) +
    geom_boxplot() +
    ylab('Salary ($M)') +
    theme_classic()
```

Note: In these 2 plots above, the x-axis has number labels, but they don't mean anything.

Let's put the boxplots next to the histograms so we can better compare the two types of visualizations. Also, let's add the mean (red dashed), median (blue dotted), and 5% trimmed mean (purple dash-dot) as annotations.

```{r echo=FALSE}
layout(matrix(c(1,2),2,1),heights = c(10,5))
par(mar = c(0,5,4,2))
hist(NHANES$SleepHrsNight, col="lightblue",main="",ylab="# of people",xlab='Hours of Sleep (hrs)',xaxt='n')
abline(v = mean(NHANES$SleepHrsNight,na.rm=TRUE),col='red',lty=2) #na.rm = TRUE removes missing values
abline(v = median(NHANES$SleepHrsNight,na.rm=TRUE), col = 'blue',lty=3)
abline(v = mean(NHANES$SleepHrsNight,trim = 0.05,na.rm=TRUE), col='purple',lty=4)
par(mar = c(5,5,0,2),bty='n')
boxplot(NHANES$SleepHrsNight,horizontal = TRUE,xlab='Hours of Sleep at Night')
abline(v = mean(NHANES$SleepHrsNight,na.rm=TRUE),col='red',lty=2)
abline(v = median(NHANES$SleepHrsNight,na.rm=TRUE), col = 'blue',lty=3)
abline(v = mean(NHANES$SleepHrsNight,trim = 0.05,na.rm=TRUE), col='purple',lty=4)
```

For the hours of sleep, the mean, median, and 5% trimmed mean are all pretty much the same. Note also that the distribution looks pretty symmetric based on the histogram.

```{r echo=FALSE}
layout(matrix(c(1,2),2,1),heights = c(10,5))
par(mar = c(0,5,4,2))
hist(ceo$salary, col="lightgreen",main="",ylab="# of people",xaxt='n',xlim=c(0,200))
abline(v = mean(ceo$salary, na.rm = TRUE),col='red',lty=2) #na.rm = TRUE removes missing values
abline(v = median(ceo$salary, na.rm = TRUE), col = 'blue',lty=3)
abline(v = mean(ceo$salary, trim = 0.05, na.rm = TRUE), col='purple',lty=4)
par(mar = c(5,5,0,2),bty='n')
boxplot(ceo$salary,horizontal = TRUE,xlab='Salary ($N)',ylim=c(0,200))
abline(v = mean(ceo$salary, na.rm = TRUE),col='red',lty=2)
abline(v = median(ceo$salary, na.rm = TRUE), col = 'blue',lty=3)
abline(v = mean(ceo$salary, trim = 0.05, na.rm = TRUE), col='purple',lty=4)
```

For CEO salaries, the mean and 5% trimmed mean are a bit higher than the median. **The mean is always pulled toward the long tail.**

```{block, type="reflect"}
What would the boxplot look like if all of the values were exactly the same? Sometimes when making multiple boxplots for each of multiple groups, a group may only have one value or a small number of values that all happen to be identical. What will this look like?
```

### Spread

There are some choices for numerically summarizing the spread of a distribution:

- **Range**: the maximum value - the minimum value
    + Sensitive to the outliers since it's the difference of the extremes
    + Units (e.g. inches, pounds) are the same as the actual data
- **IQR**: the interquartile range : $Q3 - Q1$ (75th percentile - 25th percentile).
    + Length of the box in a boxplot
    + Spread of middle 50% of data
    + Like the median. Less sensitive because it doesn't use all of the data
    + Units are the same as the actual data
- **Standard deviation** (SD): Root mean squared deviations from mean, $s_y = \sqrt{\frac{\sum^n_{i=1}(y_i-\bar{y})^2}{n-1}}$
    + Roughly the average size of deviation from the mean ($n-1$ instead of $n$)
    + Uses all the data but very sensitive to outliers and skewed data (large values are first squared).
    + Units are the same as the actual data
- **Variance**: Square of the standard deviation
    + Units are the squared version of the actual data's units (e.g. squared inches, pounds)
    + Standard deviation is preferred for interpretability of units
    + Variance will come up when we discuss models in the next chapter

We can calculate all of these in R.

- **Hours of sleep per night from the NHANES dataset**

```{r echo=TRUE}
NHANES %>%
  summarize(
    diff(range(SleepHrsNight, na.rm = TRUE)),
    IQR(SleepHrsNight, na.rm = TRUE),
    sd(SleepHrsNight, na.rm = TRUE),
    var(SleepHrsNight, na.rm = TRUE))

# range gives max and min; take difference betwee max and min
# IQR = Q3-Q1
# sd = standard deviation
# var = variance
```

- **CEO salary information from NYT**

```{r tidy=TRUE, echo=TRUE}
ceo %>%
  summarize(
    diff(range(salary)),
    IQR(salary),
    sd(salary),
    var(salary))
```

### Some data accounting {#intro-zscore}

We've looked at different measures of the spread of a distribution. Do some measures of spread encompass a lot of the data? Just a little? Can we be more precise about how much of the data is encompassed by intervals created from different spread measures?

```{r echo=FALSE}
x <- rnorm(1500)
boxplot(x,horizontal = TRUE,xlab='Generated Data')
abline(v = range(x), col='blue',lty=3)
abline(v = quantile(x,c(.25,.75)),col='purple',lty=1)
abline(v = c(mean(x) - sd(x), mean(x) + sd(x)),col='green',lty=2)
```

```{block, type="reflect"}
What percentage of the data is between the blue dotted lines (length of interval is range)?

What percentage of the data is between the purple solid lines (length of interval is IQR)?

What percentage of the data is between the green dashed lines (length of interval is 2*SD)?
```

The code below computes the fraction of data points, `x`, that fall between the lower bound of 1 SD below the mean and the upper bound of 1 SD above the mean.

```{r}
sum(x > mean(x) - sd(x) & x < mean(x) + sd(x))/length(x)
```

So with this data set, about 68% of the data values fall within 1 SD of the mean.

```{block, type="reflect"}
If we had a different data set, do you know that answer to the following questions? *You should know the answer to 2 of them at this point...*

- What percentage of the data would be between the minimum and maximum (blue dotted lines above)?

- What percentage of the data would be between bottom and top of the box (purple solid lines above)?

- What percentage of the data would be between 1 SD below the mean and 1 SD above the mean (green dashed lines)?
```

### Z-scores

How do you decide when an outlier is really unusual (think: athletic victory being very impressive or a data point that may be a typing error such as a human weight of 3000 lbs)?

If the observation is far from the rest of the measurements in the data, we tend to say that the value is unusual. We want to quantify this idea of "unusual".

To do this, we often calculate a **z-score**, a standardized data value which we denote with the letter, $z$. To calculate a z-score,

- Calculate how far an observation, $y$, is below (or above) the mean of the sample, denoted as $\bar{y}$.
- Then divide the difference by the standard deviation (measure of spread), denoted as $s_y$.

\[ z = \frac{y - \bar{y}}{s_y} \]

The z-score tells you how many standard deviations the observation is above or below the mean.

```{block, type='reflect'}
Say that you got a z-score of 1 on an exam with mean = 80 and SD = 5. That means that you got an 85 on the exam because your exam is one SD above the mean ($mean + z \times SD = 80 + 1 \times 5$).

If you got a z = -2 on an exam with mean = 80 and SD = 5, that means you got a 70 on the exam because your exam is two SD below the mean  ($mean + z \times SD = 80 + -2 \times 5$).
```

In general, it is quite common to have z-scores between -3 and 3, but fairly unusual to have them greater than 3 or less than -3.

Often, if you have data with a **unimodal, symmetric distribution**,

- about 68% of the z-scores are between -1 and 1,
- about 95% of the z-scores are between -2 and 2,
- about 99.7% of the z-scores are between -3 and 3.

This is not true for every histogram, but it will be true for a particularly special distribution that we will see later when we cover models. (This distribution is called the **normal distribution** or **Gaussian distribution**.)

However, we do know that z-scores of 5 or larger in magnitude (ignoring negative sign) are very unusual, no matter the shape of the histogram/distribution. For those inclined, see the mathematical theorem below that tells us this.

```{block, type='mathbox'}
(Optional) Chebyshev's inequality gives bounds for the percentages no matter the shape of the distribution. It states that for any real number $k$ > 0, the chance of getting a z-score greater in magnitude (ignoring the negative sign) than $k$ is less than or equal to $1/k^2$,

$$P\left(|Z| \geq k\right) \leq \frac{1}{k^2}$$
where $Z = \frac{|X - \mu|}{\sigma}$ is a z-score, $\mu$ is the mean, and $\sigma$ is the standard deviation.

If we plug in values for $k$, we see that the chance of getting a z-score

- at least 3 in magnitude (> 3 or < -3) is less than $(1/3^2) = 0.11 = 11\%$.
- at least 4 in magnitude (> 4 or < -4) is less than $(1/4^2) = 0.06 = 6\%$.
- at least 5 in magnitude (> 5 or < -5) is less than $(1/5^2) = 0.04 = 4\%$.

This is true for any shaped distribution (skewed, bimodal, etc.). See [proof here](https://en.wikipedia.org/wiki/Markov%27s_inequality) based on probability theory.
```


In summary, for a quantitative variable,

- Use a histogram to display the distribution of one variable and describe the shape and any unusual features.
- For "well-behaved" distributions (symmetric, unimodal, no outliers), use the mean and standard deviation to describe the center and spread. Then z-scores will roughly follow the 68-95-99.7 rule stated above.
- For other distributions (skewed or bimodal), use the IQR and median. You can report both mean and median, but it's usually a good idea to state why.


## One Quant. and One Cat. Variable

Let's return to the NHANES data. We noticed variation in the amount people sleep. Why do some people sleep more than others? Are there any other characteristics that may be able to *explain that variation*?

Let's look at the distribution of hours of sleep at night within subsets or groups of the NHANES data.

### Multiple Histograms


**Does the recorded binary gender explain the variability in the hours of sleep?**

```{block, type='reflect'}
What are the *ethical implications* of collecting gender identity as a binary variable (male/female) if some individuals do not identify with these categories?

What might be the *causal mechanism* between gender identity and sleep? Might you be more interested in hormone levels, which might not necessarily correspond to gender identity? How might you change the data collection procedure so that the data can address the underlying research question?
```

Let's make a histogram for each gender category by adding `facet_grid(. ~ Gender)` which separates the data into groups defined by the variable, `Gender`, and creates two plots along the x-axis.

```{r}
NHANES %>%
    ggplot(aes(x = SleepHrsNight)) +
    geom_histogram(binwidth = 1, fill = "steelblue") +
    labs(x = 'Hours of Sleep (hours)') +
    facet_grid(. ~ Gender) +
    theme_classic()
```

```{block, type='reflect'}
Do you notice any differences in sleep hour distributions between males and females?

What is easy to compare and what is hard to compare between the two histograms?
```

**Does the number of children a woman has explain the variability in the hours of sleep?**

```{block, type='reflect'}
Who have we excluded from our analysis by asking this question?
```

```{r fig.width=10}
NHANES %>%
    filter(!is.na(nBabies)) %>%
    ggplot(aes(x = SleepHrsNight)) +
    geom_histogram(binwidth = 1, fill = "steelblue") +
    labs(x = 'Hours of Sleep (hours)') +
    facet_wrap(. ~ factor(nBabies), ncol = 4) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    theme_classic()
```

The 0 to 12 labels at the top of each of these panels correspond to the number of babies a woman had.

```{block, type='reflect'}
Do you notice any differences in sleep hour distributions between these groups?

Note the x and y axes are the same for all of the groups to facilitate comparison. What is easy to compare and what is hard to compare between the histograms?
```

**Does the number of days someone has felt depressed explain the variability in the hours of sleep?**

```{r}
NHANES %>%
    ggplot(aes(x = SleepHrsNight)) +
    geom_histogram(binwidth = 1, fill = "steelblue") +
    labs(x = 'Hours of Sleep (hours)') +
    facet_grid(. ~ Depressed) +
    theme_classic()
```

What's the rightmost "NA" category? Some individuals in this study did not answer questions about days that they might have felt depressed, but they did report their hours of sleep per night.

```{block, type='reflect'}
What type of biases might be at play here?
```

### Multiple Boxplots

Let's visualize the same information but with boxplots instead of histograms and see if we can glean any other information.

```{r}
NHANES %>%
    ggplot(aes(x = Gender, y = SleepHrsNight)) +
    geom_boxplot() +
    labs(x = 'Binary Gender', y = 'Hours of Sleep (hours)') +
    theme_classic()
```


```{r}
NHANES %>%
    ggplot(aes(x = factor(nBabies), y = SleepHrsNight)) +
    geom_boxplot() +
    labs(x = 'Number of Babies', y = 'Hours of Sleep (hours)') +
    theme_classic()
```


```{r}
NHANES %>%
    ggplot(aes(x = factor(Depressed), y = SleepHrsNight)) +
    geom_boxplot() +
    labs(x = 'Days Depressed', y = 'Hours of Sleep (hours)') +
    theme_classic()
```

```{block type='reflect'}
What is easy to compare and what is hard to compare between the boxplots?

Why might you use multiple boxplots instead of multiple histograms?
```

### Is this a Real Difference?

If we notice differences in the the sleep distributions for groups based on self-reported Depression, is it a "REAL" difference? That is, is there a difference in the general U.S. population? Remember, we only have a random *sample* of the population. *NHANES is supposed to be a representative sample of the U.S. population collected using a probability sampling procedure.*

What if there were no "REAL" difference? Then the Depressed group labels wouldn't be related to the hours of sleep.

**Investigation Plan:**

1. Take all of the observed data on the hours of sleep and randomly shuffle them into new groups (of the same sizes as before). This breaks any associations between the Depressed group labels and the reported hours of sleep.
2. Calculate the difference in mean hours of sleep between the groups. Record it.
3. Repeat steps 1 and 2 many times (say 1000 times).
4. Look at the differences based on random shuffles & compare to the observed difference.


```{r echo=TRUE}
library(mosaic)
#TRUE or FALSE (converted Depressed to a 2 category variable)
NHANES <- NHANES %>%
  mutate(DepressedMost = (Depressed == 'Most'))

obsdiff <- NHANES %>%
  with(lm(SleepHrsNight ~ DepressedMost)) %>%
  tidy() %>%
  filter(term == 'DepressedMostTRUE')

sim <- do(1000)*(
  NHANES %>%
    with(lm(SleepHrsNight ~ shuffle(DepressedMost)))
)

#Randomly shuffle the DepressedMost labels
#assuming no real difference in sleep
```


Below, we have a histogram of 1000 values calculated by randomly shuffling individuals in the sample into two groups (assuming no relationship between depression and sleep) and then finding the difference in the mean amount of sleep. The red vertical line showed the observed difference in mean amount of sleep in the data.

```{r}
sim %>%
  ggplot(aes(x = DepressedMostTRUE)) +
  geom_histogram(fill = 'steelblue') +
  geom_vline(aes(xintercept = estimate), obsdiff, color = 'red') +
  labs(x = 'Difference in Mean Hours of Sleep', y = 'Counts') +
  theme_classic()
```

```{block, type='reflect'}
The observed difference in mean hours of sleep (red line) is quite far from the distribution of differences that results when we break the association between depression status and sleep hours (through randomized shuffling of group labels). Thus, it is unlikely to get a difference that large if there were no relationhip.

What do you think this indicates? How might you use this as evidence for or against a "real" population difference?
```


## Two Quantitative Variables

To discuss two quantitative variables, let's switch to new data set and consider a thought experiment.

Imagine that you are an entrepreneur selling button-down dress shirts. Clothing sizing are quite variable across clothing brands, so we are going to use our own data to come up with appropriate sizes for our customers. Two of the key measurements that we will use are the neck size in centimeters and chest size in centimeters of a customer. There are other variables in the data set, but let's focus on these two for the moment.

### Scatterplot

When you have two quantitative variables, a **scatterplot** is the main appropriate graphical display of the relationship. Each point represents the neck and chest size of one customer.

```{r echo=TRUE}
body <- read.delim("Data/bodyfat.txt")

body %>%
    ggplot(aes(x = Neck, y = Chest)) +
    geom_point(color = 'steelblue') +
    labs(x = 'Neck size (cm)', y = 'Chest size (cm)') +
    theme_classic()
```

What do you notice about:

1. **Direction** of relationship (positive, negative, or neutral)
2. **Form** of relationship (linear, curved, none, or other)
3. **Strength** of relationship (compactness around the average relationship)
4. **Unusual** features (outliers, differences in variability in $y$ variable across different values of $x$ variable)

```{block type='reflect'}
How might you use this information to determine shirt sizes for your new business venture? Come up with a few ways you could define sizes such as small, medium, large, extra large, etc.
```

Suppose instead of *Chest* in inches and *Neck size* in cm, we plotted *Chest* in inches and *Neck size* in inches.

**Does the strength of the relationship change after transformation?**

Look at the plot in inches below. Does this plot look the same as the centimeters plot?

```{r}
body %>%
  ggplot(aes(x = Neck/2.54, y = Chest/2.54)) +
  geom_point(color = 'steelblue') +
  labs(x = 'Neck size (in)', y = 'Chest size (in)') +
  theme_classic()
```

You should see that the x-axes changed but the overall shape of the plot stayed the same. Thus, the strength of the relationship was not affected by tranforming neck size from centimeters to inches (by dividing by 2.54).

### Correlation Coefficient

Since **shifting** (adding or subtracting) and **scaling** (multiplying or dividing) make no difference in the strength of the relationship, let's standardize both variables into z-scores (recall z-scores from Section \@ref(intro-zscore)).

Below we plot Neck and Chest sizes after changing them to z-scores with the function `scale()` and we add some color:

```{r echo=FALSE}
body <- body %>%
    mutate(
      zNeck = scale(Neck),
      zChest = scale(Chest),
      Product = factor(sign(zChest*zNeck)))

body %>%
    ggplot(aes(x = zNeck, y = zChest, color = Product)) +
    geom_point() +
    geom_hline(yintercept = 0) +
    geom_vline(xintercept = 0) +
    xlab('Z-scores of Neck size') +
    ylab('Z-scores of Chest size') +
    guides(color = guide_legend(title = "Sign of the product\n of the z-scores")) +
    theme_classic() +
    coord_cartesian(xlim = c(-3.5,3.5), ylim = c(-3.5,3.5)) +
    annotate("text", x = c(3, -3, -3, 3), y = c(3.5, 3.5, -3.5, -3.5), label = paste("Quadrant", 1:4))
```

The blue points in the upper right (Quadrant 1) and lower left (Quadrant 3) quadrants are either both positive or both negative in their z-score values. This means that those individuals are above average in both Neck Size and Chest Size (upper right), or they are below average in both Neck Size and Chest Size (lower left). If we multiply the z-scores of the Neck and Chest values for the blue points, we will get a positive value.

The red points in the upper left (Quadrant 2) and lower right (Quadrant 4) quadrants are positive in one and negative in the other. This means that those individuals are either above average in Neck Size but below average in Chest Size (lower right) or they are below average in Neck Size and above average in Chest Size (upper left). If we multiply the z-scores of the Neck and Chest values for the red points, we will get a negative value.

```{block type='reflect'}
If we were to have a weaker positive relationship, how would this plot change?

If we were to have a stronger positive relationship, how would this plot change?

If we were to have a negative relationship, how would this plot change?
```

We want one number to represent **strength** and **direction** of a linear relationship.

- Points in Quadrants 1 and 3 (blue) have the z-scores of the **same sign**.
- Points in Quadrants 2 and 4 (red) have z-scores of the **opposite sign**.

**What if we took the product of the $z$-scores for $x$ and $y$ variables?**

Situation 1: An individual far above the means in both the $x$ and $y$ variables or far below the means in both the $x$ and $y$ variables has a very large, positive product of z-scores.

Situation 2: An individual far above the mean in $x$ and far below the mean in $y$ has a very large, negative product of z-scores. (The same goes for low $x$ and high $y$.)

The (almost) *average* of products of the $z$-scores is the **correlation coefficient**,

$$ r_{x,y} = \frac{\sum z_x z_y}{n-1} $$

We notate the correlation coefficient between variables $x$ and $y$ as $r_{x,y}$.

Some observations:

- If most of our data points follow Situation 1, the correlation coefficient is an average of mostly large positive values. Thus the correlation coefficient will be large and positive.
- If most of our data points follow Situation 2, the correlation coefficient is an average of mostly large negative values. Thus the correlation coefficient will be large and negative.
- If about an equal number of data points follow Situation 1 and Situation 2, we will be balancing positive and negative numbers, which results in a value close to zero. Thus, the correlation coefficient will be close to zero.

**Which points contribute the most to this average?**

Let's look at the correlation for the entire sample first. Then let's calculate the correlation for individuals around the mean Neck size.

```{r, echo=FALSE}
## The scale() function adds attributes to the columns zNeck, zChest, and Product
## Makes it so that dplyr::filter doesn't work
## Reread the data
body <- read.delim("Data/bodyfat.txt")
```

```{r, echo=TRUE}
body %>%
  summarize(cor(Neck, Chest)) # All data points used in calculation

body %>%
    filter(Neck > 35 & Neck < 40) %>% # Keep individuals with Neck size between 35cm and 40cm
      summarize(cor(Neck, Chest)) # Only middle subset of data points used in calculation
```

The value is much larger and more positive when all data points are used. The points that are far from the means of x and y have a larger product of z-scores and thus increase the correlation coefficient value.

### Properties

* $-1 \leq r \leq 1$ (due to the [Cauchy-Schwarz Inequality](https://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality) for those are inclined)

* The sign of $r$ goes with the direction of the relationship.

* $r_{x,y} = r_{y,x}$, it doesn't matter which variable is $x$ and which is $y$.

* $r_{ax+b, cy+d} = r_{x,y}$, linear change of scale doesn't affect $r$. Why?

* $r$ measures strength of *linear* relationship (not a curved relationship).