IntroStatTextbook/06-explore-regression.Rmd at master · MTstateIntroStats/IntroStatTextbook · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Correlation and regression {#explore-regression}
<!-- Old reference: #cor-reg -->

```{r, include = FALSE}
source("_common.R")
```

::: {.chapterintro}
Linear regression is a very powerful statistical technique.
Many people have some familiarity with regression just from reading the news, where straight lines are overlaid on scatterplots.
Linear models can be used for prediction or to evaluate whether there is a linear relationship between two numerical variables.
:::

## Fitting a line, residuals, and correlation {#fit-line-res-cor}

It's helpful to think deeply about the line fitting process. In this section, we define the form of a linear model, explore criteria for what makes a good fit, and introduce a new statistic called *correlation*.

### Fitting a line to data

Figure \@ref(fig:perfLinearModel) shows two variables whose relationship can be modeled perfectly with a straight line.
The equation for the line is $y = 5 + 64.96 x$.
Consider what a perfect linear relationship means: we know the exact value of $y$ just by knowing the value of $x$.
This is unrealistic in almost any natural process.
For example, if we took family income ($x$), this value would provide some useful information about how much financial support a college may offer a prospective student ($y$).
However, the prediction would be far from perfect, since other factors play a role in financial support beyond a family's finances.

```{r perfLinearModel, fig.cap = "Requests from twelve separate buyers were simultaneously placed with a trading company to purchase Target Corporation stock (ticker `TGT`, December 28th, 2018), and the total cost of the shares were reported. Because the cost is computed using a linear formula, the linear fit is perfect."}
target <- simulated_scatter %>%
  filter(group == 4)

ggplot(target, aes(x = x, y = y)) +
  geom_smooth(method = "lm", color = "black", size = 0.5) +
  geom_point(col = COL["blue", "full"], size = 3) +
  scale_y_continuous(labels = label_dollar(scale = 0.001, suffix = "K", accuracy = 1), breaks = c(0, 1000, 2000)) +
  labs(
    x = "Number of Target Corporation Stocks to Purchase",
    y = "Total Cost of the Share Purchase"
  )
```

Linear regression is the statistical method for fitting a line to data where the relationship between two variables, $x$ and $y$, can be modeled by a straight line with some error:

$$ y = \beta_0 + \beta_1x + \varepsilon$$

The values $\beta_0$ and $\beta_1$ represent the model's parameters ($\beta$ is the Greek letter *beta*), and the error is represented by $\varepsilon$ (the Greek letter *epsilon*).
The parameters are estimated using data, and we write their point estimates as $b_0$ and $b_1$.
When we use $x$ to predict $y$, we usually call $x$ the explanatory or **predictor** variable, and we call $y$ the response. We also often drop the $\epsilon$ term when writing down the model since our main focus is often on the prediction of the average outcome. If the $\epsilon$ term is dropped, then we put a "hat" on $y$ ($\hat{y}$) to signal that the model yields a _prediction_ for $y$, and not the actual value.

```{r include=FALSE}
terms_chp_6 <- c("predictor")
```

It is rare for all of the data to fall perfectly on a straight line.
Instead, it's more common for data to appear as a *cloud of points*,
such as those examples shown in Figure \@ref(fig:imperfLinearModel).
In each case, the data fall around a straight line, even if none of the observations fall exactly on the line.
The first plot shows a relatively strong downward linear trend, where the remaining variability in the data around the line is minor relative to the strength of the relationship between $x$ and $y$.
The second plot shows an upward trend that, while evident, is not as strong as the first. The last plot shows a very weak downward trend in the data, so slight we can hardly notice it.
In each of these examples, we will have some uncertainty regarding our estimates of the model parameters, $\beta_0$ and $\beta_1$.
For instance, we might wonder, should we move the line up or down a little, or should we tilt it more or less?
As we move forward in this chapter, we will learn about criteria for line-fitting, and we will also learn about the uncertainty associated with estimates of model parameters.

```{r imperfLinearModel, fig.cap = "Three data sets where a linear model may be useful even though the data do not all fall exactly on the line.", out.width="100%"}
neg <- simulated_scatter %>% filter(group == 1)
pos <- simulated_scatter %>% filter(group == 2)
ran <- simulated_scatter %>% filter(group == 3)

p_neg <- ggplot(neg, aes(x = x, y = y)) +
  geom_point(col = COL["blue", "full"], size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", color = "black", size = 0.5, se = FALSE) +
  labs(x = NULL, y = NULL)

p_pos <- ggplot(pos, aes(x = x, y = y)) +
  geom_point(col = COL["blue", "full"], size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", color = "black", size = 0.5, se = FALSE) +
  labs(x = NULL, y = NULL)

p_ran <- ggplot(ran, aes(x = x, y = y)) +
  geom_point(col = COL["blue", "full"], size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", color = "black", size = 0.5, se = FALSE) +
  labs(x = NULL, y = NULL)

p_neg + p_pos + p_ran
```

There are also cases where fitting a straight line to the data, even if there is a clear relationship between the variables, is not helpful.
One such case is shown in Figure \@ref(fig:notGoodAtAllForALinearModel) where there is a very clear relationship between the variables even though the trend is not linear.
We discuss nonlinear trends in this chapter and the next, but details of fitting nonlinear models are saved for a later course.

```{r notGoodAtAllForALinearModel, fig.cap = "The best fitting line for these data is flat, which is not useful in this nonlinear case. These data are from a physics experiment."}
bad <- simulated_scatter %>% filter(group == 5)

ggplot(bad, aes(x = x, y = y)) +
  geom_point(col = COL["blue", "full"], size = 2) +
  geom_smooth(method = "lm", color = "black", size = 0.5, se = FALSE) +
  labs(
    x = "Angle of Incline (Degrees)",
    y = "Distance Traveled (m)"
    )
```

### Using linear regression to predict possum head lengths

Brushtail possums are a marsupial that lives in Australia, and a photo
of one is shown in Figure \@ref(fig:brushtail-possum).
Researchers captured 104 of these animals and took body measurements before releasing the animals back into the wild.
We consider two of these measurements: the total length of each possum, from head to tail, and the length of each possum's head.

```{r brushtail-possum, fig.cap = "The common brushtail possum of Australia. Photo by Greg Schecter, [flic.kr/p/9BAFbR](https://flic.kr/p/9BAFbR), CC BY 2.0 license. "}
knitr::include_graphics("03/figures/brushtail-possum/brushtail-possum.jpg")
```

::: {.data}
The [`possum`](http://openintrostat.github.io/openintro/reference/possum.html) data can be found in the [openintro](http://openintrostat.github.io/openintro) package.
:::

Figure \@ref(fig:scattHeadLTotalL) shows a scatterplot for the head length (mm) and total length (cm) of the possums.
Each point represents a single possum from the data.
The head and total length variables are associated: possums with an above average total length also tend to have above average head lengths.
While the relationship is not perfectly linear, it could be helpful to partially explain the connection between these variables with a straight line.

```{r scattHeadLTotalL, fig.cap = "A scatterplot showing head length against total length for 104 brushtail possums. A point representing a possum with head length 86.7 mm and total length 84 cm is highlighted."}
ggplot(possum, aes(x = total_l, y = head_l)) +
  geom_point(col = COL["blue", "full"], alpha = 0.7, size = 2) +
  labs(
    x = "Total Length (cm)",
    y = "Head Length (mm)"
  ) +
  geom_point(data = tibble(x = 84, y = 86.7), aes(x = x, y = y), color = COL["red", "full"], size = 5, shape = 21, stroke = 2)
```

We want to describe the relationship between the head length and total length variables in the possum data set using a line.
In this example, we will use the total length as the predictor variable, $x$, to predict a possum's head length, $y$.
We could fit the linear relationship using technology (criteria to be discussed in Section \@ref(least-squares-regression)), as in Figure \@ref(fig:scattHeadLTotalLLine).

```{r scattHeadLTotalLLine, fig.cap = "A reasonable linear model was fit to represent the relationship between head length and total length."}
ggplot(possum, aes(x = total_l, y = head_l)) +
  geom_point(col = COL["blue", "full"], alpha = 0.7, size = 2) +
  labs(
    x = "Total Length (cm)",
    y = "Head Length (mm)"
  ) +
  geom_smooth(method = "lm", se = FALSE, color = "black", size = 0.5)
```

The equation for this line is

$$\hat{y} = 43+0.57x.$$

A "hat" on $y$ is used to signify that this is an estimate.
We can use this line to discuss properties of possums.
For instance, the equation predicts a possum with a total length of 80 cm will have a head length of

$$\hat{y} = 43 + 0.57 \times 80 = 88.6 \text{ mm}.$$

The estimate may be viewed as an average: the equation predicts that possums with a total length of 80 cm will have an average head length of 88.6 mm.
Absent further information about an 80 cm possum, the prediction for head length that uses the average is a reasonable estimate.

There may be other variables that could help us predict the head length of a possum besides its length.
Perhaps the relationship would be a little different for male possums than female possums, or perhaps it would differ for possums from one region of Australia versus another region.
Plot A in Figure \@ref(fig:scattHeadLTotalL-sex-age) shows the relationship between total length and head length of brushtail possums, taking into consideration their sex.
Male possums (represented by blue triangles) seem to be larger in terms of total length and head length than female possums (represented by red circles).
Plot B in Figure \@ref(fig:scattHeadLTotalL-sex-age) shows the same relationship, taking into consideration their age.
It's harder to tell if age changes the relationship between total length and head length for these possums.

```{r scattHeadLTotalL-sex-age, fig.cap = "Relationship between total length and head lentgh of brushtail possums, taking into consideration their sex (Plot A) or age (Plot B).", fig.asp=1.236}
p_sex <- ggplot(possum, aes(x = total_l, y = head_l, shape = sex, color = sex)) +
  geom_point(alpha = 0.6, size = 2) +
  scale_color_manual(values = c(COL["red","full"], COL["blue","full"])) +
  labs(
    x = "Total Length (cm)",
    y = "Head Length (mm)",
    color = "Sex", shape = "Sex"
  )

p_age <- ggplot(possum, aes(x = total_l, y = head_l, color = age)) +
  geom_point(alpha = 0.6, size = 2) +
  labs(
    x = "Total Length (cm)",
    y = "Head Length (mm)",
    color = "Age"
  ) +
  scale_color_viridis_c()

p_sex / p_age + plot_annotation(tag_levels = "A")
```

In Chapter \@ref(explore-mult-reg), we'll learn about how we can include more than one predictor in our model.
Before we get there, we first need to better understand how to best build a simple linear model with one predictor.

### Residuals

**Residuals** are the leftover variation in the data after accounting for the model fit:

$$\text{Data} = \text{Fit} + \text{Residual}$$

Each observation will have a residual, and three of the residuals for the linear model we fit for the data are shown in Figure \@ref(fig:scattHeadLTotalLLine-highlighted).
If an observation is above the regression line, then its residual, the vertical distance from the observation to the line, is positive.
Observations below the line have negative residuals.
One goal in picking the right linear model is for these residuals to be as small as possible.

```{r include=FALSE}
terms_chp_6 <- c(terms_chp_6, "residuals")
```

Figure \@ref(fig:scattHeadLTotalLLine-highlighted) is almost a replica of Figure \@ref(fig:scattHeadLTotalLLine), with three points from the data highlighted.
The observation marked by a red circle has a small, negative residual of about -1; the observation marked by a green diamond has a large residual of about +7; and the observation marked by a yellow triangle has a moderate residual of about -4.
The size of a residual is usually discussed in terms of its absolute value.
For example, the residual for the observation marked by a yellow triangle is larger than that of the observation marked by a red circle because $|-4|$ is larger than $|-1|$.

```{r scattHeadLTotalLLine-highlighted, fig.cap = "A reasonable linear model was fit to represent the relationship between head length and total length, with three points and their residuals highlighted."}
mod <- lm(head_l ~ total_l, data = possum)
preds <- predict(mod, data.frame(total_l = c(76, 85, 95.5)))
obs <- c(85.1, 98.6, 94)
ggplot(possum, aes(x = total_l, y = head_l)) +
  geom_point(col = COL["blue", "full"], alpha = 0.7, size = 2) +
  labs(
    x = "Total Length (cm)",
    y = "Head Length (mm)"
  ) +
  geom_smooth(method = "lm", se = FALSE, color = "black", size = 0.5) +
  geom_point(data = possum %>% filter(total_l == 76), shape = 21, stroke = 2, size = 4, color = COL["red", "full"]) +
  geom_segment(aes(x = 76, y = preds[1], xend = 76, yend = obs[1]), color = COL["red", "full"], inherit.aes = FALSE) +
  geom_point(data = possum %>% filter(total_l == 85, head_l == 98.6), shape = 5, stroke = 2, size = 5, color = COL["green", "full"]) +
  geom_segment(aes(x = 85, y = preds[2], xend = 85, yend = obs[2]), color = COL["green", "full"], inherit.aes = FALSE) +
  geom_point(data = possum %>% filter(total_l == 95.5, head_l == 94), shape = 2, stroke = 2, size = 5, color = COL["yellow", "full"]) +
  geom_segment(aes(x = 95.5, y = preds[3], xend = 95.5, yend = obs[3]), color = COL["yellow", "full"], inherit.aes = FALSE)
```

::: {.onebox}
**Residual: Difference between observed and predicted.**

The **residual** of the $i^{th}$ observation $(x_i, y_i)$ is the difference of the observed response ($y_i$) and the response we would predict based on the model fit ($\hat{y}_i$):

$$e_i = y_i - \hat{y}_i$$

We typically identify $\hat{y}_i$ by plugging $x_i$ into the model.
:::


::: {.workedexample}
The linear fit shown in Figure \@ref(fig:scattHeadLTotalLLine-highlighted) is given as $\hat{y} = 43+0.57x$.
Based on this line, formally compute the residual of the observation
$(76.0, 85.1)$.
This observation is marked by a red circle in Figure \@ref(fig:scattHeadLTotalLLine-highlighted).
Check it against the earlier visual estimate, $-1$.

---

We first compute the predicted value of the observation marked by a red circle based on the model:

$$\hat{y} = 43+0.57x = 43+0.57\times 76.0 = 86.3mm$$


Next, we compute the difference of the actual head length and the predicted head length:

$$e = y - \hat{y} = 85.1 -  86.3 = -1.2 mm$$

The model's error is $e = -1.2$ mm, which is very close to the
visual estimate of $-1$ mm. The negative residual indicates that the linear model overpredicted head length for this particular possum.
:::


::: {.guidedpractice}
If a model underestimates an observation, will the residual be positive or negative? What about if it overestimates the observation?^[If a model underestimates an observation, then the model estimate is below the actual. The residual, which is the actual observation value minus the model estimate, must then be positive. The opposite is true when the model overestimates the observation: the residual is negative.]
:::


::: {.guidedpractice}
Compute the residuals for the observation marked by a green diamond, $(85.0, 98.6)$, and the observation marked by a yellow triangle, $(95.5, 94.0)$, in the figure using the linear relationship $\hat{y} = 43 + 0.57x$.^[Green diamond: $\hat{y} = 43+0.57x = 43+0.57\times 85.0 = 91.45 \rightarrow e = y - \hat{y} = 98.6-91.45=7.15$. This is close to the earlier estimate of 7. Yellow triangle: $\hat{y} = 43+0.57x = 97.44 \rightarrow e = -3.44$. This is also close to the estimate of -4.]
:::

Residuals are helpful in evaluating how well a linear model fits a data set.
We often display them in a residual plot such as the one shown in Figure \@ref(fig:scattHeadLTotalLResidualPlot) for the regression line in Figure \@ref(fig:scattHeadLTotalLLine-highlighted).
The residuals are plotted at their fitted values on the $x$-axis but with the vertical coordinate as the residual.
For instance, the point $(85.0, 98.6)$ (marked by the green diamond) had a predicted value of 91.45 mm and had a residual of 7.15 mm, so in the residual plot it is placed at $(91.45, 7.15)$.
Creating a residual plot is sort of like tipping the scatterplot over so the regression line is horizontal.

```{r scattHeadLTotalLResidualPlot, fig.cap = "Residual plot for the model predicting head length from total length for brushtail possums."}
m_head_total <- lm(head_l ~ total_l, data = possum)
m_head_total_aug <- augment(m_head_total)

ggplot(m_head_total_aug, aes(x = .fitted, y = .resid)) +
  geom_point(col = COL["blue", "full"], alpha = 0.7, size = 2) +
  labs(
    x = "Predicted values of head length (mm)",
    y = "Residuals"
  ) +
  geom_hline(yintercept = 0, color = "black", linetype = "dashed") +
  geom_point(data = m_head_total_aug %>% filter(total_l == 76), shape = 21, stroke = 2, size = 4, color = COL["red", "full"]) +
  geom_segment(aes(x = preds[1], y = obs[1]-preds[1], xend = preds[1], yend = 0), color = COL["red", "full"], inherit.aes = FALSE) +
  geom_point(data = m_head_total_aug %>% filter(total_l == 85, head_l == 98.6), shape = 5, stroke = 2, size = 5, color = COL["green", "full"]) +
  geom_segment(aes(x = preds[2], y = obs[2]-preds[2], xend = preds[2], yend = 0), color = COL["green", "full"], inherit.aes = FALSE) +
  geom_point(data = m_head_total_aug %>% filter(total_l == 95.5, head_l == 94), shape = 2, stroke = 2, size = 5, color = COL["yellow", "full"]) +
  geom_segment(aes(x = preds[3], y = obs[3]-preds[3], xend = preds[3], yend = 0), color = COL["yellow", "full"], inherit.aes = FALSE)
```

::: {.workedexample}
One purpose of residual plots is to identify characteristics or patterns still apparent in data after fitting a model.
Figure \@ref(fig:sampleLinesAndResPlots) shows three scatterplots with linear models in the first row and residual plots in the second row. Can you identify any patterns remaining in the residuals?

---

In the first data set (first column), the residuals show no obvious patterns.
The residuals appear to be scattered randomly around the dashed line that represents 0.

The second data set shows a pattern in the residuals.
There is some curvature in the scatterplot, which is more obvious in the residual plot.
We should not use a straight line to model these data. Instead, a more advanced technique should be used.

The last plot shows very little upwards trend, and the residuals also show no obvious patterns.
It is reasonable to try to fit a linear model to the data.
However, it is unclear whether the slope parameter is statistically discernible from zero.
The point estimate of the slope parameter, labeled $b_1$, is not zero, but we might wonder if this could just be due to chance.
We will address this sort of scenario in Chapter \@ref(inference-reg).
:::

```{r sampleLinesAndResPlots, fig.cap="Sample data with their best fitting lines (top row) and their corresponding residual plots (bottom row).", out.width="100%"}
neg_lin <- simulated_scatter %>% filter(group == 6)
neg_cur <- simulated_scatter %>% filter(group == 7)
random  <- simulated_scatter %>% filter(group == 8)

neg_lin_mod <- augment(lm(y ~ x, data = neg_lin))
neg_cur_mod <- augment(lm(y ~ x, data = neg_cur))
random_mod  <- augment(lm(y ~ x, data = random))

p_neg_lin <- ggplot(neg_lin, aes(x = x, y = y)) +
  geom_point(col = COL["blue", "full"], size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", color = "black", size = 0.5, se = FALSE) +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1))

p_neg_cur <- ggplot(neg_cur, aes(x = x, y = y)) +
  geom_point(col = COL["blue", "full"], size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", color = "black", size = 0.5, se = FALSE) +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1))

p_random <- ggplot(random, aes(x = x, y = y)) +
  geom_point(col = COL["blue", "full"], size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", color = "black", size = 0.5, se = FALSE) +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1))

p_neg_lin_res <- ggplot(neg_lin_mod, aes(x = .fitted, y = .resid)) +
  geom_point(col = COL["blue", "full"], alpha = 0.7, size = 2) +
  geom_hline(yintercept = 0, color = "black", linetype = "dashed") +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1))

p_neg_cur_res <- ggplot(neg_cur_mod, aes(x = .fitted, y = .resid)) +
  geom_point(col = COL["blue", "full"], alpha = 0.7, size = 2) +
  geom_hline(yintercept = 0, color = "black", linetype = "dashed") +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1))

p_random_res <- ggplot(random_mod, aes(x = .fitted, y = .resid)) +
  geom_point(col = COL["blue", "full"], alpha = 0.7, size = 2) +
  geom_hline(yintercept = 0, color = "black", linetype = "dashed") +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1))

p_neg_lin     + theme(plot.margin = unit(c(0, 10, 5, 0), "pt")) +
p_neg_cur     + theme(plot.margin = unit(c(0, 10, 5, 0), "pt")) + p_random +
p_neg_lin_res + theme(plot.margin = unit(c(0, 10, 5, 0), "pt")) +
p_neg_cur_res + theme(plot.margin = unit(c(0, 10, 5, 0), "pt")) + p_random_res +
  plot_layout(ncol = 3, heights = c(2, 1))
```

### Describing linear relationships with correlation

We've seen plots with strong linear relationships and others with very weak linear relationships.
It would be useful if we could quantify the strength of these linear relationships with a statistic.

::: {.onebox}
**Correlation: strength and direction of a linear relationship.**

**Correlation** which always takes values between -1 and 1, is a summary statistic that describes the strength (by its magnitude) and direction (by its sign) of the linear relationship between two variables. We denote the correlation by $r$ or $R$.
:::

```{r include=FALSE}
terms_chp_6 <- c(terms_chp_6, "correlation")
```

We can compute the correlation using a formula, just as we did with the sample mean and standard deviation.
This formula is rather complex^[Formally, we can compute the correlation for observations $(x_1, y_1)$, $(x_2, y_2)$, \..., $(x_n, y_n)$ using the formula $$R = \frac{1}{n-1} \sum_{i=1}^{n} \frac{x_i-\bar{x}}{s_x}\frac{y_i-\bar{y}}{s_y}$$ where $\bar{x}$, $\bar{y}$, $s_x$, and $s_y$ are the sample means and standard deviations for each variable.],
and like with other statistics, we generally perform the calculations on a computer or calculator.
Figure \@ref(fig:posNegCorPlots) shows eight plots and their corresponding correlations. Only when the relationship is perfectly linear is the correlation either -1 or 1. If the relationship is strong and positive, the correlation will be near +1. If it is strong and negative, it will be near -1. If there is no apparent linear relationship between the variables, then the correlation will be near zero.

```{r posNegCorPlots, fig.cap="Sample scatterplots and their correlations. The first row shows variables with a positive relationshiop, represented by the trend up and to the right. The second row shows variables with a negative trend, where a large value in one variable is associated with a low value in the other.", out.width="100%", fig.asp=0.5}
#library(ggpubr) # Adding here instead of _common.R to avoid collision with ggimage

simulated_scatter %>%
  filter(group %in% 9:16) %>%
  ggplot(aes(x = x, y = y)) +
  geom_point(col = COL["blue", "full"], size = 2, alpha = 0.8) +
  theme_void() +
  facet_wrap(~group, nrow = 2, scales = "free_x") +
  theme(
    panel.border = element_rect(colour = "gray", fill = NA, size = 1),
    strip.background = element_blank(),
    strip.text.x = element_blank()
  ) +
  stat_cor(aes(label = ..r.label..))
```

The correlation is intended to quantify the strength and direction of a linear trend.
Nonlinear trends, even when strong, sometimes produce correlations that do not reflect the strength of the relationship; see three such examples in
Figure \@ref(fig:corForNonLinearPlots).

```{r corForNonLinearPlots, fig.cap="Sample scatterplots and their correlations. In each case, there is a strong relationship between the variables, However, because the relationship is nonlinear, the correlation is relatively weak.", fig.asp=0.3, out.width="100%"}
simulated_scatter %>%
  filter(group %in% 17:19) %>%
  ggplot(aes(x = x, y = y)) +
  geom_point(col = COL["blue", "full"], size = 2, alpha = 0.8) +
  theme_void() +
  facet_wrap(~group, nrow = 1, scales = "free_x") +
  theme(
    panel.border = element_rect(colour = "gray", fill = NA, size = 1),
    strip.background = element_blank(),
    strip.text.x = element_blank()
  ) +
  stat_cor(aes(label = ..r.label..))
```

::: {.guidedpractice}
No straight line is a good fit for the data sets represented in Figure \@ref(fig:corForNonLinearPlots).
Try drawing nonlinear curves on each plot.
Once you create a curve for each, describe what is important in your fit.^[We'll leave it to you to draw the lines. In general, the lines you draw should be close to most points and reflect overall trends in the data.]
:::


## Least squares regression {#least-squares-regression}

Fitting linear models by eye is open to criticism since it is based on an individual's preference. In this section, we use *least squares regression* as a more rigorous approach.

### Gift aid for freshman at Elmhurst College

This section considers family income and gift aid data from a random sample of fifty students in the freshman class of Elmhurst College in Illinois.
Gift aid is financial aid that does not need to be paid back, as opposed to a loan.

::: {.data}
The data from this study can be found in the [openintro](http://openintrostat.github.io/openintro) package: [`elmhurst`](http://openintrostat.github.io/openintro/reference/elmhurst.html).
:::

A scatterplot of the data is shown in Figure \@ref(fig:elmhurstScatterW2Lines) along with two linear fits.
The lines follow a negative trend in the data; students who have higher family incomes tended to have lower gift aid from the university.

```{r elmhurstScatterW2Lines, fig.cap = "Gift aid and family income for a random sample of 50 freshman students from Elmhurst College, shown with the least squares line (solid line) and line fit by minimizing the sum of the residual magnitudes (dashed line)."}
sum.abs.dev <- function(b, x, y){
  # Function to compute sum of absolute deviations
  # b = vector of regression coefficients
  return(sum(abs( y - (b[1]+b[2]*x) )))
}
abs.coef <- optim(lm(gift_aid ~ family_income, data=elmhurst)$coef, sum.abs.dev, x = elmhurst$family_income, y = elmhurst$gift_aid)$par
ggplot(elmhurst, aes(x = family_income, y = gift_aid)) +
  geom_point(col = COL["blue", "full"], size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", color = "black", size = 0.5, se = FALSE) +
  geom_abline(intercept = abs.coef[1], slope = abs.coef[2], size = 0.5, color = "black", lty=2) +
  scale_y_continuous(labels = label_dollar(scale = 1, suffix = "K", accuracy = 1)) +
  scale_x_continuous(labels = label_dollar(scale = 1, suffix = "K", accuracy = 1)) +
  labs(
    x = "Family income",
    y = "Gift aid from university"
  )
```

::: {.guidedpractice}
Is the correlation positive or negative in Figure \@ref(fig:elmhurstScatterW2Lines)?^[Larger family incomes are associated with lower amounts of aid, so the correlation will be negative. Using a computer, the correlation can be computed: -0.499.]
:::

### An objective measure for finding the best line

We begin by thinking about what we mean by "best". Mathematically, we want a line that has small residuals.
The first option that may come to mind is to minimize the sum of the residual magnitudes:

$$|e_1| + |e_2| + \dots + |e_n|$$

which we could accomplish with a computer program.
The resulting dashed line shown in Figure \@ref(fig:elmhurstScatterW2Lines) demonstrates this fit can be quite reasonable.
However, a more common practice is to choose the line that minimizes the sum of the squared residuals:

$$e_{1}^2 + e_{2}^2 + \dots + e_{n}^2$$

The line that minimizes this **least squares criterion** is represented as the solid line in Figure \@ref(fig:elmhurstScatterW2Lines).
This is commonly called the **least squares line**.
The following are three possible reasons to choose this option instead of trying to minimize the sum of residual magnitudes without any squaring:

```{r include=FALSE}
terms_chp_6 <- c(terms_chp_6, "least squares criterion", "least squares line")
```

1.  It is the most commonly used method.
2.  Computing the least squares line is widely supported in statistical software.
3.  In many applications, a residual twice as large as another residual is more than twice as bad. For example, being off by 4 is usually more than twice as bad as being off by 2. Squaring the residuals accounts for this discrepancy.

The first two reasons are largely for tradition and convenience; the last reason explains why the least squares criterion is typically most helpful.^[There are applications where the sum of residual magnitudes may be more useful, and there are plenty of other criteria we might consider. However, this book only applies the least squares criterion.]

### Finding and interpreting the least squares line

For the Elmhurst data, we could write the equation of our linear regression model as
$$aid = \beta_0 + \beta_{1}\times \textit{family_income} + \epsilon.$$
Here the model equation is set up to predict gift aid based on a student's family income, which would be useful to students considering Elmhurst.
The two unknown values $\beta_0$ and $\beta_1$ are the parameters of the linear regression model, and $\epsilon$ represents the random error.

The least squares regression line, computed based on the observed data, provides estimates of the parameters $\beta_0$ and $\beta_1$:
$$\widehat{aid} = b_0 + b_{1}\times \textit{family_income}.$$
In practice, this estimation is done using a computer in the same way that other estimates, like a sample mean, can be estimated using a computer or calculator.

The dataset where these data are stored is called `elmhurst` in the `openintro` package. Run the code below to load the data set into your RStudio session.

```{r, eval=FALSE}
library(openintro)
dat(elmhurst)
```

The first five rows of this dataset are given in Table \@ref(tab:elmhurst-data).

```{r elmhurst-data}
elmhurst %>%
  slice_head(n = 5) %>%
  kable(caption = "First five rows of the `elmhurst` dataset.") %>%
  kable_styling(full_width = FALSE)
```

We can see that family income is recorded in a variable called `family_income` and gift aid from university is recorded in a variable called `gift_aid`.
For now, we won't worry about the `price_paid` variable.
We should also note that these data are from the 2011-2012 academic year, and all monetary amounts are given in \$1,000s, i.e., the family income of the first student in the data shown in Table \@ref(tab:elmhurst-data) is \$92,900 and they received a gift aid of $21,700. (The data source states that all numbers have been rounded to the nearest whole dollar.)

Using these data, we can estimate the linear regression line by fitting a `l`inear `m`odel to the data with the `lm()` function in R.

```{r}
m_ga_fi <- lm(gift_aid ~ family_income, data = elmhurst) %>% tidy()
m_ga_fi_int   <- m_ga_fi %>% filter(term == "(Intercept)") %>% pull(estimate)
m_ga_fi_slope <- m_ga_fi %>% filter(term == "family_income") %>% pull(estimate)
```

```{r echo=TRUE}
lm(gift_aid ~ family_income, data = elmhurst)
```

The model output tells us that the intercept is approximately `r m_ga_fi_int` and the slope is approximately `r m_ga_fi_slope`.

But what do these values mean?
Interpreting parameters in a regression model is often one of the most important steps in the analysis.

::: {.workedexample}
The intercept and slope estimates for the Elmhurst data are $b_0$ = `r m_ga_fi_int` and $b_1$ = `r m_ga_fi_slope`.
What do these numbers really mean?

---

Interpreting the estimated slope is helpful in almost any application.
For each additional \$1,000 of family income, we would predict a student to receive a net difference of 1,000 $\times$ (-0.0431) = -\$43.10 in aid on average, i.e., \$43.10 *less*.
Note that a higher family income corresponds to less aid because the coefficient of family income is negative in the model.
We must be cautious in this interpretation: while there is a real association, we cannot interpret a causal connection between the variables because these data are observational.
That is, increasing a student's family income may not cause the student's aid to drop. (It would be reasonable to contact the college and ask if the relationship is causal, i.e., if Elmhurst College's aid decisions are partially based on students' family income.) A more appropriate interpretation would then be: An additional \$1,000 of family income is associated with an estimated decrease of \$43.10 in aid on average.

The estimated intercept $b_0$ = `r m_ga_fi_int` describes the average aid if a student's family had no income.
The meaning of the intercept is relevant to this application since the family income for some students at Elmhurst is \$0.
In other applications, the intercept may have little or no practical value if there are no observations where $x$ is near zero.
:::

::: {.onebox}
**Interpreting parameters estimated by least squares.**

The **slope** describes the estimated difference in the $y$ variable if the explanatory variable $x$ for a case happened to be one unit larger.

The **intercept** describes the average outcome of $y$ if $x=0$ *and* the linear model is valid all the way to $x=0$, which in many applications is not the case.
:::


::: {.workedexample}
Suppose a high school senior is considering Elmhurst College.
Can they simply use the linear equation that we have estimated to calculate her financial aid from the university?

---

She may use it as an estimate, though some qualifiers on this approach are important.
First, the data all come from one freshman class, and the way aid is determined by the university may change from year to year.
Second, the equation will provide an imperfect estimate.
While the linear equation is good at capturing the trend in the data, no individual student's aid will be perfectly predicted.
:::


Statistical software is usually used to compute the least squares line and the typical output generated as a result of fitting regression models looks like the one shown in Table \@ref(tab:rOutputForIncomeAidLSRLine).
For now we will focus on the first column of the output, which lists ${b}_0$ and ${b}_1$.
In Chapter \@ref(inference-reg) we will dive deeper into the remaining columns which give us information on how accurate and precise these values of intercept and slope that are calculated from a sample of 50 students are in estimating the population parameters of intercept and slope for *all* students.

```{r rOutputForIncomeAidLSRLine}
m_ga_fi %>%
  kable(caption = "Summary of least squares fit for the Elmhurst data.") %>%
  kable_styling(full_width = FALSE)
```

### Calculating the least squares regression line using summary statistics (special topic)

An alternative way of calculating the values of intercept and slope of a least squares line is manual calculations using formulas.
While this method is not commonly used by practicing statisticians and data scientists, it is useful to work through the first time you're learning about the least squares line and modeling in general.
Calculating these values by hand leverages two properties of the least squares line:

- The slope of the least squares line can be estimated by

$$b_1 = \frac{s_y}{s_x} r $$

where $r$ is the correlation between the two variables, and $s_x$ and $s_y$ are the sample standard deviations of the explanatory variable and response, respectively.

- If $\bar{x}$ is the sample mean of the explanatory variable and $\bar{y}$ is the sample mean of the vertical variable, then the point $(\bar{x}, \bar{y})$ is on the least squares line.

Table \@ref(tab:summaryStatsElmhurstRegr) shows the sample means for the family income and gift aid as \$101,780 and \$19,940, respectively.
We could plot the point $(102, 19.9)$ on Figure \@ref(fig:elmhurstScatterW2Lines) to verify it falls on the least squares line (the solid line).

```{r summaryStatsElmhurstRegr}
elmhurst %>%
  select(family_income, gift_aid) %>%
  summarise(
    fi_m = mean(family_income),
    fi_s = sd(family_income),
    ga_m = mean(gift_aid),
    ga_s = sd(gift_aid),
    r    = cor(family_income, gift_aid)
  ) %>%
  kable(
    col.names = c("mean", "sd", "mean", "sd", "R"),
    align = "ccccc",
    caption = "Summary statistics for family income and gift aid."
  ) %>%
  kable_styling(full_width = FALSE) %>%
  add_header_above(c("Family income, $x$" = 2, "Gift aid, $y$" = 2, " " = 1))
```

Next, we formally find the point estimates $b_0$ and $b_1$ of the parameters $\beta_0$ and $\beta_1$.

::: {.workedexample}
Using the summary statistics in Table \@ref(tab:summaryStatsElmhurstRegr), compute the slope for the regression line of gift aid against family income.

---

Compute the slope using the summary statistics from Table \@ref(tab:summaryStatsElmhurstRegr):

$$b_1 = \frac{s_y}{s_x} r = \frac{5.46}{63.2}(-0.499) = -0.0431$$
:::

You might recall the form of a line from math class, which we can use to find the model fit, including the estimate of $b_0$. Given the slope of a line and a point on the line, $(x_0, y_0)$, the equation for the line can be written as

$$y - y_0 = slope\times (x - x_0)$$


::: {.onebox}
**Identifying the least squares line from summary statistics.**

To identify the least squares line from summary statistics:

-   Estimate the slope parameter, $b_1 = (s_y / s_x) R$.
-   Noting that the point $(\bar{x}, \bar{y})$ is on the least squares line, use $x_0 = \bar{x}$ and $y_0 = \bar{y}$ with the point-slope equation: $y - \bar{y} = b_1 (x - \bar{x})$.
-   Simplify the equation, which would reveal that $b_0 = \bar{y} - b_1 \bar{x}$.
:::


::: {.workedexample}
Using the point (102, 19.9) from the sample means and the slope estimate $b_1 = -0.0431$, find the least-squares line for predicting aid based on family income.

---

Apply the point-slope equation using $(102, 19.9)$ and the slope $b_1 = -0.0431$:

$$\begin{aligned}
y - y_0  &= b_1 (x - x_0) \\
y - 19.9 &= -0.0431(x - 102)
\end{aligned}$$

Expanding the right side and then adding 19.9 to each side, the equation simplifies:

$$\begin{aligned}
\widehat{aid} = 24.3 - 0.0431 \times \textit{family_income}
\end{aligned}$$

Here we have replaced $y$ with $\widehat{aid}$ and $x$ with *family_income* to put the equation in context.
The final equation should always include a "hat" on the variable being predicted, whether it is a generic "$y$" or a named variable like "$aid$".
:::


### Extrapolation is treacherous

> *When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February $6^{th}$ it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on.*^[http://www.cc.com/video-clips/l4nkoq]
>
>Stephen Colbert
>April 6th, 2010

Linear models can be used to approximate the relationship between two variables. However, these models have real limitations.
Linear regression is simply a modeling framework.
The truth is almost always much more complex than our simple line.
For example, we do not know how the data outside of our limited window will behave.

::: {.workedexample}
Use the model $\widehat{aid} = 24.3 - 0.0431 \times \textit{family_income}$ to estimate the aid of another freshman student whose family had income of $1 million.

---

We want to calculate the aid for a family with $1 million income.
Note that in our model, this will be represented as 1,000 since the data are in $1,000s.

$$24.3 - 0.0431 \times 1000 = -18.8 $$

The model predicts this student will have -$18,800 in aid (!).
However, Elmhurst College does not offer *negative aid* where they select some students to pay extra on top of tuition to attend.
:::

Applying a model estimate to values outside of the realm of the original data is called **extrapolation**.
Generally, a linear model is only an approximation of the real relationship between two variables.
If we extrapolate, we are making an unreliable bet that the approximate linear relationship will be valid in places where it has not been analyzed.

```{r include=FALSE}
terms_chp_6 <- c(terms_chp_6, "extrapolation")
```

### Describing the strength of a fit

We evaluated the strength of the linear relationship between two variables earlier using the correlation, $r$. However, it is more common to explain the strength of a linear fit using $r^2$, called **r-squared**.
If provided with a linear model, we might like to describe how closely the data cluster around the linear fit.

```{r include=FALSE}
terms_chp_6 <- c(terms_chp_6, "r-squared")
```

The $r^2$ of a linear model describes the amount of variation in the response that is explained by the least squares line.
For example, consider the Elmhurst data, shown in Figure \@ref(fig:elmhurstScatterW2Lines).
The variance of the response variable, aid received, is about $s_{aid}^2 \approx 29.8$ million.
However, if we apply our least squares line, then this model reduces our uncertainty in predicting aid using a student's family income.
The variability in the residuals describes how much variation remains after using the model: $s_{_{RES}}^2 \approx 22.4$ million.
In short, there was a reduction of
$$\frac{s_{aid}^2 - s_{_{RES}}^2}{s_{aid}^2}
  = \frac{29800 - 22400}{29800}
  = \frac{7500}{29800}
  \approx 0.25$$
or about 25% in the data's variation by using information about family income for predicting aid using a linear model.
This corresponds exactly to the r-squared value:

$$r = -0.499 \rightarrow r^2 = 0.25$$
The squared correlation coefficient, $r^2$ (or $R^2$), is also called the **coefficient of determination**.
```{r include=FALSE}
terms_chp_6 <- c(terms_chp_6, "coefficient of determination")
```

::: {.onebox}
**Coefficient of determination: proportion of variability in the response explained by the model.**

Since $r$ is always between $-1$ and $1$, $r^2$ will always be between $0$ and $1$. This statistic is called the **coefficient of determination** and measures the proportion of variation in the response variable, $y$, that can be explained by the linear model with predictor $x$.
:::

::: {.workedexample}
Examine the scatterplot of head length (mm) versus total length (cm) of possums in Figure \@ref(fig:scattHeadLTotalLLine).
The correlation between these two variables is $R = 0.69$.
Find and interpret the coefficient of determination.

---

To find $r^2$, we square the correlation: $r^2 = (0.69)^2 = 0.48$.
This tells us that about 48% of variation in possum head length can be explained by total length.
This is visualized in Figure \@ref(fig:r-squared-explanation).
:::

```{r r-squared-explanation, fig.cap = "For these 104 possums, the range of head lengths is about 103 $-$ 83 = 20 mm. However, among possums of the same total length (e.g., 85 cm), the range in head lengths is reduced to about 10 mm, or about a 50% reduction, which matches $r^2 = 0.48$, or 48%."}
ggplot(possum, aes(x = total_l, y = head_l)) +
  geom_point(col = COL["blue", "full"], alpha = 0.7, size = 2) +
  labs(
    x = "Total Length (cm)",
    y = "Head Length (mm)"
  ) +
  geom_smooth(method = "lm", se = FALSE, color = "black", size = 0.5) +
  annotate("rect",
           xmin = 84, xmax = 86,
           ymin = min(possum$head_l[possum$total_l == 84]),
           ymax = max(possum$head_l[possum$total_l == 85]),
           fill = COL["gray","full"], alpha = 0.2) +
  annotate("text", x = 83, y = 93, hjust = 1, vjust = -1,
           label = "Variability in\nhead lengths\nfor 85 cm possums",
           color = COL["gray", "full"], size = 5) +
  annotate("rect",
           xmin = max(possum$total_l)-2, xmax = max(possum$total_l),
           ymin = min(possum$head_l), ymax = max(possum$head_l),
           fill = COL["green","full"], alpha = 0.2) +
  annotate("text", x = 93, y = 86, hjust = 1, vjust = 1,
           label = "Total variability in\nhead lengths\namong all possums",
           color = COL["green", "full"], size = 5)
```

::: {.guidedpractice}
If a linear model has a very strong negative relationship with a correlation of -0.97, how much of the variation in the response is explained by the explanatory variable?^[About $r^2 = (-0.97)^2 = 0.94$ or 94% of the variation is explained by the linear model.]
:::

More generally, $r^2$ can be calculated as a ratio of a measure of variability around the line divided by a measure of total variability.

::: {.onebox}
**Sums of squares to measure variability in $y$.**

We can measure the variability in the $y$ values by how far they tend to fall from their mean, $\bar{y}$. We define this value as the **total sum of squares**,

$$
SST = (y_1 - \bar{y})^2 + (y_2 - \bar{y})^2 + \cdots + (y_n - \bar{y})^2.
$$

Left-over variability in the $y$ values if we know $x$ can be measured by the **sum of squared errors**, or sum of squared residuals^[The difference $SST - SSE$ is called the **regression sum of squares**, $SSR$, and can also be calculated as $SSR = (\hat{y}_1 - \bar{y})^2 + (\hat{y}_2 - \bar{y})^2 + \cdots + (\hat{y}_n - \bar{y})^2$. $SSR$ represents the variation in $y$ that was accounted for in our model.],

$$
SSE = (y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \cdots + (y_n - \hat{y}_n)^2 = e_{1}^2 + e_{2}^2 + \dots + e_{n}^2
$$

The **coefficient of determination** can then be calculated as

$$
r^2 = \frac{SST - SSE}{SST} = 1 - \frac{SSE}{SST}
$$
:::

```{r include=FALSE}
terms_chp_6 <- c(terms_chp_6, "total sum of squares", "sum of squared error")
```

::: {.workedexample}
Among 104 possums, the total variability in head length (mm) is $SST = 1315.2$^[$SST$ can be calculated by finding the sample variance, $s^2$ and multiplying by $n-1$.]. The sum of squared residuals is $SSE = 687.0$. Find $r^2$.

---

Since we know $SSE$ and $SST$, we can calculate $r^2$ as

$$
r^2 = 1 - \frac{SSE}{SST} = 1 - \frac{687.0}{1315.2} = 0.48,
$$
the same value we found when we squared the correlation: $r^2 = (0.69)^2 = 0.48$.
:::

### Categorical predictors with two levels (special topic) {#categorical-predictor-two-levels}

Categorical variables are also useful in predicting outcomes.
Here we consider a categorical predictor with two levels (recall that a *level* is the same as a *category*).
We'll consider Ebay auctions for a video game, *Mario Kart* for the Nintendo Wii, where both the total price of the auction and the condition of the game were recorded. Here we want to predict total price based on game condition, which takes values `used` and `new`.

::: {.data}
The [`mariokart`](http://openintrostat.github.io/openintro/reference/mariokart.html) data can be found in the [openintro](http://openintrostat.github.io/openintro) package.
:::

```{r}
mariokart_lt100 <- mariokart %>%
  filter(total_pr < 100) %>%
  mutate(cond = fct_relevel(cond, "used", "new"))
```

A plot of the auction data is shown in Figure \@ref(fig:marioKartNewUsed).
Note that the original dataset contains some Mario Kart games being sold at prices above \$100 but for this analysis we have limited our focus to the `r nrow(mariokart_lt100)` Mario Kart games that are sold below \$100.

```{r marioKartNewUsed, fig.cap="Total auction prices for the video game Mario Kart, divided into used ($x = 0$) and new ($x = 1$) condition games. The least squares regression line is also shown."}
mariokart_lt100 %>%
  filter(total_pr < 100) %>%
  ggplot(aes(x = as.numeric(cond) - 1, y = total_pr)) +
  geom_jitter(color = COL["blue","full"], alpha = 0.8, position = position_jitter(width = 0.05)) +
  geom_smooth(method = "lm", color = "black", se = FALSE, fullrange = TRUE) +
  scale_x_continuous(
    breaks = c(0, 1),
    minor_breaks = c(0, 1),
    limits = c(-0.2, 1.2),
    labels = c("0\n(used)", "1\n(new)")
    ) +
  labs(
    x = "Condition",
    y = "Total price"
  )
```

To incorporate the game condition variable into a regression equation, we must convert the categories into a numerical form.
We will do so using an **indicator variable** called `condnew`, which takes value 1 when the game is new and 0 when the game is used.
Using this indicator variable, the linear model may be written as

$$\widehat{price} = \beta_0 + \beta_1 \times condnew$$

<!-- ::: {.underconstruction} -->
<!-- Clean up population model equations - add error and remove hat. -->
<!-- ::: -->

The parameter estimates are given in Table \@ref(tab:marioKartNewUsedRegrSummary).

```{r include=FALSE}
terms_chp_6 <- c(terms_chp_6, "indicator variable")
```

```{r marioKartNewUsedRegrSummary}
m_total_pr_cond       <- lm(total_pr ~ cond, data = mariokart_lt100)
m_total_pr_cond_tidy  <- tidy(m_total_pr_cond)

m_total_pr_cond_int   <- m_total_pr_cond_tidy %>%
  filter(term == "(Intercept)") %>%
  pull(estimate)
m_total_pr_cond_slope <- m_total_pr_cond_tidy %>%
  filter(term == "condnew") %>%
  pull(estimate)

m_total_pr_cond_tidy %>%
  kable(caption = "Least squares ression summary for the final auction price against the condition of the game.") %>%
  kable_styling(full_width = FALSE)
```

Using values from Table \@ref(tab:marioKartNewUsedRegrSummary), the model equation can be summarized as

$$\widehat{price} = `r m_total_pr_cond_int` + 10.90 \times condnew$$


::: {.workedexample}
Interpret the two parameters estimated in the model for the price of Mario Kart in eBay auctions.
The intercept is the estimated price when `condnew` takes value 0, i.e. when the game is in used condition.
That is, the average selling price of a used version of the game is \$42.87.

---

The slope indicates that, on average, new games sell for about \$10.90 more than used games.
:::


::: {.onebox}
**Interpreting model estimates for categorical predictors.**

The estimated intercept is the value of the response variable for the first category (i.e. the category corresponding to an indicator value of 0).
The estimated slope is the average change in the response variable between the two categories.
:::

We'll elaborate further on this topic in Chapter \@ref(explore-mult-reg), where we examine the influence of many predictor variables simultaneously using multiple regression.


## Outliers in linear regression {#outliers-in-regression}

In this section, we identify criteria for determining which outliers are important and influential.
Outliers in regression are observations that fall far from the cloud of points.
These points are especially important because they can have a strong influence on the least squares line.

### Types of outliers

:::{ .workedexample}
There are three plots shown in Figure \@ref(fig:outlierPlots1) along with the least squares line and residual plots.
For each scatterplot and residual plot pair, identify the outliers and note how they influence the least squares line.
Recall that an outlier is any point that doesn't appear to belong with the vast majority of the other points.

---

- A: There is one outlier far from the other points, though it only appears to slightly influence the line.

- B: There is one outlier on the right, though it is quite close to the least squares line, which suggests it wasn't very influential.

- C: There is one point far away from the cloud, and this outlier appears to pull the least squares line up on the right; examine how the line around the primary cloud doesn't appear to fit very well.
:::

```{r outlierPlots1, fig.cap="Three plots, each with a least squares line and residual plot. All data sets have at least one outlier.", out.width="100%"}
d1 <- simulated_scatter %>% filter(group == 24)
d2 <- simulated_scatter %>% filter(group == 25)
d3 <- simulated_scatter %>% filter(group == 26)

m1_aug <- augment(lm(y ~ x, data = d1))
m2_aug <- augment(lm(y ~ x, data = d2))
m3_aug <- augment(lm(y ~ x, data = d3))

p_1 <- ggplot(d1, aes(x = x, y = y)) +
  geom_point(col = COL["blue", "full"], size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", color = "black", size = 0.5, se = FALSE) +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1)) +
  labs(title = "A")

p_1_res <- ggplot(m1_aug, aes(x = .fitted, y = .resid)) +
  geom_point(col = COL["blue", "full"], alpha = 0.7, size = 2) +
  geom_hline(yintercept = 0, color = "black", linetype = "dashed") +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1)) +
  ylim(-4, 4)

p_2 <- ggplot(d2, aes(x = x, y = y)) +
  geom_point(col = COL["blue", "full"], size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", color = "black", size = 0.5, se = FALSE) +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1)) +
  labs(title = "B")

p_2_res <- ggplot(m2_aug, aes(x = .fitted, y = .resid)) +
  geom_point(col = COL["blue", "full"], alpha = 0.7, size = 2) +
  geom_hline(yintercept = 0, color = "black", linetype = "dashed") +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1)) +
  ylim(-4, 4)

p_3 <- ggplot(d3, aes(x = x, y = y)) +
  geom_point(col = COL["blue", "full"], size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", color = "black", size = 0.5, se = FALSE) +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1)) +
  labs(title = "C")

p_3_res <- ggplot(m3_aug, aes(x = .fitted, y = .resid)) +
  geom_point(col = COL["blue", "full"], alpha = 0.7, size = 2) +
  geom_hline(yintercept = 0, color = "black", linetype = "dashed") +
  theme_void() +
  theme(panel.border = element_rect(colour = "gray", fill = NA, size = 1)) +
  ylim(-4, 4)