DataVisProject/DataVisualization.Rmd at main · FHKrieg/DataVisProject · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
---
title: "Data Visualization Story Telling"
author: "Vincent Krieg"
date: "11/14/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Republicans don't believe in climate change so those states will use less renewable Energies

It is common knowledge that part of the Republican Party in the United States denies the existence
of man made climate change. Since they do not believe in man made climat change, one could expect
that the amount of renewable energies is lower in those states that have voted for the Republicans in the 2020
Presidential Election. In contrast
if the percentage of votes for the Republican Party in 2020 has no effect what factors play a more
important role for the percentage of renewable energy in a states energy mix?

```{r}
library(readr)
library(tidyverse)
library(usmap)
library(ggplot2)
library(yhat)
library(gridExtra)
library(grid)
library(ggrepel)
```

### Import of the Data ###

```{r}
totalData <- read_csv("data/totalData.csv")
```
## Description of the dataset

The column year refers to the data on electricity, it does not refer to any of the data after the column solar.
The col Type_of_producer refers to the type of producer of electricity, here the focus will only be on Total Electric Power Industry. This is due to the fact that this research is focused only on the entire electricity production of the coutry and states.
The columns from Total to Solar are the MW/h produced by the source which labels the column.
The col Geoname has the full name of the state.
The col GDPinMioUSD contains the GDP by state in million USD from the year 2020.
The col sunHours contains the average sun hours per day.
The col AvgWindSpeed contains the average Wind speed throughout the year in mph.
The col EductaioIndex is a Index which is used to represent the level of education in a state.
The col Population contains the population of the individual states.
The col dem_percent and rep_percent hold the percentage of the votes the Democratic/Republican party won
in the 2020 presidential elections by state.


## First visual inspection

The first plot is supposed to give an overview over the political outcome and the energy mix of a state side by side.
In order to achiev this the political results are used to display which of the two parties has gained more votes in a state
and the state will then be marked according to the winning political parties color. In the graph on the right the percentage
of renewable energy in the energy mix of a state will be represented. In order to achieve this the data was filtered
for the year 2020 and to only show the total amount of electricity used in a state cummulated over all segments. After this the cummulated sum over the renewable energies was build and divided by the total amount of energy produced in a state. <br/>
__Graph 1__
```{r}
mapDemRep <- totalData %>% filter(YEAR=="2020") %>% filter(Type_of_Producer=='Total Electric Power Industry') %>% mutate(dem_rep=if_else(dem_percent>rep_percent,"Democratic","Republican"))
df_USMap <- usmap::us_map()
cols <- c("Republican" = "red", "Democratic" = "blue")
USplot <- plot_usmap(data = mapDemRep, values="dem_rep",regions = "states",labels = TRUE) +
            scale_fill_manual(values = cols, name = "Political Party which \nwon the state",
            guide = guide_legend(reverse = TRUE))+
            labs(title = "Pesidential elections US 2020") +
            theme(legend.position = "bottom")
#To adjust the size of the text size for the state label
USplot$layers[[2]]$aes_params$size <- 2
```

__Graph 2__
```{r}
greenCard <- totalData %>% filter(YEAR=="2020") %>% filter(Type_of_Producer=='Total Electric Power Industry')  %>% mutate(renewableTotal=HydroElectric+Wind+wood+biomass+Geothermal+Solar) %>% mutate(renewablePercent= (renewableTotal/Total)*100)

USplot1 <-  plot_usmap(data =greenCard, values="renewablePercent",regions = "states",labels = TRUE) +
    scale_fill_continuous(low = "white", high = "darkgreen", name = "Renewable Energy in %", label = scales::comma)+
    labs(title = "Percentage of renewable energy used by state 2020") +
    theme(legend.position = "bottom")
USplot1$layers[[2]]$aes_params$size <- 2
```

__Merger of Graph 1 and Graph 2__
```{r}

grid.arrange(USplot,USplot1, ncol=2, top=textGrob("Politics and renewable energies in the United States",gp=gpar(fontsize=20,font=3)))


```
```{r}
```
When looking at the first graph and with the questions from the begining there are some interesting observations, that can be made.

1. The States with the highest amount of renewables are in the northern part of the united states
2. Idaho and South Dakota were won by the republicans but have a high percentage of renewables
3. There is a large range of percentages of renewables between the individual states
4. There is no obvious correlation between the percentage of renewables and the party that won a state


```{r}
greenCard <- greenCard %>%
  mutate(coalPercent= (Coal/Total)*100) %>%
  mutate(HydroElectricPercent= (HydroElectric/Total)*100) %>%
  mutate(gasPercent= (Gas/Total)*100) %>%
  mutate(petroleumPercent= (Petroleum/Total)*100) %>%
  mutate(windPercent= (Wind/Total)*100) %>%
  mutate(woodPercent= (wood/Total)*100) %>%
  mutate(nuclearPercent= (Nuclear/Total)*100) %>%
  mutate(biomassPercent= (biomass/Total)*100) %>%
  mutate(geothermalPercent= (Geothermal/Total)*100) %>%
  mutate(solarPercent= (Solar/Total)*100)


```

On the basis of the first plot it would be interesting to see which renewable energy production type is used by which state.
In order to get this inforamtion the processed data from part one is used and for each state the percentage of all the different power sources is calculated. Additionally a function is used to make the construction of the repeating graphs
easier.
All plots are generated, so that with adjusting the code a little bit, all the other energy sources could be easily displayed.


```{r}

createUSmap <- function(colName,colorPlot, nameScale, titlePlot){
    returnPlot <-  plot_usmap(data =greenCard, values=colName,regions = "states") +
    scale_fill_continuous(low = "white", high = colorPlot,
                          name = nameScale, label = scales::comma)+
    labs(title = titlePlot)+
    theme(legend.position = "right",
        legend.key.height = unit(4, 'mm'),
        legend.key.width = unit(1, 'mm'),
        legend.title =element_text(size=8),
        plot.title = element_text(hjust = 0.5))
    returnPlot$layers[[2]]$aes_params$size <- 2
    return(returnPlot)
}


coal <- createUSmap("coalPercent","black","% of total \nenergymix \nby state", "Coal")
hydro <- createUSmap("HydroElectricPercent","darkblue","% of total \nenergymix \nby state", "Hydro Electic")
gas <- createUSmap("gasPercent","brown","% of total \nenergymix \nby state", "Natural Gas")
petr <- createUSmap("petroleumPercent","wheat4","% of total \nenergymix \nby state", "Petroleum")
wind <- createUSmap("windPercent","lightblue","% of total \nenergymix \nby state", "Wind")
nuc <- createUSmap("nuclearPercent","green","% of total \nenergymix \nby state", "Nuclear")
bio <- createUSmap("biomassPercent","darkgreen","% of total \nenergymix \nby state", "Biomass")
geo <- createUSmap("geothermalPercent","darkred","% of total \nenergymix \nby state", "Geothermal")
solar <- createUSmap("solarPercent","yellow","% of total \nenergymix \nby state", "Solar")
wood <- createUSmap("woodPercent","saddlebrown","% of total \nenergymix \nby state", "Wood")

```

After looking through all different sources the 4, in my opinion, most interesting ones were picked and displayed below.
By combining those 4 into one grafic.

## Analysis of the individual parts of the energymix in the 52 states

The darker the color the higher the percentage of the energymix of this state.
Be aware of the color scheme as the range between min/max value differ between the productiontypes. This
is on purpose to to better show the variety if the maximum percentage is rather low.

```{r}
grid.arrange(solar,wood,geo,wind, ncol=2,nrow=2,
             top=textGrob(expression(bold(underline("Renewable Energy Sources"))),
                          gp=gpar(fontsize=20,font=3)))


```
```{r}
```


## Interessting aspects
__Solar__

* Mainly used in California and Nevada
* Texas is relatively low eventhough the state has favorable weather conditions for this type of energy production

__Wood__

* Main and Vermont booth get more than 15 % of their energy from wood
* Georga and New Hemsphire get a little over 4 % of their energy supplied by wood

__Geothermal__

* Hardly adds to the supply of Energy throught the US
* The hotspot for this technology is Nevada
* Followed by California


__Wind__

* Relative percentage of wind as a source of power is greate in the center of the country
* Wind does not account for much of the energysupply on the coast


```{r}
greenCard <- greenCard[!(greenCard$state == "US-Total"),]
greenCard <- greenCard[!(greenCard$state == "DC"),]

greenCard <- greenCard %>%
  mutate(renewablePercent = log(renewablePercent)) %>%
  mutate(PopulationInMio = Population/1000000) %>%
  mutate(GdpinBUSD = GdpinMioUSD/1000)

CorrelationPlots <- pivot_longer(greenCard,cols = c("SunHours", "AvgWindSpeed", "PopulationInMio", "EducationIndex", "rep_percent", "GdpinBUSD"), names_to = "VarName", values_to = "VarValues")
```
US-Total and DC will be dropped as they are not states and the analysis is supposed to be based on the state level.
Populaiton and GDP are adjusted to make the numbers more readable.
Data is pivoted to be able to run the facet_wrap for the following graph.

```{r}
p <- ggplot(CorrelationPlots, mapping = aes(VarValues, renewablePercent)) +
        geom_point(size=0.75)+
        geom_smooth(method='lm')
p + facet_wrap(~VarName, scales = "free_x", ncol = 3,
               labeller = labeller(VarName =c("SunHours" = "Average sunshine hours per day",
                                              "AvgWindSpeed" = "Average Wind Speed in mph",
                                              "PopulationInMio"= "Population in mio",
                                              "EducationIndex"="Education index",
                                              "rep_percent" = "Percentage votes Republicans 2020 ",
                                              "GdpinBUSD" = "GDP in billion USD in 2020")))+
    ylab("Log percentage of renewable energy")+
    xlab("")+
    labs(title = "Correlation of individual factors on the percentage of renewable energy used" )+
    theme(axis.title.x=element_blank(),
          panel.spacing = unit(1, "lines"),
          strip.text=element_text(size=8),
          panel.background = element_rect(fill="white", linetype = "blank"),
          panel.grid.major = element_line(color = "black", size=0.2),
          panel.grid.minor = element_line(color = "gray", size=0.2),
          strip.background =element_rect(fill = "gray94"))


```
```{r}
```
Form the grafic one can  tell that GDP and Population themself make not much sense to use in the regression. They appear to have a similar distribution. In addition to this is the consideration that population and GDP are correlated and violate the Markov Theorem for multicolinarity.
Therefor it makes sense to convert them to GDP per Capita in Million which can be achieved by dividing the GDP by the population of a state.
Interestingly, when combining the two fators they become irrelevant for the regression.

```{r}
greenCard <- greenCard %>% mutate(gdpPerCap = GdpinMioUSD/Population)
lmodel <- lm(renewablePercent ~ SunHours + AvgWindSpeed + gdpPerCap + EducationIndex +  rep_percent, data = greenCard)
summary(lmodel)

```
After the adjustments only the average windspeed and the average sun hours appear to be significant factors at the 10 % level for the amount of renewables that are being used in a state. So only the natural factors, that decide over the productiveness of the
renewable energies, appear to play a role. The percentage of the republicans as well as the other factors are not significant for the percentage of renewable energies produced in one state.

__Notes to the regression__

  * only explains 19.24 % of the variations in the data
  * other relevant factors have been left out (e.g. factor that would help describe the percentage of hydro)
  * the regression still explains part of the issue as the p-value for the test that all ß=0 is smaller than .05


## Predict all the y's based on the linear regression
```{r}
greenCard <- data.frame(greenCard, y_hat = fitted(lmodel), e = residuals(lmodel))
```

## Error analysis

When looking at the plot bellow which shows the fitted y vs the actual y, one can tell that the regression is not
really good. It is obvious that the predicted value is overestimated in the first half and underestimated in the
second half. This translates to low actual y's being overestimated and high actual y's being underestimated.
The redline represents where the prediction would perfectly match the actual y value.


```{r}
greenCard %>% ggplot(mapping = aes(renewablePercent, y_hat))+
                geom_point(color = "steelblue")+
                geom_line(data = greenCard,mapping= aes(renewablePercent, renewablePercent), color = "darkred")+
                scale_y_continuous(expand = c(0, 0),limits = c(0.5,4.5))+
                scale_x_continuous(expand = c(0, 0),limits = c(0.5,5))+
                ylab("Predicted log renewable energy percent")+
                xlab("Actual log renewable energy percent")+
                labs(title = "Error analysis")+
                theme_bw()+
                theme(plot.title = element_text(hjust = 0.5,size=20, face= "bold"))

```

__Filter the data to provide an overview of the energy sources in the whole of United States__

```{r}
barPlotData <- totalData %>%
      mutate(coalPercent= (Coal/Total)*100) %>%
      mutate(HydroElectricPercent= (HydroElectric/Total)*100) %>%
      mutate(gasPercent= (Gas/Total)*100) %>%
      mutate(petroleumPercent= (Petroleum/Total)*100) %>%
      mutate(windPercent= (Wind/Total)*100) %>%
      mutate(woodPercent= (wood/Total)*100) %>%
      mutate(nuclearPercent= (Nuclear/Total)*100) %>%
      mutate(biomassPercent= (biomass/Total)*100) %>%
      mutate(geothermalPercent= (Geothermal/Total)*100) %>%
      mutate(solarPercent= (Solar/Total)*100) %>%
      filter(YEAR == 2020) %>%
      filter(state == "US-Total") %>%
      filter(Type_of_Producer =="Total Electric Power Industry")
barPlotData <- pivot_longer(barPlotData,
                            cols = c("coalPercent", "HydroElectricPercent", "gasPercent",
                                     "petroleumPercent", "windPercent","woodPercent",
                                     "nuclearPercent","biomassPercent","geothermalPercent","solarPercent"),
                            names_to = "VarName",
                            values_to = "VarValues")

barPlotData <- barPlotData[order(-barPlotData$VarValues),]

```

__Graphic energymix USA__

```{r}
barPlotData %>% ggplot(mapping = aes(fct_reorder(VarName,VarValues, .desc = TRUE), VarValues)) +
                    geom_col(aes(fill = fct_reorder(VarName, VarValues, .desc = TRUE)))+
                    ylab("Percentage in the US energymix")+
                    scale_fill_manual(values=
                          c("gasPercent"="burlywood4","nuclearPercent"="green","coalPercent"="black",
                             "windPercent"="lightblue","HydroElectricPercent"="darkblue","solarPercent"="yellow",
                             "biomassPercent"="darkgreen","petroleumPercent"="grey",
                            "geothermalPercent"="darkred","woodPercent"="saddlebrown"),
                         name="Energy Source",
                         labels=c("Gas","Nuclear", "Coal", "Wind", "Hydro","Solar",
                                  "Biomass","Petroleum", "Geothermal","Wood"))+
                    labs(title = "Energymix in the United States")+
                    scale_y_continuous(expand = c(0, 0),limits = c(0,45))+
                    theme_bw()+
                    theme(axis.title.x=element_blank(),
                          axis.text.x=element_blank(),
                          axis.ticks.x=element_blank(),
                          legend.position="right",
                          plot.title = element_text(hjust = 0.5))


```
```{r}
```
When looking at the overall data some interesting observations can be made

* 40% of the energy comes from gas
* The top 3 energy sources
  * Make up 80% of the energy supply
  * Are not renewable energies
* Wind currently is the biggest renewable energy source
* Hydro is the second biggest renewable energy source


This plot shows that hydroelectric is the second biggest renewable energy source through the entire US. In order
to improve the regression, it would make sense to include a factor that can explain the percentage of hydroelectric
supply. One factor to check could be the cubic meters of water flow in rivers in the state. It also might make sense to run regressions on the individual renewable energy sources rather than overall. The factors between for example solar and wind might be very different and thus it might be a good idea to focus on them individual, rather than generalizing them.