-
Notifications
You must be signed in to change notification settings - Fork 48
Expand file tree
/
Copy pathICI3D_Example_binomialDistribution.R
More file actions
335 lines (276 loc) · 11.7 KB
/
ICI3D_Example_binomialDistribution.R
File metadata and controls
335 lines (276 loc) · 11.7 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
## Fun with one of my favorite distributions: The binomial distribution
## Clinic on Meaningful Modeling of Epidemiological Data
## International Clinics on Infectious Disease Dynamics and Data Program
## African Institute for Mathematical Sciences, Muizenberg, Cape Town, RSA
##
## Jim Scott 2012
## 2026
##
## Some Rights Reserved
## CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/)
################################################################################
##
## By the end of this tutorial you should:
##
## * understand the terms:
## sample space, discrete, random variable, probability distribution
## * know the characteristics of a binomial process
## * know the binomial formula and how to use it
## * be able to utilize the functions dbinom, pbinom, and rbinom
## * be able to plot a binomial distribution in R
## * visualize how the parameters of the binomial distribution affect it.
# The binomial probability model is a discrete probability model that is
# commonly used in many applications, including modeling disease transmission.
# But, what IS it, exactly?
# The "discrete" part means that it produces random variables that have a finite
# number of possible values (e.g., integers). The "probability model" part means
# that it provides us with two important pieces of information: the possible
# outcomes that may occur and the probabilities with which each outcome occurs.
# There are a number of common discrete probability models that exist, but the
# binomial model is used to model the number of successes that occur in fixed
# number of independent trials. It's a good model for a process that:
# 1) has a fixed number of independent trials
# 2) has outcomes that can be categorized into either "success" or "failure"
# 3) has a constant probability of "success" per trial
# (i.e. the probability of a "success" doesn't change).
# Coin flipping is a common example of a binomial process. Suppose I flip a
# coin 10 times. One flip represents one trial. The number of trials is fixed
# (n=10). There are only two outcomes (heads or tails). The probability of
# a success (e.g. getting "tails") remains constant. The number of tails in
# 10 trials is a random variable that follows a binomial distribution.
# What is a random variable? Well, before we talk about that, let's define one
# other term - sample space. The sample space of a random phenomenon is the set
# of all possible outcomes that could occur. Note that an outcome does not need
# to be numeric. For example, consider tossing a coin one time. The sample space
# is {heads, tails}.
# If you were to toss the coin two times, the sample space would consist of:
# S = {heads heads, heads tails, tails heads, tails tails}
# and we would say the size of the sample space is 4.
# A random variable maps each outcome in the sample space to a numeric value.
# For example, if I defined the random variable X as the number of tails that
# occurs in two flips of a coin, I would observe the following:
#
# outcome X
# heads heads 0
# heads tails 1
# tails heads 1
# tails tails 2
#
# I could describe the distribution of X in the following manner:
#
# outcome probability
# X = 0 0.25
# X = 1 0.50
# X = 2 0.25
# Also, notice that the random variable X is binomial because we have:
# 1) fixed number of independent trials (n=2),
# 2) either a "success" (tails) or failure occurs (not tails) per trial, and
# 3) the probability of a "success" (getting tails) remains constant.
# Here the probability of a success in any one trial is p=0.5
# if we are tossing a fair coin.
# If we suspect that a random variable is binomial, we can use the binomial
# formula to calculate the probability that X equals some specific value. For
# example, if we didn't know that the probability of getting 0 tails on 2 flips
# is 0.25, we could have used the following formula to obtain it:
#
# Binomial formula:
# P(X = x) = n!/[(x!)(n-x)!] * p^x * (1-p)^(n-x)
#
# You might be familiar with seeing n!/[(x!)(n-x)!] written as "n choose x"
#
# In the above equation:
# x = number of successes (the observed value of the random variable X)
# n = number of trials (an input value describing the process)
# p = the probability of a success per trial (an input value describing
# the process)
# In our example, n = 2 and p = 0.5. To calculate the probability that X = 0
# (i.e. that we will not obtain any tails) use:
# P(X = 0) = 2!/[(0!)(2-0)!] * 0.5^0 * (1-0.5)^(2-0)
# = 1 * 1 * 0.25 = 0.25 (recall that 0! = 1)
# This has all been a preamble towards getting you prepared to use the binomial
# distribution in R. To obtain the calculation above in R, you need only
# execute the following command:
dbinom(0,2,0.5)
# The syntax is dbinom(x, n, p)
# See the help file using help(dbinom)
# That one isn't too hard to calculate in your head, but suppose you wanted to
# know the probability of getting exactly 20 tails if you flipped a coin 50
# times. Instead of calculating it with the formula (or a calculator), just
# use R. Here, n=50, p is still 0.5. Run:
dbinom(20, 50, 0.5)
# If you'd like to determine the entire distribution when n=50 and p=0.5 you
# could type:
dbinom(0:50, 50, 0.5)
# Nice, but not that visually stimulating. Let's plot it!
# We will use base R plotting commands in this tutorial, and in Tutorial 4,
# that we will do together during MMED, we will practice using the ggplot2
# library for plotting.
barplot(dbinom(0:50, size=50, prob=1/2),names.arg=0:50,ylab="Probability",
main="Binomial Distribution, n=50, p=1/2",xlab="Number of Tails",
col="light blue",ylim=c(0,.12))
# The binomial distribution has 2 parameters, n, the number of fixed trials, and
# p, the probability of a success. The distribution is completely determined by
# these values, just like a normal distribution is completely determined by its
# parameters, the mean and the standard deviation. To see how the distribution
# changes as p changes, try highlighting and running the following code (here
# we're holding n constant at 50), and watch the Plots window.
binom <- function(p) {
barplot(dbinom(0:50, size=50, prob=p),names.arg=0:50,ylab="Probability",
main=paste0("Binomial Distribution, n=50, p=",p),xlab="Number of Tails",
col="light blue", ylim=c(0,0.3))
Sys.sleep(0.1)
}
ignore <- sapply((1:19)/20, binom)
# If you didn't see all the plots, try clicking the back arrow a few times in
# the plot window to see all the plots that R produced.
# To get an idea of how the binomial distribution can change when n varies try
# running the following code and watch your Plots window. Here p is set to 1/2,
# after you've run the code once, try changing p to a different value.
binom <- function(n) {
barplot(dbinom(0:n, size=n, prob=1/2),names.arg=0:n,ylab="Probability",
main=paste0("Binomial Distribution, n=",n,", p=1/2"),xlab="Number of Tails",
col="light blue", ylim=c(0,0.5), xlim = c(0,25))
Sys.sleep(0.1)
}
ignore <- sapply(1:20, binom)
# Again, if you didn't see all the plots, try clicking the back arrow a few
# times in the plot window to see all the plots that R produced.
# The binomial formula provides you with just the area of one bar. To determine
# the area that is less than or equal to a specific value, for example, the
# probability that we get 20 or fewer tails in 50 flips, you could type:
sum(dbinom(0:20, 50, 0.5))
# or, equivalently
pbinom(20,50,0.5)
# The difference between these two commands is that sum(dbinom()) sums up the areas
# of each bar you specify, whereas pbinom sums up the bars that are less than or
# equal to the value of the first number in the parentheses.
# To create a visual of this you could execute the following:
ints <- 0:50
barplot(dbinom(ints, size=50, prob=1/2),names.arg=0:50,ylab="Probability",
main="Binomial Distribution, n=50, p=1/2",xlab="Number of Tails",
col=ifelse(ints<=20,"blue","light blue"))
# The area of the dark blue bars is:
pbinom(20,50,0.5)
# You can also use R to simulate binomially distributed random variables. For
# example, suppose n = 50 and p=1/2. You'd expect to get 25 successes, but due
# to random chance you won't always get 25. Sometimes you'll get a few more,
# sometimes a few less. To simulate the outcome of 50 coin tosses, just type:
rbinom(1,50,1/2)
# If you'd like to stimulate the results of 1000 people each flipping a coin
# 50 times, type:
mysims <- rbinom(1000,50,1/2)
mysims
# the output isn't so useful in that form...try plotting it instead:
hist(mysims, freq=FALSE, breaks=seq(-0.5,50.5,1),ylim=c(0,0.2),col="grey",
ylab="Number of Occurences",xlab="Number of Tails in 50 flips")
# The y-axis tells us the relative frequency of seeing different numbers of
# successes, in the 1000 trials. For many of the 1000 people, they get about
# 25 tails in their 50 flips, but some do get much fewer or many more tails.
# Your distribution should look similar to the following, theoretical binomial
# distribution:
barplot(dbinom(0:50, size=50, prob=1/2),names.arg=0:50,ylab="Probability",
main="Binomial Distribution, n=50, p=1/2",xlab="Number of Tails"
,ylim=c(0,0.2),col="light blue")
# Click the back arrow in the Plots window to compare and contrast. You should
# also notice that this particular binomial distribution is 'bell shaped' like the
# normal distribution. As a rule of thumb, whenever n*p > 10 and n*(1-p) > 10, you
# can use a normal distribution to *approximate* the binomial distribution, using
# mean = n*p and variance = n*p*(1-p). In this case, mean = 50*0.5 = 25 and
# variance = 50*0.5*0.5 = 12.5. The standard deviation of a normal distribution
# is the square root of the variance. In this case, sqrt(12.5) or 3.54. To compare
# the approximated normal distribution to your simulated binomial data, you could type:
hist(mysims, freq=FALSE, breaks=seq(-0.5,50.5,1),ylim=c(0,0.2),col="grey",
ylab="Number of occurences",xlab="Number of tails in 50 flips")
curve(dnorm(x,mean=25,sd=3.54), add=TRUE, col="blue",lwd=3)
# That concludes the tutorial!
# Refs:
# Introduction to Probability with R, K Baclawski. Chapman & Hall, 2008
# Introduction to the Practice of Statistics ed 7,
# DS Moore, GP McCabe, & BA Craig. W.H. Freeman, 2012
# To test your skills try the following problems.
#
#
#
# Roulette - on a roulette wheel, there are 18 red numbers, 18 black numbers,
# and 2 green numbers. The numbers 1 to 36 appear on the wheel along with 0 and 00.
# The probability that the roulette ball lands on any one number is the same
# (i.e. 1/38). Suppose you always bet on red, and you play 20 times.
#
#
# a) Determine the probability of each outcome (i.e., winning 0 times, 1 time,
# 2 times, etc)
# b) Determine the probability of winning at least 10 times
# C) Produce a plot of the probability distribution of the number of 'wins'
# d) Simulate the number of wins when you play roulette 20 times, for 1000
# replications
# e) Make a histogram of your results from part d)
# f) In how many of your replications did you win at least 10 times?
#
# Answers below
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# a)
dbinom(0:20,20,18/38)
# b)
sum(dbinom(10:20, 20,18/38))
# or
1 - pbinom(9,20,18/38)
# C)
barplot(dbinom(0:20, size=20, prob=18/38),names.arg=0:20,ylab="Probability",
main="Binomial Distribution, n=20, p=18/38",xlab="Number of Wins"
,ylim=c(0,0.2))
# d)
data1<-rbinom(1000,20,18/38)
data1
# e)
hist(data1, freq=FALSE, breaks=seq(-0.5,21.5,1),ylim=c(0,0.2),col="grey",
ylab="Number of Occurences",xlab="Number of Wins in 20 plays")
# f)
length(data1[data1>9])
# or
sum(ifelse(data1>9,1,0))
# or
sum(data1>9)
# or just sort the numbers of wins and list,
# and count yourself, looking at the output on your Console
sort(data1)