statthink/04-Probability.Rmd at master · eleuven/statthink · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
# Probability {#ChapProbability}

## Student Learning Objective

This section extends the notion of variability that was introduced in
the context of data to other situations. The variability of the entire
*population* and the concept of a *random variable* is discussed. These
concepts are central for the development and interpretation of
statistical inference. By the end of the chapter the student should:

-   Consider the distribution of a variable in a population and compute
    parameters of this distribution, such as the mean and the standard
    deviation.

-   Become familiar with the concept of a random variable.

-   Understand the relation between the distribution of the population
    and the distribution of a random variable produced by sampling a
    random subject from the population.

-   Identify the distribution of the random variable in simple settings
    and compute its expectation and variance.

## Different Forms of Variability

In the previous chapters we examined the variability in data. In the
statistical context, data is obtained by selecting a sample from the
target population and measuring the quantities of interest for the
subjects that belong to the sample. Different subjects in the sample may
obtain different values for the measurement, leading to variability in
the data.

This variability may be summarized with the aid of a *frequency table*,
a table of *relative frequency*, or via the *cumulative relative
frequency*. A graphical display of the variability in the data may be
obtained with the aid of the *bar plot*, the *histogram*, or the *box
plot*.

Numerical summaries may be computed in order to characterize the main
features of the variability. We used the *mean* and the *median* in
order to identify the location of the distribution. The *sample
variance*, or better yet the *sample standard deviation*, as well as the
*inter-quartile range* were all described as tools to quantify the
overall spread of the data.

The aim of all these graphical representations and numerical summaries
is to investigate the variability of the data.

The subject of this chapter is to introduce two other forms of
variability, variability that is not associated, at least not directly,
with the data that we observe. The first type of variability is the
*population variability*. The other type of variability is the
variability of a *random variable*.

The notions of variability that will be presented are abstract, they are
not given in terms of the data that we observe, and they have a
mathematical-theoretical flavor to them. At first, these abstract
notions may look to you as a waste of your time and may seem to be
unrelated to the subject matter of the course. The opposite is true. The
very core of statistical thinking is relating observed data to
theoretical and abstract models of a phenomena. Via this comparison, and
using the tools of statistical inference that are presented in the
second half of the book, statisticians can extrapolate insights or make
statements regarding the phenomena on the basis of the observed data.
Thereby, the abstract notions of variability that are introduced in this
chapter, and are extended in the subsequent chapters up to the end of
this part of the book, are the essential foundations for the practice of
statistics.

The first notion of variability is the variability that is associated
with the population. It is similar in its nature to the variability of
the data. The difference between these two types of variability is that
the former corresponds to the variability of the quantity of interest
across all members of the population and not only for those that were
selected to the sample.

In Chapters \@ref(ChapData) and \@ref(ChapDescriptiveStat) we examined the data
set “`ex.1`" which contained data on the sex and height of a sample of
100 observations. In this chapter we will consider the sex and height of
*all* the members of the population from which the sample was selected.
The size of the relevant population is 100,000, including the 100
subjects that composed the sample. When we examine the values of the
height across the entire population we can see that different people may
have different heights. This variability of the heights is the
population variability.

The other abstract type of variability, the variability of a random
variable, is a mathematical concept. The aim of this concept is to model
the notion of randomness in measurements or the uncertainty regarding
the outcome of a measurement. In particular we will initially consider
the variability of a random variable in the context of selecting one
subject at random from the population.

Imagine we have a population of size 100,000 and we are about to select
at random one subject from this population. We intend to measure the
height of the subject that will be selected. Prior to the selection and
measurement we are not certain what value of height will be obtained.
One may associate the notion of variability with uncertainty — different
subjects to be selected may obtain different evaluations of the
measurement and we do not know before hand which subject will be
selected. The resulting variability is the variability of a random
variable.

Random variables can be defined for more abstract settings. Their aim is
to provide models for randomness and uncertainty in measurements. Simple
examples of such abstract random variables will be provided in this
chapter. More examples will be introduced in the subsequent chapters.
The more abstract examples of random variables need not be associated
with a specific population. Still, the same definitions that are used
for the example of a random variable that emerges as a result of
sampling a single subject from a population will apply to the more
abstract constructions.

All types of variability, the variability of the data we dealt with
before as well as the other two types of variability, can be displayed
using graphical tools and characterized with numerical summaries.
Essentially the same type of plots and numerical summaries, possibly
with some modifications, may and will be applied.

A point to remember is that the variability of the data relates to a
concrete list of data values that is presented to us. In contrary to the
case of the variability of the data, the other types of variability are
not associated with quantities we actually get to observe. The data for
the sample we get to see but not the data for the rest of the
population. Yet, we can still discuss the variability of a population
that is out there, even though we do not observe the list of
measurements for the entire population. (The example that we give in
this chapter of a population was artificially constructed and serves for
illustration only. In the actual statistical context one does not obtain
measurements from the entire population, only from the subjects that
went into the sample.) The discussion of the variability in this context
is theoretical in its nature. Still, this theoretical discussion is
instrumental for understanding statistics.

## A Population

In this section we introduce the variability of a population and present
some numerical summaries that characterizes this variability. Before
doing so, let us review with the aid of an example some of the numerical
summaries that were used for the characterization of the variability of
data.

Recall the file “`ex1.csv`" that contains data on the height and sex of
100 subjects. (The data file can be obtained from
<http://pluto.huji.ac.il/~msby/StatThink/Datasets/ex1.csv>.) We read the
content of the file into a data frame by the name “`ex.1`" and apply the
function “`summary`" to the data frame:

```{r}
ex.1 <- read.csv("_data/ex1.csv")
summary(ex.1)
```

We saw in the previous chapter that, when applied to a numeric sequence,
the function “`summary`" produces the smallest and largest values in the
sequence, the three quartiles (including the median) and the mean. If
the input of the same function is a factor then the outcome is the
frequency in the data of each of the levels of the factor. Here “`sex`"
is a factor with two levels. From the summary we can see that 54 of the
subjects in the sample are female and 46 are male.

Notice that when the input to the function “`summary`" is a data frame,
as is the case in this example, then the output is a summary of each of
the variables of the data frame. In this example two of the variables
are numeric (“`id`" and “`height`") and one variable is a factor
(“`sex`").

Recall that the mean is the arithmetic average of the data which is
computed by summing all the values of the variable and dividing the
result by the number of observations. Hence, if $n$ is the number of
observations ($n=100$ in this example) and $x_i$ is the value of the
variable for subject $i$, then one may write the mean in a formula form
as

$$\bar x = \frac{\mbox{Sum of all values in the data}}{\mbox{Number of values in the data}} = \frac{\sum_{i=1}^n x_i}{n}\;,$$

where $\bar x$ corresponds to the mean of the data and the symbol
“$\sum_{i=1}^n x_i$" corresponds to the sum of all values in the data.

The median is computed by ordering the data values and selecting a value
that splits the ordered data into two equal parts. The first and third
quartile are obtained by further splitting each of the halves into two
quarters.

Let us discuss the variability associated with an entire target
population. The file “`pop1.csv`" that contains the population data can
be found on the internet
(<http://pluto.huji.ac.il/~msby/StatThink/Datasets/pop1.csv>). It is a
CSV file that contains the information on sex and height of an entire
adult population of some imaginary city. (The data in “`ex.1`"
corresponds to a sample from this city.) Read the population data into
`R` and examine it:

```{r}
pop.1 <- read.csv(file="_data/pop1.csv")
summary(pop.1)
```

The object “`pop.1`" is a data frame of the same structure as the data
frame “`ex.1`". It contains three variables: a unique identifier of each
subject (`id`), the sex of the subject (`sex`), and its height
(`height`). Applying the function “`summary`" to the data frame produces
the summary of the variables that it contains. In particular, for the
variable “`sex`", which is a factor, it produces the frequency of its
two categories – 48,888 female and 51,112 – a total of 100,000 subjects.
For the variable “`height`", which is a numeric variable, it produces
the extreme values, the quartiles, and the mean.

```{r Prob1, fig.cap='Bar Plot of Height', echo=FALSE, message=FALSE,warning=FALSE, out.width = '60%', fig.align = "center"}
plot(table(pop.1$height), xlab="height",ylab="Frequency")
```

Let us concentrate on the variable “`height`". A bar plot of the
distribution of the heights in the entire population is given in
Figure \@ref(fig:Prob1) [^4_1]. Recall that a vertical bar is placed
above each value of height that appears in the population, with the
height of the bar representing the frequency of the value in the
population. One may read out of the graph or obtain from the numerical
summaries that the variable takes integer values in the range between
117 and 217 (heights are rounded to the nearest centimeter). The
distribution is centered at 170 centimeter, with the central 50% of the
values spreading between 162 and 178 centimeters.

The mean of the height in the entire population is equal to 170
centimeter. This mean, just like the mean for the distribution of data,
is obtained by the summation of all the heights in the population
divided by the population size. Let us denote the size of the entire
population by $N$. In this example $N = 100,000$. (The size of the
sample for the data was called $n$ and was equal to $n=100$ in the
parallel example that deals with the data of a sample.) The mean of an
entire population is denoted by the Greek letter $\mu$ and is read
“*mew*". (The average for the data was denoted $\bar x$). The formula of
the population mean is:

$$\mu = \frac{\mbox{Sum of all values in the population}}{\mbox{Number of values in the population}}= \frac{\sum_{i=1}^N x_i}{N}\;.$$

Observe the similarity between the definition of the mean for the data
and the definition of the mean for the population. In both cases the
arithmetic average is computed. The only difference is that in the case
of the mean of the data the computation is with respect to the values
that appear in the sample whereas for the population all the values in
the population participate in the computation.

In actual life, we will not have all the values of a variable in the
entire population. Hence, we will not be able to compute the actual
value of the population mean. However, it is still meaningful to talk
about the population mean because this number exists, even though we do
not know what its value is. As a matter of fact, one of the issues in
statistics is to try to estimate this unknown quantity on the basis of
the data we do have in the sample.

A characteristic of the distribution of an entire population is called a
*parameter*. Hence, $\mu$, the population average, is a parameter. Other
examples of parameters are the population median and the population
quartiles. These parameters are defined exactly like their data
counterparts, but with respect to the values of the entire population
instead of the observations in the sample alone.

Another example of a parameter is the population variance. Recall that
the sample variance was defined with the aid of the deviations
$x_i - \bar x$, where $x_i$ is the value of the measurement for the
$i$th subject and $\bar x$ is the mean for the data. In order to compute
the sample variance these deviations were squared to produce the squared
deviations. The squares were summed up and then divided by the sample
size minus one ($n-1$). The sample variance, computed from the data, was
denoted $s^2$.

The population variance is defined in a similar way. First, the
deviations from the population mean $x_i - \mu$ are considered for each
of the members of the population. These deviations are squared and the
average of the squares is computed. We denote this parameter by
$\sigma^2$ (read “*sigma square*"). A minor difference between the
sample variance and the population variance is that for the latter we
should divide the sum of squared deviations by the population size ($N$)
and not by the population size minus one ($N-1$):

$$\begin{aligned}
\sigma^2 =& \mbox{The average square deviation in the population}\\
         =& \frac{\mbox{Sum of the squares of the deviations in the population}}{\mbox{Number of values in the population}}\\
         =& \frac{\sum_{i=1}^N (x_i-\mu)^2}{N}\;.\end{aligned}$$

The standard deviation of the population, yet another parameter, is
denoted by $\sigma$ and is equal to the square root of the variance. The
standard deviation summarizes the overall variability of the measurement
across the population. Again, the typical situation is that we do not
know what the actual value of the standard deviation of the population
is. Yet, we may refer to it as a quantity and we may try to estimate its
value based on the data we do have from the sample.

For the height of the subjects in our imaginary city we get that the
variance is equal to $\sigma^2 =126.1576$. The standard deviation is
equal to $\sigma = \sqrt{126.1576} = 11.23199$. These quantities can be
computed in this example from the data frame “`pop.1`" with the aid of
the functions “`var`" and “`sd`", respectively[^4_2].

## Random Variables

In the previous section we dealt with the variability of the population.
Next we consider the variability of a random variable. As an example,
consider taking a sample of size $n=1$ from the population (a single
person) and measuring his/her height.

The object `pop.1$height` is a sequence with 100,000 entries. Think of
it as a population. We will apply the function “`sample`" to this
sequence:

```{r}
sample(pop.1$height,1)
```

The first entry to the function is the given sequence of heights. When
we set the second argument to 1 then the function selects one of the
entries of the sequence at random, with each entry having the same
likelihood of being selected. Specifically, in this example an entry
that contains the value 162 was selected. Let us run the function again:

```{r}
sample(pop.1$height,1)
```

In this instance an entry with a different value was selected. Try to
run the command several times yourself and see what you get. Would you
necessarily obtain a different value in each run?

Now let us enter the same command without pressing the return key:

```{r, eval=FALSE}
sample(pop.1$height,1)
```


Can you tell, before pressing the key, what value will you get?

The answer to this question is of course *“No"*. There are 100,000
entries with a total of 94 distinct values. In principle, any of the
values may be selected and there is no way of telling in advance which
of the values will turn out as an outcome.

A random variable is the future outcome of a measurement, [**before**]{}
the measurement is taken. It does not have a specific value, but rather
a collection of potential values with a distribution over these values.
After the measurement is taken and the specific value is revealed then
the random variable ceases to be a random variable! Instead, it becomes
data.

Although one is not able to say what the outcome of a random variable
will turn out to be. Still, one may identify patterns in this potential
outcome. For example, knowing that the distribution of heights in the
population ranges between 117 and 217 centimeter one may say in advance
that the outcome of the measurement must also be in that interval.
Moreover, since there is a total of 3,476 subjects with height equal to
168 centimeter and since the likelihood of each subject to be selected
is equal then the likelihood of selecting a subject of this height is
3,476/100,000 = 0.03476. In the context of random variables we call this
likelihood *probability*. In the same vain, the frequency of subjects
with hight 192 centimeter is 488, and therefore the probability of
measuring such a height is 0.00488. The frequency of subjects with
height 200 centimeter or above is 393, hence the probability of
obtaining a measurement in the range between 200 and 217 centimeter is
0.00393.

### Sample Space and Distribution

Let us turn to the formal definition of a random variable: A random
variable refer to numerical values, typically the outcome of an
observation, a measurement, or a function thereof.

A random variable is characterized via the collection of potential
values it may obtain, known as the *sample space* and the likelihood of
obtaining each of the values in the sample space (namely, the
probability of the value). In the given example, the sample space
contains the 94 integer values that are marked in
Figure \@ref(fig:Prob1). The probability of each value is the height
of the bar above the value, divided by the total frequency of 100,000
(namely, the relative frequency in the population).

We will denote random variables with capital Latin letters such as $X$,
$Y$, and $Z$. Values they may obtain will be marked by small Latin
letters such as $x$, $y$, $z$. For the probability of values we will use
the letter “$\Prob$". Hence, if we denote by $X$ the measurement of
height of a random individual that is sampled from the given population
then:

$$\Prob(X = 168) = 0.03476$$ and
$$\Prob(X \geq 200) = 0.00393\;.$$

Consider, as yet another example, the probability that the height of a
random person sampled from the population differs from 170 centimeter by
no more than 10 centimeters. (In other words, that the height is between
160 and 180 centimeters.) Denote by $X$ the height of that random
person. We are interested in the probability
$\Prob(|X -170| \leq 10)$.[^4_3]

The random person can be any of the subjects of the population with
equal probability. Thus, the sequence of the heights of the 100,000
subjects represents the distribution of the random variable $X$:

```{r}
pop.1 <- read.csv(file="_data/pop1.csv")
X <- pop.1$height
```

Notice that the object “`X`" is a sequence of lenght 100,000 that stores
all the heights of the population. The probability we seek is the
relative frequency in this sequency of values between 160 and 180. First
we compute the probability and then explain the method of computation:

```{r}
mean(abs(X-170) <= 10)
```

We get that the height of a person randomly sampled from the population
is between 160 and 180 centimeters with probability 0.64541.

Let us produce a small example that will help us explain the computation
of the probability. We start by forming a sequence with 10 numbers:

```{r}
Y <- c(6.3, 6.9, 6.6, 3.4, 5.5, 4.3, 6.5, 4.7, 6.1, 5.3)
```

The goal is to compute the proportion of numbers that are in the range
$[4,6]$ (or, equivalently, $\{|Y-5| \leq 1\}$).

The function “`abs`" computes the absolute number of its input argument.
When the function is applied to the sequence “`Y-5`" it produces a
sequence of the same length with the distances between the components of
“`Y`" and the number 5:

```{r}
abs(Y-5)
```

Compare the resulting output to the original sequence. The first value
in the input sequence is 6.3. Its distance from 5 is indeed 1.3. The
fourth value in the input sequence is 3.4. The difference 3.4 - 5 is
equal to -1.6, and when the absolute value is taken we get a distance of
1.6.

The function “`<=`" expects an argument to the right and an argument to
the left. It compares each component to the left with the parallel
component to the right and returns a logical value, “`TRUE`" or
“`FALSE`", depending on whether the relation that is tested holds or
not:

```{r}
abs(Y - 5) <= 1
```

Observe the in this example the function “`<=`" produced 10 logical
values, one for each of the elements of the sequence to the left of it.
The first input in the sequence “`Y`" is 6.3, which is more than one
unit away from 5. Hence, the first output of the logical expression is
“`FALSE`". On the other hand, the last input in the sequence “`Y`" is
5.3, which is within the range. Therefore, the last output of the
logical expression is “`TRUE`".

Next, we compute the proportion of “`TRUE`" values in the sequence:

```{r}
mean(abs(Y - 5) <= 1)
```

When a sequence with logical values is entered into the function
“`mean`" then the function replaces the `TRUE`’s by 1 and the `FALSE`’s
by 0. The average produces then the relative frequency of `TRUE`’s in
the sequence as required. Specifically, in this example there are 4
`TRUE`’s and 6 `FALSE`’s. Consequently, the output of the final
expression is $4/10 = 0.4$.

The computation of the probability that the sampled height falls within
10 centimeter of 170 is based on the same code. The only differences are
that the input sequence “`Y`" is replaced by the sequence of population
heights “`X`" as input. the number “`5`" is replaced by the number
“`170`" and the number “`1`" is replaced by the number “`10`". In both
cases the result of the computation is the relative proportion of the
times that the values of the input sequence fall within a given range of
the indicated number.

The probability function of a random variable is defined for any value
that the random variable may obtain and produces the *distribution* of
the random variable. The probability function may emerge as a relative
frequency as in the given example or it may be a result of theoretical
modeling. Examples of theoretical random variables are presented mainly
in the next two chapters.

Consider an example of a random variable. The sample space and the
probability function specify the distribution of the random variable.
For example, assume it is known that a random variable $X$ may obtain
the values 0, 1, 2, or 3. Moreover, imagine that it is known that
$\Prob(X=1) = 0.25$, $\Prob(X=2) = 0.15$, and $\Prob(X=3)= 0.10$. What
is $\Prob(X=0)$, the probability that $X$ is equal to 0?

The sample space, the collection of possible values that the random
variable may obtain is the collection $\{0,1,2,3\}$. Observe that the
sum over the positive values is:

$$\Prob(X > 0) = \Prob(X=1) + \Prob(X=2) + \Prob(X=3) = 0.25 + 0.15 + 0.10 = 0.50\;.$$
It follows, since the sum of probabilities over the entire sample space
is equal to 1, that $\Prob(X=0) = 1- 0.5 = 0.5$.

```{r Probability1, echo=FALSE, out.width = '60%', fig.align = "center"}
tab1 = data.frame(Value=c(0,1,2,3),Probability=c(.5,.25,.15,.1),Cum.Prob=c(.5,.75,.9,1))
knitr::kable(
  tab1,
  booktabs = TRUE,
  caption = 'The Distribution of $X$'
)
```

Table \@ref(tab:Probability1) summarizes the distribution of the random
variable $X$. Observe the similarity between the probability function
and the notion of relative frequency that was discussed in
Chapter \@ref(ChapData). Both quantities describe distribution. Both are
non-negative and sum to 1. Likewise, notice that one may define the
cumulative probability the same way cumulative relative frequency is
defined: Ordering the values of the random variable from smallest to
largest, the cumulative probability at a given value is the sum of
probabilities for values less or equal to the given value.

Knowledge of the probabilities of a random variable (or the cumulative
probabilities) enables the computation of other probabilities that are
associated with the random variable. For example, considering the random
variable $X$ of Table \@ref(tab:Probability1), we may calculate the
probability of $X$ falling in the interval $[0.5, 2.3]$. Observe that
the given range contains two values from the sample space, 1 and 2,
therefore:

$$\Prob(0.5 \leq X \leq 2.3) = \Prob(X=1) + \Prob(X = 2) = 0.25 + 0.15 = 0.40\;.$$
Likewise, we may produce the probability of $X$ obtaining an odd value:

$$\Prob(X = \mbox{odd}) = \Prob(X=1) + \Prob(X=3) = 0.25 + 0.10 = 0.35\;.$$
Observe that both $\{0.5 \leq X \leq 2.3\}$ and $\{X = \mbox{odd}\}$
refer to subsets of values of the sample space. Such subsets are denoted
*events*. In both examples the probability of the event was computed by
the summation of the probabilities associated with values that belong to
the event.

### Expectation and Standard Deviation

We may characterize the center of the distribution of a random variable
and the spread of the distribution in ways similar to those used for the
characterization of the distribution of data and the distribution of a
population.

The *expectation* marks the center of the distribution of a random
variable. It is equivalent to the data average $\bar x$ and the
population average $\mu$, which was used in order to mark the location
of the distribution of the data and the population, respectively.

Recall from Chapter \@ref(ChapDescriptiveStat) that the average of the data
can be computed as the weighted average of the values that are present
in the data, with weights given by the relative frequency. Specifically,
we saw for the data

$$1,\; 1,\; 1,\; 2,\; 2,\; 3,\; 4,\; 4,\; 4,\; 4,\; 4$$ that

$$\frac{1 + 1 + 1 + 2 + 2 + 3 + 4 + 4 + 4 + 4 + 4}{11} =
1\times \frac{3}{11} + 2 \times \frac{2}{11} + 3 \times \frac{1}{11} + 4 \times \frac{5}{11}\;,$$
producing the value of $\bar x =2.727$ in both representations. Using a
formula, the equality between the two ways of computing the mean is
given in terms of the equation:

$$\bar x = \frac{\sum_{i=1}^n x_i}{n} = \sum_x \big(x \times (f_x/n)\big)\;.$$
In the first representation of the arithmetic mean, the average is
computed by the summation of all data points and dividing the sum by the
sample size. In the second representation, that uses a weighted sum, the
sum extends over all the unique values that appear in the data. For each
unique value the value is multiplied by the relative frequency of the
value in the data. These multiplications are summed up to produce the
mean.

The expectation of a random variable is computed in the spirit of the
second formulation. The expectation of a random variable is marked with
the letter “$\Expec$" and is define via the equation:

$$\Expec(X) = \sum_x \big(x \times \Prob(x)\big)\;.$$ In this definition
all the unique values of the sample space are considered. For each value
a product of the value and the probability of the value is taken. The
expectation is obtained by the summation of all these products. In this
definition the probability $\Prob(x)$ replaces the relative frequency
$f_x/n$ but otherwise, the definition of the expectation and the second
formulation of the mean are identical to each other.

Consider the random variable $X$ with distribution that is described in
Table \@ref(tab:Probability1). In order to obtain its expectation we
multiply each value in the sample space by the probability of the value.
Summation of the products produces the expectation (see
Table \@ref(tab:Probability1):

$$\Expec(X) = 0 \times 0.5 + 1 \times 0.25 + 2 \times 0.15 + 3\times 0.10 = 0.85\;.$$
```{r Probability2, echo=FALSE, out.width = '60%', fig.align = "center"}
x = c(0,1,2,3)
p = c(.5,.25,.15,.1)
tab1 = data.frame(x=x,p=p,px=p*x)
knitr::kable(
  tab1,
  booktabs = TRUE,
  col.names = c("Value","Probability","Value * Probability"),
  caption = 'The Expectation of $X$'
)
cat(paste("Expectation = sum(Value*Probability) = ", sum(p*x)))
```

In the example of height we get that the expectation is equal to 170.035
centimeter. Notice that this expectation is equal to $\mu$, the mean of
the population[^4_4]. This is no accident. The expectation of a potential
measurement of a randomly selected subject from a population is equal to
the average of the measurement across all subjects.

The sample variance ($s^2$) is obtained as the sum of the squared
deviations from the average, divided by the sample size ($n$) minus 1:

$$s^2 = \frac{\sum_{i=1}^n (x_i - \bar x)^2}{n-1}\;.$$ A second
formulation for the computation of the same quantity is via the use of
relative frequencies. The formula for the sample variance takes the form

$$s^2 = \frac{n}{n-1}\sum_x \big((x - \bar x)^2\times (f_x/n)\big)\;.$$
In this formulation one considers each of the unique value that are
present in the data. For each value the deviation between the value and
the average is computed. These deviations are then squared and
multiplied by the relative frequency. The products are summed up.
Finally, the sum is multiplied by the ratio between the sample size $n$
and $n-1$ in order to correct for the fact that in the sample variance
the sum of squared deviations is divided by the sample size minus 1 and
not by the sample size.

In a similar way, the variance of a random variable may be defined via
the probability of the values that make the sample space. For each such
value one computes the deviation from the expectation. This deviation is
then squared and multiplied by the probability of the value. The
multiplications are summed up in order to produce the variance:

$$\Var(X) = \sum_x\big( (x-\Expec(X))^2 \times \Prob(x)\big)\;.$$ Notice
that the formula for the computation of the variance of a random
variable is very similar to the second formulation for the computation
of the sample variance. Essentially, the mean of the data is replaced by
the expectation of the random variable and the relative frequency of a
value is replaced by the probability of the value. Another difference is
that the correction factor is not used for the variance of a random
variable.

```{r tab3, echo=FALSE, error=FALSE}
x = c(0,1,2,3)
p = c(0.5,0.25,0.15,0.1)
xm = x-sum(p*x)
xx = data.frame(x,p,xm,xm2=xm^2,xm2p=p*xm^2)
knitr::kable(
  xx,
  booktabs = TRUE,
  col.names= c("$x$", "$P(X=x)$", "$x - E[X]$", "$(x - E[X])^2$", "$P(X=x)*(x - E[X])^2$"),
  caption = 'The Variance of $X$'
)
cat(paste("Var = sum(p*(x-sum(p*x))^2) = ",sum(p*(x-sum(p*x))^2)))
```

As an example consider the variance of the random variable $X$. The
computation of the variance of this random variable is carried out in
Table \@ref(tab:tab3). The sample space, the values that the
random variable may obtain, are given in the first column and the
probabilities of the values are given in the second column. In the third
column the deviation of the value from the expectation
$\Expec(X) = 0.85$ is computed for each value. The 4th column contains
the square of these deviations and the 5th and last column involves the
product of the square deviations and the probabilities. The variance is
obtained by summing up the products in the last column. In the given
example:

$$\begin{aligned}
\Var(X) = & (0-0.85)^2 \times 0.5  + (1-0.85)^2 \times 0.25 \\ &+ (2-0.85)^2\times 0.15 + (3-0.85)^2\times 0.10= 1.0275\;.\end{aligned}$$

The standard deviation of a random variable is the square root of the
variance. The standard deviation of $X$ is
$\sqrt{\Var(X)} = \sqrt{1.0275} = 1.013657$.

In the example that involves the height of a subject selected from the
population at random we obtain that the variance is $126.1576$, equal to
the population variance, and the standard deviation is 11.23199, the
square root of the variance.

Other characterization of the distribution that were computed for data,
such as the median, the quartiles, etc., may also be defined for random
variables.

## Probability and Statistics

Modern science may be characterized by a systematic collection of
empirical measurements and the attempt to model laws of nature using
mathematical language. The drive to deliver better measurements led to
the development of more accurate and more sensitive measurement tools.
Nonetheless, at some point it became apparent that measurements may not
be perfectly reproducible and any repeated measurement of presumably the
exact same phenomena will typically produce variability in the outcomes.
On the other hand, scientists also found that there are general laws
that govern this variability in repetitions. For example, it was
discovered that the average of several independent repeats of the
measurement is less variable and more reproducible than each of the
single measurements themselves.

Probability was first introduced as a branch of mathematics in the
investigation of uncertainty associated with gambling and games of
chance. During the early 19th century probability began to be used in
order to model variability in measurements. This application of
probability turned out to be very successful. Indeed, one of the major
achievements of probability was the development of the mathematical
theory that explains the phenomena of reduced variability that is
observed when averages are used instead of single measurements. In
Chapter \@ref(ChapSampDist) we discuss the conclusions of this theory.

Statistics study method for inference based on data. Probability serves
as the mathematical foundation for the development of statistical
theory. In this chapter we introduced the probabilistic concept of a
random variable. This concept is key for understanding statistics. In
the rest of Part I of this book we discuss the probability theory that
is used for statistical inference. Statistical inference itself is
discussed in Part II of the book.

## Exercises

```{r tab4, echo=FALSE, out.width = '60%', fig.align = "center"}
knitr::kable(
  data.frame(Value=0:5,Probability=paste(1:6,"p",sep="")),
  booktabs = TRUE,
  caption = 'The Distribution of $Y$'
)
```

```{exercise}
Table \@ref(tab:tab4) presents the
probabilities of the random variable $Y$.

These probabilities are a
function of the number $p$, the probability of the value “0". Answer the
following questions:

1.  What is the value of $p$?

2.  $\Prob(Y <3 )$ = ?

3.  $\Prob(Y = \mbox{odd})$ = ?

4.  $\Prob(1 \leq Y < 4)$ = ?

5.  $\Prob(|Y -3| < 1.5)$ = ?

6.  $\Expec(Y)$ = ?

7.  $\Var(Y)$ = ?

8.  What is the standard deviation of $Y$.

```

```{exercise}
One invests \$2 to participate in a game of chance.
In this game a coin is tossed three times. If all tosses end up “Head"
then the player wins \$10. Otherwise, the player losses the investment.

1.  What is the probability of winning the game?

2.  What is the probability of loosing the game?

3.  What is the expected gain for the player that plays this game?
    (Notice that the expectation can obtain a negative value.)

```

## Summary

### Glossary {#glossary .unnumbered}

Random Variable:

:   The probabilistic model for the value of a measurement, before the
    measurement is taken.

Sample Space:

:   The set of all values a random variable may obtain.

Probability:

:   A number between 0 and 1 which is assigned to a subset of the sample
    space. This number indicates the likelihood of the random variable
    obtaining a value in that subset.

Expectation:

:   The central value for a random variable. The expectation of the
    random variable $X$ is marked by $\Expec(X)$.

Variance:

:   The (squared) spread of a random variable. The variance of the
    random variable $X$ is marked by $\Var(X)$. The standard deviation
    is the square root of the variance.

### Discussion in the Forum {#discussion-in-the-forum .unnumbered}

Random variables are used to model situations in which the outcome,
before the fact, is uncertain. One component in the model is the sample
space. The sample space is the list of all possible outcomes. It
includes the outcome that took place, but also all other outcomes that
could have taken place but never did materialize. The rationale behind
the consideration of the sample space is the intention to put the
outcome that took place in context. What do you think of this rationale?

When forming your answer to this question you may give an example of a
situation from you own field of interest for which a random variable can
serve as a model. Identify the sample space for that random variable and
discuss the importance (or lack thereof) of the correct identification
of the sample space.

For example, consider a factory that produces car parts that are sold to
car makers. The role of the QA personnel in the factory is to validate
the quality of each batch of parts before the shipment to the client.

To achieve that, a sample of parts may be subject to a battery of
quality test. Say that 20 parts are selected to the sample. The number
of those among them that will not pass the quality testing may be
modeled as a random variable. The sample space for this random variable
may be any of the numbers 0, 1, 2, …, 20.

The number 0 corresponds to the situation where all parts in the sample
passed the quality testing. The number 1 corresponds to the case where 1
part did not pass and the other 19 did. The number 2 describes the case
where 2 of the 20 did not pass and 18 did pass, etc.

### Summary of Formulas {#summary-of-formulas .unnumbered}

Population Size:

:    $N$ = the number of people, things, etc. in the population.

Population Average:

:    $\mu = (1/N)\sum_{i=1}^N x_i$

Expectation of a Random Variable:

:    $\Expec(X) = \sum_x \big(x \times \Prob(x)\big)$

Population Variance:

:    $\sigma^2 = (1/N)\sum_{i=1}^N (x_i-\mu)^2$

Variance of a Random Variable:

:    $\Var(X) = \sum_x\big( (x-\Expec(X))^2 \times \Prob(x)\big)$

[^4_1]: Such a bar plot can be produced with the expression
    “`plot(table(pop.1$height))`".

[^4_2]: Observe that the function “`var`" computes the sample variance.
    Consequently, the sum of squares is divided by $N-1$. We can correct
    that when computing the population variance by multiplying the
    result by $N-1$ and dividing by $N$. Notice that the difference
    between the two quantities is negligible for a large population.
    Henceforth we will use the functions “`var`" and “`sd`" to compute
    the variance and standard deviations of populations without the
    application of the correction.

[^4_3]: The expression $\{|X -170| \leq 10\}$ reads as “the absolute value
    of the difference between $X$ and 170 is no more that 10". In other
    words, $\{-10 \leq X - 170 \leq 10\}$, which is equivalent to the
    statement that $\{160 \leq X \leq 180\}$. It follows that
    $\Prob(|X-170|\leq 10) = \Prob(160\leq X \leq 180)$.

[^4_4]: The mean of the population can be computed with the expression
    “`mean(pop.1$height)`"