sample size :
$n$ population size :
$N$ popualation mean :
$\mu$ sample mean :
$\bar{x}$ the population variance :
$\sigma^2 = { \sum{(x_i -\bar{x})^2} \over N}$ the sample variance :
$S_n^2 = {\sum(x_i - \bar{x})^2 \over n }$ often the sample variance is less than the population variance, so in order to estimate the population variance, we need to modify it as follows:
- the unbiased sample variance :
$S_n^2 = {\sum(x_i - \bar{x})^2 \over n -1}$ - stanard deviation :
$\sigma = \sqrt{\sigma^2} = \sqrt{\sum(x_i - \mu)^2 \over N}$ - stanard deviation :
$s= \sqrt{s^2} = \sqrt{\sum(x_i - \bar{x})^2 \over n-1}$
A function mapping a experiment to a variable
There are two types:
- discrete random variable
- continues random variable
A single trial with two possible outcome, p% to succed and (1-p)% to fail
Expectation of Bernoulli distribution
$$E = p$$ $$\sigma^2 = p(1-p)$$
The binomial distribution is a Bernoulli distribution if n tries, n>1.
example:
A person took 6 shoots with 30% likely to make the shot
p(X = 0) = 0.7 * 0.7 * 0.7 * 0.7 * 0.7 * 0.7 =
p(X = 1) = 0.3 * 0.7 * 0.7 * 0.7 * 0.7 * 0.7 =
p(X = 2) = 0.3 * 0.3 * 0.7 * 0.7 * 0.7 * 0.7 =
p(X = 3) = 0.3 * 0.3 * 0.3 * 0.7 * 0.7 * 0.7 =
p(X = 4) = 0.3 * 0.3 * 0.3 * 0.3 * 0.7 * 0.7 =
p(X = 5) = 0.3 * 0.3 * 0.3 * 0.3 * 0.3 * 0.7 =
p(X = 6) = 0.3 * 0.3 * 0.3 * 0.3 * 0.3 * 0.3 =
$$p(X=n) = 0.3^n * 0.7^{6-n} * C_6^n$$
$$C_n^k = \binom{n}{k} = \frac{n!}{k!(n-k)!}$$
Expectation of Binomial distribution
n tries with possibility of p to success for each try
- $$p(x=k) = \binom{n}{k}p^k(1-p)^{n-k}$$
$$E(x) = np$$ $$\sigma^2 = np(1-p)$$
based on two assumption:
- the event happens at stable rate, any period of time is no different than another
- the events between different time period are independent
The expectation
say X = number of cars passion in 1 hour
this is to split one hour into 60 min and examin if 1 car will pass in each minute with the possibility of
WHAT IF more than 1 cars passes ---> MORE GRANULAR
Set the interval to 1 sec:
$$p(X=k) = \binom{3600}{k}({\lambda \over 3600})^k({1-{\lambda \over 3600}})^{3600-k}$$
--> more and more granular -->
NOTE :
the Z table is a accumulative distribution
Significant level :
$\mu \pm \sigma$ : p = 68%$\mu \pm 2\sigma$ : p = 95% --> common confidential Interval$\mu \pm 3\sigma$ : p = 99.7%
sample sum or sample mean from any distribution will display a normal distribution with large enough sampling
Standard Error :
$\mu$ : mean of sampling distribution
$\mu_0$ : mean of original population
In most cases, we aren't not able to get the standard deivation of the population and std of sample is used to estimate the true std of population , and subsequently estimate the std of the sampling distribution
p(
$\bar{x}$ is within 2$\sigma_{\bar{x}}$ if$\mu_{\bar{x}}$ ) =
p($\mu_{\bar{x}}$ is within 2$\sigma_{\bar{x}}$ if$\bar{x}$ )
if a point where the normalized z score is 2, the area under the curve is before this point is 97.5%, the p value is 1-0.25 = 97.5%
if a point where the normalized z score is 2, the area under the curve is before this point is 97.5%, the p value is 1-2*0.25 = 95%
so two tail test is recommended for its stricter p value
if n > 30, the sd is considered close to
$\sigma$ if n < 30, we use t-distribution to estimate its significant level instead of normal distribution
for two independent variables X, Y
For another variable Z, if Z = X + Y:
For another variable Z, if G = X - Y:
NOTE :
It is variance, not standard deviation
If we have the sample of x and y, even with different sample size(n,m),
we wil have their sampling distribution, and if we are interested in
We will have a diff sampling distribution like this:
The one random variable test is to test:
- Given a population mean, we can calculate the probability of getting that sample, and do a hypothesis test(test if the sample mean is the population mean), accept or reject the hypothesis based a p value.
- calculate the confidentila interval without the population mean provided.
The two random variable test is to test how different these two varibles are:
- Given the mean, we can calculate the probability of getting those sample, and do a hypothesis test(test if they are different, null : they are the not different, the mean of diff is 0 and they should have the same sd, use the overall sd to estimate if possible), accept or reject the hypothesis based a p value.
- calculate the confidentila interval without the mean provided.
To find a line that has represent the data point best,
the fitted line should have the minimized squared error
with the line being
SE_line (squared error agianst the line):
We can formulate m and b using partial derivative
- total variation of y:
- total variation NOT described by the line :
Total variation = those described by the line + those not derribed by the line
Here, R squared is the coeffiicient of determination, showing what % of the total variation is descrubed by variation in x according to the line
A test of whether distributions are different
| Day | Mon | Tue | Wen | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| Expected % | 10 | 10 | 15 | 20 | 30 | 25 |
| Oberserved | 30 | 14 | 34 | 45 | 57 | 20 |
| Expected | 20 | 20 | 30 | 40 | 60 | 30 |
Chi-square statistic
df = 6-1 = 5 as we take 6 sums
E_i = number of the expectation, E_i = NP_i
F statistic
- SSB : sum of square between groups
- SSW : sum of square within groups
- SST : sum of square total
Example :
| c1 | c2 | c3 |
|---|---|---|
| 3 | 5 | 5 |
| 2 | 3 | 6 |
| 1 | 4 | 7 |
$$\bar{x_1} = 2, \bar{x_2} = 4, \bar{x_3} = 6$$ $$\bar{\bar{x}} = 4$$
$$SST = (3-4)^2 + (2-4)^2+ (1-4)^2+ (5-4)^2+(3-4)+... = 30$$ $$SSW = (3-2)^2 + (2-2)^2+ (1-2)^2+ (5-3)^2+(3-3)+... = 6$$ - $$SST = (2-4)^23 + (4-4)^23 + (6-4)^2*3 = 24$$
- dfT = mn -1 = 8
- dfW = mn-m = 6
- dfB = m -1 = 2
| G1 | G2 | sum |
|---|---|---|
| a | b | a+b |
| c | d | c+d |
| a+c | b+d | n= a+b+c+d |
Example
| M | F | sum |
|---|---|---|
| 1 | 9 | 10 |
| 11 | 3 | 14 |
| 12 | 12 | 24 |
NOTE
the question to answer here is, Given the row and col sum
how likely it is to get distribution observed.