Christian Althaus, Judith Bouman, Martin Wohlfender
Tu, 09:00-12:30
Reproducible reports and GitHub
Christian Althaus, Alan Haynes
Schedule - Basic Statistics
Time
Topic
Lecturer(s)
Thu, 09:00-12:30
Inference about the mean
Ben Spycher
Thu, 13:30-17:00
Non-normal and dependent/paired data
Beatriz Vidondo
Fri, 09:00-12:30
Inference about proportions and rates
Ben Spycher
Fri, 13:30-17:00
Continue R project with a guided data analysis
Ben Spycher, Beatriz Vidondo
Inference about the mean
Time
Duration
Topic
09:00-10:00
60 min
Getting started, descriptive analysis of quantitative data, measures of location and dispersion
10:00-10:15
15 min
Coffee break
10:15-11:30
75 min
Central limit theorem, confidence intervals, p-values, comparing groups
11:30-11:45
15 min
Coffee break
11:45-12:30
45 min
Inference about the mean assuming normality, t-tests for independent and paired samples, ANOVA
Before you begin looking at the data…
ask yourself…
Do I know what am I trying to find out?
Do I understand where the data come from?
Do I have a plan?
Know your questions
“Figuring out” your questions after seeing the data is not a good idea?
Texas sharpshooter by Dirk-Jan Hoek CC-BY
Know where your data come from
Data collection:
Context and purpose? Who/what are the subjects? How were they selected? Who is in and who is not? How were measurements made? How were data entered? Were they checked and cleaned?
Data preprocessing:
Which files were merged and how? How were data coded? Were any missing values imputed? Derived variables? Were any subjects excluded? Any manual or other error prone steps?
Have a plan
Analysis plan: step-by-step outline of how you will approach your questions
Background
Aims
Methods: data sources, study population, variables (outcomes, exposures, covariates), dealing with missing data, statistical methods, sensitivity analyses
Make sure data types are correctly represented in R
Read in again with \(id\) as integer, and \(sex\), and \(respsymptoms\) as factors (categorical variables)
data<- data |>mutate(id=as.integer(id), sex =factor(sex, levels=c(0,1), labels=c("f","m")), respsymptoms=factor(respsymptoms, levels=c(0,1), labels=c("no","yes")), asthma_hist=factor(asthma_hist))data
# A tibble: 636 × 7
id fev1 age height sex respsymptoms asthma_hist
<int> <dbl> <dbl> <dbl> <fct> <fct> <fct>
1 638 0.64 7.47 118. m yes current asthma
2 527 0.68 8.21 113. f yes never
3 33 0.81 8.42 121. f no never
4 289 0.81 7.46 109. f no never
5 254 0.82 8.95 128 f yes current asthma
6 355 0.83 8.48 115 f yes previous asthma
7 484 0.85 8.47 132. m yes previous asthma
8 86 0.86 7.56 113 f no never
9 601 0.86 8.58 127. m no never
10 242 0.89 9.05 109. m yes never
# ℹ 626 more rows
The histogram approaches the true distribution (probability density function) as the sample size increases.
Descriptive statistics
A statistic = a quantity calculated from the data
A descriptive statistic = a statistic summarizing features of the sample, typically features of the distribution of a variable
Descriptive statistics = the process of describing the sample using such summary statistics
IMPORTANT: describing the sample … as opposed to the population
Descriptive statistics
An easy way to obtain some important descriptive statistics on your variables.
summary(data)
id fev1 age height sex
Min. : 1.0 Min. :0.640 Min. : 7.116 Min. :105.6 f:335
1st Qu.:160.8 1st Qu.:1.397 1st Qu.: 8.493 1st Qu.:119.9 m:301
Median :319.5 Median :1.580 Median : 8.909 Median :124.0
Mean :319.8 Mean :1.595 Mean : 8.984 Mean :124.1
3rd Qu.:479.2 3rd Qu.:1.790 3rd Qu.: 9.627 3rd Qu.:128.0
Max. :638.0 Max. :2.690 Max. :10.440 Max. :149.0
respsymptoms asthma_hist
no :491 current asthma : 59
yes:145 never :450
previous asthma:127
Note: There are no missing values in this data set. These would appear in the lowest row as NA: number of missings.
Location and spread
Location: Measures where on the x-axis an average value would lie (central tendency).
Alternative notation: \(s_x\), \(\text{SD}\), \(\text{SD}_x\)
Why divide by \(n-1\)?
The sample mean is closer to the data points than the true mean would be.
The mean sum of squares underestimates the true variance. Dividing by \(n-1\) (degrees of freedom) corrects for the bias.
Degrees of freedom (df)
Relative to the true mean \(\mu\) all n deviations \((x_{i}-\mu)\) are free to vary during sampling (\(\text{df}=n\)).
Relative to the sample mean \(\overline{x}\) only \(n-1\) deviations \((x_{i}-\overline{x})\) are free to vary (\(\text{df}=n-1\)).
We “use up”” one df by calculating \(\overline{x}\). More generally, one df is lost for every fitted parameter used to calculate the mean (relevant in regression modelling).
Other measures of dispersion
Range: difference between highest (maximum) and lowest value (minimum) \[\text{Range}=x_{(n)}-x_{(1)}\]
Interquartile range (IQR): Difference between upper and lower quartile (includes 50% of the observations).
Note that IQR and range are mostly reported as intervals rather than the length of these intervals.
tabl_summary
Characteristic
N = 636
fev1
Mean (SD)
1.59 (0.30)
Median (IQR)
1.58 (1.40, 1.79)
Range
0.64, 2.69
age
Mean (SD)
8.98 (0.72)
Median (IQR)
8.91 (8.49, 9.63)
Range
7.12, 10.44
height
Mean (SD)
124 (6)
Median (IQR)
124 (120, 128)
Range
106, 149
Standardizing variables
If the values \(x_{1}, x_{2}, \ldots, x_{n}\) have mean \(\overline{x}\) and standard deviation \(s_x\)
then the standardized values \(z_i\)
\[z_i=\frac{x_{i}-\overline{x}}{s_x}\]
have sample mean \(\overline{z}=0\) and standard deviation \(s_z=1\).
Normal distribution
We often use the normal distribution to model the distribution of continuous variables.
Normal distribution
If \(X\) follows a normal distribution distribution with mean \(\mu\) and variance \(\sigma^2\) its probability density function (pdf) \(f(x)\) is given by
The standardized variable \(Z=\frac{X-\mu}{\sigma}\) then follows a standard normal distribution with mean 0 and variance 1. \[f(z)=\frac{1}{\sqrt{2\pi }}e^{-\frac{1}{2}z^2}\]
Normal distribution
To calculate the probability that \(X<a\) we need the area under the standard normal density below the value \(\frac{a-\mu}{\sigma}\).
Example
Body height of men
Example
Probability that a randomly selected man is taller than 180cm:
\[\text{Pr}(X > 180)= 1-\text{Pr}(X \le 180) = 1 - \text{Pr}\left(Z \le \frac{180-171.5}{6.5}\right) \] We can use the cumulative distribution function pnorm in R
# Using z transformation and standard normalz<-(180-171.5)/(6.5)1-pnorm(z)
[1] 0.09548885
# orpnorm(z, lower.tail =FALSE)
[1] 0.09548885
# or directly using the distribution of X1-pnorm(180, mean=171.5, sd=6.5)
[1] 0.09548885
Remember 1.96
Some commonly used quantiles of the standard normal distribution (you will shortly see why).
qnorm(c(0.95, 0.975, 0.995))
[1] 1.644854 1.959964 2.575829
Inference about the mean
Population and samples
Inferential statistics
Inferential statistics = Inferring properties of the population based on the sample
Parameteric statistics: We make specific (parametric) assumptions about the true distribution of a variable. Example:
Assumptions: \(FEV_1\) is normally distributed within children of the same sex (\(i\)) and age (\(j\)) with means \(\mu_{i,j}\) and variances \(\sigma_{i,j}\).
Inference: We want to estimate the means \(\mu_{i,j}\) and test whether these differ between sexes.
Non-parametric statistics: We make no parametric assumptions about the true distribution. Example:
Using the histogram to estimate probability density function
What’s the problem?
We never observe the full population, only 1 sample. Any inference about the population is subject to error.
Sampling variation: Chance of variation between samples.
Sampling distribution: The distribution of a statistic across different samples.
Regardless of the underlying distribution, the sampling distribution becomes more and more …
narrow
unimodel
symmetrical
focused around the true mean
Note: The sample mean is unbiased estimator. Its sampling distribution is always centered around the mean regardless of sample size. The last point is a visual impression due to the narrowing of the curve.
The standard deviation of the sampling distribution is called the standard error.
Assuming that the observations \(x_i\) are independent, the standard error of the mean is
\[SE=\frac{\sigma}{\sqrt{n}}\]
where \(\sigma\) is the standard deviation of the underlying distribution.
As \(\sigma\) is usually unknown, we estimate it using the sample standard deviation \(s\)
\[\hat{SE}=\frac{s}{\sqrt{n}}\]
Central limit theorem (CLT)
As sample size increases, the sampling distribution of the mean approaches a normal distribution.
Specifically, the distribution of
\[z=\frac{\overline{x}-\mu}{SE}\] approaches the standard normal distribution. Here \(\mu\) is the mean of the underlying distribution, i.e. the true mean.
Central limit theorem (CLT)
Histograms of standardized means \(z\) for increasing sample sizes
\(FEV_1\) by age and sex
data$age_cat<-cut(data$age,seq(6,12,2), right=FALSE)tbl_fev1 <- data %>%group_by(sex, age_cat) %>%summarise(n=n(), mean=mean(fev1), sd=sd(fev1))tbl_fev1
# A tibble: 6 × 5
# Groups: sex [2]
sex age_cat n mean sd
<fct> <fct> <int> <dbl> <dbl>
1 f [6,8) 29 1.30 0.232
2 f [8,10) 275 1.53 0.267
3 f [10,12) 31 1.83 0.300
4 m [6,8) 24 1.35 0.257
5 m [8,10) 254 1.67 0.296
6 m [10,12) 23 1.85 0.256
Assuming we have a large sample, we can use the CLT to construct an approximate 95%-confidence interval (95%-CI) for the mean.
\[[\overline{x}-1.96\cdot\hat{SE},\; \overline{x}+1.96\cdot\hat{SE}]\]Defining feature: A 95%-CI is defined such that, under repeated sampling, 95% of the intervals are expected to contain the true mean.
# A tibble: 6 × 8
# Groups: sex [2]
sex age_cat n mean sd SE CI_norm_lb CI_norm_ub
<fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 f [6,8) 29 1.30 0.232 0.0431 1.21 1.38
2 f [8,10) 275 1.53 0.267 0.0161 1.50 1.56
3 f [10,12) 31 1.83 0.300 0.0538 1.73 1.94
4 m [6,8) 24 1.35 0.257 0.0525 1.25 1.45
5 m [8,10) 254 1.67 0.296 0.0186 1.63 1.71
6 m [10,12) 23 1.85 0.256 0.0534 1.75 1.96
Note: These are 95%-CIs are based on the normal approximation which can be poor in small samples, e.g. in age groups 6-7 and 10-11 years.
Differences in means
Let’s compare the means between the sexes in the largest age group (8-9 years):
comp_mean<-tbl_fev1 %>%filter(age_cat =="[8,10)") %>%select(mean, SE, CI_norm_lb, CI_norm_ub)print(comp_mean,digits=3)
# A tibble: 2 × 5
# Groups: sex [2]
sex mean SE CI_norm_lb CI_norm_ub
<fct> <dbl> <dbl> <dbl> <dbl>
1 f 1.53 0.0161 1.50 1.56
2 m 1.67 0.0186 1.63 1.71
Mean \(FEV_1\) is higher in boys than girls by 138 ml:
Large deviations of \(D = \overline{x}_2-\overline{x}_1\) from 0 on either side provide evidence against \(H_0\).
Note: \(H_0\) typically states the absence of a difference between groups or of an association.
Another approach - Testing
Under \(H_0: \Delta=0\), the statistic \(z_D=\frac{D}{SE_D}\) approximately follows the standard normal distribution.
The combined shaded area beyond the \(|z_D|\) on either side of 0 gives us the two-sided p-value.
The P-value
P-value: The conditional probability (under repeated sampling) of obtaining a test statistic as extreme or more extreme than the one observed given \(H_0\) is true.
Calculate the two sided p-value:
D = comp_mean$mean[comp_mean$sex=="m"]-comp_mean$mean[comp_mean$sex=="f"]SE_D<-sqrt(sum(comp_mean$SE^2))z<-D/sqrt(sum(comp_mean$SE^2))p<-2*pnorm(-abs(z))sprintf("D = %1.3f, SE_D = %1.3f, z_D = %1.3f, P = %1.3e", D, SE_D, z, p)
Normality assumption: We assume the underlying distribution (population) of our measurements \(X\) is normal with unknown mean \(\mu\) and standard deviation \(\sigma\).
Under normality, the sampling distribution of the mean of \(\overline{x}\) is exactly normal with mean \(\mu\) and standard deviation \(SE=\frac{\sigma}{\sqrt{n}}\) (no approximation involved).
Remember: The standard error is the standard deviation of a sampling distribution.
The normality assumption
Example: Assume that the body height of men is normally distributed with \(\mu=170 \text{ cm}\) and \(\sigma=10 \text{ cm}\) (Made up)
Caveat: \(\sigma\) is unknown. We can only estimate \(SE\) using \(\hat{SE}=\frac{s}{\sqrt{n}}\).
The normality assumption
In small samples, \(z=\frac{\overline{x}-\mu}{\hat{SE}}\) is somewhat wider than the standard normal distribution.
Student’s t-distribution
Under the normality assumption, the statistic \(t=\frac{\overline{x}-\mu}{\hat{SE}}\) follows the \(t\)-distribution with \(n-1\) degrees of freedom (one df lost because the mean was used to obtain \(s\)).
Quantiles of the t-distribution
The \(t\)-distribution approaches the standard normal distribution as the df increase:
df
5
50
100
500
1000
\(t_{\text{df},0.975}\)
2.5706
2.0086
1.984
1.9647
1.9623
Better 95%-CIs for the mean
If normality holds, we can calculate exact confidence intervals even from small samples:
If normality holds in both populations, we can use the \(t\)-distributed statistics to test \(H_0\). We distinguish two situations:
Assumptions
test statistic
df of \(t\)-distribution
R t.test option
Equal variances
\(t=\frac{D}{s_p\cdot\sqrt{1/n_1+1/n_2}}\)
\(n-2\)
var.equal = TRUE
Unequal variances
\(t=\frac{D}{\sqrt{s_1^2/n_1+s_2^2/n_2}}\)
complicated formula
var.equal = FALSE
Here \(s_p\) is a pooled estimator of the common standard deviation. For more details see the Wikipedia article
The second version is known as Welch’s test. It is the default in R (we recommend to keep it).
Two-sided p-values are obtained by summing the tail areas of the corresponding \(t\)-distribution below \(-|t|\) and above \(|t|\) (as for the \(z\)-test using the standard normal distribution).
Two-sample \(t\)-test
Let’s test for differences in mean \(FEV_1\) between sexes in the smaller age groups
Welch Two Sample t-test
data: fev1 by sex
t = -0.7894, df = 46.944, p-value = 0.4338
alternative hypothesis: true difference in means between group f and group m is not equal to 0
95 percent confidence interval:
-0.19012150 0.08296633
sample estimates:
mean in group f mean in group m
1.295172 1.348750
Interpretation: There is no evidence of a difference in mean \(FEV_1\) between sexes in the youngest age group (P=0.43).
Two-sample \(t\)-test
Here a summary output (last columns) of the two-sample t-test for all age groups:
One Sample t-test
data: data$fev1[ind]
t = -2.4337, df = 28, p-value = 0.02158
alternative hypothesis: true mean is not equal to 1.4
95 percent confidence interval:
1.206940 1.383404
sample estimates:
mean of x
1.295172
Paired-samples \(t\)-test
Situations with dependent samples \(x_{1,i}\), \(x_{2,i}\) (\(i=1,...,n\)):
Repeated measurements on the same subjects (e.g. pre- post treatment, using different devices)
Measurements on persons matched on certain characteristics (e.g. same family)
Paired-samples \(t\)-test: To test \(H_0: \mu_1=\mu_2\) we run a one sample \(t\)-test on the sample \(d_i = x_{1,i} - x_{2,i}\) (pairwise differnces) testing for \(H_0: \delta=0\) where \(\delta\) is the population mean of the pairwise differences.
Comparing multiple groups - ANOVA
Analysis of variance (ANOVA) is a method for comparing means (despite the name) across multiple groups.
It is a generalization of the \(t\)-test to multiple, say \(k\), groups/sample.
It assumes normality withing groups equal variance across groups.
When \(k=2\) it is equivalent to the \(t\)-test (with equal variances)
Here the \(\hat\mu_j\) stand for the group-wise sample means, \(\hat\mu\) stand for the overall sample mean, and the \(x_i\) for individual \(FEV_1\) values.
ANOVA
A large value of \(SS_{B}\) provide evidence against \(H_0\).
We use the \(F\)-statistic to test \(H_0\):
\[F=\frac{SS_{B}/(k-1)}{SS_{W}/(n-k)}\]
In our case \(F=\) 3.896
This value must be compared against the \(F\) distribution \(k-1=2\) und \(n-k=332\) degrees of freedom.
ANOVA
The \(p\)-values is the tail area of the of \(F\) distribution above the test statistic.
p<-pf(F,2,332,lower.tail =FALSE)p
# [1] 0.02125017
This yields \(p=\) 0.02125
We thus have some evidence against \(H_0\) suggesting that the \(FEV_1\) depends on asthma history.
ANOVA in R
All in one using the anova function:
anova(lm(fev1~asthma_hist, data=dat))
Analysis of Variance Table
Response: fev1
Df Sum Sq Mean Sq F value Pr(>F)
asthma_hist 2 0.6473 0.32363 3.8964 0.02125 *
Residuals 332 27.5755 0.08306
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: ANOVA can be treated as a special case of the linear regression model. Here, the result of a linear model, executed with lm, is passed on to the anova function.
Summary
Measures of central tendency include the mean and median. The median is robust againts outliers.
Measures of spread include variance, standard deviation, IQR and range.
The sample mean is an unbiased estimator.
For large \(n\) (sample size), the means from independent samples will vary normally around the true mean.
Increasing \(n\) by 100 will reduce the standard error of the sample mean by 10 (i.e. divide by 10)
Confidence intervals based on the t-distribution and t-tests for comparisons of means (one-sample and two-sample situation) are appropriate
if the normality assumption holds (regardless of sample size) or
more generally, when samples sizes are large.
The \(F\)-test (in ANOVA) generalizes the two-sample \(t\)-test to the comparison of multiple groups
While Welch’s \(t\)-test (the default in R) allows for unequal variances between groups, ANOVA strictly assumes equal variances across groups.