Lecture 1
The aim of this lecture is to help you understand
that there exist mathematical functions that describe different distributions
what makes the normal distribution normal and what are its properties
how random fluctuations affect sampling and parameter estimates
the function of the sampling distribution and the standard error
the Central Limit Theorem
With this knowledge you’ll build a solid foundation for understanding all the statistics we will be learning in this programme!
Before we start, let’s make clear an important distinction between sample statistics, population parameters, and parameter estimates.
What we really want when we’re analysing data using statistical methods is to know the value of the population parameter(s) of interest. This could be the population mean, or the difference between two populations. The problem is that we can’t directly measure these parameters because it’s not possible to observe the entire population.
What we do instead is collect a sample and observe the sample statistic, such as the sample mean. We then use this sample statistic as an estimate – the best guess – of the value of the population parameter.
To make this distinction clear in notation, we use Greek letters for population parameters, Latin letters for sample statistics, and letters with a hat for population estimates:
Same goes for, for example, the standard deviation: \(\sigma\) is the population parameter, \(s\) is the sample statistic, and \(\hat{\sigma}\) is the parameter estimate.
Let’s briefly revisit the general topic of distributions. Numerically speaking, a distribution is the number of observations per each value of a variable. We could ask our sample whether they like cats, dogs, both, or neither and the resulting numbers would be the distribution of the “pet person” variable (or some other name).
Inspecting a variable’s distribution gives information about which values occur more often and which less often.
We can visualise a distribution as the shape formed by the bars of a bar chart/histogram.
df <- tibble(eye_col = sample(c("Brown", "Blue", "Green", "Gray"), 555,
replace = T, prob = c(.55, .39, .04, .02)),
age = rnorm(length(eye_col), 20, .65))
p1 <- df %>%
ggplot(aes(x = eye_col)) +
geom_bar(fill = c("skyblue4", "chocolate4", "slategray", "olivedrab"), colour=NA) +
labs(x = "Eye colour", y = "Count")
p2 <- df %>%
ggplot(aes(x = age)) +
geom_histogram() +
stat_density(aes(y = ..density.. * 80), geom = "line", color = theme_col, lwd = 1) +
labs(x = "Age (years)", y = "Count")
plot_grid(p1, p2)
Here’s how we can code this simulation:
- we draw 20 observation form the population using rnorm()
- we calculate the mean of this sample with mean()
- we repeat this procedure 100,000 times using replicate()
saving all the means of all the samples in an object called x_bar
.
Distribution is the number of observations per each value of a variable
There are many mathematically well-described distributions
Each has a formula allowing the calculation of the probability of drawing an arbitrary range of values
Normal distribution is
In a normal distribution it is true that
Every known distribution has its own critical values
Statistics of random samples differ from parameters of a population
Distribution of sample parameters is the sampling distribution
Standard error of a parameter estimate is the SD of its sampling distribution
Central Limit Theorem
Understanding distributions, sampling distributions, standard errors, and CLT it most of what you need to understand all the stats techniques we will cover.