Lecture 1

Objectives

The aim of this lecture is to help you understand

  • that there exist mathematical functions that describe different distributions

  • what makes the normal distribution normal and what are its properties

  • how random fluctuations affect sampling and parameter estimates

  • the function of the sampling distribution and the standard error

  • the Central Limit Theorem

With this knowledge you’ll build a solid foundation for understanding all the statistics we will be learning in this programme!

It’s all Greek to me!

Before we start, let’s make clear an important distinction between sample statistics, population parameters, and parameter estimates.

What we really want when we’re analysing data using statistical methods is to know the value of the population parameter(s) of interest. This could be the population mean, or the difference between two populations. The problem is that we can’t directly measure these parameters because it’s not possible to observe the entire population.

What we do instead is collect a sample and observe the sample statistic, such as the sample mean. We then use this sample statistic as an estimate – the best guess – of the value of the population parameter.

To make this distinction clear in notation, we use Greek letters for population parameters, Latin letters for sample statistics, and letters with a hat for population estimates:

  • \(\mu\) is the population mean
  • \(\hat{x}\) is the sample mean
  • \(\hat{\mu}\) is the estimate of the population mean

Same goes for, for example, the standard deviation: \(\sigma\) is the population parameter, \(s\) is the sample statistic, and \(\hat{\sigma}\) is the parameter estimate.

Distributions again

Let’s briefly revisit the general topic of distributions. Numerically speaking, a distribution is the number of observations per each value of a variable. We could ask our sample whether they like cats, dogs, both, or neither and the resulting numbers would be the distribution of the “pet person” variable (or some other name).

Inspecting a variable’s distribution gives information about which values occur more often and which less often.

We can visualise a distribution as the shape formed by the bars of a bar chart/histogram.

df <- tibble(eye_col = sample(c("Brown", "Blue", "Green", "Gray"), 555,
                        replace = T, prob = c(.55, .39, .04, .02)),
             age = rnorm(length(eye_col), 20, .65))

p1 <- df %>%
  ggplot(aes(x = eye_col)) +
  geom_bar(fill = c("skyblue4", "chocolate4", "slategray", "olivedrab"), colour=NA) +
  labs(x = "Eye colour", y = "Count")

p2 <- df %>%
  ggplot(aes(x = age)) +
  geom_histogram() +
  stat_density(aes(y = ..density.. * 80), geom = "line", color = theme_col, lwd = 1) +
  labs(x = "Age (years)", y = "Count")
  
plot_grid(p1, p2)
Visualising distributions using a bar chart for a discrete variable (eye colour) and a histogram for a continuous variable (age)

Figure 1: Visualising distributions using a bar chart for a discrete variable (eye colour) and a histogram for a continuous variable (age)

Here’s how we can code this simulation: - we draw 20 observation form the population using rnorm() - we calculate the mean of this sample with mean() - we repeat this procedure 100,000 times using replicate() saving all the means of all the samples in an object called x_bar.

n <- 20
mu <- 173
sigma <- 23
x_bar <- replicate(100000, mean(rnorm(n, mu, sigma)))
mean(x_bar)
[1] 172.9956

Take-home message

  • Distribution is the number of observations per each value of a variable

  • There are many mathematically well-described distributions

    • Normal (Gaussian) distribution is one of them
  • Each has a formula allowing the calculation of the probability of drawing an arbitrary range of values

  • Normal distribution is

    • continuous
    • unimodal
    • symmetrical
    • bell-shaped
    • it’s the right proportions that make a distribution normal!
  • In a normal distribution it is true that

    • ∼68.2% of the data is within ±1 SD from the mean
    • ∼95.4% of the data is within ±2 SD from the mean
    • ∼99.7% of the data is within ±3 SD from the mean
  • Every known distribution has its own critical values

  • Statistics of random samples differ from parameters of a population

  • Distribution of sample parameters is the sampling distribution

  • Standard error of a parameter estimate is the SD of its sampling distribution

    • Provides_ margin of error_ for estimated parameter
    • The larger the sample, the less the estimate varies from sample to sample
  • Central Limit Theorem

    • Really important!
    • With increasing sample size, the sampling distribution of the mean tends to – or approximates – normal even if population distribution is not normal

Understanding distributions, sampling distributions, standard errors, and CLT it most of what you need to understand all the stats techniques we will cover.