Introduction
Confidence intervals (CI) are a way of estimating a population parameter (e.g., mean, variance) from a sample of data. They provide a range of values within which the true population parameter is likely to fall, with a certain level of confidence. For example, a 95% confidence interval means that we can be 95% confident that the true population parameter lies within the specified range.
The confidence interval formula is:
where:
- is the sample mean, or sample for probabilities
- t is the t-value from the t-distribution corresponding to the desired confidence level (e.g., 1.96 for a 95% confidence interval)
- s is the sample standard deviation, calculated as:
- for continuous metrics:
- for probabilities:
- n is the sample size
Now let’s see how we can simply implement this in Python.
Generate random data
We begin by generating synthetic data, drawing a random sample of size 100
from a normal distribution with mean 40
and standard deviation 10
.
value | |
count | 100.00 |
mean | 40.26 |
std | 9.15 |
Calculate confidence interval
Since the formula is straightforward, we can easily compute the confidence interval without any additional library:
But more conveniently, we can compute it as a one-liner with the scipy
package:
The intervals are almost identical, and the 2nd decimal difference is explained by the fact that we have approximated the t-value to 1.96 in the first “manual” method.
Plot distribution and confidence interval
Finally, let’s plot a histogram of the the sample distribution with the population mean, sample mean, and confidence interval of the sample mean.
The plot above shows the sample distribution with a population mean of 40
, as well as the sample mean 40.26
and a 95% confidence interval of [38.46, 42.05]
.