Basic Math for Data Science, etc.

The normal distribution is the basis of modern statistics. Do you remember the t-distribution, the $\chi^2$-distribution, and the F-distribution from the introductory stats course? These distributions are all derived from the normal distribution.

So, why do we see the normal distribution everywhere? People’s height, weight, IQ, student’s grade, …, so many things around us are normally distributed. Of course, not everything is normally distributed, but it is safe to say that the normal distribution is the most common distribution we see in this world. But why is that?

It can be explained by the central limit theorem, which is the central principle of Statistics. The theorem essentially states that the sum of a large number of random variables is normally distributed. Many things are indeed the sum of many many random factors. (Actually, they may be a complicated function of many variables, but often it can be approximated as a weighted sum of the random variables. This is one of the most important topics in Calculus that we will talk about in the future.)

Here’s some R code that demonstrates the central limit theorem:

Sorry! It’s broken right now.

n =

x = c()
for(i in 1:1000){
  x = c(x, sum(runif(n)) ) #Append sum to array x
}
hist(x, breaks=30)

In this code, runif(n) generates n uniformly distributed random numbers. Then, the sum of n random numbers is computed 1000 times and stored in x.

The plot below is the histogram. Change the value of n and see what happens.

When $n=1$, we see the uniform distribution of the individual random variable. But once we start adding multiple random variables ($n\ge 2$), it quickly becomes bell-shaped. I will give you an intuitive explanation of this phenomenon in another post (coming soon!).

By the way, this interactive page is powered by R running on my server. I will tell you about this web app in the near future.

Posts

The normal distribution is everywhere!