In this article we will learn about normal distribution in R. We will look into generating a set of values that follow a normal distribution; finding probabilities for outcomes given a normal distribution, and visualize normal distribution.



Theory

To begin with, we need to identify what the normal distribution is (as I’m sure you hear this term everywhere and it is widely used) and it is crucial to understand it.

When we refer to the term “distribution” it is often about the spread of the data.

Data can be spread in various ways.

There can be more observations with values less than the average (the majority of observations are on the left of the mean and the spread is more on the right) and vice versa.

These distributions aren’t symmetric around the mean and have a non-zero skewness.

Thus, these distributions aren’t normal.

So what exactly is a normal distribution?

Normal distribution is a common type of continuous probability distribution with a unique “bell shape” where the data is symmetrical around the mean.

I understand this definition may not be as easy to grasp right away as you are starting to learn statistics. But bare with me, there are examples along this article which will definitely help you understand the points I make with some visual help.

Going back to the normal distribution, there are a few key things you should know about it:

  1. It has a unique “bell shape”
  2. Mean = median
  3. Skewness = 0
  4. 68% of data falls between the mean ± 1 standard deviation

Okay, enough of theory! Let’s run the numbers and do some visualizations to help us better understand what this is about!



Application

Below are the steps we are going to take to make sure we do understand the concept of normal distribution and how to work with it in R:

  1. Creating sample normal distribution using rnorm() command in R
  2. Descriptive statistics of normal distribution in R
  3. Plotting normal distribution in R
  4. Finding probability using pnorm() command in R



Part 1. Creating sample normal distribution using rnorm() command in R

Let’s think of a scenario that will be intuitive to understand!

I suggest: assume an economics course in university with 1000 students enrolled. The first semester is halfway through and everyone wrote their first midterm exam. The professor is inputting the grades into an Excel spreadsheet. When he runs the numbers, he sees that the average for the midterm is 70% with more than half of the students having grades in the range between 60% and 80%.

Sounds like a realistic scenario, doesn’t it?

Let’s try to work with it and see what we get.

Some important information that we need here is:

  1. There are 1000 students
  2. Mean (or average) is 70
  3. Standard deviation is 10 (assume this roughly)

This information is enough to create a sample normal distribution in R which will follow these exact properties.

R has a built in command rnorm() which is used to generate a dataset of random numbers give the parameters you set.

The short theoretical explanation of the function is the following:

rnorm(n, mean= , sd= )

This function generates a set of n normally distributed numbers with the mean and sd you set.

Let’s call our dataset “x” and go ahead and generate 1000 normally distributed numbers with mean = 70 and standard deviation = 10.

You can do it using the following code:


x<-rnorm(1000, mean=70 , sd=10)

Amazing! We generated our dataset!

Note: every time you run this line it will generate a new set of numbers. Not necessarily the numbers will be identical, yet they will follow the same distribution.



Part 2. Descriptive statistics of normal distribution in R

After we created our normally distributed dataset in R we should take a look at some of it's descriptive statistics.

Let's find the mean, median, skewness, and kurtosis of this distribution.

Mean and median commands are built into R already, but for skewness and kurtosis we will need to install and additional package e1071.


mean(x)
median(x)
skewness(x)
kurtosis(x)

The results I got are the following:
mean = 69.8924
median = 69.74109
skewness = -0.003629289
kurtosis = 0.01726331

As I mentioned earlier in this article, the mean and median should be equal. We see here that they are very very close. If the increase the number of observations in the dataset (n) to say, for example, 100000, we will see that the gap between mean and median will be even smaller.

The same logic works for skewness and kurtosis which will get closer to 0 as we increase the number of observations (n).



Part 3. Plotting normal distribution in R

Once we get the basic descriptive statistics for the dataset, it should become clearer about its properties.

Yet, often times the best way to get a more thorough understanding of the above parts it to connect it to data visualization.

Here is the distribution plot of our dataset:

example of normal distribution in R

Another useful way to visualize data is a histogram:

example of histogram in R

Recall that our mean and median are very close to 70.

Both of the graphs above show that most the observations are distributed very close to the mean.

I mentioned before that roughly 68% of data is located ± 1 standard deviation from the mean.

The graph below shows the plotted distribution with the mean (red line) and the interval of ±1 standard deviation (green lines).
normal distribution with standard deviation intervals



Part 4. Finding probability using pnorm() command in R

Up to this point we discussed what is normal distribution; descriptive statistics of normal distribution in R; and plotting normal distribution in R along with mean and standard deviation on the same graph.

Throughout the article we are working with sample dataset on grades of students that follows a normal distribution.

Now let's get into the use of this theory and see an applied example which will help you understand the commands when working with normal distribution in R.

Example 1: Usage of pnorm() in R

Consider the following question: What is the probability that a randomly chosen exam paper (x) will have a grade of less than 50% (x<50%)?

Basically, what we try to do her is to determine if a randomly chosen exam from our sample of 1000 will be a pass or fail.

Sounds quite applied doesn't it?

In order to answer this question we will need to use pnorm() command in R.

Since I haven't used it before in this article, I will give a brief introduction of this function and it's features.

The short theoretical explanation of the function is the following:

pnorm(x, mean= , sd= )

Here, "x" refers to the value probaility of occurence below of which we are trying to find. "mean" and "sd" refer to the average and the standard deviation of the set of numbers we are working with.

Let's put it into the context of our example!

Recall from the section on descriptive statistics of this distribution that we created a normal distribution in R with mean = 70 and standard deviation = 10.

Now, the value "x" that we are interested in is 50.

Below is the plot that illustrates the question and what we are going to find. The value of "x" is set as 50 (purple line). We are going to find the probability of a random drawn number from our dataset to be on the left on the purple line (or less than 50).

calculating cumulative distribution function

You can find the probability by plugging the parameters into the formula and using the following code:


pnorm(50, mean=70, sd=10)

The answer I got is 0.02275013 or 2.27%.

Therefore, the probability that a random drawn number from this dataset is less than 50 is 2.27%.



Example 2: Usage of pnorm() in R

Let's think of a little more complicated example.

Consider the following question: What is the probability that a randomly chosen exam paper will have a "B" grade?

Assume that "B" grade range is between 70% and 75%.

Paraphrasing this question in numerical terms: What is the probability that a randomly chosen exam paper (x) will have a grade of between 70% and 75% (70%<x<75%)?

In order to answer this question we will use the same tools as in the previous example: pnorm() command in R.

The logic here will be to find the probability of x<75%, then probability of x<70%, and subtract the first one from the latter one to find the probability of the area between them.

In order to shape this problem in a more visual way, please take a look at the plot below:

example 2 with normal distribution in R and using pnrom() command

In a visual way, in this question we are trying to find the probability of the randomly selected number from our dataset to occur between the two purple lines (or between 70% and 75%).

Recall from the section on descriptive statistics of this distribution that we created a normal distribution in R with mean = 70 and standard deviation = 10.

The only difference is that now we have to "x"s: 70 and 75.

You can find the probability of the interval between 70 and 75 by plugging the parameters into the formula and using the following code:


pnorm(75, mean=70, sd=10)-pnorm(70, mean=70, sd=10)

The answer I got is 0.1914625 or 19.15%.

Therefore, the probability that a random drawn number from this dataset is between 70 and 75 is 19.15%.



The blog has a lot of other interesting articles about Statistics in R which you can read to learn about more commands and functionality of R.