In this article we will learn about descriptive statistics in R. The area of coverage includes mean, median, mode, standard deviation, skewness, and kurtosis.
Theory
Mean – the central value of a set of numbers.
Median – the value between the higher half and lower half of a set of numbers.
Mode – the value that appears most often in the set of number.
Variance – the squared deviation of a random variable from its mean.
Standard deviation – the measurement of variation of a set of numbers.
Skewness – the measure of asymmetry or “skew” of the probability distribution.
Kurtosis – the measurement of “sharpness” of the peak of the probability distribution.
Application
Below are the steps we are going to take to make sure we do master descriptive statistics in R:
- Loading sample dataset: economics
- How to calculate mean in R
- How to calculate median in R
- How to calculate variance in R
- How to calculate standard deviation in R
- How to calculate skewness in R
- How to calculate kurtosis in R
- Summary statistics in R
- Finding minimum value in R
- Finding maximum value in R
- Finding range in R
- How to calculate the sum of values in R
Part 1. Loading sample dataset: economics
R has a variety datasets already built into it. Although the step of “loading” this dataset isn’t required, it’s a good practice to get familiar with 🙂
I prefer to call the data I work with “mydata”, so here is the command you would use for that:
mydata<-economics
This built-in dataset is the US economic time series.
Take a look at the dataset and the variables it contains:
View(mydata)
The variable (column) we will be working with in this tutorial is "unemploy", which is the number of unemployed (in thousands).
For this purpose and to simplify things, we will define this specific column as a new dataset:
unemployment<-mydata$unemploy
Part 2. How to calculate Mean in R
To calculate the mean of a set of numbers R has a built in command mean():
mean(unemployment)
You should get the mean equal to 7771.557.
Part 3. How to calculate Median in R
To calculate the median of a set of numbers R has a built in command median():
median(unemployment)
You should get the median equal to 7494.
Part 4. How to calculate Variance in R
To calculate the variance of a set of numbers R has a built in command var():
var(unemployment)
You should get the variance equal to: 6979956.
Part 5. How to calculate Standard Deviation in R
To calculate the standard deviation of a set of numbers R has a built in command sd():
sd(unemployment)
You should get the standard deviation equal to: 2641.961.
Part 6. How to calculate Skewness in R
As R doesn't have this command built in, we will need an additional package in order to calculate skewness in R.
You can learn more about e1071 package here.
In order to install and "call" the package into your workspace, you should use the following code:
install.packages("e1071")
library(e1071)
To calculate the skewness of a set of numbers, this package provides a command skewness():
skewness(unemployment)
You should get the skewness equal to: 0.6966289.
Once we have calculated the value for skewness, there are three possible cases:
Case 1: skewnewss < 0
In this case we will have a left skewed distribution (negatively skewed).
What's the other way to think about it?
It's the case when the mean of the dataset is less than the median (mean < median) and most values are concentrated on the right of the mean value, yet all the extreme values are on the left of the mean value.
Here is an illustration:
Case 2: skewness = 0
In this case we will have a normal distribution (no skew).
What's the other way to think about it?
It's the case when the mean of the dataset is equal to the median (mean = median) and all the values are symmetrical around the mean value.
Here is an illustration:
Case 3: skewness > 0
In this case we will have a right skewed distribution (positive skew).
What's the other way to think about it?
It's the case when the mean of the dataset is greater than the median (mean > median) and most values are concentrated on the left of the mean value, yet all the extreme values are on the right of the mean value.
Here is an illustration:
Part 7. How to calculate Kurtosis in R
As R doesn't have this command built in, we will need an additional package in order to calculate kurtosis in R.
You can learn more about e1071 package here.
In order to install and "call" the package into your workspace, you should use the following code:
install.packages("e1071")
library(e1071)
To calculate the kurtosis of a set of numbers, this package provides a command kurtosis():
kurtosis(unemployment)
You should get the kurtosis equal to: 0.7938124.
Once we have calculated the value for skewness, there are three possible cases:
Case 1: kurtosis < 3
In this case we will have a platykurtic distribution (flatter than normal or normal peak).
What's the other way to think about it?
It's the case when the distribution has thinner tails than normal distribution. Thin tails means less outliers (extreme values).
Case 2: kurtosis = 3
In this case we will have a mesokurtic distribution (normal).
What's the other way to think about it?
It's the case when the distribution has the tails identical to the normal distribution.
Case 3: kurtosis > 3
In this case we will have a leptokurtic distribution (sharpier than normal or higher peak).
What's the other way to think about it?
It's the case when the distribution has thicker tails than normal distribution. Thin tails means more outliers (extreme values).
Part 8. Summary Statistics in R
R has built in function summary() that provides a brief basic overview of the dataset.
All together it shows the minimum and maximum values, median, mean, 1st quartile value, and 3rd quartile value.
It is often very useful to see these statistics together (unless you are looking for a specific one, in which case you can just use the applicable command).
So let's go ahead and find some summary statistics in R:
summary(unemployment)
You should get the following output:
Part 9. Finding Minimum Value in R
To find the minimum value of a set of numbers R has a built in command min():
min(unemployment)
You will find that the minimum value in this dataset is: 2685.
Part 10. Finding Maximum Value in R
To find the maximum value of a set of numbers R has a built in command max():
max(unemployment)
You will find that the maximum value in this dataset is: 15352.
Part 11. Finding Range in R
To find the range of a set of numbers R has a built in command range():
range(unemployment)
The range of the dataset is from its minimum value to its maximum value. Therefore, for this dataset, the range is from 2685 to 15352.
Part 12. Calculating the Sum of Values in R
To calculate the sum of a set of numbers R has a built in command sum():
sum(unemployment)
The sum of value of this dataset is: 4460874.
The blog has a lot of other interesting articles for Statistics in R which you can read to learn about more commands and functionality of R.