In this article we will learn how to do Levene’s test in R using leveneTest() function to test for homogeneity of variances across samples from the same distribution.
Theory
Levene’s test is an inferential statistic that tests if samples drawn from the same distribution have equal variances.
Levene’s test is an alternative to Bartlett’s test. It’s considered more robust since it is less sensitive to data deviations from normal distribution (performs well in symmetric heavy-tailed distributions). Levene’s test tests the null hypothesis that variances across samples are equal:
Levene’s test tests the null hypothesis that variances across samples are equal:
$$H_0: {\sigma_1^2 = \sigma_2^2 = … = \sigma_n^2}$$
against the alternative hypothesis that at least for one pair of samples the variances won’t be equal:
$$H_1: {\sigma_i^2 ≠ \sigma_j^2}$$
for any i and j.
It’s test statistic is quite long so in this article I won’t go into explanation of it in detail, but please see the image attached below for the formula:
Unlike Bartlett’s test, Levene’s test performs well when your data comes from a non-normal distribution.
If you are not certain about the distribution of your variable, you should test for normality.
If you find that it follows a normal distribution or a nearly normal distribution, you should use Bartlett’s test because it will have a better performance.
Application
Below are the steps we are going to take to make sure we do learn how to test for heteroscedasticity using Levene’s test in R:
- Loading sample dataset: titanic_train from titanic package
- Preparing the dataset with dplyr
- Basic leveneTest() function description
- Performing Levene’s test in R
Part 1. Loading sample dataset: titanic_train from titanic package
To illustrate the performance of Levene’s test in R we will need a dataset with two columns: one with numerical data, the other with categorical data (or levels).
In this tutorial I will be using the titanic_train dataset from titanic package.
It contains 891 observations across 12 variables.
In order to install and “call” the package into your workspace, you should use the following code:
install.packages("titanic")
library(titanic)
The titanic_train dataset is a part of this package and can be added to the workspace directly.
I prefer to I prefer to call the data I work with “mydata”, so here is the code you would use for that:
mydata<-titanic_train
Take a look at the dataset and the variables it contains:
View(mydata)
For this tutorial we are interested in two variables: Parch and Embarked.
Part 2. Preparing the dataset with dplyr
At this point we identified which variables we will be working with.
Now we need to select the columns from the original dataset and explore the observations further.
We will need dplyr package to help us with data manipulation.
In order to install and "call" the package into your workspace, you should use the following code:
install.packages("dplyr")
library(dplyr)
Now let's go ahead and select the columns we need:
mydata<-select(mydata, Parch, Embarked)
Our new dataset is 891 observations over 2 variables.
Next, usually we would look into data for missing values and correct the dataset for it.
Finding and replacing missing values is a separate topic, so for now assume that I found these for you in this dataset.
What you will see if you scroll through this dataset is that 2 rows in "Embarked" are empty.
Let's go ahead and solve it:
mydata<-filter(mydata, Embarked!="")
After this data manipulation, our dataset became 889 observations across 2 variables.
From the introduction, you can recall that the Bartlett test performs well when applied to normal distributions.
Let's take a look at the distribution of the dataset we are working with.
Plotting the histogram for the "Parch" variable yields the following:
Right away we see that this is a non-normal distribution.
When working on your research, I highly recommend actually testing for normality before using these tests.
To drill into more details, remember that we have three groups for "Embarked": S, C, Q.
We can construct a boxplot to see the dispersion of the observations:
For the three groups the mean appears to be around the same value of 0. That's even better for us.
Now that our data preparation step is complete, let's look into the details of the leveneTest() function.
Part 3. Basic leveneTest() function description
The short theoretical explanation of the function is the following:
levene.Test(x, g)
Here, "x" is the vector of numeric values that represent particular samples of the population (in our case "Parch").
"g" is the vector with group values corresponding to each "x" value (in our case "Embarked").
The full description of this command and its arguments is available here.
Part 4. Performing Levene's test in R
Up to this point we discussed the theory behind Levene's test as well as key underlying assumptions.
We have also prepared our dataset and corrected it for missing values.
Let's recap the null and alternative hypothesis for this test.
Null hypothesis: variances across samples are equal.
Alternative hypothesis: at least one sample has different variance.
We will test the null hypothesis at 0.05 significance level or (95%).
Now, let's go ahead and perform the Levene's test in R!
You can use the following code:
leveneTest(mydata$Parch, mydata$Embarked)
Your output should look like this:
We can see the p-value of 0.07552 which is larger than 0.05 rejection region.
This mean that we can't reject the null hypothesis that the variance is equal across all samples at 0.05 significance level.
This concludes our article on Levene's test in R. You can learn more about statistical analysis in the Statistics in R section.