In this article we will learn how to do Bartlett’s test in R using bartlett.test() function to test for homogeneity of variances of samples from a distribution.
Theory
Let’s begin with homogeneity. Homogeneity (or in statistics: homoscedasticity) refers to variances being equal.
Bartlett’s test (or Bartlett’s test for homogeneity of variances) is a statistical test to determine whether N samples are from a population with equal variance.
This test is particularly useful when the assumption of equal variances is made and allows us to test it.
Bartlett’s test tests the null hypothesis that variances across samples are equal:
$$H_0: {\sigma_1^2 = \sigma_2^2 = … = \sigma_n^2}$$
against the alternative hypothesis that at least for one pair of samples the variances won’t be equal:
$$H_1: {\sigma_i^2 ≠ \sigma_j^2}$$
for any i and j.
It’s test statistic is quite long so in this article I won’t go into explanation of it in detail, but please see the image attached below for the formula:
One important thing to note is that Bartlett’s test works well when you are working with normal distributions.
Otherwise it may be very sensitive to non normality and provide misleading results.
You can test for normality of your data and if you find that it follows a non normal distribution, you should look into Levene’s test.
Application
Below are the steps we are going to take to make sure we do learn how to test for heteroscedasticity using Bartlett’s test in R:
- Loading sample dataset: titanic_train from titanic package
- Preparing the dataset with dplyr
- Basic bartlett.test() function description
- Performing Bartlett’s test in R
Part 1. Loading sample dataset: titanic_train from titanic package
To illustrate the performance of Bartlett’s test in R we will need a dataset with two columns: one with numerical data, the other with categorical data (or levels).
In this tutorial I will be using the titanic_train dataset from titanic package.
It contains 891 observations across 12 variables.
In order to install and “call” the package into your workspace, you should use the following code:
install.packages("titanic")
library(titanic)
The titanic_train dataset is a part of this package and can be added to the workspace directly.
I prefer to I prefer to call the data I work with “mydata”, so here is the code you would use for that:
mydata<-titanic_train
Take a look at the dataset and the variables it contains:
View(mydata)
For this tutorial we are interested in two variables: Age and Embarked.
Part 2. Preparing the dataset with dplyr
At this point we identified which variables we will be working with.
Now we need to select the columns from the original dataset and explore the observations further.
We will need dplyr package to help us with data manipulation.
In order to install and "call" the package into your workspace, you should use the following code:
install.packages("dplyr")
library(dplyr)
Now let's go ahead and select the columns we need:
mydata<-select(mydata, Age, Embarked)
Our new dataset is 891 observations over 2 variables.
Next, usually we would look into data for missing values and correct the dataset for it.
Finding and replacing missing values is a separate topic, so for now assume that I found these for you in this dataset.
What you will see if you scroll through this dataset is that some values in "Age" are missing, and 2 rows in "Embarked" are empty.
Let's go ahead and solve it:
mydata<-mydata[complete.cases(mydata), ]
mydata<-filter(mydata, Embarked!="")
After this data manipulation, our dataset became 712 observations across 2 variables.
From the introduction, you can recall that the Bartlett test performs well when applied to normal distributions.
Let's take a look at the distribution of the dataset we are working with.
Plotting the histogram for the "Age" variable yields the following:
It doesn't look perfectly normal to me but from the real data you will face it's quite a good distribution to work with.
When working on your research, I highly recommend actually testing for normality before using these tests.
For the purposes of this tutorial it is not so crucial at this point.
To drill into more details, remember that we have three groups for "Embarked": S, C, Q.
We can construct a boxplot to see the dispersion of the observations:
For the three groups the mean appears to be around the same value of 30. That's even better for us.
Now that our data preparation step is complete, let's look into the details of the bartlett.test() function.
Part 3. Basic bartlett.test() function description
The short theoretical explanation of the function is the following:
bartlett.test(x, g)
Here, "x" is the vector of numeric values that represent particular samples of the population (in our case "Age").
"g" is the vector with group values corresponding to each "x" value (in our case "Embarked").
The full description of this command and its arguments is available here.
Part 4. Performing Bartlett's test in R
Up to this point we discussed the theory behind Bartlett's test as well as key underlying assumptions.
We have also prepared our dataset and corrected it for missing values.
Let's recap the null and alternative hypothesis for this test.
Null hypothesis: variances across samples are equal.
Alternative hypothesis: at least one sample has different variance.
We will test the null hypothesis at 0.05 significance level or (95%).
Now, let's go ahead and perform the Bartlett's test in R!
You can use the following code:
bartlett.test(mydata$Age, mydata$Embarked)
Your output should look like this:
We can see the p-value of 0.2094 which is larger than 0.05 rejection region.
This mean that we can't reject the null hypothesis that the variance is equal across all samples at 0.05 significance level.
This concludes our article on Bartlett's test in R. You can learn more about statistical analysis in the Statistics in R section.