In this article we will learn how to do chi-square test in R using chisq.test().


Theory

Chi-square test (or chi-square test for independence) is used to determine whether there is correlation (or significant “relationship”) between two categorical variables.

It is important that the variables must be categorical.
For example: Yes/No, Male/Female, Student/Employed/Retired, and so on.



Application

Below are the steps we are going to take to make sure we do learn how to do chi-square test in R:

  1. Loading sample dataset: Aids2 from MASS package
  2. Preparing the dataset
  3. Basic chisq.test() command description
  4. Performing chi-square test in R



Part 1. Loading sample dataset: Aids2 from MASS package

As mentioned above, we need a dataset with categorical variables to demonstrate how to do chi-square test in R.

There are multiple blogs in data science covering this topic and most of them use the same datasets.

I want to create a different experience and diverge from the datasets that are overused by being present in almost every other article.

Here, I will be working with Aids2 dataset from MASS package in R.

Aids2 is the Australian AIDS Survival Data.

It is a part of the MASS package which was created by Brian Ripley and includes datasets and functions to support one of his books.

I encourage you to read about this package. There is a lot of useful information and interesting datasets you may want to use in the future.

Okay, let’s get back to the topic!

In order to install and “call” the package into your workspace, you should use the following code:


install.packages("MASS")
library(MASS)

As I mentioned above, the Aids2 dataset is a part of this package. If you don’t install and “call” the package, you won’t have access to it.

Note: If you have your own in a csv or excel files, you can follow the same procedure to arrive at the result. But make sure that you are working with categorical variables.

I prefer to call the data I work with “mydata”, so here is the command you would use for that:


mydata<-Aids2

Take a look at the dataset and the variables it contains:


View(mydata)



Part 2. Preparing the dataset

We see a 2843x7 data frame which contains seven variables.

The two variables we will be working with in this article are:

  1. state - categorical variable (NSW/Other/QLD/VIC)  for state of origin (region in Australia)
  2. status - categorical variable A/D (alive/dead) at the end of observation

We will need to convert the observations in these two columns into a two-way table, also called the contingency table.

Once converted, the table will show the frequency counts of a value in column 1 given a value from column 2 (for categorical values).

R has a built in command table() which converts selected columns into a contingency table with counts.

You can read more about this command here.

Now we will create the contingency table in R (let's call it "freq_data") using the following code:


freq_data<-table(mydata$state, mydata$status)

Now we can take a look at our contingency table. Just type in freq_data and run the line.


freq_data

You will see the following table:



Part 3. Basic chisq.test() command description

The short theoretical explanation of the function is the following:

chisq.test(object)

Here, "object" refers to either a numeric vector or a matrix.

The full description of this command and its arguments is available here.



Part 4. Performing chi-square test in R

So far we reviewed the basic intuition behind the chi-square test, prepared the dataset, and looked at the function we will use in order to do chi-square test in R.

Let's dive a little deeper into the statistical implication of this test.

We have a null hypothesis and an alternative hypothesis.

The null hypothesis (or H0) is that the person's "status" (A/D) is independent of their "state" of origin.

The alternative hypothesis (or Ha) is that the person's "status" (A/D) is not independent of their "state" of origin.

We will test the null hypothesis at 0.05 significance level. In other words, we will be 95% certain of our result.

Now, let's go ahead and do the chi-square test in R!

As mentioned in "Part 3. Basic chisq.test() command description", the function needs an "object" to work with.

In our case it's the contingency table freq_data that we created.

You can use the following code to perform the chi-square test in R:


chisq.test(freq_data)

Once you run this line, you will see the following output:

From the result above we find the p-value = 0.1872.

As the p-value of 0.1872 is greater than 0.05, we don't reject the null hypothesis (H0).

Therefore, we conclude that the person's "status" (A/D) is independent of their "state" of origin.



This concludes our article on chi-square test in R. You can learn more about statistical analysis in the Statistics in R section.