In this article we will learn how to calculate confidence interval in R using CI() command using Rmisc package.
Theory
In general, a confidence interval is a range of values with a defined probability that a number is within it.
It is computed from the given dataset and we are able to confirm with a certain confidence level that a value lies within it.
In statistics, it is mainly used to find a population parameter from the sample data.
Mostly it is used to work with mean values (finding the population mean from a sample dataset having some sample mean).
Logically, as you increase the sample size, the closer is the value of a sample parameter to a population parameter, therefore the narrower the confidence interval gets.
We all know the good old iris dataset at this point.
I know it’s been overused and appears in so many articles on R.
Yet, I chose to use it in this tutorial because it has 150 observations ready and I don’t have to build a synthetic dataset to show how to calculate confidence interval in R.
Let’s do the practical part now!
Application
Below are the steps we are going to take to make sure we do master the skill of calculating confidence intervals in R:
- Installing Rmisc package
- Basic CI() command description
- Loading sample dataset: iris
- Calculate confidence interval in R
- Calculate confidence interval for sample from dataset in R
Part 1. Installing Rmisc package
As R doesn’t have this function built it, we will need an additional package in order to find a confidence interval in R.
There are several packages that have functionality which can help us with calculating confidence intervals in R.
I prefer the command from Rmisc package for it’s simplicity in syntax.
You can learn more about this package here.
In order to install and “call: the package into your R (R Studio) environment, you should use the following code:
install.packages("Rmisc")
library(Rmisc)
Great! The package is now loaded to our environment.
Let’s look into the CI() command!
Part 2. Basic CI() command description
The very brief theoretical explanation of the function is the following:
CI(x, ci=a)
Here, “x” is a vector of data, “a” is the confidence level you are using for your confidence interval (for example 0.95 or 0.99).
Now, let’s prepare our dataset and apply the CI() function to calculate confidence interval in R.
Part 3. Loading sample dataset: iris
For the purposes of this article I will use the popular in R community dataset iris.
As I mentioned before, it has been overused across R articles, yet this time I choose to work with it because it has 150 observations, which simplifies my presentation of the results.
You can read more about this dataset here.
If you have your own data that you want to work with right away, you can import your dataset and follow the same procedures as in this article.
I prefer to call the data I work with “mydata”, so here is the command you would use for that:
mydata<-iris
You can take a look at your dataset using the following code:
View(mydata)
At this point, our data is ready and let's get into calculating confidence interval in R!
Part 4. Calculate confidence interval in R
I will go over a few different cases for calculating confidence interval.
For the purposes of this article,we will be working with the first variable/column from iris dataset which is Sepal.Length.
First, let's calculate the population mean. It should be equal to: 5.843333.
Calculate 95% confidence interval in R
CI(mydata$Sepal.Length, ci=0.95)
You will observe that the 95% confidence interval is between 5.709732 and 5.976934.
Interpreting it in an intuitive manner tells us that we are 95% certain that the population mean falls in the range between values mentioned above.
Calculate 95% confidence interval in R for small sample from population
This example is a little more advanced in terms of data preparation code, but is very similar in terms of calculating the confidence interval.
Our dataset has 150 observations (population), so let's take random 15 observations from it (small sample).
This small sample will represent 10% of the entire dataset.
To do it, we need to find random 15 row numbers and create a substet using them.
Using sample() command in R, we create a set of random 15 row numbers:
index_s <- sample(1:nrow(mydata), 15)
My random numbers are: 36, 33, 27, 41, 6, 2, 20, 1, 17, 12, 43, 44, 26, 45, 9.
These are the IDs of rows of our dataset.
Using them, we can create a subset of population data (call it "mydata_ss"):
mydata_ss <- mydata[index_s, ]
You can take a look at the small sample:
View(mydata_ss)
Now, our small sample dataset is ready and we can calculate the 95% confidence interval for small sample:
CI(mydata_ss$Sepal.Length, ci=0.95)
In such a small sample I observed that the 95% confidence interval is between 4.827444 and 5.145889 with sample mean equal to 4.986667.
Note: your random small sample data may be different and produce different results.
Notice that the small sample mean is roughly 4.99 while the population mean is roughly 5.84.
Quite a significant difference, isn't it?
Well, let's take a look how the range of confidence interval and the sample mean change as we increase the sample size.
Calculate 95% confidence interval in R for large sample from population
Our dataset has 150 observations (population), so let's take random 120 observations from it (large sample).
This small sample will represent 80% of the entire dataset.
Same as in the example before, using sample() command in R, we create a set of random 120 row numbers (call it "index_l"):
index_l <- sample(1:nrow(mydata), 120)
Using them, we can create a subset of population data (call it "mydata_ls"):
mydata_ls <- mydata[index_l, ]
You can take a look at the large sample:
View(mydata_ls)
Now, our small sample dataset is ready and we can calculate the 95% confidence interval for large sample:
CI(mydata_ls$Sepal.Length, ci=0.95)
In a large sample I observed that the 95% confidence interval is between 5.702847 and 6.007153 with sample mean equal to 5.855.
Note: your random large sample data may be different and produce different results.
Notice that the large sample mean is roughly 5.85 while the population mean is roughly 5.84.
It's much closer to the population mean than small sample mean (4.99).
Also notice that as our sample size increased, the confidence interval narrowed down.
Small sample: (4.827444; 5.145889) with distance between values = 0.318445.
Large sample: (5.702847; 6.007153) with distance between values = 0.304306.
The key findings to keep in mind from the above examples is that large sample parameters are closer to population parameters than small sample parameters (in our case it is mean or average).
As well as the larger is the sample, the narrower is the confidence interval.
Intuitively, the more observations we have, the better our estimates will be.
This concludes our article on how to calculate confidence interval in R. You can learn more about various statistical concepts in the Statistics in R section.