In this article we will learn how to filter a data frame by a value in a column in R using filter() command from dplyr package.
Theory
It is often the case, when importing data into R, that our dataset will have a lot of observations on all kinds of objects.
Each of these observations belongs to some group, and for the vast majority of projects we will be interested in analyzing a particular group or find group-specific metrics.
Even to do simple descriptive statistics for variables that belong to a certain group, we will need to filter the dataset in R.
Filtering the dataset by a specific value of interest is a very useful skill to have as it narrows down the observations and potentially eases up the other statistics commands you will run in R.
Just think how much better it would be if you could narrow down from 100,000 observations to 2,000 observations of interest?
Whether you are interested in testing for normality, or just running a simple linear regression, this will help you clean the dataset way ahead before starting the more complex tasks.
Application
Below are the steps we are going to take to make sure we do master the skill of removing columns from data frame in R:
- Installing dplyr package
- Basic filter() command description
- Loading sample dataset: mtcars
- Filter by single value in R
- Filter by multiple values in R
Part 1. Installing dplyr package
As R doesn’t have this command built in, we will need to install an additional package in order to filter a dataset by value in R.
You can learn more about dplyr package here.
In order to install and “call: the package into your R (R Studio) environment, you should use the following code:
install.packages("dplyr")
library(dplyr)
Once we have the package installed and ready, it’s time to discuss the capabilities and syntax of the filter() function in R.
Part 2. Basic filter() command description
The very brief theoretical explanation of the function is the following:
filter(data, conditions)
Here, “data” refers to the dataset you are going to filter; and “conditions” refer to a set of logical arguments you will be doing your filtering based on.
It is also important to remember the list of operators used in filter() command in R:
- == : exactly equal
- != : not equal to
- > : greater than
- < : less than
- >= : greater or equal to
- <= : less or equal to
- & : and
- | : or
As you will learn in the further sections of this articles, these can be use on their own as well as in combinations with other operators.
Now let’s prepare our dataset and get started on how to apply filter() function in R.
Part 3. Loading sample dataset: mtcars
Similar to the majority of my articles and for simplicity, we will be working with one of the datasets already built into R.
If you have your own data that you want to work with right away, you can import your dataset and follow the same procedures as in this article.
The prebuilt dataset I will be working with to show the application of filter() command in R is mtcars.
This dataset provides observations on 32 cars across 11 variables (weight, fuel efficiency, engine, and so on).
I prefer to call the data I work with “mydata”, so here is the command you would use for that:
mydata<-mtcars
You can take a look at your dataset using the following code:
View(mydata)
At this point, our data is ready and let's get into examples of filtering in R!
Part 4. Filter by single value in R
When working with the operators mentioned above, please note that == and != can be used with characters as well as numerical data.
Example set 1: Filtering by single value and single condition in R
Example 1: Assume we want to filter our dataset to include only cars with V-shaped engine.
The variable in mtcars dataset that represents the type of engine is vs (0 = V-shaped, 1 = straight).
In technical terms, we want to keep only those observations where vs = 0.
I will call this subset "ex11_mydata".
You can filter the original dataset using the following code:
ex11_mydata<-filter(mydata, vs==0)
Example 2: Assume we want to filter our dataset to include only cars with all numbers of cylinders except 8.
The variable in mtcars dataset that represents the number of cylinders is cyl.
In technical terms, we want to keep only those observations where cyl is not equal 8 (or user the operator notation !=8).
I will call this subset "ex12_mydata".
You can filter the original dataset using the following code:
ex12_mydata<-filter(mydata, cyl!=8)
Example 3: Assume we want to filter our dataset to include only cars that have gross horsepower equal to 180 or greater.
The variable in mtcars dataset that represents the number of cylinders is cyl.
In technical terms, we want to keep only those observations where cyl is not equal 8 (or user the operator notation !=8).
I will call this subset "ex13_mydata".
You can filter the original dataset using the following code:
ex13_mydata<-filter(mydata, hp>=180)
Similarly, you can practice using all other operators and filter datasets in R by single value.
Example set 2: Filtering by single value and multiple conditions in R
Example 1: Assume we want to filter our dataset to include only cars with number of cylinders equal to 4 or 6.
As discussed in one of the previous examples, the variable in mtcars dataset that represents the number of cylinders is cyl.
In technical terms, we want to keep only those observations where cyl is equal to 4 or equal to 6 (using the operator notation ==4 and ==6).
I will call this subset "ex21_mydata".
You can filter the original dataset using the following code:
ex21_mydata<-filter(mydata, cyl==4 | cyl==6)
Generally when filtering by single variable and having multiple conditions will involve the "or" operator.
Going forward you will see how the variety of filter operator combinations in R can change when we look at filtering by multiple values with single or multiple conditions.
Part 5. Filter by multiple values in R
This type of filtering is considered to be slightly more complex, yet you will see that it's just a small extension of the previous part (in terms of logic and code).
The main difference is that we will be placing conditions on more than one variable in the dataset, while everything else will remain the same.
Example 1: Assume we want to filter our dataset to include only cars with V-shaped engine and that have 8 cylinders.
In technical terms, we want to keep only those observations where vs is equal to 0 and cyl is equal to 8 (using the operator notation vs==0 and cyl==8).
I will call this subset "ex31_mydata".
You can filter the original dataset using the following code:
ex31_mydata<-filter(mydata, vs==0 & cyl==8)
Example 2: Assume we want to filter our dataset to include only that have 8 cylinders or have 180 horse power or more.
In technical terms, we want to keep only those observations where cyl is equal to 8 and hp is equal to or greater than 180 (using the operator notation cyl==8 and hp>=180).
I will call this subset "ex32_mydata".
ex32_mydata<-filter(mydata, cyl==8 | hp>=180)
Using the same logic you can extend the application of filter() command in R to an infinite number of conditions and work with very large datasets.
This concludes our article on how to filter by value in R. You can learn more about working with more commands from dplyr package in the Data Manipulation section.