In this article we will learn how to remove rows with NA from dataframe in R. We will walk through a complete tutorial on how to treat missing values using complete.cases() function in R.

Theory

The real world data that data scientists work with often isn’t perfect. It can contain wrong entries, mistakes, different data types, missing values and so on.

In this article we will focus on working with missing values in R dataframe.

Why is it such an important topic?

Well it all starts with how functions in R work. A lot of functions that perform descriptive statistics operations or rounding, when used on columns in which rows have NA or missing values, fail and give errors.

If you think about it, it makes sense. How can you possibly find the average of a set of numbers where some of them are “unknown”? You can’t round them either!

In this article we will learn how to subset data with complete entries. When you are certain you data is clean and complete, you can go ahead and analyze it.


Application

Below are the steps we are going to take to make sure we do learn how to remove rows with NA and handle missing values in R dataframe:

  1. Creating sample dataframe that includes missing values 
  2. Basic complete.cases() function description
  3. Removing rows with NA from R dataframe



Part 1. Creating sample dataframe that includes missing values

The first step we will need to take is create some arbitrary dataset to work with.

One of the popular examples is a customer list with their information that a company can use for its marketing purposes or some promotional activity.

Let’s create a dataframe with the following columns: id, name, phone, email.



mydata <- data.frame(
  id = c("0001","0002","0003","0004"),
  name = c("John Smith", "Eric Parker","Mark Austin", "David Richards"), 
  phone = c("(999) 999-9999", "(888) 888-8888", NA, "(666) 666-6666"), 
  email = c("john@smith.com", NA, "mark@austin.com", "david@richards.com")
)


Perfect! Now we can take a look at the dataframe we just created:



show(mydata)


It will produce the following output:


    id           name          phone              email
1 0001     John Smith (999) 999-9999     john@smith.com
2 0002    Eric Parker (888) 888-8888               <NA>
3 0003    Mark Austin           <NA>    mark@austin.com
4 0004 David Richards (666) 666-6666 david@richards.com


As you can see from the dataframe above, we introduced some NAs or missing values. This is very similar to what you see in the actual business datasets.

Now let's discuss the R function that will help us clean this messy data!


Part 2. Basic complete.cases() function description

The complete.cases() function description is built into R already, so we can skip the step of installing additional packages.

Here is a theoretical explanation of the function:

complete.cases(data)

This function accepts a sequence of dataframes and returns a logical vector with "TRUE"/"FALSE" showing which observations are "complete" ("TRUE") and which are missing ("FALSE").

Essentially the function goes through every observation and asks a question "Is there a value?" If yes, then it returns "TRUE", if the value is missing it returns "FALSE".

For each object that you apply this function to, you will get a logical vector with results.

A nice capacity of this function that is very useful when removing rows with NAs (missing values), is that it allows to pass a whole dataframe, or if you want, you can just pass a single column. In the section below we will walk through several examples of how to remove rows with NAs (missing values).


Part 3. Removing rows with NA from R dataframe

At this point, our problem is outlined, we covered the theory and the function we will use, and we are all ready and equipped to do some applied examples of removing rows with NA in R.

Recall our dataset. We have missing values in two columns: "phone" and "email". Depending on the business problem you are presented with, the solutions can vary. Here are the two potential cases that you can have:

  1. You want to clean up the entire dataframe by removing all rows with NA from the dataframe.
  2. You want to clean only some specific column of the dataframe.

We will show how to approach both of these.

Business problem: You are an analyst and your manager gives you the following customer data and asks to clean it up. The manager wants to receive two files:
1. A list of customers that have both phone and email (with respective entries).
2. A list of customers that have a phone regardless if they have/don't have an email (with respective entries.

Here is the data:


    id           name          phone              email
1 0001     John Smith (999) 999-9999     john@smith.com
2 0002    Eric Parker (888) 888-8888               <NA>
3 0003    Mark Austin           <NA>    mark@austin.com
4 0004 David Richards (666) 666-6666 david@richards.com



Remove all rows with NA

From the above you see that all you need to do is remove rows with NA which are 2 (missing email) and 3 (missing phone number).

First, let's apply the complete.cases() function to the entire dataframe and see what results it produces:



complete.cases(mydata)


And we get:


[1] FALSE FALSE FALSE  TRUE


What the function did is it looked through each row and in each row it checked every column (observation). For example, it looks at the first row and sees:


1 0001     John Smith (999) 999-9999     john@smith.com


There are no missing values, so it returns "TRUE".

Then it moves on two the second row and sees:


2 0002    Eric Parker (888) 888-8888               <NA>


Here, there is one NA (missing value), so it returns "FALSE". Note: it doesn't matter if there is only one or more NAs. One is enough so return "FALSE".

And the function keeps iterating through all rows while appending "TRUE"/"FALSE" result for each row into a logical vector.

Now we know which rows are complete and all that's left to do is to take the original dataframe and clean it up from missing values:



df_all <- mydata[complete.cases(mydata),]


The above manipulation basically tells R to only keep rows where the logical vector has "TRUE" for all columns.
We can take a look at the result:



show(df_all)


Output:


    id           name          phone              email
1 0001     John Smith (999) 999-9999     john@smith.com
4 0004 David Richards (666) 666-6666 david@richards.com


We now have a list of customers who have entered both their phone and email.

Looks like we solved the first part!


Remove rows with NA in a particular column

From the above you see that all you need to do is remove rows with NA. In this case it is row 3 (missing phone number).

Our procedure will be identical to the first case in terms of functionality. What we will do differently is that instead of applying complete.cases() to the entire dataframe, we will focus on a specific column which is "phone":



complete.cases(mydata$phone)


And we get:


[1]  TRUE  TRUE FALSE  TRUE


The function did the same procedure as in the first example, with the only difference that it only checked for missing values in the column we specified.

Now we know which rows are complete (have a phone entered) and all that's left to do is to take the original dataframe and clean it up from missing values:



df_phone <- mydata[complete.cases(mydata$phone),]


The above manipulation basically tells R to only keep rows where the logical vector has "TRUE" for rows in the "phone" column.
We can take a look at the result:



show(df_phone)


Output:


    id           name          phone              email
1 0001     John Smith (999) 999-9999     john@smith.com
2 0002    Eric Parker (888) 888-8888               <NA>
4 0004 David Richards (666) 666-6666 david@richards.com


We see that the observation that was dropped is row 3, where the "phone" entry was NA.


This concludes the article on how to remove rows with NA (missing values) from R dataframe.

If you liked this article, I encourage you to take a look at the Data Manipulation in R section where you will find a lot of useful information and master the skill of data wrangling.