Yeah NAh

A short post about filtering/subsetting data when there are NA values in R
code
R
Author

Dan Miller

Published

December 3, 2024

Filtering in R

When we want to subset some data in R we have a few options, but really all of them under the hood are returning our data (often rows) based on logical, numeric or character indices.

So in the simplest form in base R we can use the [ function to return the elements we want.

# Return the first two elements using numeric indices:
c(1,2,3)[1:2]
[1] 1 2
# And now Boolean:
c(1,2,3)[c(TRUE,TRUE,FALSE)]
[1] 1 2
# And how we would generally do it, by actually testing for equality
dat <- c(1,2,3)
dat[dat != 3]
[1] 1 2

But what if our data has NA values?

# Filtering by testing equality/inequality with an NA value
dat <- c(1,2,3,NA)
dat[dat != 3]
[1]  1  2 NA

That looks ok, right? It’s returning all the things that don’t equal 3. But is it really return the NA value from the original dat?

Considering that we said we subset with logical, numeric or character indices, what does dat != 3 return?

dat != 3
[1]  TRUE  TRUE FALSE    NA

NA is not one of the valid values for indexing. The documentation for [ gives us a clue:

NAs in indexing: When extracting, a numerical, logical or character NA index picks an unknown element and so returns NA in the corresponding element of a logical, integer, numeric, complex or character result, and NULL for a list.

So what does it matter? Well, it becomes clearer that it’s not returning our original NA when we look at more complex data.

dat <- data.frame(name = LETTERS[1:16], 
                  feature = sample(c(1:3,NA),16,replace = TRUE))
dat
   name feature
1     A       1
2     B       3
3     C       3
4     D       3
5     E       2
6     F      NA
7     G       3
8     H      NA
9     I       2
10    J       1
11    K       2
12    L       3
13    M       2
14    N       3
15    O       2
16    P       3
dat_base <- dat[dat$feature != 3,]
dat_base
     name feature
1       A       1
5       E       2
NA   <NA>      NA
NA.1 <NA>      NA
9       I       2
10      J       1
11      K       2
13      M       2
15      O       2

The rows where feature was NA are not actually returned, instead a whole missing row is returned in their place. This is somewhat unexpected unless you know what is happening and it’s probably unlikely to be the desired result.

Filtering with NAs

dplyr::filter() and base::subset() are a little clearer and provide results we are more likely to want when filtering with NA values present. First subset():

“For ordinary vectors, the result is simply x[subset & !is.na(subset)].”

subset(dat, feature != 3)
   name feature
1     A       1
5     E       2
9     I       2
10    J       1
11    K       2
13    M       2
15    O       2

And dplyr::filter() is clearer again:

From the dplyr::filter() documentation:

“The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions. Note that when a condition evaluates to NA the row will be dropped, unlike base subsetting with [.”

👀 dplyr::filter() documentation

dat |> dplyr::filter(feature != 3)
  name feature
1    A       1
2    E       2
3    I       2
4    J       1
5    K       2
6    M       2
7    O       2

But what if I want the NA rows?!?!

Well, we need to create a vector that explicitly includes the NA values:

dat$feature != 3 | is.na(dat$feature)
 [1]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
[13]  TRUE FALSE  TRUE FALSE
dat[dat$feature != 3 | is.na(dat$feature),]
   name feature
1     A       1
5     E       2
6     F      NA
8     H      NA
9     I       2
10    J       1
11    K       2
13    M       2
15    O       2
subset(dat, feature != 3 | is.na(feature))
   name feature
1     A       1
5     E       2
6     F      NA
8     H      NA
9     I       2
10    J       1
11    K       2
13    M       2
15    O       2
dat |> dplyr::filter(feature != 3 | is.na(feature))
  name feature
1    A       1
2    E       2
3    F      NA
4    H      NA
5    I       2
6    J       1
7    K       2
8    M       2
9    O       2