A subtle flaw in pull()

R
dplyr
SE/NSE

Demonstrating a subtle scoping flaw in dplyr::pull() and providing a menu of reliable workarounds.

Author

Richard Layton

Published

2024–08–01

Summary

In the current version of dplyr, if x is not a column name in data frame d, then pull(d, x) attempts to look up the value of x in the environment instead of returning NULL or an error. There are ways to augment pull() to yield the expected results, though base R alternatives [[ and $ may also be used reliably and predictably.

A current dplyr article, “Programming with dplyr”, states that the pull() function, like select(), uses “tidy selection” for working with data frame columns (Wickham et al., 2023). Accordingly, the select() documentation includes <tidy-select> in its arguments list—but the pull() documentation (unexpectedly) does not.

The thumbnail image Pull by Jeremy Brooks (2014) is licensed under CC BY-NC 2.0.

This may be simply a minor documentation error, but close examination reveals a subtle flaw in pull() that users should be aware of—if column name x does not exist in data frame d, then pull(d, x) unexpectedly attempts to operate on the name x in the environment, if present, instead of returning a NULL or an error.

In this post, I compare the behavior of pull() to that of [[, $, and subset() for extracting a single column x from a data frame when column x exists and when it does not, while the environment contains another name x and the value it references.

To distinguish between a name in the environment and the value(s) it references and the name of a data frame column and the value(s) it references, I borrow some terminology from the dplyr article cited above:

For example, in the following code chunk, I create a env-variable, df, that contains two data-variables, x and y. I then use $ to extract the data-variable x from the env-variable df.

df <- data.frame(x = c("ex", "ex"), y = c("why", "why"))
df
#>    x   y
#> 1 ex why
#> 2 ex why

df$x
#> [1] "ex" "ex"

For reference, the R and package versions I’m using are as follows.

R.version$version.string
#> [1] "R version 4.4.1 (2024-06-14 ucrt)"

library("dplyr")
packageVersion("dplyr")
#> [1] '1.1.4'
packageVersion("tidyselect") # for all_of()
#> [1] '1.2.1'
packageVersion("magrittr") # for %>%
#> [1] '2.0.3'
packageVersion("rlang") # for !!
#> [1] '1.1.4'

Extracting a single column

I start each example with two assignments: the name d referencing the data frame created above; and the name x, referencing a string value.

d <- df
x <- "y"

For extracting a single column as a vector from a data frame, the following are roughly equivalent. Each should extract the data-variable x from the env-variable d.

    d[["x"]]
    d$x
    pull(d, x)

Using [[ or $

The base R operator [[ matches a character value to the column names of a data frame and returns the matching column, if any, as a vector. No partial matching is the default behavior, that is, d[["x"]] is equivalent to d[["x", exact = TRUE]]. The return is the expected data-variable x in vector form.

d[["x"]]
#> [1] "ex" "ex"

The base R operator $ is similar: d$name is equivalent to d[["name", exact = FALSE]] (partial matching enabled). In this case, partial matching is not relevant; we extract the expected column.

d$x
#> [1] "ex" "ex"

identical(d[["x"]], d$x)
#> [1] TRUE

If name x does not exist in d,

d <- select(df, -x)

both [[ and $ return NULL.

d[["x"]]
#> NULL

d$x
#> NULL

Both [[ and $ produce reliable and easily predictable results. As expected in the examples above, the env-variable x and the value it references, "y", are irrelevant to [[ and $. This turns out (sometimes) not to be the case with pull().

Using pull()

Reset. (Binding the name x to the value "y" again is unnecessary, but I repeat the assignment with each reset just to remind us that the environment contains this name and value.)

d <- df
x <- "y"

dplyr::pull() extracts a column as a vector, similar to $. Here, the data-variable x is correctly extracted from d.

pull(d, x)
#> [1] "ex" "ex"

If name x does not exist in d,

d <- select(df, -x)

then, like $, we expect pull() to return NULL (or error)—but it doesn’t.

pull(d, x)
#> [1] "why" "why"

pull(), not finding the data-variable x in d, has unexpectedly operated on the env-variable x and used its value to pull the y data-variable from d, exactly as if we had written pull(d, y)

identical(pull(d, x), pull(d, y))
#> [1] TRUE

—behavior surely contrary to a user’s expectations. In general, one expects such behavior only when deliberately using syntax designed to use env-variables to extract data-variables from data frames (the topic of the section below on programming safely).

To borrow a conclusion from John Mount (2018), the unfortunate coincidence that the name x has a value in the environment should be irrelevant to pull().

Some background

I was re-reading John Mount’s opinion/tutorial piece (cited above) that demonstrated that dplyr::select() at the time had the same sort of flaw as the one I discuss in this post. Running Mount’s examples today show that the flaw in select() has since been corrected.

I had been working with pull() in another context and John’s article prompted me to compare standard evaluation (SE) and non-standard evaluation (NSE) approaches to the task for which pull() is designed, inspiring me to write this post.

Mount also showed that base R subset() is known to have a similar mal-feature. To illustrate, I set up subset() to extract one column as a vector.

Reset.

d <- df
x <- "y"

The argument drop = TRUE yields a vector when a single column is selected.

subset(d, select = x, drop = TRUE)
#> [1] "ex" "ex"

If name x does not exist in d,

d <- select(df, -x)

then, like pull(), the column y is returned instead the expected NULL or error.

subset(d, select = x, drop = TRUE)
#> [1] "why" "why"

However, the subset() documentation does include the following warning about the potential for “unanticipated consequences” of subset’s non-standard evaluation (NSE) interface:

Warning

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

The pull() documentation does not include such a warning.

Using pull() safely

To use pull() safely to extract column x, we have three forms (at least) that currently return the expected results, including NULL or an error when the column doesn’t exist.

Reset.

d <- df
x <- "y"
  1. Quote the data-variable name.
pull(d, "x")
#> [1] "ex" "ex"
  1. Use the all_of() selection helper with the column name in quotes.
pull(d, all_of("x"))
#> [1] "ex" "ex"
  1. Use the .data pronoun with $.
pull(d, .data$x)
#> [1] "ex" "ex"

If name x does not exist in d,

d <- select(df, -x)

all three forms ignore the env-variable x and return the expected errors.

pull(d, "x")
#> Error in `pull()`:
#> ! Can't extract columns that don't exist.
#> ✖ Column `x` doesn't exist.

pull(d, all_of("x"))
#> Error in `pull()`:
#> ℹ In argument: `all_of("x")`.
#> Caused by error in `all_of()`:
#> ! Can't subset elements that don't exist.
#> ✖ Element `x` doesn't exist.

pull(d, .data$x)
#> Error in `pull()`:
#> ℹ In argument: `x`.
#> Caused by error in `.data$x`:
#> ! Column `x` not found in `.data`.

The safer syntax may be inconvenient enough to defeat the purpose of non-standard evaluation in the first place: being able to type column names without quotation marks and looking nice in pipes. For the moment, the alternatives [[ or $ may be more attractive.

Of course, like subset(), one can treat pull() as a convenience function best used interactively where the existence of the desired column can be confirmed before pulling.

Programming safely

When programming, it is often useful to have an env-variable that references a character vector populated with column names you expect a function to find and operate with. In the examples below, we use the env-variable x in function arguments to pull the column specified by its value, in this case, "y".

Reset.

d <- df
x <- "y"

Create function f to operate on the env-variable var using square brackets [[ to extract the column specified by the value of var. The y column is returned as desired.

f <- function(dframe, var) {
  dframe[[var]]
}

f(d, x)
#> [1] "why" "why"

Function g using pull() and the all_of() selection helper yields a similar result.

g <- function(dframe, var) {
  dframe %>% 
    pull(all_of(var))
}

g(d, x)
#> [1] "why" "why"

Function h using pull() and the .data pronoun also yields the desired result.

h <- function(dframe, var) {
  dframe %>% 
    pull(.data[[var]])
}

h(d, x)
#> [1] "why" "why"

As does function q using the rlang injection operator !!.

q <- function(dframe, var) {
  dframe %>% 
    pull(!!var)
}

q(d, x)
#> [1] "why" "why"

And lastly, if name x does not exist in d,

d <- select(df, -y)

all four functions return a NULL or an error.

f(d, x)
#> NULL

g(d, x)
#> Error in `pull()`:
#> ℹ In argument: `all_of(var)`.
#> Caused by error in `all_of()`:
#> ! Can't subset elements that don't exist.
#> ✖ Element `y` doesn't exist.

h(d, x)
#> Error in `pull()`:
#> ℹ In argument: `y`.
#> Caused by error in `.data[["y"]]`:
#> ! Column `y` not found in `.data`.

q(d, x)
#> Error in `pull()`:
#> ! Can't extract columns that don't exist.
#> ✖ Column `y` doesn't exist.

Conclusion

It appears to be an oversight that pull() attempts to operate on an env-variable if the intended data-variable doesn’t exist. Workarounds exist though base R alternatives [[ and $ may also be used reliably and predictably.

References

Mount, J. (2018). A subtle flaw in some popular R NSE interfaces. Blog post. https://win-vector.com/2018/09/23/a-subtle-flaw-in-some-popular-r-nse-interfaces/
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A grammar of data manipulation. https://CRAN.R-project.org/package=dplyr