按名称删除数据帧列

I have a number of columns that I would like to remove from a data frame. I know that we can delete them individually using something like:

df$x <- NULL

But I was hoping to do this with fewer commands.

Also, I know that I could drop columns using integer indexing like this:

df <- df[ -c(1, 3:6, 12) ]

But I am concerned that the relative position of my variables may change.

Given how powerful R is, I figured there might be a better way than dropping each column one by one.

转载于:https://stackoverflow.com/questions/4605206/drop-data-frame-columns-by-name

You can use a simple list of names :

DF <- data.frame(
  x=1:10,
  y=10:1,
  z=rep(5,10),
  a=11:20
)
drops <- c("x","z")
DF[ , !(names(DF) %in% drops)]

Or, alternatively, you can make a list of those to keep and refer to them by name :

keeps <- c("y", "a")
DF[keeps]

EDIT : For those still not acquainted with the drop argument of the indexing function, if you want to keep one column as a data frame, you do:

keeps <- "y"
DF[ , keeps, drop = FALSE]

drop=TRUE (or not mentioning it) will drop unnecessary dimensions, and hence return a vector with the values of column y.

You could use %in% like this:

df[, !(colnames(df) %in% c("x","bar","foo"))]

There's also the subset command, useful if you know which columns you want:

df <- data.frame(a = 1:10, b = 2:11, c = 3:12)
df <- subset(df, select = c(a, c))

UPDATED after comment by @hadley: To drop columns a,c you could do:

df <- subset(df, select = -c(a, c))

I keep thinking there must be a better idiom, but for subtraction of columns by name, I tend to do the following:

df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)

# return everything except a and c
df <- df[,-match(c("a","c"),names(df))]
df

There is a potentially more powerful strategy based on the fact that grep() will return a numeric vector. If you have a long list of variables as I do in one of my dataset, some variables that end in ".A" and others that end in ".B" and you only want the ones that end in ".A" (along with all the variables that don't match either pattern, do this:

dfrm2 <- dfrm[ , -grep("\\.B$", names(dfrm)) ]

For the case at hand, using Joris Meys example, it might not be as compact, but it would be:

DF <- DF[, -grep( paste("^",drops,"$", sep="", collapse="|"), names(DF) )]

Another possibility:

df <- df[, setdiff(names(df), c("a", "c"))]

df <- df[, grep('^(a|c)$', names(df), invert=TRUE)]

If you want remove the columns by reference and avoid the internal copying associated with data.frames then you can use the data.table package and the function :=

You can pass a character vector names to the left hand side of the := operator, and NULL as the RHS.

library(data.table)

df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)
DT <- data.table(df)
# or more simply  DT <- data.table(a=1:10, b=1:10, c=1:10, d=1:10) #

DT[, c('a','b') := NULL]

If you want to predefine the names as as character vector outside the call to [, wrap the name of the object in () or {} to force the LHS to be evaluated in the calling scope not as a name within the scope of DT.

del <- c('a','b')
DT <- data.table(a=1:10, b=1:10, c=1:10, d=1:10)
DT[, (del) := NULL]
DT <-  <- data.table(a=1:10, b=1:10, c=1:10, d=1:10)
DT[, {del} := NULL]
# force or `c` would also work.

You can also use set, which avoids the overhead of [.data.table, and also works for data.frames!

df <- data.frame(a=1:10, b=1:10, c=1:10, d=1:10)
DT <- data.table(df)

# drop `a` from df (no copying involved)

set(df, j = 'a', value = NULL)
# drop `b` from DT (no copying involved)
set(DT, j = 'b', value = NULL)

Out of interest, this flags up one of R's weird multiple syntax inconsistencies. For example given a two-column data frame:

df <- data.frame(x=1, y=2)

This gives a data frame

subset(df, select=-y)

but this gives a vector

df[,-2]

This is all explained in ?[ but it's not exactly expected behaviour. Well at least not to me...

within(df, rm(x))

is probably easiest, or for multiple variables:

within(df, rm(x, y))

Or if you're dealing with data.tables (per How do you delete a column by name in data.table?):

dt[, x := NULL]   # deletes column x by reference instantly

dt[, !"x", with=FALSE]   # selects all but x into a new data.table

or for multiple variables

dt[, c("x","y") := NULL]

dt[, !c("x", "y"), with=FALSE]

In the development version of data.table (installation instructions), with = FALSE is no longer necessary:

dt[ , !"x"]
dt[ , !c("x", "y")]

list(NULL) also works:

dat <- mtcars
colnames(dat)
# [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
# [11] "carb"
dat[,c("mpg","cyl","wt")] <- list(NULL)
colnames(dat)
# [1] "disp" "hp"   "drat" "qsec" "vs"   "am"   "gear" "carb"

DF <- data.frame(
  x=1:10,
  y=10:1,
  z=rep(5,10),
  a=11:20
)
DF

Output:

    x  y z  a
1   1 10 5 11
2   2  9 5 12
3   3  8 5 13
4   4  7 5 14
5   5  6 5 15
6   6  5 5 16
7   7  4 5 17
8   8  3 5 18
9   9  2 5 19
10 10  1 5 20

DF[c("a","x")] <- list(NULL)

Output:

Here is a dplyr way to go about it:

#df[ -c(1,3:6, 12) ]  # original
df.cut <- df %>% select(-col.to.drop.1, -col.to.drop.2, ..., -col.to.drop.6)  # with dplyr::select()

I like this because it's intuitive to read & understand without annotation and robust to columns changing position within the data frame. It also follows the vectorized idiom using - to remove elements.

Another dplyr answer. If your variables have some common naming structure, you might try starts_with(). For example

library(dplyr)
df <- data.frame(var1 = rnorm(5), var2 = rnorm(5), var3 = rnorm (5), 
                 var4 = rnorm(5), char1 = rnorm(5), char2 = rnorm(5))
df
#        var2      char1        var4       var3       char2       var1
#1 -0.4629512 -0.3595079 -0.04763169  0.6398194  0.70996579 0.75879754
#2  0.5489027  0.1572841 -1.65313658 -1.3228020 -1.42785427 0.31168919
#3 -0.1707694 -0.9036500  0.47583030 -0.6636173  0.02116066 0.03983268
df1 <- df %>% select(-starts_with("char"))
df1
#        var2        var4       var3       var1
#1 -0.4629512 -0.04763169  0.6398194 0.75879754
#2  0.5489027 -1.65313658 -1.3228020 0.31168919
#3 -0.1707694  0.47583030 -0.6636173 0.03983268

If you want to drop a sequence of variables in the data frame, you can use :. For example if you wanted to drop var2, var3, and all variables in between, you'd just be left with var1:

df2 <- df1 %>% select(-c(var2:var3) )  
df2
#        var1
#1 0.75879754
#2 0.31168919
#3 0.03983268

There's a function called dropNamed() in Bernd Bischl's BBmisc package that does exactly this.

BBmisc::dropNamed(df, "x")

The advantage is that it avoids repeating the data frame argument and thus is suitable for piping in magrittr (just like the dplyr approaches):

df %>% BBmisc::dropNamed("x")

Another solution if you don't want to use @hadley's above: If "COLUMN_NAME" is the name of the column you want to drop:

df[,-which(names(df) == "COLUMN_NAME")]

Dplyr Solution

I doubt this will get much attention down here, but if you have a list of columns that you want to remove, and you want to do it in a dplyr chain I use one_of() in the select clause:

Here is a simple, reproducable example:

undesired <- c('mpg', 'cyl', 'hp')

mtcars %>%
  select(-one_of(undesired))

Documentation can be found by running ?one_of or here:

http://genomicsclass.github.io/book/pages/dplyr_tutorial.html

Provide the data frame and a string of comma separated names to remove:

remove_features <- function(df, features) {
  rem_vec <- unlist(strsplit(features, ', '))
  res <- df[,!(names(df) %in% rem_vec)]
  return(res)
}

Usage:

remove_features(iris, "Sepal.Length, Petal.Width")

Beyond select(-one_of(drop_col_names)) demonstrated in earlier answers, there are a couple other dplyr options for dropping columns using select() that do not involve defining all the specific column names (using the dplyr starwars sample data for some variety in column names):

library(dplyr)
starwars %>% 
  select(-(name:mass)) %>%        # the range of columns from 'name' to 'mass'
  select(-contains('color')) %>%  # any column name that contains 'color'
  select(-starts_with('bi')) %>%  # any column name that starts with 'bi'
  select(-ends_with('er')) %>%    # any column name that ends with 'er'
  select(-matches('^f.+s$')) %>%  # any column name matching the regex pattern
  select_if(~!is.list(.)) %>%     # not by column name but by data type
  head(2)

# A tibble: 2 x 2
homeworld species
  <chr>     <chr>  
1 Tatooine  Human  
2 Tatooine  Droid

Find the index of the columns you want to drop using which. Give these indexes a negative sign (*-1). Then subset on those values, which will remove them from the dataframe. This is an example.

DF <- data.frame(one=c('a','b'), two=c('c', 'd'), three=c('e', 'f'), four=c('g', 'h'))
DF
#  one two three four
#1   a   d     f    i
#2   b   e     g    j

DF[which(names(DF) %in% c('two','three')) *-1]
#  one four
#1   a    g
#2   b    h