I'd like to remove the lines in this data frame that:
a) contain NA
s across all columns. Below is my example data frame.
gene hsap mmul mmus rnor cfam
1 ENSG00000208234 0 NA NA NA NA
2 ENSG00000199674 0 2 2 2 2
3 ENSG00000221622 0 NA NA NA NA
4 ENSG00000207604 0 NA NA 1 2
5 ENSG00000207431 0 NA NA NA NA
6 ENSG00000221312 0 1 2 3 2
Basically, I'd like to get a data frame such as the following.
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
b) contain NA
s in only some columns, so I can also get this result:
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
转载于:https://stackoverflow.com/questions/4862178/remove-rows-with-all-or-some-nas-missing-values-in-data-frame
Also check complete.cases
:
> final[complete.cases(final), ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
na.omit
is nicer for just removing all NA
's. complete.cases
allows partial selection by including only certain columns of the dataframe:
> final[complete.cases(final[ , 5:6]),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
Your solution can't work. If you insist on using is.na
, then you have to do something like:
> final[rowSums(is.na(final[ , 5:6])) == 0, ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
but using complete.cases
is quite a lot more clear, and faster.
Try na.omit(your.data.frame)
. As for the second question, try posting it as another question (for clarity).
I prefer following way to check whether rows contain any NAs:
row.has.na <- apply(final, 1, function(x){any(is.na(x))})
This returns logical vector with values denoting whether there is any NA in a row. You can use it to see how many rows you'll have to drop:
sum(row.has.na)
and eventually drop them
final.filtered <- final[!row.has.na,]
For filtering rows with certain part of NAs it becomes a little trickier (for example, you can feed 'final[,5:6]' to 'apply'). Generally, Joris Meys' solution seems to be more elegant.
Another option if you want greater control over how rows are deemed to be invalid is
final <- final[!(is.na(final$rnor)) | !(is.na(rawdata$cfam)),]
Using the above, this:
gene hsap mmul mmus rnor cfam
1 ENSG00000208234 0 NA NA NA 2
2 ENSG00000199674 0 2 2 2 2
3 ENSG00000221622 0 NA NA 2 NA
4 ENSG00000207604 0 NA NA 1 2
5 ENSG00000207431 0 NA NA NA NA
6 ENSG00000221312 0 1 2 3 2
Becomes:
gene hsap mmul mmus rnor cfam
1 ENSG00000208234 0 NA NA NA 2
2 ENSG00000199674 0 2 2 2 2
3 ENSG00000221622 0 NA NA 2 NA
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
...where only row 5 is removed since it is the only row containing NAs for both rnor
AND cfam
. The boolean logic can then be changed to fit specific requirements.
This will return the rows that have at least ONE non-NA value.
final[rowSums(is.na(final))<length(final),]
This will return the rows that have at least TWO non-NA value.
final[rowSums(is.na(final))<(length(final)-1),]
We can also use the subset function for this.
finalData<-subset(data,!(is.na(data["mmul"]) | is.na(data["rnor"])))
This will give only those rows that do not have NA in both mmul and rnor
If you want control over how many NAs are valid for each row, try this function. For many survey data sets, too many blank question responses can ruin the results. So they are deleted after a certain threshold. This function will allow you to choose how many NAs the row can have before it's deleted:
delete.na <- function(DF, n=0) {
DF[rowSums(is.na(DF)) <= n,]
}
By default, it will eliminate all NAs:
delete.na(final)
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
Or specify the maximum number of NAs allowed:
delete.na(final, 2)
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
I am a synthesizer:). Here I combined the answers into one function:
#' keep rows that have a certain number (range) of NAs anywhere/somewhere and delete others
#' @param df a data frame
#' @param col restrict to the columns where you would like to search for NA; eg, 3, c(3), 2:5, "place", c("place","age")
#' \cr default is NULL, search for all columns
#' @param n integer or vector, 0, c(3,5), number/range of NAs allowed.
#' \cr If a number, the exact number of NAs kept
#' \cr Range includes both ends 3<=n<=5
#' \cr Range could be -Inf, Inf
#' @return returns a new df with rows that have NA(s) removed
#' @export
ez.na.keep = function(df, col=NULL, n=0){
if (!is.null(col)) {
# R converts a single row/col to a vector if the parameter col has only one col
# see https://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/#comments
df.temp = df[,col,drop=FALSE]
} else {
df.temp = df
}
if (length(n)==1){
if (n==0) {
# simply call complete.cases which might be faster
result = df[complete.cases(df.temp),]
} else {
# credit: http://stackoverflow.com/a/30461945/2292993
log <- apply(df.temp, 2, is.na)
logindex <- apply(log, 1, function(x) sum(x) == n)
result = df[logindex, ]
}
}
if (length(n)==2){
min = n[1]; max = n[2]
log <- apply(df.temp, 2, is.na)
logindex <- apply(log, 1, function(x) {sum(x) >= min && sum(x) <= max})
result = df[logindex, ]
}
return(result)
}
For your first question, I have a code that I am comfortable with to get rid of all NAs. Thanks for @Gregor to make it simpler.
final[!(rowSums(is.na(final))),]
For the second question, the code is just an alternation from the previous solution.
final[as.logical((rowSums(is.na(final))-5)),]
Notice the -5 is the number of columns in your data. This will eliminate rows with all NAs, since the rowSums adds up to 5 and they become zeroes after subtraction. This time, as.logical is necessary.
If you like pipes (%>%
), tidyr
's new drop_na
is your friend:
library(tidyr)
df %>% drop_na()
# gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674 0 2 2 2 2
# 6 ENSG00000221312 0 1 2 3 2
df %>% drop_na(rnor, cfam)
# gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674 0 2 2 2 2
# 4 ENSG00000207604 0 NA NA 1 2
# 6 ENSG00000221312 0 1 2 3 2
Assuming dat
as your dataframe, the expected output can be achieved using
1.rowSums
> dat[!rowSums((is.na(dat))),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
2.lapply
> dat[!Reduce('|',lapply(dat,is.na)),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
Using dplyr package we can filter NA as follows:
dplyr::filter(df, !is.na(columnname))
data.table
and na.omit()
with optional param cols=
.na.omit.data.table
is the fastest on my benchmark (see below), whether for all columns or for select columns (OP question part 2).
data.table
, use complete.cases()
.On a vanilla data.frame
, complete.cases
is faster than na.omit()
or dplyr::drop_na()
. Notice that na.omit.data.frame
does not support cols=
.
Here is a comparison of base (blue), dplyr
(pink), and data.table
(yellow) methods for dropping either all or select missing observations, on notional dataset of 1 million observations of 20 numeric variables with independent 5% likelihood of being missing, and a subset of 4 variables for part 2.
Your results may vary based on length, width, and sparsity of your particular dataset.
Note log scale on y axis.
#------- Adjust these assumptions for your own use case ------------
row_size <- 1e6L
col_size <- 20 # not including ID column
p_missing <- 0.05 # likelihood of missing observation (except ID col)
col_subset <- 18:21 # second part of question: filter on select columns
#------- System info for benchmark ----------------------------------
R.version # R version 3.4.3 (2017-11-30), platform = x86_64-w64-mingw32
library(data.table); packageVersion('data.table') # 1.10.4.3
library(dplyr); packageVersion('dplyr') # 0.7.4
library(tidyr); packageVersion('tidyr') # 0.8.0
library(microbenchmark)
#------- Example dataset using above assumptions --------------------
fakeData <- function(m, n, p){
set.seed(123)
m <- matrix(runif(m*n), nrow=m, ncol=n)
m[m<p] <- NA
return(m)
}
df <- cbind( data.frame(id = paste0('ID',seq(row_size)),
stringsAsFactors = FALSE),
data.frame(fakeData(row_size, col_size, p_missing) )
)
dt <- data.table(df)
par(las=3, mfcol=c(1,2), mar=c(22,4,1,1)+0.1)
boxplot(
microbenchmark(
df[complete.cases(df), ],
na.omit(df),
df %>% drop_na,
dt[complete.cases(dt), ],
na.omit(dt)
), xlab='',
main = 'Performance: Drop any NA observation',
col=c(rep('lightblue',2),'salmon',rep('beige',2))
)
boxplot(
microbenchmark(
df[complete.cases(df[,col_subset]), ],
#na.omit(df), # col subset not supported in na.omit.data.frame
df %>% drop_na(col_subset),
dt[complete.cases(dt[,col_subset,with=FALSE]), ],
na.omit(dt, cols=col_subset) # see ?na.omit.data.table
), xlab='',
main = 'Performance: Drop NA obs. in select cols',
col=c('lightblue','salmon',rep('beige',2))
)
delete.dirt <- function(DF, dart=c('NA')) {
dirty_rows <- apply(DF, 1, function(r) !any(r %in% dart))
DF <- DF[dirty_rows, ]
}
mydata <- delete.dirt(mydata)
Above function deletes all the rows from the data frame that has 'NA' in any column and returns the resultant data. If you want to check for multiple values like NA
and ?
change dart=c('NA')
in function param to dart=c('NA', '?')
My guess is that this could be more elegantly solved in this way
m <- matrix(1:25, ncol = 5)
m[c(1, 6, 13, 25)] <- NA
df <- data.frame(m)
library(dplyr)
df %>%
filter_all(any_vars(is.na(.)))
#> X1 X2 X3 X4 X5
#> 1 NA NA 11 16 21
#> 2 3 8 NA 18 23
#> 3 5 10 15 20 NA