如何求变数之和？

Let's say I have two columns of data. The first contains categories such as "First", "Second", "Third", etc. The second has numbers which represent the number of times I saw "First".

For example:

Category     Frequency
First        10
First        15
First        5
Second       2
Third        14
Third        20
Second       3

I want to sort the data by Category and sum the Frequencies:

Category     Frequency
First        30
Second       5
Third        34

How would I do this in R?

转载于:https://stackoverflow.com/questions/1660124/how-to-sum-a-variable-by-group

Using aggregate:

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
  Category  x
1    First 30
2   Second  5
3    Third 34

In the example above, multiple dimensions can be specified in the list. Multiple aggregated metrics of the same data type can be incorporated via cbind:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...

(embedding @thelatemail comment), aggregate has a formula interface too

aggregate(Frequency ~ Category, x, sum)

Or if you want to aggregate multiple columns, you could use the . notation (works for one column too)

aggregate(. ~ Category, x, sum)

or tapply:

tapply(x$Frequency, x$Category, FUN=sum)
 First Second  Third 
    30      5     34

Using this data:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                      "Third", "Third", "Second")), 
                    Frequency=c(10,15,5,2,14,20,3))

If x is a dataframe with your data, then the following will do what you want:

require(reshape)
recast(x, Category ~ ., fun.aggregate=sum)

library(plyr)
ddply(tbl, .(Category), summarise, sum = sum(Frequency))

Just to add a third option:

require(doBy)
summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)

EDIT: this is a very old answer. Now I would recommend the use of group_by and summarise from dplyr, as in @docendo answer.

This is somewhat related to this question.

You can also just use the by() function:

x2 <- by(x$Frequency, x$Category, sum)
do.call(rbind,as.list(x2))

Those other packages (plyr, reshape) have the benefit of returning a data.frame, but it's worth being familiar with by() since it's a base function.

The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:

library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), 
                  Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
#    Category V1
# 1:    First 30
# 2:   Second  5
# 3:    Third 34
system.time(data[, sum(Frequency), by = Category] )
# user    system   elapsed 
# 0.008     0.001     0.009

Let's compare that to the same thing using data.frame and the above above:

data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
                  Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user    system   elapsed 
# 0.008     0.000     0.015

And if you want to keep the column this is the syntax:

data[,list(Frequency=sum(Frequency)),by=Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34

The difference will become more noticeable with larger datasets, as the code below demonstrates:

data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
                  Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user    system   elapsed 
# 0.055     0.004     0.059 
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000), 
                  Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user    system   elapsed 
# 0.287     0.010     0.296

For multiple aggregations, you can combine lapply and .SD as follows

data[, lapply(.SD, sum), by = Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34

More recently, you can also use the dplyr package for that purpose:

library(dplyr)
x %>% 
  group_by(Category) %>% 
  summarise(Frequency = sum(Frequency))

#Source: local data frame [3 x 2]
#
#  Category Frequency
#1    First        30
#2   Second         5
#3    Third        34

Or, for multiple summary columns (works with one column too):

x %>% 
  group_by(Category) %>% 
  summarise_each(funs(sum))

Update for dplyr >= 0.5: summarise_each has been replaced by summarise_all, summarise_at and summarise_if family of functions in dplyr.

Or, if you have multiple columns to group by, you can specify all of them in the group_by separated with commas:

mtcars %>% 
  group_by(cyl, gear) %>%                            # multiple group columns
  summarise(max_hp = max(hp), mean_mpg = mean(mpg))  # multiple summary columns

For more information, including the %>% operator, see the introduction to dplyr.

Several years later, just to add another simple base R solution that isn't present here for some reason- xtabs

xtabs(Frequency ~ Category, df)
# Category
# First Second  Third 
#    30      5     34

Or if want a data.frame back

as.data.frame(xtabs(Frequency ~ Category, df))
#   Category Freq
# 1    First   30
# 2   Second    5
# 3    Third   34

While I have recently become a convert to dplyr for most of these types of operations, the sqldf package is still really nice (and IMHO more readable) for some things.

Here is an example of how this question can be answered with sqldf

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                  "Third", "Third", "Second")), 
                Frequency=c(10,15,5,2,14,20,3))

sqldf("select 
          Category
          ,sum(Frequency) as Frequency 
       from x 
       group by 
          Category")

##   Category Frequency
## 1    First        30
## 2   Second         5
## 3    Third        34

using cast instead of recast (note 'Frequency' is now 'value')

df  <- data.frame(Category = c("First","First","First","Second","Third","Third","Second")
                  , value = c(10,15,5,2,14,20,3))

install.packages("reshape")

result<-cast(df, Category ~ . ,fun.aggregate=sum)

to get:

Category (all)
First     30
Second    5
Third     34