Let's say I have two columns of data. The first contains categories such as "First", "Second", "Third", etc. The second has numbers which represent the number of times I saw "First".
For example:
Category Frequency
First 10
First 15
First 5
Second 2
Third 14
Third 20
Second 3
I want to sort the data by Category and sum the Frequencies:
Category Frequency
First 30
Second 5
Third 34
How would I do this in R?
转载于:https://stackoverflow.com/questions/1660124/how-to-sum-a-variable-by-group
Using aggregate
:
aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
Category x
1 First 30
2 Second 5
3 Third 34
In the example above, multiple dimensions can be specified in the list
. Multiple aggregated metrics of the same data type can be incorporated via cbind
:
aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...
(embedding @thelatemail comment), aggregate
has a formula interface too
aggregate(Frequency ~ Category, x, sum)
Or if you want to aggregate multiple columns, you could use the .
notation (works for one column too)
aggregate(. ~ Category, x, sum)
or tapply
:
tapply(x$Frequency, x$Category, FUN=sum)
First Second Third
30 5 34
Using this data:
x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
If x
is a dataframe with your data, then the following will do what you want:
require(reshape)
recast(x, Category ~ ., fun.aggregate=sum)
library(plyr)
ddply(tbl, .(Category), summarise, sum = sum(Frequency))
Just to add a third option:
require(doBy)
summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)
EDIT: this is a very old answer. Now I would recommend the use of group_by and summarise from dplyr, as in @docendo answer.
This is somewhat related to this question.
You can also just use the by() function:
x2 <- by(x$Frequency, x$Category, sum)
do.call(rbind,as.list(x2))
Those other packages (plyr, reshape) have the benefit of returning a data.frame, but it's worth being familiar with by() since it's a base function.
The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:
library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
# Category V1
# 1: First 30
# 2: Second 5
# 3: Third 34
system.time(data[, sum(Frequency), by = Category] )
# user system elapsed
# 0.008 0.001 0.009
Let's compare that to the same thing using data.frame and the above above:
data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user system elapsed
# 0.008 0.000 0.015
And if you want to keep the column this is the syntax:
data[,list(Frequency=sum(Frequency)),by=Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
The difference will become more noticeable with larger datasets, as the code below demonstrates:
data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user system elapsed
# 0.055 0.004 0.059
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user system elapsed
# 0.287 0.010 0.296
For multiple aggregations, you can combine lapply
and .SD
as follows
data[, lapply(.SD, sum), by = Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
More recently, you can also use the dplyr package for that purpose:
library(dplyr)
x %>%
group_by(Category) %>%
summarise(Frequency = sum(Frequency))
#Source: local data frame [3 x 2]
#
# Category Frequency
#1 First 30
#2 Second 5
#3 Third 34
Or, for multiple summary columns (works with one column too):
x %>%
group_by(Category) %>%
summarise_each(funs(sum))
Update for dplyr >= 0.5: summarise_each
has been replaced by summarise_all
, summarise_at
and summarise_if
family of functions in dplyr.
Or, if you have multiple columns to group by, you can specify all of them in the group_by
separated with commas:
mtcars %>%
group_by(cyl, gear) %>% # multiple group columns
summarise(max_hp = max(hp), mean_mpg = mean(mpg)) # multiple summary columns
For more information, including the %>%
operator, see the introduction to dplyr.
Several years later, just to add another simple base R solution that isn't present here for some reason- xtabs
xtabs(Frequency ~ Category, df)
# Category
# First Second Third
# 30 5 34
Or if want a data.frame
back
as.data.frame(xtabs(Frequency ~ Category, df))
# Category Freq
# 1 First 30
# 2 Second 5
# 3 Third 34
While I have recently become a convert to dplyr
for most of these types of operations, the sqldf
package is still really nice (and IMHO more readable) for some things.
Here is an example of how this question can be answered with sqldf
x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
sqldf("select
Category
,sum(Frequency) as Frequency
from x
group by
Category")
## Category Frequency
## 1 First 30
## 2 Second 5
## 3 Third 34
using cast
instead of recast
(note 'Frequency'
is now 'value'
)
df <- data.frame(Category = c("First","First","First","Second","Third","Third","Second")
, value = c(10,15,5,2,14,20,3))
install.packages("reshape")
result<-cast(df, Category ~ . ,fun.aggregate=sum)
to get:
Category (all)
First 30
Second 5
Third 34