用朴素贝叶斯模型选取优化健身计划
开始时运行结果是:提示错误dict没有groupby()函数,后来用df.groupby()方法改后飘来一片红,本人第一次遇见这种情况啊,有哪位可以帮帮我吗?
训练数据:
from os import rename
import numpy as np
import pandas as pd
# 建立数据集
name = [0, 1, 2, 3, 4, 5, 6, 7]
data = {
"Gender": pd.Series(['male', 'male', 'male', 'male', 'female', 'female', 'female', 'female'], index=name),
" Height": pd.Series([6.00, 5.92, 5.58, 5.92, 5.00, 5.50, 5.75, 5.42], index=name),
" Weight": pd.Series([180, 190, 170, 165, 100, 150, 150, 130], index=name),
"Size": pd.Series([12, 11, 12, 10, 6, 8, 9, 7], index=name),
"Team": pd.Series(['i100', 'i100', 'i500', 'i100', 'i500', 'i100', 'i100', 'i500'])
}
data_df = pd.DataFrame(data)
# 计算i100和i500的占比(先验概率)
n_i100 = data['Team'][data['Team'] == 'i100'].count()
n_i500 = data['Team'][data['Team'] == 'i500'].count()
total_ppl = data['Team'].count()
P_i100 = n_i100 * 1.0 / total_ppl
P_i500 = n_i500 * 1.0 / total_ppl
# reset_index()是pandas库中的一个方法,表示重置索引,set_index()表示建立索引
df1 = data_df.groupby(['Team', 'Gender']).size(). \
rename('cnt').reset_index().set_index('Team')
# Dataframe是一种数据结构,类似于excel,是一个二维表,pd.Dataframe()的作用是从字典中导入数据
# df2表示的是i100和i500的个数
df2 = pd.DataFrame(data_df.groupby(['Team']).size().rename('total'))
# merge()表示将两个表通过共有的列连接起来,这里的df3就是将df1和df3连接起来
df3 = df1.merge(df2, left_index=True, right_index=True)
df3['p'] = df3['cnt']*1.0/df3['total']
# 数据分组,计算均值
data_means = data_df.groupby('Team').mean()
# 数据分组,计算方差
data_variance = data.groupby('Team').var()
# i00的均值
i100_height_mean = data_means['Height'][data_means.index == 'i100']
i100_weight_mean = data_means['Weight'][data_means.index == 'i100']
i100_size_mean = data_means['Size'][data_means.index == 'i100']
# i100的方差
i100_height_variance = data_variance['Height'][data_variance.index == 'i100']
i100_weight_variance = data_variance['Weight'][data_variance.index == 'i100']
i100_size_variance = data_variance['Size'][data_variance.index == 'i100']
# i500的均值
i500_height_mean = data_means['Height'][data_means.index == 'i500']
i500_weight_mean = data_means['Weight'][data_means.index == 'i500']
i500_size_mean = data_means['Size'][data_means.index == 'i500']
# i500的方差
i500_height_variance = data_variance['Height'][data_variance.index == 'i500']
i500_weight_variance = data_variance['Weight'][data_variance.index == 'i500']
i500_size_variance = data_variance['Size'][data_variance.index == 'i500']
# 计算离散变量,也就是性别的条件概率
def p_x_given_y_1(team,gender):
return df3['p'][df3['Team'] == team][df3['Gender'] == gender].values[0]
# 对于正态分布的连续性分布计算其条件概率
def p_x_given_y_2(x,mean_y, variance_y):
# 把参数带入概率密度公式
p = 1/(np.sprt(2*np.pi)*variance_y) * np.exp((-(x-mean_y))**2)/(2*variance_y)
return p
# 建立Tom数据集
name1 = [0]
person = {
"Gender": pd.Series(['female'], index=name),
" Height": pd.Series([6.00], index=name),
" Weight": pd.Series([130], index=name),
"Size": pd.Series([8], index=name),
}
# 计算后验概率1
P1 = P_i100 * p_x_given_y_1('i100', person['Gender'][0]) * \
p_x_given_y_2(person['Height'][0], i100_height_mean, i100_height_variance) * \
p_x_given_y_2(person['Weight'][0], i100_weight_mean, i100_weight_variance) * \
p_x_given_y_2(person['Size'][0], i100_size_mean, i100_size_variance)
# 计算后验概率2
P2 = P_i100 * p_x_given_y_1('i500', person['Gender'][0]) * \
p_x_given_y_2(person['Height'][0], i500_height_mean, i500_height_variance) * \
p_x_given_y_2(person['Weight'][0], i500_weight_mean, i500_weight_variance) * \
p_x_given_y_2(person['Size'][0], i500_size_mean, i500_size_variance)
print(P1, P2)
# 进行比较
if(P1>P2):
print("Tom适合i100")
else:
print("Tom适合i500")
出现的问题:
grouped=df.groupby(['A','B'])
grouped.aggregate(np.sum)
C | D | ||
---|---|---|---|
A | B | ||
bar | one | 0.078877 | -0.667510 |
three | 0.275751 | 0.685817 | |
two | 0.182907 | -0.306387 | |
foo | one | -0.690335 | 0.409347 |
three | -0.826608 | 1.170842 | |
two | -0.181721 | -2.612407 |
注意:通过上面的结果可以看到。聚合完成后每组都有一个组名作为新的索引,使用as_index=False可以忽略组名。