朴素贝叶斯模型,groupby()函数出现问题

用朴素贝叶斯模型选取优化健身计划
开始时运行结果是:提示错误dict没有groupby()函数,后来用df.groupby()方法改后飘来一片红,本人第一次遇见这种情况啊,有哪位可以帮帮我吗?
训练数据:

img


代码:

from os import rename

import numpy as np
import pandas as pd

# 建立数据集
name = [0, 1, 2, 3, 4, 5, 6, 7]
data = {
    "Gender": pd.Series(['male', 'male', 'male', 'male', 'female', 'female', 'female', 'female'], index=name),
    " Height": pd.Series([6.00, 5.92, 5.58, 5.92, 5.00, 5.50, 5.75, 5.42], index=name),
    " Weight": pd.Series([180, 190, 170, 165, 100, 150, 150, 130], index=name),
    "Size": pd.Series([12, 11, 12, 10, 6, 8, 9, 7], index=name),
    "Team": pd.Series(['i100', 'i100', 'i500', 'i100', 'i500', 'i100', 'i100', 'i500'])
}
data_df = pd.DataFrame(data)
# 计算i100和i500的占比(先验概率)
n_i100 = data['Team'][data['Team'] == 'i100'].count()
n_i500 = data['Team'][data['Team'] == 'i500'].count()
total_ppl = data['Team'].count()
P_i100 = n_i100 * 1.0 / total_ppl
P_i500 = n_i500 * 1.0 / total_ppl
# reset_index()是pandas库中的一个方法,表示重置索引,set_index()表示建立索引
df1 = data_df.groupby(['Team', 'Gender']).size(). \
    rename('cnt').reset_index().set_index('Team')
# Dataframe是一种数据结构,类似于excel,是一个二维表,pd.Dataframe()的作用是从字典中导入数据
# df2表示的是i100和i500的个数
df2 = pd.DataFrame(data_df.groupby(['Team']).size().rename('total'))
# merge()表示将两个表通过共有的列连接起来,这里的df3就是将df1和df3连接起来
df3 = df1.merge(df2, left_index=True, right_index=True)
df3['p'] = df3['cnt']*1.0/df3['total']
# 数据分组,计算均值
data_means = data_df.groupby('Team').mean()
# 数据分组,计算方差
data_variance = data.groupby('Team').var()
# i00的均值
i100_height_mean = data_means['Height'][data_means.index == 'i100']
i100_weight_mean = data_means['Weight'][data_means.index == 'i100']
i100_size_mean = data_means['Size'][data_means.index == 'i100']
# i100的方差
i100_height_variance = data_variance['Height'][data_variance.index == 'i100']
i100_weight_variance = data_variance['Weight'][data_variance.index == 'i100']
i100_size_variance = data_variance['Size'][data_variance.index == 'i100']
# i500的均值
i500_height_mean = data_means['Height'][data_means.index == 'i500']
i500_weight_mean = data_means['Weight'][data_means.index == 'i500']
i500_size_mean = data_means['Size'][data_means.index == 'i500']
# i500的方差
i500_height_variance = data_variance['Height'][data_variance.index == 'i500']
i500_weight_variance = data_variance['Weight'][data_variance.index == 'i500']
i500_size_variance = data_variance['Size'][data_variance.index == 'i500']
# 计算离散变量,也就是性别的条件概率
def p_x_given_y_1(team,gender):
    return df3['p'][df3['Team'] == team][df3['Gender'] == gender].values[0]
# 对于正态分布的连续性分布计算其条件概率
def p_x_given_y_2(x,mean_y, variance_y):
# 把参数带入概率密度公式
    p = 1/(np.sprt(2*np.pi)*variance_y) * np.exp((-(x-mean_y))**2)/(2*variance_y)
    return p
# 建立Tom数据集
name1 = [0]
person = {
    "Gender": pd.Series(['female'], index=name),
    " Height": pd.Series([6.00], index=name),
    " Weight": pd.Series([130], index=name),
    "Size": pd.Series([8], index=name),
}
# 计算后验概率1
P1 = P_i100 * p_x_given_y_1('i100', person['Gender'][0]) * \
p_x_given_y_2(person['Height'][0], i100_height_mean, i100_height_variance) * \
p_x_given_y_2(person['Weight'][0], i100_weight_mean, i100_weight_variance) * \
p_x_given_y_2(person['Size'][0], i100_size_mean, i100_size_variance)
# 计算后验概率2
P2 = P_i100 * p_x_given_y_1('i500', person['Gender'][0]) * \
p_x_given_y_2(person['Height'][0], i500_height_mean, i500_height_variance) * \
p_x_given_y_2(person['Weight'][0], i500_weight_mean, i500_weight_variance) * \
p_x_given_y_2(person['Size'][0], i500_size_mean, i500_size_variance)
print(P1, P2)
# 进行比较
if(P1>P2):
    print("Tom适合i100")
else:
    print("Tom适合i500")

出现的问题:

img

img

img

  • 这篇文章:pandas中的groupby函数的分组结果怎么保存成DataFrame 也许有你想要的答案,你可以看看
  • 除此之外, 这篇博客: pandas--groupby相关操作中的 2.按A、B两列对df进行分组,并使用聚合函数aggregate对每组求和 部分也许能够解决你的问题, 你可以仔细阅读以下内容或跳转源博客中阅读:
  • grouped=df.groupby(['A','B'])  
    grouped.aggregate(np.sum)
    
    CD
    AB
    barone0.078877-0.667510
    three0.2757510.685817
    two0.182907-0.306387
    fooone-0.6903350.409347
    three-0.8266081.170842
    two-0.181721-2.612407

    注意:通过上面的结果可以看到。聚合完成后每组都有一个组名作为新的索引,使用as_index=False可以忽略组名。