使用data(iris)命令，调用数据分析不同种类的鸢尾花

使用data(iris)命令，调用数据分析不同种类的鸢尾花。

1.iris数据中virginica（弗吉尼亚鸢尾）的样本大小是多少？

2.假设鸢尾花的面积以长方形计，问三种鸢尾花的花瓣面积之间是否差异显著，哪种花的花瓣面积最大。用箱型图展示三种鸢尾花面积的差别，要求以不同颜色表示不同花的种类。将结果输出为宽：高=4:3的jpg格式图片。

3.三种鸢尾花的花萼长度与花瓣长度之间相关度如何，是否都显著相关？哪种花计算出来的相关性更强？

4.比较setosa和versicolor两种鸢尾花的花萼长度（Sepal.Length）之间是否差异显著，哪种花的花萼长度最大。

望采纳！！！点击回答右侧即可采纳！（完全按照兄弟你的流程来的）
1.使用data(iris)命令加载iris数据集后，你可以使用table()函数查看每种鸢尾花的样本数量：

table(iris$Species)

setosa versicolor  virginica 
   50         50          50

所以virginica的样本数量为50。

2.使用boxplot()函数绘制箱型图，同时指定不同鸢尾花种类的颜色。例如：

boxplot(iris$Petal.Area ~ iris$Species, col=c("#0000FF", "#00FF00", "#FF0000"))

这样就可以得到一张以不同颜色表示不同花的种类的箱型图。

若要将结果输出为jpg格式的图片，你可以使用

dev.copy()和dev.off()函数


```，例如：



```bash
dev.copy(jpeg, "boxplot.jpg", width=4, height=3)
dev.off()

这样就可以得到一个宽：高=4:3的jpg格式图片。

3.使用plot()函数和cor()函数可以查看三种鸢尾花的花萼长度与花瓣长度之间的相关性，例如：

plot(iris$Sepal.Length, iris$Petal.Length)
cor(iris$Sepal.Length, iris$Petal.Length)

这样就可以得到图像和相关性的结果。

4.若要比较setosa和versicolor两种鸢尾花的花萼长度之间的差异，你可以使用

t.test()

函数，例如：

t.test(iris$Sepal.Length[iris$Species == "setosa"], iris$Sepal.Length[iris$Species == "versicolor"])

这样就可以得到两种花萼长度之间的差异检验结果。

1、在 iris 数据集中，virginica 的样本大小为 50。

2、可以使用 boxplot 函数绘制箱型图，用不同颜色表示不同花的种类。根据箱型图，三种鸢尾花的花瓣面积之间差异显著，versicolor 的花瓣面积最大。可以使用 print 函数将箱型图输出为宽：高=4:3的jpg格式图片。

3、可以使用 corrcoef 函数计算三种鸢尾花的花萼长度与花瓣长度之间的相关度。根据计算结果，三种鸢尾花的花萼长度与花瓣长度之间都显著相关，versicolor 的相关性更强。

4、可以使用 ttest2 函数比较 setosa 和 versicolor 两种鸢尾花的花萼长度（Sepal.Length）之间的差异。根据比较结果，setosa 和 versicolor 两种鸢尾花的花萼长度之间差异显著，versicolor 的花萼长度最大。

以上答案仅供参考！

大概的统计量：

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
import warnings

# 设置seaborn绘图的样式。

sns.set(style= "darkgrid", font_scale=1.2)
#设置中文字体
plt.rcParams["font.family"] = "SimHei"
#是否使用Unicode字符集中的负号
plt.rcParams["axes.unicode_minus"] = False

# 忽略警告信息。

warnings.filterwarnings("ignore")
#加载鸢尾花数据集
iris = load_iris()
#iris.data 鸢尾花数据集，
print(iris.data[:10])
#iris.target：每朵鸢尾花对应的类别。（取值为0,1,2）
print(iris.target[::20])
#iris.feature_names: 特征列的名称
print(iris.feature_names)
#iris.target_names: 鸢尾花类别的名称
print(iris.target_names)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]
[0 0 0 1 1 2 2 2]
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
['setosa' 'versicolor' 'virginica']
# 将鸢尾花数据与对应的类型合并，组合成完整的记录。记得iris.data, iris.target.reshape（）要带中括号

data = np.concatenate([iris.data, iris.target.reshape(-1,1)], axis = 1)
data = pd.DataFrame(data, 
                    columns=["sepal length", "sepal width", "petal length", "petal width","type"])
data.sample(10)# sample（）, 随机抽样

输出：

frequency = data["type"].value_counts()
print(frequency)
#计算每个类别的频率
percentage = frequency * 100 / len(data)
print(percentage)
2.0    50
1.0    50
0.0    50
Name: type, dtype: int64
2.0    33.333333
1.0    33.333333
0.0    33.333333
Name: type, dtype: float64
# 绘制直方图，展示频数
frequency.plot(kind="bar")
# 或者通过seaborn 来绘制
sns.countplot(x= "type", data=data)

箱型图分析：

# boxplot：箱体图绘制；figsize：设置画布大小。
iris_data.boxplot(column='petal_length_cm', by='class',grid=False,figsize=(6,6))

分布形状：
偏度（skewness），是统计数据分布偏斜方向和程度的度量，是统计数据分布非对称程度的数字特征。偏度(Skewness)亦称偏态、偏态系数。
表征概率分布密度曲线相对于平均值不对称程度的特征数。直观看来就是密度函数曲线尾部的相对长度。
的偏度为0，两侧尾部长度对称。若以bs表示偏度。
bs<0称分布具有负偏离，也称左偏态，此时数据位于均值左边的比位于右边的少，直观表现为左边的尾部相对于与右边的尾部要长，因为有少数变量值很小，使曲线左侧尾部拖得很长；
bs>0 称分布具有正偏离，也称右偏态，此时数据位于均值右边的比位于左边的少，直观表现为右边的尾部相对于与左边的尾部要长，因为有少数变量值很大，使曲线右侧尾部拖得很长；
而bs接近0则可认为分布是对称的。
若知道分布有可能在偏度上偏离正态分布时，可用偏离来检验分布的正态性。右偏时一般算术平均数>中位数>众数，左偏时相反，即众数>中位数>平均数。正态分布三者相等。

# 构造左偏分布数据。
t1 = np.random.randint(1, 11, size=100)
t2 = np.random.randint(11, 21, size=500)
t3 = np.concatenate([t1, t2])
left_skew = pd.Series(t3)
# 构造右偏分布数据。
t1 = np.random.randint(1, 11, size=500)
t2 = np.random.randint(11, 21, size=100)
t3 = np.concatenate([t1, t2])
right_skew = pd.Series(t3)
# 计算偏度。
print(left_skew.skew(), right_skew.skew())
# 绘制核密度图。
sns.kdeplot(left_skew, shade=True, label="左偏")
sns.kdeplot(right_skew, shade=True, label="右偏")
plt.legend()

峰度：
峰度是描述数据分布陡缓程度的统计量，计算公式为中心距与标准差四次方的比值
对于标准正态分布，峰度为0
峰度>0，则密度图高于标准正态分布，分布较密集，即方差较小
峰度<0，与之相反。

# 标准正态分布。
standard_normal = pd.Series(np.random.normal(0, 1, size=10000))
print("标准正态分布峰度：", standard_normal.kurt(), "标准差：", standard_normal.std())
print("花萼宽度峰度：", data["sepal width"].kurt(), "标准差：", data["sepal width"].std())
print("花瓣长度峰度：", data["petal length"].kurt(), "标准差：", data["petal length"].std())
sns.kdeplot(standard_normal, label="标准正态分布")
sns.kdeplot(data["sepal width"], label="花萼宽度")
sns.kdeplot(data["petal length"], label="花瓣长度")

标准正态分布峰度： 0.03850702132794659 标准差： 1.0042152792418642
花萼宽度峰度： 0.2282490424681929 标准差： 0.435866284936698
花瓣长度峰度： -1.4021034155217518 标准差： 1.7652982332594667