所利用的数据原来有100个维度,用LLE降维到了20维,然后才作为输入的变量。尝试过用原本的数据训练也得到相似的结果。从结果中可以看出预测值与真实值几乎是毫无关联,无论真实值为多少,预测值几乎就是一个定值。
这个是我用到的代码,以adaboost模型为例。我尝试过RF,GBRT等模型,都是这样效果。请问是模型训练的问题还是数据集的问题呢?
# input data
X2 = pd.read_excel('n=20.xlsx', engine='openpyxl')
y = pd.read_excel('target.xlsx', engine='openpyxl')
# normalizing the whole data sets
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X2)
X2 = scaler.transform(X2)
# randomly spliting the database into training-testing sets as 80%-20%
from sklearn.model_selection import train_test_split
X2_train, X2_test, y_train, y_test = train_test_split(X2, y, test_size=0.20, random_state=10)
# building the ensemble learnning model
from sklearn.ensemble import AdaBoostRegressor
regr_3 = AdaBoostRegressor(n_estimators=3,learning_rate=0.1,random_state=90)
scores = cross_val_score (regr_3, X2_train, y_train, cv=10, scoring='neg_mean_squared_error', n_jobs = -1)
print('10-fold mean RMSE:', np.mean(np.sqrt( -scores)))
# training the model
regr_3.fit(X2_train, y_train)
# predicting the results
Z1 = regr_3.predict(X2_train)
Z2 = regr_3.predict(X2_test)
# plotting the scatter for the training and testing sets
import matplotlib.pyplot as plt
xx = np.linspace(-0.001,0.02,100)
yy = xx
plt.figure()
plt.plot(xx, yy, c='k', linewidth=2)
plt.scatter(y_train, Z1, marker='s')
plt.scatter(y_test, Z2, marker='o')
plt.grid()
plt.legend(['y=x', 'Training set', 'Testing set'], loc = 'upper left', fontsize=13)
plt.tick_params (axis = 'both', which = 'major')
plt.axis('tight')
plt.xlabel('real', fontsize=18)
plt.ylabel('predicted', fontsize=18)
plt.tick_params(labelsize=14)
plt.title('adaboost', fontsize=18)
plt.tight_layout()
建议先做一下相关性分析来确定是不是数据集的问题
模型还没有训练稳定吧,在测试集上验证的损失降下来了吗?没有的话预测么有意义啊,因为模型还没训练好