使用R语言进行某项数据的二元logistics分析,当数据使用xgb.train输出后完整输出了相应内容,然后使用xgb.importance输出变量重要性没有报错误,但是没有任何反应。
boston <- read.csv(file.choose())
skim(boston)
for(i in c(1,3, 4,21)){
boston[,i] <- factor(boston[,i])
}
#修正变量类型
set.seed(42)
trains<- createDataPartition(
y= boston$Length.of.hospital.stay,
p=0.85,
list=F,
)
trains2<-sample(trains,nrow(boston)*0.7)
valids <- setdiff(trains,trains2)
data_train <-boston[trains2,]
data_valid <-boston[valids,]
data_test <-boston[-trains,]
table(data_train$Length.of.hospital.stay)
table(data_valid$Length.of.hospital.stay)
table(data_test$Length.of.hospital.stay)
colnames(boston)
dvfunc<-dummyVars(~.,data=data_train[,1:19], fullRank = T)
data_trainx<- predict(dvfunc,newdata=data_train[,1:19])
data_trainy <- ifelse(data_train$Length.of.hospital.stay == "NO",0,1)
data_validx<-predict(dvfunc,newdata=data_valid[,1:19])
data_validy<- ifelse(data_valid$Length.of.hospital.stay == "NO",0,1)
data_testx<-predict(dvfunc,newdata=data_test[,1:19])
data_testy<- ifelse(data_test$Length.of.hospital.stay == "NO",0,1)
dtrain<-xgb.DMatrix(data=data_trainx,
label=data_trainy)
dvalid<-xgb.DMatrix(data=data_validx,
label=data_validy)
dtest<-xgb.DMatrix(data=data_testx,
label=data_testy)
watchlist<-list(train = dtrain, test = dvalid)
#训练模型
fit_xgb_reg <- xgb.train(
data=dtrain,
eta=0.3,
gamma=0.001,
max_depth =2,
subsample =0.7,
colsample_bytree =0.4,
objective = "binary:logistic",
nrounds = 1000,
watchlist=watchlist,
verbose=1,
print_every_n = 100,
early_stopping_rounds = 200
)
fit_xgb_reg
importance_matrix <- xgb.importance (model =fit_xgb_reg)
print(importance_matrix)
xgb.plot.importance(importance_matrix = importance_matrix)
运行到训练模型部分时,
[1] train-logloss:0.439797 test-logloss:0.439797
Multiple eval metrics are present. Will use test_logloss for early stopping.
Will train until test_logloss hasn't improved in 200 rounds.
[101] train-logloss:0.002339 test-logloss:0.002339
[201] train-logloss:0.002339 test-logloss:0.002339
Stopping. Best iteration:
[21] train-logloss:0.002339 test-logloss:0.002339
是我想要的内容。
运行importance_matrix <- xgb.importance (model =fit_xgb_reg)
Empty data.table (0 rows and 4 cols): Feature,Gain,Cover,Frequency
显示数据缺失
运行代码
importance_matrix <- xgb.importance (model =fit_xgb_reg)
print(importance_matrix)
会显示数据
Feature Gain Cover Frequency
1: Total.blood.loss 0.167833001 0.064426668 0.06392694
2: HB.Decreased.value 0.156806719 0.031960656 0.03196347
3: ALB.Decreased.value 0.088609260 0.055453526 0.05479452
4: BMI 0.079037023 0.133290604 0.13242009
5: Total.blood.volume 0.064410934 0.113359875 0.11415525
6: cost 0.061274356 0.095499871 0.09589041
7: age 0.054263127 0.049857638 0.05022831
8: ALB.Preoperative 0.049471451 0.069283011 0.06849315
9: ALB.. 0.048777618 0.027424782 0.02739726
10: HB..preoperative. 0.043222795 0.072537008 0.07305936
11: weight 0.033926769 0.054270254 0.05479452
12: D.dimer 0.033483713 0.041747298 0.04109589
13: ALB.Postoperative 0.031290170 0.058621242 0.05936073
14: height 0.023252573 0.041057056 0.04109589
15: HB..Postoperative. 0.022152296 0.036533508 0.03652968
16: Prothrombin.time 0.019281749 0.022026106 0.02283105
17: ASA.2 0.012353804 0.009084074 0.00913242
18: surgery.site.1 0.005650289 0.009429195 0.00913242
19: gender.1 0.003094026 0.009577104 0.00913242
20: ASA.1 0.001808328 0.004560526 0.00456621
那个是波士顿房价的数据吗?
1.输出XGBoost特征的重要性
from matplotlib import pyplot
pyplot.bar(range(len(model_XGB.feature_importances_)), model_XGB.feature_importances_)
pyplot.show()
from matplotlib import pyplot
pyplot.bar(range(len(model_XGB.feature_importances_)), model_XGB.feature_importances_)
pyplot.show()
# 也可以使用XGBoost内置的特征重要性绘图函数
# plot feature importance using built-in function
from xgboost import plot_importance
plot_importance(model_XGB)
pyplot.show()
# plot feature importance using built-in function
from xgboost import plot_importance
plot_importance(model_XGB)
pyplot.show()
#2.根据特征重要性筛选特征
from numpy import sort
from sklearn.feature_selection import SelectFromModel
# Fit model using each importance as a threshold
thresholds = sort(model_XGB.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model_XGB, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1],
accuracy*100.0))
from numpy import sort
from sklearn.feature_selection import SelectFromModel
# Fit model using each importance as a threshold
thresholds = sort(model_XGB.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model_XGB, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1],
accuracy*100.0))
原因
可能是(可能)试图绘制一个完全由缺失 ( NA
) 值组成的向量。
这是一个例子:
> x=rep(NA,100)
> y=rnorm(100)
> plot(x,y)
Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
在您的示例中,这意味着在您的行plot(costs,pseudor2,type="l")
中,costs
完全是NA
。你必须弄清楚为什么会这样,但这就是你错误的解释。
** 解决办法**
查看一下数据类型,同时转换为数字试下
python xgboost输出变量重要性_XGBoost 输出特征重要性以及筛选特征
https://blog.csdn.net/weixin_40009026/article/details/110773061