R语言XGboost使用xgb.importance无法输出变量重要性

问题遇到的现象和发生背景

使用R语言进行某项数据的二元logistics分析,当数据使用xgb.train输出后完整输出了相应内容,然后使用xgb.importance输出变量重要性没有报错误,但是没有任何反应。

问题相关代码,请勿粘贴截图

二分类XGBoost

boston <- read.csv(file.choose())

skim(boston)

for(i in c(1,3, 4,21)){
boston[,i] <- factor(boston[,i])
}
#修正变量类型

set.seed(42)

trains<- createDataPartition(
y= boston$Length.of.hospital.stay,
p=0.85,
list=F,
)

trains2<-sample(trains,nrow(boston)*0.7)
valids <- setdiff(trains,trains2)

data_train <-boston[trains2,]
data_valid <-boston[valids,]
data_test <-boston[-trains,]

table(data_train$Length.of.hospital.stay)
table(data_valid$Length.of.hospital.stay)
table(data_test$Length.of.hospital.stay)

colnames(boston)
dvfunc<-dummyVars(~.,data=data_train[,1:19], fullRank = T)
data_trainx<- predict(dvfunc,newdata=data_train[,1:19])
data_trainy <- ifelse(data_train$Length.of.hospital.stay == "NO",0,1)

data_validx<-predict(dvfunc,newdata=data_valid[,1:19])
data_validy<- ifelse(data_valid$Length.of.hospital.stay == "NO",0,1)

data_testx<-predict(dvfunc,newdata=data_test[,1:19])
data_testy<- ifelse(data_test$Length.of.hospital.stay == "NO",0,1)

dtrain<-xgb.DMatrix(data=data_trainx,
label=data_trainy)
dvalid<-xgb.DMatrix(data=data_validx,
label=data_validy)
dtest<-xgb.DMatrix(data=data_testx,
label=data_testy)

watchlist<-list(train = dtrain, test = dvalid)

#训练模型

fit_xgb_reg <- xgb.train(
data=dtrain,
eta=0.3,
gamma=0.001,
max_depth =2,
subsample =0.7,
colsample_bytree =0.4,

objective = "binary:logistic",

nrounds = 1000,
watchlist=watchlist,
verbose=1,
print_every_n = 100,
early_stopping_rounds = 200
)

fit_xgb_reg
importance_matrix <- xgb.importance (model =fit_xgb_reg)
print(importance_matrix)
xgb.plot.importance(importance_matrix = importance_matrix)

运行结果及报错内容

运行到训练模型部分时,
[1] train-logloss:0.439797 test-logloss:0.439797
Multiple eval metrics are present. Will use test_logloss for early stopping.
Will train until test_logloss hasn't improved in 200 rounds.

[101] train-logloss:0.002339 test-logloss:0.002339
[201] train-logloss:0.002339 test-logloss:0.002339
Stopping. Best iteration:
[21] train-logloss:0.002339 test-logloss:0.002339
是我想要的内容。
运行importance_matrix <- xgb.importance (model =fit_xgb_reg)
Empty data.table (0 rows and 4 cols): Feature,Gain,Cover,Frequency
显示数据缺失

我想要达到的结果

运行代码
importance_matrix <- xgb.importance (model =fit_xgb_reg)
print(importance_matrix)
会显示数据
Feature Gain Cover Frequency
1: Total.blood.loss 0.167833001 0.064426668 0.06392694
2: HB.Decreased.value 0.156806719 0.031960656 0.03196347
3: ALB.Decreased.value 0.088609260 0.055453526 0.05479452
4: BMI 0.079037023 0.133290604 0.13242009
5: Total.blood.volume 0.064410934 0.113359875 0.11415525
6: cost 0.061274356 0.095499871 0.09589041
7: age 0.054263127 0.049857638 0.05022831
8: ALB.Preoperative 0.049471451 0.069283011 0.06849315
9: ALB.. 0.048777618 0.027424782 0.02739726
10: HB..preoperative. 0.043222795 0.072537008 0.07305936
11: weight 0.033926769 0.054270254 0.05479452
12: D.dimer 0.033483713 0.041747298 0.04109589
13: ALB.Postoperative 0.031290170 0.058621242 0.05936073
14: height 0.023252573 0.041057056 0.04109589
15: HB..Postoperative. 0.022152296 0.036533508 0.03652968
16: Prothrombin.time 0.019281749 0.022026106 0.02283105
17: ASA.2 0.012353804 0.009084074 0.00913242
18: surgery.site.1 0.005650289 0.009429195 0.00913242
19: gender.1 0.003094026 0.009577104 0.00913242
20: ASA.1 0.001808328 0.004560526 0.00456621

那个是波士顿房价的数据吗?


1.输出XGBoost特征的重要性
from matplotlib import pyplot

pyplot.bar(range(len(model_XGB.feature_importances_)), model_XGB.feature_importances_)
pyplot.show()

from matplotlib import pyplot
pyplot.bar(range(len(model_XGB.feature_importances_)), model_XGB.feature_importances_)
pyplot.show()

# 也可以使用XGBoost内置的特征重要性绘图函数

# plot feature importance using built-in function

from xgboost import plot_importance

plot_importance(model_XGB)
pyplot.show()
# plot feature importance using built-in function
from xgboost import plot_importance
plot_importance(model_XGB)
pyplot.show()
#2.根据特征重要性筛选特征

from numpy import sort
from sklearn.feature_selection import SelectFromModel

# Fit model using each importance as a threshold
thresholds = sort(model_XGB.feature_importances_)
for thresh in thresholds:
# select features using threshold
  selection = SelectFromModel(model_XGB, threshold=thresh, prefit=True)
  select_X_train = selection.transform(X_train)
  # train model
  selection_model = XGBClassifier()
  selection_model.fit(select_X_train, y_train)
  # eval model
  select_X_test = selection.transform(X_test)
  y_pred = selection_model.predict(select_X_test)
  predictions = [round(value) for value in y_pred]
  accuracy = accuracy_score(y_test, predictions)
  print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1],
  accuracy*100.0))

from numpy import sort
from sklearn.feature_selection import SelectFromModel
# Fit model using each importance as a threshold
thresholds = sort(model_XGB.feature_importances_)
for thresh in thresholds:
  # select features using threshold
  selection = SelectFromModel(model_XGB, threshold=thresh, prefit=True)
  select_X_train = selection.transform(X_train)
  # train model
  selection_model = XGBClassifier()
  selection_model.fit(select_X_train, y_train)
  # eval model
  select_X_test = selection.transform(X_test)
  y_pred = selection_model.predict(select_X_test)
  predictions = [round(value) for value in y_pred]
  accuracy = accuracy_score(y_test, predictions)
  print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1],
  accuracy*100.0))

原因
可能是(可能)试图绘制一个完全由缺失 ( NA) 值组成的向量。

这是一个例子:

> x=rep(NA,100)
> y=rnorm(100)
> plot(x,y)
Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf

在您的示例中,这意味着在您的行plot(costs,pseudor2,type="l")中,costs完全是NA。你必须弄清楚为什么会这样,但这就是你错误的解释。

** 解决办法**
查看一下数据类型,同时转换为数字试下

参考链接:

python xgboost输出变量重要性_XGBoost 输出特征重要性以及筛选特征
https://blog.csdn.net/weixin_40009026/article/details/110773061