CatBoost 用户自定损失函数,predict索引出错怎么解决
from catboost import CatBoostClassifier
from catboost import Pool
from catboost import cv
from sklearn.metrics import accuracy_score
建立模型
#错误示范
model = CatBoostClassifier(
custom_loss=['Accuracy'],#定义损失函数
random_seed=42,
logging_level='Silent'
)
#正确示范
model = CatBoostClassifier(eval_metric='Accuracy',use_best_model=True,random_seed=42)
关于logging_level的解释CatBoost has several parameters to control verbosity. Those are verbose, silent and logging_level.
By default logging is verbose, so you see loss value on every iteration. If you want to see less logging, you need to use one of these parameters. It’s not allowed to set two of them simultaneously.
silent has two possible values - True and False.
verbose can also be True and False, but it also can be an integer. If
it is an integer N, then logging will be printed out each N-th iteration.
logging_level can be ‘Silent’, ‘Verbose’, ‘Info’ and ‘Debug’:
’Silent’ means no output to stdout (except for important warnings) and is same as silent=True or verbose=False.
'Verbose’ is the default logging mode. It’s the same as verbose=True or silent=False.
**‘Info’**prints out the trees that are selected on every iteration.
'Debug’ prints a lot of debug info.
在两个地方设置这个参数:
1)model creation
2)fitting of the created model.
训练你的模型
model.fit(X_train,y_train,cat_features=cate_features_index,eval_set=(X_test,y_test))
运行结果
bestTest = 0.8295964126
bestIteration = 53
Shrink model to first 54 iterations.
开始验证你的模型
cv_params = model.get_params()
cv_params.update({
'loss_function': 'Logloss'
})
cv_data = cv(Pool(X,y,cat_features=cate_features_index),cv_params,plot=True)
参数的设置与更新
params = {
'iterations': 500,
'learning_rate': 0.1,
'eval_metric': 'Accuracy',
'random_seed': 666,
'logging_level': 'Silent',
'use_best_model': False
}
params.update({'iterations': 1000})
print(params)
#输出结果
{'iterations': 1000, 'learning_rate': 0.1, 'eval_metric': 'Accuracy', 'random_seed': 666, 'logging_level': 'Silent', 'use_best_model': False}
Pool是catboost中的用于组织数据的一种形式,也可以用numpy array和dataframe。但更推荐Pool,其内存和速度都更优。
pool的用法
train_pool = Pool(X_train,y_train,cat_features=cate_features_index)
test_pool = Pool(X_test,y_test,cat_features=cate_features_index)
model = CatBoostClassifier(**params)
model.fit(train_pool,eval_set=test_pool)
earlystop_model_1 = CatBoostClassifier(**params)
earlystop_model_1.fit(train_pool,eval_set=test_pool, early_stopping_rounds=200, verbose=20)
结果对比与输出
simpelmodelaccracy = accuracy_score(y_test,model.predict(X_test))
model1accuracy = accuracy_score(y_test,earlystop_model_1.predict(X_test))
print('Simple model tree count:{0}'.format(model.tree_count_))#用model.tree_count_获得最佳的迭代次数
print('Simple model validation accuracy: {:.4}'.format(simpelmodelaccracy))
print('Early-stopped model 1 tree count: {}'.format(earlystop_model_1.tree_count_))
print('Early-stopped model 1 validation accuracy: {:.4}'.format(model1accuracy))
#输出结果
bestTest = 0.8385650224
bestIteration = 171
Simple model tree count:1000
Simple model validation accuracy: 0.8161
Early-stopped model 1 tree count: 372
Early-stopped model 1 validation accuracy: 0.8296
可以看到用earlystopping后训练时间更短,可以有效避免过拟合,得到的模型准确率更高。
model.fit(train_pool,eval_set=test_pool, early_stopping_rounds=200, verbose=20)
feature_importances = model.get_feature_importance(train_pool)
feature_names = X_train.columns
for score,name in sorted(zip(feature_importances,feature_names),reverse=True):
print('{}:{}'.format(name,score))
#结果输出
Sex:24.170501191468333
Age:16.631944921637867
Fare:13.148998881800521
Pclass:12.825448667869106
Embarked:7.7271347452781
Cabin:6.581032890648493
Ticket:6.561360471647209
Parch:6.460657202234844
SibSp:5.892921027415583
PassengerId:0.0
Name:0.0
model = CatBoostClassifier(**params).fit(train_pool,eval_set=test_pool, early_stopping_rounds=200, verbose=20)
eval_metrics = model.eval_metrics(test_pool,['AUC','F1','Logloss'],plot=True)
model = CatBoostClassifier(iterations=10, random_seed=42, logging_level='Silent').fit(train_pool)
model.save_model('catboost_model.dump')
model = CatBoostClassifier()
model.load_model('catboost_model.dump');
print(model.get_params())
print(model.random_seed_)
print(model.learning_rate_)