CatBoost 用户自定损失函数

CatBoost 用户自定损失函数，predict索引出错怎么解决

不知道你这个问题是否已经解决, 如果还没有解决的话:

这篇博客: Catboost用泰坦尼克号数据训练中的 CatBoost基础 部分也许能够解决你的问题, 你可以仔细阅读以下内容或者直接跳转源博客中阅读:

导入

from catboost import CatBoostClassifier
from catboost import Pool
from catboost import cv
from sklearn.metrics import accuracy_score

模型训练
现在开始创建模型。使用默认参数。作者认为默认参数已经提供了一个较好的默认值。因此这里只设置了损失函数。

建立模型

#错误示范
model = CatBoostClassifier(
    custom_loss=['Accuracy'],#定义损失函数
    random_seed=42,
    logging_level='Silent'
)

#正确示范
model = CatBoostClassifier(eval_metric='Accuracy',use_best_model=True,random_seed=42)

关于logging_level的解释

CatBoost has several parameters to control verbosity. Those are verbose, silent and logging_level.
By default logging is verbose, so you see loss value on every iteration. If you want to see less logging, you need to use one of these parameters. It’s not allowed to set two of them simultaneously.
silent has two possible values - True and False.
verbose can also be True and False, but it also can be an integer. If
it is an integer N, then logging will be printed out each N-th iteration.
logging_level can be ‘Silent’, ‘Verbose’, ‘Info’ and ‘Debug’:
’Silent’ means no output to stdout (except for important warnings) and is same as silent=True or verbose=False.
'Verbose’ is the default logging mode. It’s the same as verbose=True or silent=False.
**‘Info’**prints out the trees that are selected on every iteration.
'Debug’ prints a lot of debug info.
在两个地方设置这个参数：
1）model creation
2）fitting of the created model.

训练你的模型

model.fit(X_train,y_train,cat_features=cate_features_index,eval_set=(X_test,y_test))

运行结果

bestTest = 0.8295964126
bestIteration = 53
Shrink model to first 54 iterations.

开始验证你的模型

cv_params = model.get_params()
cv_params.update({
    'loss_function': 'Logloss'
})
cv_data = cv(Pool(X,y,cat_features=cate_features_index),cv_params,plot=True)

参数的设置与更新

params = {
    'iterations': 500,
    'learning_rate': 0.1,
    'eval_metric': 'Accuracy',
    'random_seed': 666,
    'logging_level': 'Silent',
    'use_best_model': False
}
params.update({'iterations': 1000})
print(params)
#输出结果
{'iterations': 1000, 'learning_rate': 0.1, 'eval_metric': 'Accuracy', 'random_seed': 666, 'logging_level': 'Silent', 'use_best_model': False}

Pool是catboost中的用于组织数据的一种形式，也可以用numpy array和dataframe。但更推荐Pool，其内存和速度都更优。
pool的用法

train_pool = Pool(X_train,y_train,cat_features=cate_features_index)
test_pool = Pool(X_test,y_test,cat_features=cate_features_index)

用Early Stopping防止过拟合、节约训练时间

model = CatBoostClassifier(**params)
model.fit(train_pool,eval_set=test_pool)

earlystop_model_1 = CatBoostClassifier(**params)
earlystop_model_1.fit(train_pool,eval_set=test_pool, early_stopping_rounds=200, verbose=20)

结果对比与输出

simpelmodelaccracy = accuracy_score(y_test,model.predict(X_test))
model1accuracy = accuracy_score(y_test,earlystop_model_1.predict(X_test))
print('Simple model tree count:{0}'.format(model.tree_count_))#用model.tree_count_获得最佳的迭代次数
print('Simple model validation accuracy: {:.4}'.format(simpelmodelaccracy))
print('Early-stopped model 1 tree count: {}'.format(earlystop_model_1.tree_count_))
print('Early-stopped model 1 validation accuracy: {:.4}'.format(model1accuracy))
#输出结果
bestTest = 0.8385650224
bestIteration = 171
Simple model tree count:1000
Simple model validation accuracy: 0.8161
Early-stopped model 1 tree count: 372
Early-stopped model 1 validation accuracy: 0.8296

可以看到用earlystopping后训练时间更短，可以有效避免过拟合，得到的模型准确率更高。

Feature Importance特征重要性选择

model.fit(train_pool,eval_set=test_pool, early_stopping_rounds=200, verbose=20)
feature_importances = model.get_feature_importance(train_pool)
feature_names = X_train.columns
for score,name in sorted(zip(feature_importances,feature_names),reverse=True):
    print('{}:{}'.format(name,score))
    #结果输出
Sex:24.170501191468333
Age:16.631944921637867
Fare:13.148998881800521
Pclass:12.825448667869106
Embarked:7.7271347452781
Cabin:6.581032890648493
Ticket:6.561360471647209
Parch:6.460657202234844
SibSp:5.892921027415583
PassengerId:0.0
Name:0.0

训练后查看模型在新数据集上的表现（Eval Metrics）
CatBoost有一个eval_metrics的方法，可以用于计算训练后的模型某一指定指标在每一轮迭代的表现，同时也可以可视化。可用于训练后的模型在新数据集上的评估。

model = CatBoostClassifier(**params).fit(train_pool,eval_set=test_pool, early_stopping_rounds=200, verbose=20)
eval_metrics = model.eval_metrics(test_pool,['AUC','F1','Logloss'],plot=True)

保存和导入模型

model = CatBoostClassifier(iterations=10, random_seed=42, logging_level='Silent').fit(train_pool)
model.save_model('catboost_model.dump')
model = CatBoostClassifier()
model.load_model('catboost_model.dump');

print(model.get_params())
print(model.random_seed_)
print(model.learning_rate_)

如果你已经解决了该问题, 非常希望你能够分享一下解决方案, 写成博客, 将相关链接放在评论区, 以帮助更多的人 ^-^