#在渠道营销场景中,按短信是否触达构建实验组和对照组,实验组圈人528万,对照组圈人528万,但是实验组触达存在漏损50%,实际触达50%,我现在用了两种方式(A:等比例缩放实验组圈人528W和对照组528W,B等比例缩放实验组触达264W和对照组528W:)
#构建样本比如下:
组别 样本量 正样本量 负样本 正负比 建模总样本量
实验组 527万 2474 15834 01:06.4 44458
对照组未触达527万 1046 25104 1:24
总计 3520 40938 01:11.6
#部分脚本如下:
model2=yangben_all
model2.loc[:, 'if_arrive'] = 1
result_df_a=pd.DataFrame(data=[],columns=['user_id','score'])
for col in predictors:
if model2.loc[:,col].dtype =='object':
model2[col]=model2[col].apply(lambda x:x.decode('gbk') if type(x) ==bytes else x)
model2.loc[:,col]=model2.loc[:,col].astype('category')
y_pred_proba=gbm1.predict(model2[predictors])
model2['score']=y_pred_proba
result_df_a=pd.concat([result_df_a,model2.loc[:,['user_id','score']]],axis=0)
print(result_df_a.shape)
model3=yangben_all
model3.loc[:, 'if_arrive'] = 0
result_df_b=pd.DataFrame(data=[],columns=['user_id','score2'])
for col in predictors:
if model3.loc[:,col].dtype =='object':
model3[col]=model3[col].apply(lambda x:x.decode('gbk') if type(x) ==bytes else x)
model3.loc[:,col]=model3.loc[:,col].astype('category')
y_pred_proba=gbm1.predict(model3[predictors])
model3['score2']=y_pred_proba
result_df_b=pd.concat([result_df_b,model3.loc[:,['user_id','score2']]],axis=0)
print(result_df_b.shape)
result_df_all = pd.merge(result_df_a, result_df_b, left_on='user_id', right_on='user_id' , how='inner', sort=False)
result_df_all['score_diff']=result_df_all['score']-result_df_all['score2']
# 响应分 分箱排序
result_df_all["score_diff_lv"] = pd.qcut(result_df_all["score_diff"],10)
result_df_all_lv = result_df_all.groupby(["score_diff_lv"],as_index=True)
score_diff_lv min_bin max_bin
0 (-0.105, 0.00942] -0.103506 0.009417
1 (0.00942, 0.0152] 0.009417 0.015241
2 (0.0152, 0.0247] 0.015241 0.024690
3 (0.0247, 0.0432] 0.024690 0.043207
4 (0.0432, 0.0832] 0.043207 0.083177
5 (0.0832, 0.126] 0.083177 0.126363
6 (0.126, 0.164] 0.126363 0.163937
7 (0.164, 0.198] 0.163937 0.198153
8 (0.198, 0.236] 0.198153 0.236462
9 (0.236, 0.451] 0.236462 0.451162
result_lv = result_df_all.groupby(['score_diff_lv','if_arrive']).agg({'user_id':'count','shenqing_flag':'sum'}).reset_index()
result_lv
score_diff_lv if_arrive user_id shenqing_flag
0 (-0.105, 0.00942] 0 506019 21
1 (-0.105, 0.00942] 1 607708 39
2 (0.00942, 0.0152] 0 515787 10
3 (0.00942, 0.0152] 1 597894 45
4 (0.0152, 0.0247] 0 520117 33
5 (0.0152, 0.0247] 1 593444 79
6 (0.0247, 0.0432] 0 530328 55
7 (0.0247, 0.0432] 1 583327 135
8 (0.0432, 0.0832] 0 531341 170
9 (0.0432, 0.0832] 1 582315 336
10 (0.0832, 0.126] 0 530088 235
11 (0.0832, 0.126] 1 583568 508
12 (0.126, 0.164] 0 530668 183
13 (0.126, 0.164] 1 582988 416
14 (0.164, 0.198] 0 530439 154
15 (0.164, 0.198] 1 583217 366
16 (0.198, 0.236] 0 532594 120
17 (0.198, 0.236] 1 581062 318
18 (0.236, 0.451] 0 535430 65
19 (0.236, 0.451] 1 578226 232
实验组 对照组
组别 用户数 申请数 用户数 申请数 用户数汇总 申请数汇总 实验组申请率 对照组申请率 折算增益
(0.237, 0.426] 579145 138 534450 43 1113595 181 0.024% 0.008% 0.014%
(0.217, 0.237] 583021 132 530696 40 1113717 172 0.023% 0.008% 0.013%
(0.2, 0.217] 586028 131 527626 51 1113654 182 0.022% 0.010% 0.011%
(0.182, 0.2] 588730 151 524909 53 1113639 204 0.026% 0.010% 0.013%
(0.156, 0.182] 590526 157 523149 55 1113675 212 0.027% 0.011% 0.014%
(0.127, 0.156] 590443 201 523213 83 1113656 284 0.034% 0.016% 0.015%
(0.0991, 0.127] 592015 264 521636 94 1113651 358 0.045% 0.018% 0.023%
(0.0742, 0.0991] 589894 316 523691 134 1113585 450 0.054% 0.026% 0.023%
(0.0513, 0.0742] 588882 390 524849 192 1113731 582 0.066% 0.037% 0.024%
(0.00253, 0.0513] 585065 594 528592 301 1113657 895 0.102% 0.057% 0.035%
总计 5873749 2474 5262811 1046 11136560 3520 0.042% 0.020% 0.018%
(上下数据不是同一版,但是问题一样,只是举例)
如上展示,uplift S-Learner建模中遇到预测增益分和实际增益分排序相反的情况,实际应该预测分越高实际增益越高,目前排查了代码,不存在弄反的情况,样本的定义目前也都尝试了,还是存在这个问题,所以目前找不到具体原因是什么,请各位看看问题出在哪
短信,有没有排除电信供应商的干扰因素,比如不可达的用户本身已经被拉入了黑名单。
地区是否分布均匀,需要随机均匀抽样才有效果
您的应用场景能细说一下麽
实验组和对照组随机打乱试试