推广

旅游平台用户流失预警

iseeyu2年前 (2024-02-21)推广119

2、数据探索

# 加载包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.cluster import KMeans
from sklearn import metrics
pd.set_option('display.max_rows', 200) 
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 200)
# 获取数据
df_orign = pd.read_csv('userlostprob_train.txt', sep='\t') #
df = df_orign.copy()  # 复制一份数据
df.shape

(689945, 51)

2.1 目标变量分布

df['label'].value_counts() 
0    500588
1    189357
Name: label, dtype: int64
  • 流失和未流失的用户比例2:5,样本不算特别不平衡,此处不做处理。

2.2 处理异常值

df.describe()

观察到用户偏好价格delta_price1、delta_price2,以及当前酒店可订最低价 lowestprice存在一些负值,理论上酒店的价格不可能为负。同时数据分布比较集中,因此采取中位数填充。而客户价值customer_value_profit、ctrip_profits也不应该为负值,这里将其填充为0。deltaprice_pre2_t1是酒店价格与对手价差均值,可以为负值,无需处理。

# 过滤掉前面非数值的字段
df_min=df.min().iloc[4:]  
# 查看值为负的字段
index=df_min[df_min<0].index.tolist()
# 查看存在负值的字段的值分布
plt.figure(figsize=(20,10))
for i in range(len(index)):
    plt.subplot(2,3,i+1)
    plt.hist(df[index[i]],bins=100)
    plt.title(index[i])

  • 数据分布比较集中,因此采取中位数填充。
neg1=['delta_price1','delta_price2','lowestprice']   # 填充中位数
neg2=['customer_value_profit','ctrip_profits']  # 填充0
for col in neg1:
    df.loc[df[col]<0,col]=df[col].median()
for col in neg2:
    df.loc[df[col]<0,col]=0
  • 24小时内登陆时长内登录时长不应该超过24小时,将大于24的值改为24
df.loc[df['landhalfhours']>24,['landhalfhours']] = 24

2.3 格式转换

  • 访日期d和入住日期arrival是字符串格式,需要进行格式转换,将字符串格式转换为日期格式
df['d']=pd.to_datetime(df['d'],format="%Y-%m-%d")
df['arrival']=pd.to_datetime(df['arrival'],format="%Y-%m-%d")

3、特征工程

  • 数据和特征决定了机器学习效果的上限,而模型和算法只是逼近这个上限。特征工程是建模前的关键步骤,特征处理得好,可以提升模型的性能。

3.1 缺失值处理

  • 查看字段缺失比例
na_rate=(len(df)-df.count())/len(df)  # 计算缺失值比例(类型一维序列)
na_rate.sort_values(ascending=True,inplace=True) # 排序
na_rate=pd.DataFrame(na_rate,columns=['rate'])  # 转化为数据框
# 画图
plt.figure(figsize=(6,12)) 
plt.barh(na_rate.index,na_rate['rate'],alpha = 0.5)
plt.xlabel('na_rate') # 添加轴标签
plt.xlim([0,1]) # 刻度范围
for x,y in enumerate(na_rate['rate']):
    plt.text(y,x,'%.2f%%'%y)  # 添加数值标签

根据上图显示:几乎所有字段都存在缺失值,存在缺失值的字段均为连续性字段,其中historyvisit_7ordernum缺失值超过80%,已没有分析的必要,考虑将其删除。剩余存在缺失值的字段,考虑用其他值来填充空值。

  • 使用相关字段相互填充
    计算字段的相关性发现:commentnums_pre和novoters_pre相关性较强;commentnums_pre2和novoters_pre2相关性较强。

    此处取上图结果的中位数65%作为评分率,考虑用novoters_pre*65%来填充commentnums_pre;
    commentnums_pre/65%来填充novoters_pre。 填充了commentnums_pre和novoters_pre的部分缺失值,剩余缺失值用中位数填充。

def fill_commentnum_novoter_pre(x):
    if (x.isnull()['commentnums_pre'])&(x.notnull()['novoters_pre']): 
        x['commentnums_pre'] = x['novoters_pre']*0.65
    elif (x.notnull()['commentnums_pre'])&(x.isnull()['novoters_pre']):
        x['novoters_pre'] = x['commentnums_pre']/0.65
    else:
        return x
    return x
df[['commentnums_pre','novoters_pre']] = df[['commentnums_pre','novoters_pre']].apply(fill_commentnum_novoter_pre,axis=1)
def fill_commentnum_novoter_pre2(x):
    if (x.isnull()['commentnums_pre2'])&(x.notnull()['novoters_pre2']):
        x['commentnums_pre2'] = x['novoters_pre2']*0.65
    elif (x.notnull()['commentnums_pre2'])&(x.isnull()['novoters_pre2']):
        x['novoters_pre2'] = x['commentnums_pre2']/0.65
    else:
        return x
    return x
df[['commentnums_pre2','novoters_pre2']] = df[['commentnums_pre2','novoters_pre2']].apply(fill_commentnum_novoter_pre2,axis=1)
# 均值填充(极端值影响不大,符合近似正态分布的字段)
fill_mean = ['cancelrate','landhalfhours','visitnum_oneyear','starprefer','price_sensitive','lowestprice','customereval_pre2',
            'uv_pre2','lowestprice_pre2','novoters_pre2','commentnums_pre2','businessrate_pre2','lowestprice_pre','hotelcr','cancelrate_pre']
df[fill_mean] = df[fill_mean].apply(lambda x:x.fillna(x.mean()))

#中位数填充
fill_median = ['ordernum_oneyear','commentnums_pre','novoters_pre','uv_pre','ordercanncelednum','ordercanceledprecent',
               'lasthtlordergap','cityuvs','cityorders','lastpvgap','historyvisit_avghotelnum','businessrate_pre','cr','uv_pre','cr_pre'
                ,'novoters_pre','commentnums_pre','novoters','hoteluv','ctrip_profits','customer_value_profit']
df[fill_median] = df[fill_median].apply(lambda x:x.fillna(x.median()))

#0填充
df[['deltaprice_pre2_t1','historyvisit_visit_detailpagenum']] = df[['deltaprice_pre2_t1','historyvisit_visit_detailpagenum']].apply(lambda x:x.fillna(0))
  • 分段填充
    consuming_capacity和starprefer相关,考虑通过starprefer分段来填充consuming_capacity。 看一下这两个字段的描述情况:

    将starprefer分为三段:<60,60~80,>80

fill1 = df.loc[df['starprefer']<60,['consuming_capacity']].mean()
fill2 = df.loc[(df['starprefer']<80)&(df['starprefer']>=60),['consuming_capacity']].mean()
fill3 = df.loc[df['starprefer']>=80,['consuming_capacity']].mean()
def fill_consuming_capacity(x):
    if x.isnull()['consuming_capacity']:
        if x['starprefer']<60:
            x['consuming_capacity'] = fill1
        elif (x['starprefer']<80)&(x['starprefer']>=60):
            x['consuming_capacity'] = fill2
        else:
            x['consuming_capacity'] = fill3
    else:
        return x
    return x
df[['consuming_capacity','starprefer']] = df[['consuming_capacity','starprefer']].apply(fill_consuming_capacity,axis=1)
  • 聚类填充
    commentnums和novoters、cancelrate、hoteluv存在较强相关性
    考虑通过聚类取中位数的方式来填充commentnums。
#commentnums:当前酒店点评数
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
km = KMeans(n_clusters=4)
data = df.loc[:,['commentnums','novoters','cancelrate','hoteluv']]
ss = StandardScaler()  # 聚类算距离,需要先标准化
data[['novoters','cancelrate','hoteluv']] = pd.DataFrame(ss.fit_transform(data[['novoters','cancelrate','hoteluv']]))

km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
data.loc[(data['commentnums'].isnull())&(data['label_pred']==0),['commentnums']] = (data.loc[data['label_pred'] == 0,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==1),['commentnums']] = (data.loc[data['label_pred'] == 1,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==2),['commentnums']] = (data.loc[data['label_pred'] == 2,'commentnums']).median()
data.loc[(data['commentnums'].isnull())&(data['label_pred']==3),['commentnums']] = (data.loc[data['label_pred'] == 3,'commentnums']).median()
df['commentnums'] = data['commentnums']

取starprefer和consuming_capacity聚类后每类avgprice的均值来填充avgprice的空值

# avgprice:starprefer,consuming_capacity
km = KMeans(n_clusters=5)
data = df.loc[:,['avgprice','starprefer','consuming_capacity']]
ss = StandardScaler()  # 聚类算距离,需要先标准化
data[['starprefer','consuming_capacity']] = pd.DataFrame(ss.fit_transform(data[['starprefer','consuming_capacity']]))
km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
# metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['avgprice'].isnull())&(data['label_pred']==0),['avgprice']] = (data.loc[data['label_pred'] == 0,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==1),['avgprice']] = (data.loc[data['label_pred'] == 1,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==2),['avgprice']] = (data.loc[data['label_pred'] == 2,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==3),['avgprice']] = (data.loc[data['label_pred'] == 3,'avgprice']).mean()
data.loc[(data['avgprice'].isnull())&(data['label_pred']==4),['avgprice']] = (data.loc[data['label_pred'] == 4,'avgprice']).mean()
df['avgprice'] = data['avgprice']

取consuming_capacity和avgprice聚类后的中位数来填充delta_price1

# delta_price1:consuming_capacity,avgprice
km = KMeans(n_clusters=6)
data = df.loc[:,['delta_price1','consuming_capacity','avgprice']]
ss = StandardScaler()  # 聚类算距离,需要先标准化
data[['consuming_capacity','avgprice']] = pd.DataFrame(ss.fit_transform(data[['consuming_capacity','avgprice']]))

km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
# metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==0),['delta_price1']] = (data.loc[data['label_pred'] == 0,'delta_price1']).median()
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==1),['delta_price1']] = (data.loc[data['label_pred'] == 1,'delta_price1']).median()
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==2),['delta_price1']] = (data.loc[data['label_pred'] == 2,'delta_price1']).median()
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==3),['delta_price1']] = (data.loc[data['label_pred'] == 3,'delta_price1']).median()
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==4),['delta_price1']] = (data.loc[data['label_pred'] == 4,'delta_price1']).median()
data.loc[(data['delta_price1'].isnull())&(data['label_pred']==5),['delta_price1']] = (data.loc[data['label_pred'] == 5,'delta_price1']).median()
df['delta_price1'] = data['delta_price1']

取 consuming_capacity和avgprice聚类delta_price2的中位数来填充delta_price2

# delta_price2: consuming_capacity,avgprice
km = KMeans(n_clusters=5)
data = df.loc[:,['delta_price2','avgprice','consuming_capacity']]
ss = StandardScaler()  # 聚类算距离,需要先标准化
data[['avgprice','consuming_capacity']] = pd.DataFrame(ss.fit_transform(data[['avgprice','consuming_capacity']]))

km.fit(data.iloc[:,1:])
label_pred = km.labels_
data['label_pred'] = label_pred
#metrics.calinski_harabaz_score(data.iloc[:,1:],km.labels_)
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==0),['delta_price2']] = (data.loc[data['label_pred'] == 0,'delta_price2']).median()
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==1),['delta_price2']] = (data.loc[data['label_pred'] == 1,'delta_price2']).median()
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==2),['delta_price2']] = (data.loc[data['label_pred'] == 2,'delta_price2']).median()
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==3),['delta_price2']] = (data.loc[data['label_pred'] == 3,'delta_price2']).median()
data.loc[(data['delta_price2'].isnull())&(data['label_pred']==4),['delta_price2']] = (data.loc[data['label_pred'] == 4,'delta_price2']).median()
df['delta_price2'] = data['delta_price2']
  • 以上,缺失值处理完毕

3.2 新增字段

  • 时间字段
    新增字段:访问日期和入住日期间隔天数booking_gap、入住日期是星期几week_day、入住日期是否是周末is_weekend
#格式为年-月-日
df[['d','arrival']] = df[['d','arrival']].apply(lambda x:pd.to_datetime(x,format='%Y-%m-%d'))
#访问日期和入住日期间隔天数
df['booking_gap'] = ((df['arrival']-df['d'])/np.timedelta64(1,'D')).astype(int)
#入住日期是星期几
df['week_day'] = df['arrival'].map(lambda x:x.weekday())
#入住日期是否是周末
df['is_weekend'] = df['week_day'].map(lambda x: 1 if x in (5,6) else 0)
  • 是否是同一个样本【选取部分客户行为指标】
    查看字段sid,发现95%都是老用户,新用户很少,一周内部分用户可能会下多个订单,为了方便后续划分训练集和验证集,此处添加一个user_tag来区分是否是同一个用户的订单。
df['user_tag'] = df['ordercanceledprecent'].map(str) + df['ordercanncelednum'].map(str) + df['ordernum_oneyear'].map(str) +\
                  df['starprefer'].map(str) + df['consuming_capacity'].map(str) + \
                 df['price_sensitive'].map(str) + df['customer_value_profit'].map(str) + df['ctrip_profits'].map(str) +df['visitnum_oneyear'].map(str) + \
                  df['historyvisit_avghotelnum'].map(str) + df['businessrate_pre2'].map(str) +\
                df['historyvisit_visit_detailpagenum'].map(str) + \
                  df['delta_price2'].map(str) +  \
                df['commentnums_pre2'].map(str) + df['novoters_pre2'].map(str) +df['customereval_pre2'].map(str) + df['lowestprice_pre2'].map(str)
df['user_tag'] = df['user_tag'].apply(lambda x : hash(x))
  • 用户字段和酒店字段
    选取部分用户相关字段进行聚类创建用户字段user_group,选取部分酒店相关字段进行聚类创建酒店字段hotel_group。
user_group = ['ordercanceledprecent','ordercanncelednum','ordernum_oneyear',
             'historyvisit_visit_detailpagenum','historyvisit_avghotelnum']
hotel_group = ['commentnums', 'novoters', 'lowestprice', 'hotelcr', 'hoteluv', 'cancelrate']
#聚类之前先标准化
km_user = pd.DataFrame(df[user_group])
km_hotel = pd.DataFrame(df[hotel_group])
ss = StandardScaler()
for i in range(km_user.shape[1]):
    km_user[user_group[i]] = ss.fit_transform(df[user_group[i]].values.reshape(-1, 1)).ravel()
ss = StandardScaler()
for i in range(km_hotel.shape[1]):
    km_hotel[hotel_group[i]] = ss.fit_transform(df[hotel_group[i]].values.reshape(-1, 1)).ravel()
df['user_group'] = KMeans(n_clusters=3).fit_predict(km_user)
# score = metrics.calinski_harabaz_score(km_user,KMeans(n_clusters=3).fit(km_user).labels_)
# print('数据聚calinski_harabaz指数为:%f'%(score)) #3:218580.269018  4:218580.416497 5:218581.368953 6:218581.203569 
df['hotel_group'] = KMeans(n_clusters=5).fit_predict(km_hotel)
# score = metrics.calinski_harabaz_score(km_hotel,KMeans(n_clusters=3).fit(km_hotel).labels_)
# print('数据聚calinski_harabaz指数为:%f'%(score))  #3:266853.481135  4:268442.314369 5:268796.468103 6:268796.707149

3.3 连续特征离散化

historyvisit_avghotelnum大部分都小于5,将字段处理成小于等于5和大于5的离散值;
ordercanncelednum大部分都小于5,将字段处理成小于等于5和大于5的离散值;
sid等于1是新访设为0,其他设为1为老用户。
avgprice、lowestprice、starprefer、consuming_capacity和h进行数值分段离散化。

df['historyvisit_avghotelnum'] = df['historyvisit_avghotelnum'].apply(lambda x: 0 if x<=5 else 1)
df['ordercanncelednum'] = df['ordercanncelednum'].apply(lambda x: 0 if x<=5 else 1)
df['sid'] = df['sid'].apply(lambda x: 0 if x==1 else 1)  
#分段离散化
def discrete_avgprice(x):
    if x<=200:
        return 0
    elif x<=400:
        return 1
    elif x<=600:
        return 2
    else:
        return 3
    
def discrete_lowestprice(x):
    if x<=100:
        return 0
    elif x<=200:
        return 1
    elif x<=300:
        return 2
    else:
        return 3
    
def discrete_starprefer(x):
    if x==0:
        return 0
    elif x<=60:
        return 1
    elif x<=80:
        return 2
    else:
        return 3
    
def discrete_consuming_capacity(x):
    if x<0:
        return 0
    elif x<=20:
        return 1
    elif x<=40:
        return 2
    elif x<=60:
        return 3
    else:
        return 4
    
def discrete_h(x):
    if x>=0 and x<6:#凌晨访问
        return 0
    elif x<12:#上午访问
        return 1
    elif x<18:#下午访问
        return 2
    else:
        return 3#晚上访问
    
df['avgprice'] = df['avgprice'].map(discrete_avgprice)
df['lowestprice'] = df['lowestprice'].map(discrete_lowestprice)
df['starprefer'] = df['starprefer'].map(discrete_starprefer)
df['consuming_capacity'] = df['consuming_capacity'].map(discrete_consuming_capacity)
df['h'] = df['h'].map(discrete_h)
  • 对当前的数值型类别变量进行离散特征热编码,此处用OneHotEncoder方法
discrete_field = ['historyvisit_avghotelnum','ordercanncelednum'
                  ,'avgprice','lowestprice','starprefer','consuming_capacity','user_group',
                 'hotel_group','is_weekend','week_day','sid','h']
encode_df = pd.DataFrame(preprocessing.OneHotEncoder(handle_unknown='ignore').fit_transform(df[discrete_field]).toarray())
encode_df_new = pd.concat([df.drop(columns=discrete_field,axis=1),encode_df],axis=1)

3.4 删除字段

去掉两类字段: d、arrival、sampleid、firstorder_bu这几个对分析没有意义的字段; historyvisit_totalordernum和ordernum_oneyear这两个字段值相等,此处取ordernum_oneyear这个字段,删除historyvisit_totalordernum; decisionhabit_user和historyvisit_avghotelnum数值较一致,此处选择historyvisit_avghotelnum,删除decisionhabit_user。

encode_df_new = encode_df_new.drop(columns=['d','arrival','sampleid','historyvisit_totalordernum','firstorder_bu','decisionhabit_user','historyvisit_7ordernum'],axis=1)
encode_df_new.shape

4、模型训练

4.1 划分训练集和验证集

为了保证训练集和验证集独立同分布,将数据按照user_tag进行排序,取前70%作为训练集,剩余的作为验证集。

ss_df_new = encode_df_new
num = ss_df_new.shape[0]
df_sort = ss_df_new.sort_values(by=['user_tag'],ascending=True)
train_df = df_sort.iloc[:int(num*0.7),:]
test_df = df_sort.iloc[int(num*0.7):,:]
train_y = train_df['label']
train_x = train_df.iloc[:,1:]
test_y = test_df['label']
test_x = test_df.iloc[:,1:]

4.2比较各个模型的训练效果

所有模型的调参都采用GridSearchCV网格搜索进行。

  • 决策树
from sklearn.tree import DecisionTreeClassifier
bdt = DecisionTreeClassifier(random_state=0,max_depth=30, min_samples_split=70)
bdt.fit(train_x,train_y)
predict_train = bdt.predict_proba(train_x)[:,1]
predict_test = bdt.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97时对应的最大recall为:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分为:{}'.format(auc_test))
返回
0.0
0.8340018840954033
  • 随机森林
#调整的参数:
#n_estimators
#max_depth
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=300,max_depth=50)
rf.fit(train_x,train_y)
predict_train = rf.predict_proba(train_x)[:,1]
predict_test = rf.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97时对应的最大recall为:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分为:{}'.format(auc_test))
返回
0.666135416301797
0.9616117844760916
  • Adaboost
from sklearn.ensemble import AdaBoostClassifier
bdt = AdaBoostClassifier(algorithm="SAMME",
                         n_estimators=600, learning_rate=1)
bdt.fit(train_x,train_y)
predict_train = bdt.predict_proba(train_x)[:,1]
predict_test = bdt.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97时对应的最大recall为:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分为:{}'.format(auc_test))
返回
0.00019265123121650496
0.7300356696791559
  • GBDT
#调整的参数:
#n_estimators
#max_depth和min_samples_split
#min_samples_split和min_samples_leaf
#max_features
#subsample
#learning_rate,需要配合调整n_estimators
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
#最终的参数结果
gbc = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
                        min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
gbc.fit(train_x,train_y)
predict_train = gbc.predict_proba(train_x)[:,1]
predict_test = gbc.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('调参之后:测试集中precision>=0.97对应的最大recall为:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分为:{}'.format(auc_test))
返回
0.15988300816140671
0.8808204850185188
  • xgboost
#调整的参数:
#迭代器个数n_estimators
#min_child_weight以及max_depth
#gamma值
##subsample 和 colsample_bytree
#learning_rate,需要配合调整n_esgtimators


from xgboost.sklearn import XGBClassifier
xgbc = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1,  scale_pos_weight=1, seed=27,
                    subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
xgbc.fit(train_x,train_y)
predict_train = xgbc.predict_proba(train_x)[:,1]
predict_test = xgbc.predict_proba(test_x)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97时对应的最大recall为:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分为:{}'.format(auc_test))
返回
0.7640022417597814
0.9754939563495324
  • 根据上述结果可知,xgboost的训练效果最好,当precision>=0.97时,recall最大能达到76.4%。

5、模型融合

后面也尝试了模型堆叠的方法,看是否能得到更好的效果,首先利用上述提到的各个模型,根据特征重要性选取了57个特征,然后利用KFold方法进行5折交叉验证,得到五种模型的验证集和测试集结果,分别作为第二层的训练数据集和测试数据集,并用逻辑回归模型来训练这五个特征,最终得到的结果是当precision>=0.97时,recall最大能达到78.3%,比原来的76.4%稍有提高。

  • 选取重要特征
#筛选特征
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier

def get_top_n_features(train_x, train_y):

    # random forest
    rf_est = RandomForestClassifier(n_estimators=300,max_depth=50)
    rf_est.fit(train_x, train_y)
    feature_imp_sorted_rf = pd.DataFrame({'feature': train_x.columns,
                                          'importance': rf_est.feature_importances_}).sort_values('importance', ascending=False)

    # AdaBoost
    ada_est =AdaBoostClassifier(n_estimators=600,learning_rate=1)
    ada_est.fit(train_x, train_y)
    feature_imp_sorted_ada = pd.DataFrame({'feature': train_x.columns,
                                           'importance': ada_est.feature_importances_}).sort_values('importance', ascending=False)

    
    # GradientBoosting
    gb_est = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
                        min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
    gb_est.fit(train_x, train_y)
    feature_imp_sorted_gb = pd.DataFrame({'feature':train_x.columns,
                                          'importance': gb_est.feature_importances_}).sort_values('importance', ascending=False)

    # DecisionTree
    dt_est = DecisionTreeClassifier(random_state=0,min_samples_split=70,max_depth=30)
    dt_est.fit(train_x, train_y)
    feature_imp_sorted_dt = pd.DataFrame({'feature':train_x.columns,
                                          'importance': dt_est.feature_importances_}).sort_values('importance', ascending=False)
    
    # xgbc
    xg_est = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1,  scale_pos_weight=1, seed=27,
                    subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
    xg_est.fit(train_x, train_y)
    feature_imp_sorted_xg = pd.DataFrame({'feature':train_x.columns,
                                          'importance': xg_est.feature_importances_}).sort_values('importance', ascending=False)

    
    return feature_imp_sorted_rf,feature_imp_sorted_ada,feature_imp_sorted_gb,feature_imp_sorted_dt,feature_imp_sorted_xg

feature_imp_sorted_rf,feature_imp_sorted_ada,feature_imp_sorted_gb,feature_imp_sorted_dt,feature_imp_sorted_xg = get_top_n_features(train_x, train_y)
top_n_features = 35
features_top_n_rf = feature_imp_sorted_rf.head(top_n_features)['feature']
features_top_n_ada = feature_imp_sorted_ada.head(top_n_features)['feature']
features_top_n_gb = feature_imp_sorted_gb.head(top_n_features)['feature']
features_top_n_dt = feature_imp_sorted_dt.head(top_n_features)['feature']
features_top_n_xg = feature_imp_sorted_xg.head(top_n_features)['feature']
features_top_n = pd.concat([features_top_n_rf, features_top_n_ada, features_top_n_gb, features_top_n_dt,features_top_n_xg], 
                               ignore_index=True).drop_duplicates()
    
features_importance = pd.concat([feature_imp_sorted_rf, feature_imp_sorted_ada, 
                                   feature_imp_sorted_gb, feature_imp_sorted_dt,feature_imp_sorted_xg],ignore_index=True)
train_x_new = pd.DataFrame(train_x[features_top_n])
test_x_new = pd.DataFrame(test_x[features_top_n])
features_top_n

最终从79个特征中选取了57个。

  • 第一层模型训练
#第一层
from sklearn.model_selection import KFold
ntrain = train_x_new.shape[0]
ntest = test_x_new.shape[0]
kf = KFold(n_splits = 5, random_state=0, shuffle=False)

def get_out_fold(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((5, ntest))
    oof_train_prob = np.zeros((ntrain,))
    oof_test_prob = np.zeros((ntest,))
    oof_test_skf_prob = np.empty((5, ntest))

    for i, (train_index, test_index) in enumerate(kf.split(x_train)):
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]

        clf.fit(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)
        oof_train_prob[test_index] = clf.predict_proba(x_te)[:,1]
        oof_test_skf_prob[i, :] = clf.predict_proba(x_test)[:,1]
        print('现在是第{}层'.format(i))
        print('训练集索引如下:')
        print(train_index)
        print('测试集索引如下:')
        print(test_index)
    oof_test[:] = oof_test_skf.mean(axis=0)
    oof_test_prob[:] = oof_test_skf_prob.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1),oof_train_prob.reshape(-1, 1), oof_test_prob.reshape(-1, 1)
rf = RandomForestClassifier(n_estimators=300,max_depth=50)
ada = AdaBoostClassifier(n_estimators=600,learning_rate=1)
gb = GradientBoostingClassifier(loss='deviance',random_state=2019,learning_rate=0.05, n_estimators=200,min_samples_split=4,
                        min_samples_leaf=1,max_depth=11,max_features='sqrt', subsample=0.8)
dt = DecisionTreeClassifier(random_state=0,min_samples_split=70,max_depth=30)

x_train = train_x_new.values 
x_test = test_x_new.values 
y_train =train_y.values
rf_oof_train, rf_oof_test,rf_oof_train_prob, rf_oof_test_prob = get_out_fold(rf, x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test,ada_oof_train_prob, ada_oof_test_prob = get_out_fold(ada, x_train, y_train, x_test) # AdaBoost 
gb_oof_train, gb_oof_test,gb_oof_train_prob, gb_oof_test_prob = get_out_fold(gb, x_train, y_train, x_test) # Gradient Boost
dt_oof_train, dt_oof_test,dt_oof_train_prob, dt_oof_test_prob = get_out_fold(dt, x_train, y_train, x_test) # Decision Tree
xgbc = XGBClassifier(learning_rate=0.05, objective= 'binary:logistic', nthread=1,  scale_pos_weight=1, seed=27,
                    subsample=0.6, colsample_bytree=0.6, gamma=0, reg_alpha= 0, reg_lambda=1,max_depth=38,min_child_weight=1,n_estimators=210)
xgbc_oof_train, xgbc_oof_test,xgbc_oof_train_prob, xgbc_oof_test_prob = get_out_fold(xgbc, x_train, y_train, x_test) # XGBClassifier
print("Training is complete")
  • 第二层模型训练
    将第一层的输出结果作为训练集和测试集
#划分训练集和测试集
train_x2_prob = pd.DataFrame(np.concatenate((rf_oof_train_prob, ada_oof_train_prob, gb_oof_train_prob, dt_oof_train_prob), axis=1),columns=['rf_prob','ada_prob','gb_prob','dt_prob'])
test_x2_prob = pd.DataFrame(np.concatenate((rf_oof_test_prob, ada_oof_test_prob, gb_oof_test_prob, dt_oof_test_prob), axis=1),columns=['rf_prob','ada_prob','gb_prob','dt_prob'])
#逻辑回归模型训练
from sklearn.linear_model import LogisticRegression
#调参
# param_rf4 = {'penalty': ['l1','l2'],'C':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]}
# rf_est4 = LogisticRegression()
# rfsearch4 = GridSearchCV(estimator=rf_est4,param_grid=param_rf4,scoring='roc_auc',iid=False,cv=5)
# rfsearch4.fit(train_x2_prob,train_y)
# print('每个参数值的平均得分:{}'.format(rfsearch4.cv_results_['mean_test_score']))
# print('最佳参数值为:{}'.format(rfsearch4.best_params_))
# print('最佳参数值roc_auc得分为:{}'.format(rfsearch4.best_score_))
#调参结果:C=0.1,penalty='l2'
lr = LogisticRegression(C=0.1,penalty='l2')
lr.fit(train_x2_prob,train_y)
predict_train = lr.predict_proba(train_x2_prob)[:,1]
predict_test = lr.predict_proba(test_x2_prob)[:,1]
pr_train,re_train,thre_train = metrics.precision_recall_curve(train_y,predict_train)
pr_test,re_test,thre_test = metrics.precision_recall_curve(test_y,predict_test)
auc_train = metrics.roc_auc_score(train_y,predict_train)
auc_test = metrics.roc_auc_score(test_y,predict_test)
prt_train = pd.DataFrame({'precision':pr_train,'recall':re_train})
prt_test = pd.DataFrame({'precision':pr_test,'recall':re_test})
print('precision>=0.97时对应的最大recall为:')
print(prt_test.loc[prt_test['precision']>=0.97,'recall'].max())
print('auc得分为:{}'.format(auc_test))
返回
0.7832498511331395
0.9763271659779821

通过堆叠的方法,将recall值从76.4%提高到78.3%。

扫描二维码推送至手机访问。

版权声明:本文由西安泽虎代运营发布,如需转载请注明出处。

转载请注明出处https://0291.com.cn/post/57116.html

相关文章

营销计划与管理——制订营销计划

营销计划与管理——制订营销计划

计划指导和协调公司的所有营销工作。它是公司战略计划过程的具体成果,概括了公司的最终目标及其实现手段。为了达到指导公司行动的最终目的,营销计划必须有效地将公司的目标和拟议的行动方案传达给相应的利益相关者——公司员工、合作者、股东和投资者。营销计划的范围比商业计划的范围更窄,因...

小编教你网站建设需要简洁精致。

小编教你网站建设需要简洁精致。

随着技能的不断提升,越来越多的网站功能可以通过技能来实现。可以说,只要市场有需求,就会发现并形成新的功能。但是企业选择建网站的时候,不能说目前最常用的是哪些功能,用户需要的更多,很难添加到网站上。最终会形成这样一种情况:功能复杂,但操作繁琐,不利于用户体验和网站优化 通常对网站复杂程度把握不当,肯定...

极速推在哪里展现(极速推在哪里设置)

极速推在哪里展现(极速推在哪里设置)

系统会根据您设置的订单条件,比如定向人群,快速智能地帮助您的宝贝投放给更多高价值活跃的消费者人群链路上,目前以淘内推荐流量为主要渠道。...

网课时代,护眼学习平板怎么选——教多多E15教

网课时代,护眼学习平板怎么选——教多多E15教

作者:邪恶的麦叔叔 前言 现在家长最操心的就是孩子的教育问题了,疫情反反复复,有点风吹草动学校就要停课改为在家上网课。而双减也是把压力转移到了家长的身上,培训班被叫停更让家长不知所措,想给孩子补习功课只能通过线上学习了。 对于学龄期的孩子来说,Ta们还比较贪玩,自控力没有那么好,学习还...

#少坤老师 互联网营销心理学第九讲:网络营销策略(网络营销策略包括哪四种)

#少坤老师 互联网营销心理学第九讲:网络营销策略(网络营销策略包括哪四种)

是企业根据自身所在的市场中所处地位不同而采取的一些网络营销组合,以互联网络为基础,利用数字化的信息和网络媒体的交互性来辅助营销目标实现的一种新型的市场营销方式。简单来说,网络营销就是以互联网为主要手段,为达到一定营销目的而进行的营销活动。 网络营销要经过以下几个步骤: 一、战略整体规划...

淘宝投入产出比计算公式,在哪里查看,费用投入产出比怎么算(投入产出比怎么计算)

淘宝投入产出比计算公式,在哪里查看,费用投入产出比怎么算(投入产出比怎么计算)

ROI的公式是:投入产出比ROI=总成交金额/花费,然后,我们分解里面的每一个因素,其中成交金额=成交笔数*客单价,每一笔的客单价乘以订单量等于成交金额。...

现在,非常期待与您的又一次邂逅

我们努力让每一部企业宣传片和抖音短视频成为商业大片