2018年10月21日星期日

文章列表上线啦

原来blogger是自带的,就是现在页面底下的那个小东西。但是为了改造成能国内访问的时候删了引用的blogger的代码,原环境没保存好导致复原的时候出现了点问题。心力憔悴,无心复原,就干脆自己写一个。
这里利用了Blogger提供的RSS接口,对返回的XML进行了解析,最多能提供五百个文章地址,而且信息丰富,像时间、标签等的东西都有,就是比较耗流量,一次请求返回1MB大小的文件。其实返回的是全文的信息了。而Blogger也有自带sitemap,并且只有2.5KB大小,然而缺少文章标题等信息,就放弃使用了。 相应的代码呢,其实就是直接网上搜了一下,复制上去了。就只修改了选择器的部分代码和输出形式。

 

2018年10月19日星期五

机器学习基本流程示例

消费者重购预测


1. 数据准备

2. 特征提取

3. 特征挑选

4. 算法对比

5. 精细调整

In [5]:
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from sklearn import ensemble
from sklearn.model_selection import cross_val_score
from deep_learning.data_convert import *
from sklearn import metrics
import xgboost as xgb
import matplotlib.pyplot as plt
from basic_model.feature_extraction.user_profile import operate_days, purchase_days
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from feature_selector import FeatureSelector
import featuretools as ft
from featuretools.primitives import make_agg_primitive
from featuretools.variable_types import Categorical, Numeric
import os
In [2]:
root_path = get_root_path()
df = pd.read_csv(root_path + './data/data_format1/user_log_format1.csv', nrows=500000)
In [3]:
user_info = get_user_info()
In [4]:
train_pairs = get_train_pairs()
test_pairs = get_test_pairs()

数据结构


用户点击、购买等数据

In [7]:
df.sample(6)
Out[7]:
user_id item_id cat_id seller_id brand_id time_stamp action_type
315210 404274 806011 629 1986 2158.0 1111 0
388593 375180 781616 1095 3638 5446.0 1111 0
19556 106446 880184 1467 1253 7907.0 525 0
450498 259 783997 1349 184 1360.0 1110 0
295620 210345 105430 1095 780 516.0 1031 0
342376 12840 1063731 302 742 1016.0 720 0

用户信息数据

In [8]:
user_info.sample(5)
Out[8]:
user_id age_range gender
128191 168385 0 0
355699 83601 5 0
134076 232512 6 0
317952 18559 4 0
7024 201150 7 1

重购训练数据

In [9]:
train_pairs.sample(5)
Out[9]:
user_id seller_id label
202530 229076 1501 0
169668 259444 160 0
38192 105969 2447 0
47697 146700 467 0
65502 313665 4282 0

需要提交预测的数据

In [10]:
test_pairs.sample(5)
Out[10]:
user_id seller_id prob
213760 72947 3319 NaN
60735 413745 3635 NaN
248962 69467 3629 NaN
147533 224815 297 NaN
129898 324604 1474 NaN

数据缺失情况

In [11]:
df.isnull().any()
Out[11]:
user_id        False
item_id        False
cat_id         False
seller_id      False
brand_id        True
time_stamp     False
action_type    False
dtype: bool

每个用户操作记录数大概分布情况

In [12]:
df.groupby(["user_id"]).size().describe()
Out[12]:
count    3382.000000
mean      147.841514
std       193.169136
min         3.000000
25%        45.000000
50%        90.000000
75%       174.000000
max      2877.000000
dtype: float64
In [13]:
df.groupby(["user_id"]).size().hist()
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x21746834b00>

用户操作类型分布

In [14]:
df["action_type"].hist()
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x217468c7320>

用户重购率等分布情况

In [15]:
train_pairs["label"].describe()
Out[15]:
count    260864.000000
mean          0.061151
std           0.239607
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           1.000000
Name: label, dtype: float64
In [16]:
train_pairs["label"].hist()
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x217469750f0>

特征提取

针对结构化数据,使用自动化提取工具Featuretools进行特征提取。


In [17]:
log_df = get_train_log(500000)
log_df = log_df.loc[log_df['action_type'] == 2]
log_df["month"] = log_df["time_stamp"].map(lambda x: int(x / 100))
log_df['data'] = log_df["time_stamp"].map(lambda x: '2016-' + str(int(x / 100)) + '-' + str(int(x // 100)))
user_df = get_user_info()
log_df = log_df.merge(user_df, on="user_id", how="inner")
log_df["index"] = log_df.index
log_df.drop(labels=['time_stamp', 'action_type', "gender", "age_range", "month"], axis=1, inplace=True)
es = ft.EntitySet(id="logs")
es = es.entity_from_dataframe(entity_id="logs",
                              dataframe=log_df,
                              index="index",
                              time_index="data",
                              variable_types={"user_id": ft.variable_types.Categorical,
                                              "item_id": ft.variable_types.Categorical,
                                              "cat_id": ft.variable_types.Categorical,
                                              "seller_id": ft.variable_types.Categorical,
                                              "brand_id": ft.variable_types.Categorical
                                              }
                              )

es = es.entity_from_dataframe(entity_id="users",
                              dataframe=user_df,
                              index="user_id",
                              variable_types={"age_range": ft.variable_types.Categorical,
                                              "gender": ft.variable_types.Categorical}
                              )

user_relationship = ft.Relationship(es["users"]["user_id"],
                                    es["logs"]["user_id"])

es = es.add_relationship(user_relationship)

es = es.normalize_entity(base_entity_id="logs",
                         new_entity_id="sellers",
                         index="seller_id")

es = es.normalize_entity(base_entity_id="logs",
                         new_entity_id="catalogs",
                         index="cat_id")

es = es.normalize_entity(base_entity_id="logs",
                         new_entity_id="brands",
                         index="brand_id")

es = es.normalize_entity(base_entity_id="logs",
                         new_entity_id="items",
                         index="item_id")

feature_defs = ft.dfs(entityset=es,
                      target_entity="sellers",
                      agg_primitives=["std", "mean", "count"],
                      max_depth=4,
                      where_primitives=[],
                      trans_primitives=[],
                      features_only=True
                      )

能够自动提取的特征

In [18]:
feature_defs
Out[18]:
[<Feature: COUNT(logs)>,
 <Feature: STD(logs.users.COUNT(logs))>,
 <Feature: STD(logs.catalogs.COUNT(logs))>,
 <Feature: STD(logs.brands.COUNT(logs))>,
 <Feature: STD(logs.items.COUNT(logs))>,
 <Feature: MEAN(logs.users.COUNT(logs))>,
 <Feature: MEAN(logs.catalogs.COUNT(logs))>,
 <Feature: MEAN(logs.brands.COUNT(logs))>,
 <Feature: MEAN(logs.items.COUNT(logs))>]

生成的对应特征

In [19]:
seller_buy_features = pd.read_csv(os.path.join(get_root_path(), "feature_vectors", "seller_buy.csv") , index_col=0)
In [21]:
seller_buy_features.sample(5)
Out[21]:
COUNT(logs) MONTH(first_logs_time) DAY(first_logs_time) COUNT(logs WHERE age_range = 1) COUNT(logs WHERE age_range = 3) COUNT(logs WHERE month = 10) COUNT(logs WHERE month = 11) COUNT(logs WHERE gender = 2) COUNT(logs WHERE month = 9) COUNT(logs WHERE age_range = 5) ... STD(logs.items.COUNT(logs) WHERE month = 6) STD(logs.items.COUNT(logs) WHERE age_range = 6) TREND(logs.catalogs.COUNT(logs), data) TREND(logs.items.COUNT(logs), data) TREND(logs.brands.COUNT(logs), data) TREND(logs.users.COUNT(logs), data) MEDIAN(logs.users.COUNT(logs)) MEDIAN(logs.catalogs.COUNT(logs)) MEDIAN(logs.brands.COUNT(logs)) MEDIAN(logs.items.COUNT(logs))
index
4897 200 6 6 0.0 46 57.0 113.0 8.0 9.0 26.0 ... 0.0000 22.1037 1.7501 0.3587 0.0000 -0.0159 12.0 61722.0 200.0 75.0
403 294 5 5 0.0 60 27.0 136.0 21.0 72.0 39.0 ... 40.9565 45.9035 -0.0132 -0.1075 1.2654 -0.0500 8.0 1015.0 291.0 68.0
2265 377 5 5 0.0 139 67.0 51.0 4.0 72.0 30.0 ... 12.5168 16.6649 -0.2432 -0.0077 0.0000 -0.0147 11.0 10034.0 377.0 10.0
4597 498 5 5 0.0 95 57.0 195.0 24.0 97.0 73.0 ... 3.5979 20.0499 37.0294 0.1373 1.1922 -0.0343 13.0 4212.5 840.0 8.0
4351 276 5 5 0.0 63 50.0 167.0 14.0 19.0 18.0 ... 4.2890 59.7204 -189.3830 0.5104 0.0000 -0.0531 8.5 8751.0 276.0 143.0

5 rows × 198 columns

In [22]:
seller_buy_features.iloc[:,:10].describe()
Out[22]:
COUNT(logs) MONTH(first_logs_time) DAY(first_logs_time) COUNT(logs WHERE age_range = 1) COUNT(logs WHERE age_range = 3) COUNT(logs WHERE month = 10) COUNT(logs WHERE month = 11) COUNT(logs WHERE gender = 2) COUNT(logs WHERE month = 9) COUNT(logs WHERE age_range = 5)
count 4995.000000 4995.000000 4995.000000 4995.000000 4995.000000 4995.000000 4995.000000 4995.000000 4995.000000 4995.000000
mean 659.087888 5.485085 5.485085 0.019219 173.305105 88.657457 270.999199 30.209209 73.683083 79.416016
std 1196.207834 1.175409 1.175409 0.192029 332.168473 168.686292 602.828244 61.905816 139.227856 178.754836
min 8.000000 5.000000 5.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 142.000000 5.000000 5.000000 0.000000 33.000000 15.000000 50.000000 5.000000 11.000000 14.000000
50% 293.000000 5.000000 5.000000 0.000000 71.000000 38.000000 96.000000 12.000000 30.000000 33.000000
75% 649.000000 5.000000 5.000000 0.000000 165.000000 91.000000 229.000000 29.000000 75.000000 79.000000
max 18877.000000 11.000000 11.000000 6.000000 4864.000000 3024.000000 10126.000000 1096.000000 2143.000000 5014.000000
In [23]:
seller_buy_features.iloc[:,-13:-4].boxplot(figsize=(20,10))
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x21747fc2240>
In [24]:
seller_buy_features.iloc[:,-10:].hist(figsize=(20,10))
Out[24]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000217469CF358>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000021748F18390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000021748F3F6A0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000021748F689B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000021748F93CC0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000021748F93CF8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000021748FF0320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000021749019630>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000021749042940>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002174906BC50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000021749095F60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000217490C62B0>]],
      dtype=object)

利用word2vec进行特征提取

假设用户的购买顺序和物品是由一定关联性,提取了品牌、买家、种类等的几个特征。

In [27]:
df[df["action_type"] == 2][["user_id","seller_id", "time_stamp"]].groupby('user_id').apply(lambda x: x.sort_values('time_stamp')).head(7)
Out[27]:
user_id seller_id time_stamp
user_id
16 61991 16 650 914
61956 16 3948 924
61964 16 4644 1025
61951 16 1435 1111
259 450473 259 740 527
450474 259 740 527
450318 259 2445 1111
In [28]:
df[df["action_type"] == 2][["user_id","seller_id", "time_stamp"]].groupby('user_id').apply(lambda x: x.sort_values('time_stamp')).rename(columns={'user_id':'user'}).reset_index()["seller_id"].head(10)
Out[28]:
0     650
1    3948
2    4644
3    1435
4     740
5     740
6    2445
7     184
8    4129
9     184
Name: seller_id, dtype: int64

无用特征删除

如单一值的,大量数据缺失的,和其他特征呈线性相关的


In [29]:
def select_features_without_label(features:pd.DataFrame, missing_threshold=0.7, correlation_threshold=0.95) -> pd.DataFrame:
    fs = FeatureSelector(data=features)
    fs.identify_missing(missing_threshold)
    fs.identify_single_unique()
    fs.identify_collinear(correlation_threshold)
    return fs.remove(methods=['missing', 'single_unique', 'collinear'])
In [30]:
features_less = select_features_without_label(seller_buy_features)
No labels provided. Feature importance based methods are not available.
8 features with greater than 0.70 missing values.

0 features with a single unique value.

76 features with a correlation magnitude greater than 0.95.

Removed 82 features.

算法挑选

对比了线性的Logistic Regression classifier, XGBoost, DNN等几种算法,XGBoost结果最好。


使用的XGBoost大概参数

In [ ]:
param = {'booster': 'gbtree', 'objective': 'binary:logistic', 'eval_metric': 'auc', 'max_depth': 3, 'lambda': 10,
         'subsample': 0.80, 'colsample_bytree': 0.75, 'min_child_weight': 3, 'eta': 0.03, 'seed': 0, 'silent': 1,
         "gamma": 1}

evallist = [(dtrain, 'train'), (dvalid, 'eval')]
num_round = 2000
bst = xgb.train(param, dtrain, num_round, evallist, early_stopping_rounds=30)

DNN大概参数

In [ ]:
periods = 15
epochs = 90
DEPTH = 7

input = keras.Input(shape=(621,), name="input")
x = input

for i in range(DEPTH):
    x = keras.layers.Dense(100)(x)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)
    x = keras.layers.Dropout(0.5)(x)

output = keras.layers.Dense(1, activation="sigmoid", name='output')(x)

model = keras.Model(input=[input], output=[output])
optimizer = keras.optimizers.Adam(decay=0.2, amsgrad=True)
model.compile(optimizer=adamw, loss="mean_squared_error", metrics=['accuracy'])
for i in range(periods):
    model.fit(train_pair, y_result_train, shuffle=True, batch_size=256, validation_data=(valid_pair, y_result_test),
              epochs=int(epochs / periods), callbacks=[tb_callback], class_weight={1: 0.938, 0: 0.062})

精细调整

主要是调整参数,特征,特征搜索使用了随机搜索

In [ ]:
import scipy.stats as st

one_to_left = st.beta(10, 1)  
from_zero_positive = st.expon(0, 50)

params = {  
    "n_estimators": st.randint(20, 200),
    "max_depth": st.randint(1, 10),
    "learning_rate": st.uniform(0.01, 0.4),
    "colsample_bytree": one_to_left,
    "subsample": one_to_left,
    "gamma": st.uniform(0, 20),
    'reg_alpha': from_zero_positive,
    "min_child_weight": from_zero_positive,
}

from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV

xgbcla = XGBClassifier(nthreads=-1, booster='gbtree', objective='binary:logistic', eval_metric= 'auc', silent=1) 

train_no_duplicate = train_log.drop_duplicates(subset=["user_id", "seller_id"], keep='first')

gs = RandomizedSearchCV(xgbcla, params, n_jobs=2)  
gs.fit(train_no_duplicate.drop(columns=["label"] + drop_list).values, train_no_duplicate["label"])