Zhi Wang's Machine Learning  blog

Zhi Wang's Machine Learning blog

ideas worth practise

深度学习TensorFlow之Scikit Flow尝鲜

Scikit Flow简介

如果你听说过深度学习,你应该听说过Google开源的深度学习工具TensorFlow
如果你尝试过学习或使用Tensorflow,你一定会喜欢上Scikit Flow,因为他会让你更快的上手使用Tensorflow。
如果你有使用Scikit-learn的经验,那么对Scikit Flow的操作逻辑你会很熟悉。
有兴趣的同学可以详细阅读该博客。(http://terrytangyuan.github.io/2016/03/14/scikit-flow-intro/)

牛刀小试

笔者就拿自己项目中的数据来做个体验,该项目是个二分类问题,之前使用GBM方法达到的效果AUC达到0.72

数据情况

由于数据保密性问题,这里不做数据的展示,明确:

  1. 使用同样的训练集和测试集数据
  2. 训练集样本正负样本不均衡

我们直接从处理好的数据开始,这里我准备好了X_train, y_train, X_test, y_test:

1
2
3
4
5
6
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.cross_validation import train_test_split
### 这里输出***男一号***
from tensorflow.contrib import learn
1
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
(30466, 67) (30466,) (7781, 67) (7781,)

抽样处理

对不均衡数据样本处理的方法有很多,这里简单使用了Under-Sampling的方法,对其他方法感兴趣学习下unbalanced_dataset这个library。

1
2
3
4
5
6
7
8
9
10
11
12
from unbalanced_dataset import UnderSampler
import numpy as np

print("抽样前:")
print (np.count_nonzero(y_train==1), np.count_nonzero(y_train==0))
verbose = False

US = UnderSampler(verbose = verbose)
usx, usy = US.fit_transform(X_train, y_train)

print ("抽样后:")
print (np.count_nonzero(usy==1), np.count_nonzero(usy==0))
抽样前:
4440 26026
抽样后:
4440 3782

使用Deep Neural Network模型

  1. 首先创建一个3层的深度神经网络,
  2. 从训练及样本中抽取10%的样本作为Validation set
  3. 使用Early-stop方法控制训练过程,在skflow.monitors.ValidationMonitor中提供了early_stopping_rounds参数来进行设置
1
2
3
4
5
6
7
8
classifier = learn.TensorFlowDNNClassifier(hidden_units=[10, 20, 10],
n_classes=2, steps=5000)

X_train, X_val, y_train, y_val = train_test_split(usx, usy,
test_size=0.1, random_state=42)
val_monitor = learn.monitors.ValidationMonitor(X_val, y_val,
early_stopping_rounds=200,
n_classes=2)

fit函数中设置val_monitor参数实现early-stop控制

1
classifier.fit(usx, usy, val_monitor)
Step #99, avg. train loss: 0.98097, avg. val loss: nan
Step #199, avg. train loss: 0.53269, avg. val loss: nan
Step #300, epoch #1, avg. train loss: 0.52669, avg. val loss: nan
Step #400, epoch #1, avg. train loss: 0.51898, avg. val loss: nan
Step #500, epoch #1, avg. train loss: 0.53059, avg. val loss: nan
Step #600, epoch #2, avg. train loss: 0.51597, avg. val loss: nan
Step #700, epoch #2, avg. train loss: 0.52027, avg. val loss: nan
Step #800, epoch #2, avg. train loss: 0.50008, avg. val loss: nan
Step #900, epoch #3, avg. train loss: 0.50545, avg. val loss: nan
Step #1000, epoch #3, avg. train loss: 0.50853, avg. val loss: nan


Stopping. Best step:
 step 829 with loss 0.468855381012





TensorFlowDNNClassifier(batch_size=32, class_weight=None, clip_gradients=5.0,
            config=None, continue_training=False, dropout=None,
            hidden_units=[10, 20, 10], learning_rate=0.1, n_classes=2,
            optimizer='Adagrad', steps=5000, verbose=1)

使用模型并检验效果

使用测试集Test set评估模型效果,得到AUC0.64

1
2
3
print ("AUC:", metrics.roc_auc_score(y_test, classifier.predict(X_test)))
print ("Metrics:")
print(metrics.classification_report(y_test,classifier.predict(X_test)))
AUC: 0.638526423392
Metrics:
             precision    recall  f1-score   support

          0       0.93      0.73      0.82      6941
          1       0.20      0.55      0.29       840

avg / total       0.85      0.71      0.76      7781

小结

本文通过使用Scikit Flow,建立一个3层深度神经网络模型,展示了使用深度学习算法的便捷性。
当然,模型还存在优化空间,如,对不均衡样本集的处理,及后续可以尝试使用grid_search进行参数优化等方向。