NLP,十八 一维卷积网络IMDB情感分析

原文链接:http://www.one2know.cn/nlp18/

  • 准备

    Keras的IMDB数据集,包含一个词集和对应的情感标签

import pandas as pd
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense,Dropout,Activation
from keras.layers import Embedding
from keras.layers import Conv1D,GlobalAveragePooling1D
from keras.datasets import imdb
from sklearn.metrics import accuracy_score,classification_report

# 参数 最大特征数6000 单个句子最大长度400
max_features = 6000
max_length = 400
(x_train,y_train),(x_test,y_test) = imdb.load_data(num_words=max_features)
print(len(x_train),'train observations')
print(len(x_test),'test observations')

wind = imdb.get_word_index() # 给单词编号,用数字代替单词
revind = dict((k,v) for k,v in enumerate(wind))
# 单词编号:情感词性编号 字典 => 情感词性编号:一堆该词性的单词编号列表
print(x_train[0])
print(y_train[0])

def decode(sent_list): # 逆映射字典解码 数字=>单词
    new_words = []
    for i in sent_list:
        new_words.append(revind[i])
    comb_words = " ".join(new_words)
    return comb_words
print(decode(x_train[0]))

输出:

25000 train observations
25000 test observations
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 。。。]
1
tsukino 'royale rumbustious canet thrace bellow headbanger 。。。
  • 如何实现

    1.预处理,数据整合到一个固定的维度

    2.一维CNN模型的构建和验证

    3.模型评估

  • 代码
import pandas as pd
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense,Dropout,Activation
from keras.layers import Embedding
from keras.layers import Conv1D,GlobalAveragePooling1D
from keras.datasets import imdb
from sklearn.metrics import accuracy_score,classification_report

# 参数 最大特征数6000 单个句子最大长度400
max_features = 6000
max_length = 400
(x_train,y_train),(x_test,y_test) = imdb.load_data(num_words=max_features)
# print(x_train) # 一堆句子,每个句子有有一堆单词编码
# print(y_train) # 一堆0或1
# print(len(x_train),'train observations')
# print(len(x_test),'test observations')

wind = imdb.get_word_index() # 给单词编号,用数字代替单词
revind = dict((k, v) for k, v in enumerate(wind))
# 单词编号:情感词性编号 字典 => 情感词性编号:一堆该词性的单词编号列表
# print(x_train[0])
# print(y_train[0])

def decode(sent_list): # 逆映射字典解码 数字=>单词
    new_words = []
    for i in sent_list:
        new_words.append(revind[i])
    comb_words = " ".join(new_words)
    return comb_words
# print(decode(x_train[0]))

# 将句子填充到最大长度400 使数据长度保持一致
x_train = sequence.pad_sequences(x_train,maxlen=max_length)
x_test = sequence.pad_sequences(x_test,maxlen=max_length)
print('x_train.shape:',x_train.shape)
print('x_test.shape:',x_test.shape)

## Keras框架 深度学习 一维CNN模型
# 参数
batch_size = 32
embedding_dims = 60
num_kernels = 260
kernel_size = 3
hidden_dims = 300
epochs = 3
# 建立模型
model = Sequential()
model.add(Embedding(max_features,embedding_dims,input_length=max_length))
model.add(Dropout(0.2))
model.add(Conv1D(num_kernels,kernel_size,padding='valid',activation='relu',strides=1))
model.add(GlobalAveragePooling1D())
model.add(Dense(hidden_dims))
model.add(Dropout(0.5))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

model.fit(x_train,y_train,batch_size=batch_size,epochs=epochs,validation_split=0.2)

# 模型预测
y_train_predclass = model.predict_classes(x_train,batch_size=batch_size)
y_test_preclass = model.predict_classes(x_test,batch_size=batch_size)
y_train_predclass.shape = y_train.shape
y_test_preclass.shape = y_test.shape

print('\n\nCNN 1D - Train accuracy:',round(accuracy_score(y_train,y_train_predclass),3))
print('\nCNN 1D of Training data\n',classification_report(y_train,y_train_predclass))
print('\nCNN 1D - Train Confusion Matrix\n\n',pd.crosstab(y_train,y_train_predclass,
                    rownames=['Actuall'],colnames=['Predicted']))
print('\nCNN 1D - Test accuracy:',round(accuracy_score(y_test,y_test_preclass),3))
print('\nCNN 1D of Test data\n',classification_report(y_test,y_test_preclass))
print('\nCNN 1D - Test Confusion Matrix\n\n',pd.crosstab(y_test,y_test_preclass,
                    rownames=['Actuall'],colnames=['Predicted']))

输出:

Using TensorFlow backend.
x_train.shape: (25000, 400)
x_test.shape: (25000, 400)
WARNING:tensorflow:From 
D:\Python37\Lib\site-packages\tensorflow\python\framework\op_def_library.py:263: 
colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a 
future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From 
D:\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:3445: calling dropout 
(from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a 
future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   

=================================================================
embedding_1 (Embedding)      (None, 400, 60)           360000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 400, 60)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 398, 260)          47060     
_________________________________________________________________
global_average_pooling1d_1 ( (None, 260)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 300)               78300     
_________________________________________________________________
dropout_2 (Dropout)          (None, 300)               0         
_________________________________________________________________
activation_1 (Activation)    (None, 300)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 301       
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         

=================================================================
Total params: 485,661
Trainable params: 485,661
Non-trainable params: 0
_________________________________________________________________
None
WARNING:tensorflow:From 
D:\Python37\Lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from 
tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 20000 samples, validate on 5000 samples
Epoch 1/3
2019-07-07 15:27:37.848057: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU 
supports instructions that this TensorFlow binary was not compiled to use: AVX2

   32/20000 [..............................] - ETA: 7:03 - loss: 0.6929 - acc: 0.5000
   64/20000 [..............................] - ETA: 4:13 - loss: 0.6927 - acc: 0.5156
   96/20000 [..............................] - ETA: 3:19 - loss: 0.6933 - acc: 0.5000
  128/20000 [..............................] - ETA: 2:50 - loss: 0.6935 - acc: 0.4844
  160/20000 [..............................] - ETA: 2:32 - loss: 0.6931 - acc: 0.4813
  此处省略一堆epoch的一堆操作
  
CNN 1D - Train accuracy: 0.949

CNN 1D of Training data
               precision    recall  f1-score   support

           0       0.94      0.96      0.95     12500
           1       0.95      0.94      0.95     12500

    accuracy                           0.95     25000
   macro avg       0.95      0.95      0.95     25000
weighted avg       0.95      0.95      0.95     25000

CNN 1D - Train Confusion Matrix

 Predicted      0      1
Actuall                
0          11938    562
1            715  11785

CNN 1D - Test accuracy: 0.876

CNN 1D of Test data
               precision    recall  f1-score   support

           0       0.86      0.89      0.88     12500
           1       0.89      0.86      0.87     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000

CNN 1D - Test Confusion Matrix

 Predicted      0      1
Actuall                
0          11144   1356
1           1744  10756