Yelper 之评论文本情感分析

2018-08-09

Notebooks：

如果说评论中的 stars 值代表了用户对某一个商店粗粒度的评价，那他或她写下的具体评论文本就代表了其对商店细粒度的评价。对评论文本的情感分析能够给我们增加一个细粒度衡量商店好坏的角度。

情感分析 EDA

在进行实际情感分析前，我们对 Las Vegas 一家名为 Mon Ami Gabi 的商店进行了探索性数据分析（Exploratory Data Analysis, EDA），发现用户所用的词很明显就能够体现出其对商店的评价，如用户使用最正面和最负面的 10 个词分别为（值越大，越正面；反之越负面）：

横坐标为情感度（Sentiment），纵坐标为单词（Word）

另外，对评论文本总体情感度的分析也能体现出用户的态度，如最正面和最负面的 5 条评价：

最正面 5 条评价：

Loved the chicken =] very good
Great food and awesome views
Amazing awesome lovely So nice to find a gem - had a lovely breakfast here. Very reasonably priced, excellent food.
Very good, excellent service , Steak Frite is outstanding . Mussels are very good . Excellent location.
Delicious seafood. Great service. Awesome cocktails!
最负面 5 条评价：

The food was decent. Servicer was terrible. They charge me for two bottles I stole ….. Wtf .
Do the patio and the bloody Mary bar.
Very bad Had nothing to do with French food. Except for few French words on the menus maybe!!! What a shame…
This is the most horrible restaurant in the strip. Really bad service , the wine was warm .. The waiter was super rude .. And the ladies on the entrance when we reported what happens they just stay quite. Horrible. Food also expensive and really bad
Worst food I ever had. Complained to the staff, they were rude. Called the manager over, he was even more rude and basically didn’t believe me that the food was bad. Needless to say, I ended up in urgent care and was sick all night long. Never coming back here again. The only thing French about this restaurant is the attitude of its staff.

（仔细看下，负面评价的内容长度很明显要多正面评价的长。）

下文使用了两种方法来对评论文本进行情感分析。一种基于 Affin 库，这个方法没有显式学习过程，作为一个 Baseline 跟第二种进行比较；第二种基于 CNN 模型，有具体学习过程。这两种方法在后文会做描述。

数据集构造

存在的一个问题就是如何判定一个评论是正面还是反面的？在这里，具体做法是把评论中 stars 值为 5 的文本定为正面，stars 值为 1 或 2 的定为负面；而 stars 值处于 3 或 4 的属于中等评价，不好定为正面或负面，直接舍弃。需要注意的是我们并没有采用按时间过滤过后的 stars 值来作为文本正面或负面的度量，因为文本的情感跟时间几乎没有关系。

这样，我们就能够构造一个有输入（文本）及对应类别（正面或反面）的数据集。具体处理过程可参考 Notebook Review_Text_Sentiment_Analysis。

基于 Affin 的情感分析

Affin 库是一个基于词列表（Wordlist）的情感分析库，可以直接计算一段文本或一个词的情感度，下边是其官方介绍：

AFINN sentiment analysis in Python: Wordlist-based approach for sentiment analysis.

具体计算过程可以参考 Notebook Review_Text_Sentiment_Analysis。但此方法得到的正确率仅 0.5124，跟我们随机猜的正确率差不多。

基于 CNN 的情感分析

这部分是我们的重点。来看看 CNN 是怎么吊打上一种方法的…

基于 CNN 的文本分类

大多数情况下我们都是把 CNN 用在图片处理上，但其实 CNN 用在文本分类同样厉害！不一样的是图片是二维的，文本是一维的。

这里我们参考的架构来自于一篇论文 A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification：

如果我们之前学习过二维 CNN（示意图）的话，会发现这个架构跟前者完全一致，也是由输入层+卷积层（激活函数）+池化层+全连接层（含输出层）组成。

上图中，

输入层为一个 7*5 的矩阵，每一个行向量表示一个词（如 like）；这里的 CNN “一维”正是指以行向量（词）为基本单位构成了“一维”列向量
卷积层使用了 3 种卷积核，各有 2 个；输入层的数据和卷积核进行卷积相乘后得到 3 中共 6 个 Feature Map
池化层使用 Max Pooling 方法取出各 Feature Map 中最大值，最后组装成一个 6 个特征数的列向量，输入到全连接层
全连接层（含输出层）利用 Softmax 函数计算出分类结果

这里解释的可能不是很清晰，可以参考论文中的解释：

Illustration of a Convolutional Neural Network (CNN) architecture for sentence classification. Here we depict three filter region sizes: 2, 3 and 4, each of which has 2 filters. Every filter performs convolution on the sentence matrix and generates (variable-length) feature maps. Then 1-max pooling is performed over each map, i.e., the largest number from each feature map is recorded. Thus a univariate feature vector is generated from all six maps, and these 6 features are concatenated to form a feature vector for the penultimate layer. The final softmax layer then receives this feature vector as input and uses it to classify the sentence; here we assume binary classification and hence depict two possible output states.

训练及测试

有了方法，接下来就是实现了。深度学习框架采用了较易上手的 Keras。在具体实现中，有不少细节需要考虑的（如字符串向量转换成数字序列），请参考 Notebook Review_Text_Sentiment_Analysis。这里可以看一下核心代码实现，跟上边提到的 CNN 一维模型一致：

from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import Dense, Activation
from keras.layers import Conv1D, GlobalMaxPooling1D

# 嵌入层(Embedding)
input_dim = vacabulary_size # 词汇表大小
output_dim = 128 # 输出向量大小
input_length = maxlen # 输入（行）向量长度，即每个词表示成向量后的长度

# 卷积层(Convolution)
kernel_size = 5 # 卷积核的大小
filters = 64 # 卷积核的数目

# 训练参数(Training)
batch_size = 30
epochs = 2

# 定义 Sequential 模型
cnn_model = Sequential()

# 将输入向量转化为维度为（batch_size, input_length, output_dim）的稠密矩阵（Dense vectors）
cnn_model.add(Embedding(input_dim, output_dim, input_length=input_length))

# 添加卷积层
cnn_model.add(Conv1D(filters,
                     kernel_size,
                     padding='valid',
                     activation='relu',
                     strides=1))

# 添加池化层
cnn_model.add(GlobalMaxPooling1D())

# 添加全连接层
cnn_model.add(Dense(1))
cnn_model.add(Activation('sigmoid'))

# 定义损失函数、优化器、衡量指标
cnn_model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

# 训练
cnn_model.fit(X_train, y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_data=(X_test, y_test))

# 计算测试集准确率
score, acc = cnn_model.evaluate(X_test, y_test, batch_size=batch_size)
print 'Test score:', score
print 'Test accuracy:', acc
# Test score: 0.11281012577892074
# Test accuracy: 0.9573922869095017

这次的准确率可以达到 0.9567！