NLP情感分析之LSTM模型贰

同样是CSDN上找到的模型，使用Keras调用LSTM的，70多行，主要看到最底下有模型预测接口，就准备搭一下看看~
【传送门】中文情感分类（基于LSTM）

import pandas as pd
from tensorflow.contrib import learn
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.models import Sequential, load_model
from keras.layers import Dense
import numpy as np
import os

MAX_DOCUMENT_LEN = 30
EMBEDDING_SIZE = 100

def train():
    train = pd.read_csv("all.csv", encoding="utf-8")
    all_word = set([word for seq in train.content.values for word in list(seq)])

    if os.path.exists("processor"):
        processor = learn.preprocessing.VocabularyProcessor.restore("processor")
    else:
        processor = learn.preprocessing.VocabularyProcessor(max_document_length=MAX_DOCUMENT_LEN)
        processor.fit(all_word)
        processor.save("processor")

    # fit generate
    def get_generate(batch_size):
        while True:

            sample_data = train.sample(batch_size)
            labels = sample_data.label.values
            contents = sample_data.content.values

            contents = [list(seq) for seq in contents]
            x = []
            for seq in contents:
                item_str = ""
                for word in seq:
                    item_str += " " + str(word)
                x.append(item_str)

            x = np.array(list(processor.transform(x)))
            y = np.array([[1, 0] if int(item) == 0 else [0, 1] for item in labels])
            yield x, y

    # create embedding
    model = Sequential()
    model.add(Embedding(input_dim=len(all_word), output_dim=EMBEDDING_SIZE))
    model.add(LSTM(units=128, activation="relu"))
    model.add(Dense(2, activation="softmax", name="logit"))

    model.compile(loss='categorical_crossentropy', optimizer='sgd')

    model.fit_generator(get_generate(50), samples_per_epoch=10000, epochs=10)
    model.save("lstm.model")

def pre():
    text = input("请输入:")
    text = list(text)
    item_str = ""
    for item in text:
        item_str += " " + str(item)

    processor = learn.preprocessing.VocabularyProcessor.restore("processor")
    input_data = np.array(list(processor.transform([item_str])))
    print(input_data)
    model = load_model("lstm.model")
    target = model.predict(input_data)
    print(target)

首先空行在python函数中会被认为函数结束，所以函数中的空行和缩进需要再次检查一遍。
其次给出的数据集要求是有表头并且有标注的，第一列为文字，第二列为其情感标注。
我于是将自己处理的“ipad发布新款-微博评论”csv略修改（加表头、第二列添加0）后放入使用。当时完全没有考虑第二列是什么，只不过看到原文图片中显示的全部为0就以为是作者处理数据留下来的尾巴，我也照着弄就好了，结果导致了后面的问题。
第三，“from tensorflow.contrib import learn”会报错，查资料后发现这一句用到的learn似乎使用到了GCC编译器，于是乎我先去安装了个GCC/G++，然后整好环境变量，此处不再报错。

顺利进行训练（忘记截图了。。。），就是loss率很快就从20掉到3又掉到0.16，在epoch1结束时已经变成了0.0024，收敛速度过快，有点蹊跷。。。
果然在epoch2时loss就止于10^-7这一数量级，一直到训练结束也没再掉一个数量级。
平均一个epoch跑10000个词，用时大约4~5min，整个跑完用了将近1个小时。嗯，，这才是训练模型应该有的样子啊，之前第一个LSTM或许是词语2000太少，十分钟训练完毕？

在我调测试函数时，首先输出的是一个张量而非正向负向，不过【1.0,-0.00..】也能够说明正向负向了。但是在输入明显负向情感时，输出仍然是【1.0.-0.00..】，前置的12维张量明显是字字对应的情感分数，但是为啥后面输出张量就一成不变呢？？
后来想想，一是似乎我用的那个“ipad新品发售——微博评论”数据集是自己处理的，可能整体效果不好，并且本身这个数据集对象和极性就不够明显。
我又回头打开那个all.csv文件，猛然想起我的所有标记都写的是0，这个。。。。。