金昌市建設(shè)局網(wǎng)站朝陽網(wǎng)絡(luò)推廣
文本分類類似于圖片分類,也是很常見的一種分類任務(wù),將一段不定長的文本序列變換為文本的類別。這節(jié)主要就是關(guān)注文本的情感分析(sentiment analysis),對電影的評論進行一個正面情緒與負面情緒的分類。
整理數(shù)據(jù)集
第一步都是將數(shù)據(jù)集整理好,這里我們使用"大型電影評論數(shù)據(jù)集"LMDB(Large Movie Review Dataset v1.0),該數(shù)據(jù)集包含電影評論及其相關(guān)二進制情感標(biāo)簽。標(biāo)簽的整體分布是平衡的,一半的正類標(biāo)簽和一半的負類標(biāo)簽,另外有一些未貼標(biāo)簽的用于無監(jiān)督學(xué)習(xí)。電影評分滿分是10分,將評分>=7分的判定為正面評論,評論得分<= 4分則為負面評論。
下載數(shù)據(jù)集,可以使用自帶的函數(shù)
import d2lzh as d2l
d2l.download_imdb(data_dir='data')
或者手動下載:http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
自動下載雖然只有80M的大小,但是下載特別慢。這里依然推薦迅雷下載,下載下來之后就手動解壓(自動下載的函數(shù)包括自動解壓)
我們先來看下這個數(shù)據(jù)集里面有一些什么內(nèi)容,本人地址截圖如下:


可以看到有train和test兩個數(shù)據(jù)集,里面都有neg和pos的評論,分別表示負面和正面的評論:

每個文本是一條影評,文本名稱構(gòu)造:id_評分,比如上面圖中的200_8.txt表示id為200的這條影評的評分是8分。
還有一種feat文件,如下圖:

這種.feat文件的格式為LIBSVM,是一種用于標(biāo)記的ascii稀疏向量格式數(shù)據(jù),比如圖片中紅色劃線處的第200條評論,8后面的數(shù)字表示什么意思呢?
8 0:5 1:2 3:1 4:2 6:4 7:7 8:4 9:2 10:2 11:3 16:1 17:3 ... ...
這里的0:5表示第一個單詞出現(xiàn)了5次,1:2就是第二個單詞出現(xiàn)了2次,后面依次類推。
接下來使用自帶的read_imdb函數(shù)來讀取訓(xùn)練集和測試集,當(dāng)然這里使用自帶的函數(shù)需要注意目錄的位置,將aclImdb整個目錄剪切到上級目錄data里面,比如本人電腦上的地址:D:\data\aclImdb
train_data, test_data = d2l.read_imdb("train"), d2l.read_imdb("test")
print(train_data[1])
'''
(pygpu) D:\DOG-BREED>python test.py
["i went to this movie expecting an artsy scary film. what i got was scare after scare. it's a horror film at it's core. it's not dull like other horror films where a haunted house just has ghosts and gore. this film doesn't even show you the majority of the deaths it shows the fear of the characters. i think one of the best things about the concept where it's not just the house thats haunted its whoever goes into the house. they become haunted no matter where they are. office buildings, police stations, hotel rooms... etc. after reading some of the external reviews i am really surprised that critics didn't like this film. i am going to see it again this week and am excited about it.<br /><br />i gave this film 10 stars because it did what a horror film should. it scared the s**t out of me.", 1]
'''
返回的結(jié)果是列表,里面元素是評論加一個正負類標(biāo)簽。這里是贊嘆這部恐怖片拍的很不錯,后面的1表示正類評價。
上面兩個函數(shù)的源碼附上[../envs/pygpu/Lib/site-packages/d2lzh/utils.py]:
def download_imdb(data_dir='../data'):"""Download the IMDB data set for sentiment analysis."""url = ('http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz')sha1 = '01ada507287d82875905620988597833ad4e0903'fname = gutils.download(url, data_dir, sha1_hash=sha1)with tarfile.open(fname, 'r') as f:f.extractall(data_dir)def read_imdb(folder='train'):"""Read the IMDB data set for sentiment analysis."""data = []for label in ['pos', 'neg']:folder_name = os.path.join('../data/aclImdb/', folder, label)for file in os.listdir(folder_name):with open(os.path.join(folder_name, file), 'rb') as f:review = f.read().decode('utf-8').replace('\n', '').lower()data.append([review, 1 if label == 'pos' else 0])random.shuffle(data)return data
預(yù)處理數(shù)據(jù)集
數(shù)據(jù)集和測試集讀取沒有問題之后,我們對評論進行分詞,這里基于空格分詞,也是自帶的函數(shù)get_tokenized_imdb進行分詞并做了小寫處理。
def get_tokenized_imdb(data):"""Get the tokenized IMDB data set for sentiment analysis."""def tokenizer(text):return [tok.lower() for tok in text.split(' ')]return [tokenizer(review) for review, _ in data]
然后將分好詞的訓(xùn)練數(shù)據(jù)集創(chuàng)建Vocabulary詞典,我們這里過濾掉出現(xiàn)次數(shù)少于5的詞,min_freq=5。
def get_vocab_imdb(data):"""Get the vocab for the IMDB data set for sentiment analysis."""tokenized_data = get_tokenized_imdb(data)counter = collections.Counter([tk for st in tokenized_data for tk in st])return text.vocab.Vocabulary(counter, min_freq=5)tokenized_data = d2l.get_tokenized_imdb(train_data)
vocab=d2l.get_vocab_imdb(train_data)
print(len(vocab))#46151
可以看到過濾掉次數(shù)少的之后,詞匯量從25000降低到了46151,這里返回的變量vocab是mxnet.contrib.text.vocab.Vocabulary類型,我們可以查看它里面有哪些屬性與方法:
dir(mxnet.contrib.text.vocab.Vocabulary)
'''
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_index_counter_keys', '_index_unknown_and_reserved_tokens', 'idx_to_token', 'reserved_tokens', 'to_indices', 'to_tokens', 'token_to_idx', 'unknown_token']
'''
print(vocab.idx_to_token[1])#the
由于每條評論的字?jǐn)?shù)或說長度不一樣,所以不能直接組合成小批量,我們通過一個輔助函數(shù)讓它的長度固定在500,超出的進行截斷,不足的進行'<pad>'補足。這個函數(shù)preprocess_imdb在d2lzh包中也自帶有
features, labels = d2l.preprocess_imdb(train_data, vocab)
print(features.shape, labels.shape)#(25000, 500) (25000,)
從形狀可以看到每條評論都固定到了長度為500
print(features)
'''
[[5.0000e+00 5.3200e+02 0.0000e+00 ... 0.0000e+00 0.0000e+00 0.0000e+00][2.0100e+02 5.4810e+03 4.2891e+04 ... 1.6000e+01 2.9200e+02 1.1000e+01][0.0000e+00 0.0000e+00 3.6000e+01 ... 0.0000e+00 0.0000e+00 0.0000e+00]...[9.0000e+00 2.2600e+02 3.0000e+00 ... 0.0000e+00 0.0000e+00 0.0000e+00][2.8690e+03 1.2220e+03 1.4000e+01 ... 1.1538e+04 5.2700e+02 2.9000e+01][9.0000e+00 1.9900e+02 1.2108e+04 ... 0.0000e+00 0.0000e+00 0.0000e+00]]
<NDArray 25000x500 @cpu(0)>
'''
附上源碼:
def preprocess_imdb(data, vocab):"""Preprocess the IMDB data set for sentiment analysis."""max_l = 500def pad(x):return x[:max_l] if len(x) > max_l else x + [0] * (max_l - len(x))tokenized_data = get_tokenized_imdb(data)features = nd.array([pad(vocab.to_indices(x)) for x in tokenized_data])labels = nd.array([score for _, score in data])return features, labels
當(dāng)然如果想要查看'<pad>'對應(yīng)的值,print(vocab.token_to_idx['<pad>'])會報錯:
Traceback (most recent call last):
File "test.py", line 19, in <module>
print(vocab.token_to_idx['<pad>'])
KeyError: '<pad>'
所以在創(chuàng)建詞典Vocabulary的時候,需指定參數(shù)reserved_tokens=['<pad>']保留這個詞
def get_vocab_imdb(data):"""Get the vocab for the IMDB data set for sentiment analysis."""tokenized_data = d2l.get_tokenized_imdb(data)counter = collections.Counter([tk for st in tokenized_data for tk in st])return text.vocab.Vocabulary(counter, min_freq=5,reserved_tokens=['<pad>'])
創(chuàng)建數(shù)據(jù)迭代器
數(shù)據(jù)集都整理好了之后,就開始做數(shù)據(jù)迭代器,每次迭代將返回一個小批量的數(shù)據(jù)
batch_size = 64
#train_set = gdata.ArrayDataset(*d2l.preprocess_imdb(train_data, vocab))
train_set=gdata.ArrayDataset(*[features,labels])
test_set = gdata.ArrayDataset(*d2l.preprocess_imdb(test_data, vocab))
train_iter = gdata.DataLoader(train_set, batch_size, shuffle=True)
test_ieter = gdata.DataLoader(test_set, batch_size)print(len(train_iter))
for X,y in train_iter:print(X.shape,y.shape)break
'''
391
(64, 500) (64,)
'''
創(chuàng)建RNN模型
數(shù)據(jù)迭代器測試沒有問題之后,接下來就是選擇循環(huán)神經(jīng)網(wǎng)絡(luò)模型來試下效果怎么樣了。
首先就是將每個詞做嵌入,也就是通過嵌入層得到特征向量,然后我們使用雙向循環(huán)神經(jīng)網(wǎng)絡(luò)對特征序列進一步編碼得到序列信息,最后將編碼的序列信息通過全連接層變換成輸出。
具體來說,我們可以將雙向長短期記憶在最初時間步和最終時間步的隱藏狀態(tài)連結(jié),作為特征序列的表征傳遞給輸出層分類。在下面實現(xiàn)BiRNN類中,Embedding實例就是嵌入層,LSTM實例即為序列編碼的隱藏層,Dense實例即生成分類結(jié)果的輸出層。
class BiRNN(nn.Block):def __init__(self, vocab, embed_size, num_hiddens, num_layers, **kwargs):super(BiRNN, self).__init__(**kwargs)# 詞嵌入層self.embedding = nn.Embedding(input_dim=len(vocab), output_dim=embed_size)# bidirectional設(shè)為True就是雙向循環(huán)神經(jīng)網(wǎng)絡(luò)self.encoder = rnn.LSTM(hidden_size=num_hiddens,num_layers=num_layers,bidirectional=True,input_size=embed_size,)self.decoder = nn.Dense(2)def forward(self, inputs):# LSTM需要序列長度(詞數(shù))作為第一維,所以inputs[形狀為:(批量大小,詞數(shù))]需做轉(zhuǎn)置embeddings = self.embedding(inputs.T)print(embeddings.shape)outputs = self.encoder(embeddings)print(outputs.shape)# 將初始時間步和最終時間步的隱藏狀態(tài)作為全連接層輸入encoding = nd.concat(outputs[0], outputs[-1])print(encoding.shape)outs = self.decoder(encoding)return outs# 創(chuàng)建一個含2個隱藏層的雙向循環(huán)神經(jīng)網(wǎng)絡(luò)
embed_size, num_hiddens, num_layers, ctx = 100, 100, 2, d2l.try_all_gpus()
net = BiRNN(vocab=vocab, embed_size=embed_size, num_hiddens=num_hiddens, num_layers=num_layers
)
net.initialize(init.Xavier(), ctx=ctx)
#print(net)
'''
BiRNN((embedding): Embedding(46152 -> 100, float32)(encoder): LSTM(100 -> 100, TNC, num_layers=2, bidirectional)(decoder): Dense(None -> 2, linear)
)
'''
其中LSTM長短期記憶的公式如下(來自源碼):
訓(xùn)練模型
由于情感分類的訓(xùn)練數(shù)據(jù)集并不大,容易過擬合,所以這里將使用glove.6B.100d.txt的語料庫,將這個預(yù)訓(xùn)練的詞向量作為每個詞的特征向量。
需要注意的是,這里選擇的預(yù)訓(xùn)練詞向量維度是100,需要跟創(chuàng)建的模型中的嵌入層輸出層大小embed_size一致,以及在訓(xùn)練中就不再需要更新這些詞向量。
glove_embedding = text.embedding.create("glove", pretrained_file_name="glove.6B.100d.txt", vocabulary=vocab
)
net.embedding.weight.set_data(glove_embedding.idx_to_vec)
net.embedding.collect_params().setattr('grad_req','null')lr,num_epochs=0.01,5
trainer=gluon.Trainer(net.collect_params(),'adam',{'learning_rate':lr})
loss=gloss.SoftmaxCrossEntropyLoss()
d2l.train(train_iter,test_ieter,net,loss,trainer,ctx,num_epochs)print(d2l.predict_sentiment(net,vocab,['this','movie','is','so','good']))
print(d2l.predict_sentiment(net,vocab,['this','movie','is','so','bad']))
'''
training on [gpu(0)]
epoch 1, loss 0.6553, train acc 0.605, test acc 0.738, time 65.4 sec
epoch 2, loss 0.4273, train acc 0.807, test acc 0.809, time 65.4 sec
epoch 3, loss 0.3514, train acc 0.851, test acc 0.849, time 65.5 sec
epoch 4, loss 0.3054, train acc 0.874, test acc 0.859, time 65.6 sec
epoch 5, loss 0.2765, train acc 0.887, test acc 0.843, time 65.6 sec
positive
negative
'''
其中預(yù)測函數(shù)的源碼如下:
def predict_sentiment(net, vocab, sentence):"""Predict the sentiment of a given sentence."""sentence = nd.array(vocab.to_indices(sentence), ctx=try_gpu())label = nd.argmax(net(sentence.reshape((1, -1))), axis=1)return 'positive' if label.asscalar() == 1 else 'negative'