首页 > 技术文章 > Gated-Attention Readers for Text Comprehension

nlplmt 2017-05-17 22:53 原文

Gated-Attention Readers for Text Comprehension

In this paper we study the problem of answering cloze-style questions over short documents. We introduce a new attention mechanism which uses multiplicative interactions between the query embedding and intermediate states of a recurrent neural network reader. This enables the reader to build query-specific representations of tokens in the document which are further used for answer selection. Our model, the Gated-Attention Reader, outperforms all state-of-the-art models on several large-scale benchmark datasets for this task—the CNN & Daily Mail news stories and Children’s Book Test. We also provide a detailed analysis of the performance of our model and several baselines over a subset of questions manually annotated with certain linguistic features. The analysis sheds light on the strengths and weaknesses of several existing models.

其核心就是利用注意力机制,在查询向量的表示和阅读器的中间表示状态引入点乘的交互机制(这种交互式阅读文本和查询向量之间的互动,二者相互配合,加强信息的流动,以正确学习到答案,这和人类阅读是很相近的,我们在阅读寻找特定问题的答案时,也是将注意力放在问题和文档中最相关的部分),如此一来可以使阅读器针对给定的问题将注意力放在文档不同的部分。

 

 

核心算法解读:

       首先用双向GRU编码问题query,将正向和逆向的最后一个表示作为查询向量的表示q,每层采用一个GRU编码器。

 

Gated-Attention mechanism by applying an element-wise multiplication between the query embedding qi-1 and the outputs ei-1from the previous layer:

用查询的表示对每一层的每一个文档中的词操作,作者称之为gate-attention,这个操作是多个点乘的方式,和传统的attention机制不一样,传统的attention机制是对每一个词做权重的加和。

 

To obtain the probability that a particular token in the document answers the query we take an inner-product between outputs of the last layer qK and eKt , and pass through a soft-max layer:

最后一层做内积(dot-product between two vectors),得到概率分布,然后利用sum-reader思想,将相同的文档词语预测加和。这个概率分布是针对文档中的词来讲的,若文档中有相同词多次出现,则把每一个位置的词概率加和,得到最终这个词预测的概率。这样的方式,就是直接在文档中寻找答案,阅读理解有一个假设就是回答的答案应该出现在文档中至少一次,如此,在学习的最后阶段我们可以直接从文档中找寻答案。

           

 

实验分析

Three parameters were tuned on the validation set for each dataset—the number of layers K, GRU hidden state sizes (both query and document) d, and the dropout rate p. We experimented with K = 2; 3, d = 256; 384 and p = 0:1; 0:2; 0:3; 0:4; 0:5. Memory constraints prevented us from experimenting with higher K.

 

 

 

 

推荐阅读