Bi-Directional Attention Flow For Machine Comprehension

We introduce the Bi-Directional Attention Flow (BiDAF) network, a hierarchical muti-stage architecture for modeling the representations of the context paragraph at different levels.

BiDAF includes character-level, word-level, contextual embeddings, and query-aware context representation.

MODEL

Character Embedding Layer
Maps each word to a high-dimensional vector space.
Let ${x_1, \dots, x_T}$ and ${q_1, \dots, q_J}$
Word Embedding Layer
Use GloVe pre-trained model.
The concatenation of the characters and word embedding vectors is passed to a two-layer Highway Network
Contextual Embedding Layer
Use Long Short-Term Memory Network (LSTM)
Attention Flow Layer
Inputs to the layer are contextual vectors of the context $\mathbf{H}$ and the query $\mathbf{U}$ .
Outputs of the layer are the query-aware vector representations of the context words, $\mathbf{G}$ , along with the contextual embeddings from the previous layer.

Both of two attentions, from context to query and fro query to context, are derived from share similarity matrix. $\mathbf{S}\in \mathbb{R}^{T\times J}$ , $\mathbf{S}_{tj}=\alpha(\mathbf{H}_{:t}, \mathbf{U}_{:j})\in \mathbb{R}$ .
$\alpha(\mathbf{h}, \mathbf{u})=\mathbf{w}^\intercal_{(\mathbf{S})}[\mathbf{h};\mathbf{u};\mathbf{h}\circ\mathbf{u}]$

Context-to-query (C2Q) attention: which query words are most relevant to each context word. $\tilde{\mathbf{U}}\in\mathbb{R}^{2d\times T}$
Query-to-context (Q2C) attention: which context words have the closest similarity to one of the query words. $\tilde{\mathbf{H}}\in\mathbb{R}^{2d\times T}$
Modeling Layer
Use two layers of bi-directional LSTM.

The input to the modeling layer is $\mathbf{G}$ , which encodes the query-aware representations of context words.
The output of the modeling layer captures the interaction among the context words conditioned on the query.
Output Layer
The output layer is application-specific.

Experiments

Use SQuAD dataset
- is a machine comprehension dataset on a large set of Wikipedia articles, with more than 100,000 questions.
- 90k/10k train/dev question-context tuples
Each paragraph and question are tokenized by a regular-expression-based word tokenizer (PTB Tokenizer)
100 1D filters for CNN char embedding, each with a width of 5.
The model has about 2.6M parameters.
Use AdaDelta optimizer, with a mini-batch size of 60.
Initial learning rate 0.5 for 12 epochs.

Discussion

Ablation Study

Word-level embedding is better at representing the semantics of each word as a whole
Char-level embedding can better handle **out-of-vocab (OOV) **
C2Q attention proves to be critical with a drop of more than 10 points on both metrics
Static attention mechanism(in this paper) outperforms the dynamically computed attention by more than 3 points.

Visualization

저작자표시 (새창열림)

'Paper Review' 카테고리의 다른 글

Fake News Detection on Social Media: A Data Mining Perspective (0)	2018.03.26
Enriching Word Vectors with Subword Information (0)	2018.03.22
Joint Many-Task(JMT) Model 관련 paper 리뷰 (0)	2017.09.07
간단한 Softmax Regression (0)	2017.04.17
간단한 Logistic Regression (0)	2017.04.17

Yumere

BiDAF 리뷰 및 기록

Bi-Directional Attention Flow For Machine Comprehension

MODEL

Experiments

Discussion

Ablation Study

Visualization

'Paper Review' 카테고리의 다른 글

티스토리툴바

BiDAF 리뷰 및 기록

Bi-Directional Attention Flow For Machine Comprehension

MODEL

Experiments

Discussion

Ablation Study

Visualization

'Paper Review' 카테고리의 다른 글

'Paper Review' Related Articles

티스토리툴바