Paper Review

Bi-Directional Attention Flow For Machine Comprehension

We introduce the Bi-Directional Attention Flow (BiDAF) network, a hierarchical muti-stage architecture for modeling the representations of the context paragraph at different levels.

BiDAF includes character-level, word-level, contextual embeddings, and query-aware context representation.

MODEL

1. Character Embedding Layer
Maps each word to a high-dimensional vector space.
Let ${x_1, \dots, x_T}$ and ${q_1, \dots, q_J}$

2. Word Embedding Layer
Use GloVe pre-trained model.
The concatenation of the characters and word embedding vectors is passed to a two-layer Highway Network

3. Contextual Embedding Layer
Use Long Short-Term Memory Network (LSTM)

4. Attention Flow Layer
Inputs to the layer are contextual vectors of the context $\mathbf{H}$ and the query $\mathbf{U}$.
Outputs of the layer are the query-aware vector representations of the context words, $\mathbf{G}$, along with the contextual embeddings from the previous layer.

Both of two attentions, from context to query and fro query to context, are derived from share similarity matrix. $\mathbf{S}\in \mathbb{R}^{T\times J}$, $\mathbf{S}_{tj}=\alpha(\mathbf{H}_{:t}, \mathbf{U}_{:j})\in \mathbb{R}$.
$\alpha(\mathbf{h}, \mathbf{u})=\mathbf{w}^\intercal_{(\mathbf{S})}[\mathbf{h};\mathbf{u};\mathbf{h}\circ\mathbf{u}]$

Context-to-query (C2Q) attention: which query words are most relevant to each context word. $\tilde{\mathbf{U}}\in\mathbb{R}^{2d\times T}$
Query-to-context (Q2C) attention: which context words have the closest similarity to one of the query words. $\tilde{\mathbf{H}}\in\mathbb{R}^{2d\times T}$

5. Modeling Layer
Use two layers of bi-directional LSTM.

The input to the modeling layer is $\mathbf{G}$, which encodes the query-aware representations of context words.
The output of the modeling layer captures the interaction among the context words conditioned on the query.

6. Output Layer
The output layer is application-specific.

Experiments

• is a machine comprehension dataset on a large set of Wikipedia articles, with more than 100,000 questions.
• 90k/10k train/dev question-context tuples
2. Each paragraph and question are tokenized by a regular-expression-based word tokenizer (PTB Tokenizer)
3. 100 1D filters for CNN char embedding, each with a width of 5.
4. The model has about 2.6M parameters.
6. Initial learning rate 0.5 for 12 epochs.

Discussion

Ablation Study

1. Word-level embedding is better at representing the semantics of each word as a whole
2. Char-level embedding can better handle **out-of-vocab (OOV) **
3. C2Q attention proves to be critical with a drop of more than 10 points on both metrics
4. Static attention mechanism(in this paper) outperforms the dynamically computed attention by more than 3 points.

Visualization

'Paper Review' 카테고리의 다른 글

 BiDAF 리뷰 및 기록  (0) 2018.11.02 2018.03.26 2018.03.22 2017.09.07 2017.04.17 2017.04.17