Bi-Directional Attention Flow For Machine Comprehension
We introduce the Bi-Directional Attention Flow (BiDAF) network, a hierarchical muti-stage architecture for modeling the representations of the context paragraph at different levels.
BiDAF includes character-level, word-level, contextual embeddings, and query-aware context representation.
MODEL
-
Character Embedding Layer
Maps each word to a high-dimensional vector space.
Let and -
Word Embedding Layer
Use GloVe pre-trained model.
The concatenation of the characters and word embedding vectors is passed to a two-layer Highway Network -
Contextual Embedding Layer
Use Long Short-Term Memory Network (LSTM) -
Attention Flow Layer
Inputs to the layer are contextual vectors of the context and the query .
Outputs of the layer are the query-aware vector representations of the context words, , along with the contextual embeddings from the previous layer.Both of two attentions, from context to query and fro query to context, are derived from share similarity matrix. , .
Context-to-query (C2Q) attention: which query words are most relevant to each context word.
Query-to-context (Q2C) attention: which context words have the closest similarity to one of the query words. -
Modeling Layer
Use two layers of bi-directional LSTM.The input to the modeling layer is , which encodes the query-aware representations of context words.
The output of the modeling layer captures the interaction among the context words conditioned on the query. -
Output Layer
The output layer is application-specific.
Experiments
- Use SQuAD dataset
- is a machine comprehension dataset on a large set of Wikipedia articles, with more than 100,000 questions.
- 90k/10k train/dev question-context tuples
- Each paragraph and question are tokenized by a regular-expression-based word tokenizer (PTB Tokenizer)
- 100 1D filters for CNN char embedding, each with a width of 5.
- The model has about 2.6M parameters.
- Use AdaDelta optimizer, with a mini-batch size of 60.
- Initial learning rate 0.5 for 12 epochs.
Discussion
Ablation Study
- Word-level embedding is better at representing the semantics of each word as a whole
- Char-level embedding can better handle **out-of-vocab (OOV) **
- C2Q attention proves to be critical with a drop of more than 10 points on both metrics
- Static attention mechanism(in this paper) outperforms the dynamically computed attention by more than 3 points.
Visualization
'Paper Review' 카테고리의 다른 글
Fake News Detection on Social Media: A Data Mining Perspective (0) | 2018.03.26 |
---|---|
Enriching Word Vectors with Subword Information (0) | 2018.03.22 |
Joint Many-Task(JMT) Model 관련 paper 리뷰 (0) | 2017.09.07 |
간단한 Softmax Regression (0) | 2017.04.17 |
간단한 Logistic Regression (0) | 2017.04.17 |