Bi-Directional Attention Flow For Machine Comprehension
We introduce the Bi-Directional Attention Flow (BiDAF) network, a hierarchical muti-stage architecture for modeling the representations of the context paragraph at different levels.
BiDAF includes character-level, word-level, contextual embeddings, and query-aware context representation.
Character Embedding Layer
Maps each word to a high-dimensional vector space.
Word Embedding Layer
Use GloVe pre-trained model.
The concatenation of the characters and word embedding vectors is passed to a two-layer Highway Network
Contextual Embedding Layer
Use Long Short-Term Memory Network (LSTM)
Attention Flow Layer
Inputs to the layer are contextual vectors of the context and the query .
Outputs of the layer are the query-aware vector representations of the context words, , along with the contextual embeddings from the previous layer.
Both of two attentions, from context to query and fro query to context, are derived from share similarity matrix. , .
Context-to-query (C2Q) attention: which query words are most relevant to each context word.
Query-to-context (Q2C) attention: which context words have the closest similarity to one of the query words.
Use two layers of bi-directional LSTM.
The input to the modeling layer is , which encodes the query-aware representations of context words.
The output of the modeling layer captures the interaction among the context words conditioned on the query.
The output layer is application-specific.
- Use SQuAD dataset
- is a machine comprehension dataset on a large set of Wikipedia articles, with more than 100,000 questions.
- 90k/10k train/dev question-context tuples
- Each paragraph and question are tokenized by a regular-expression-based word tokenizer (PTB Tokenizer)
- 100 1D filters for CNN char embedding, each with a width of 5.
- The model has about 2.6M parameters.
- Use AdaDelta optimizer, with a mini-batch size of 60.
- Initial learning rate 0.5 for 12 epochs.
- Word-level embedding is better at representing the semantics of each word as a whole
- Char-level embedding can better handle **out-of-vocab (OOV) **
- C2Q attention proves to be critical with a drop of more than 10 points on both metrics
- Static attention mechanism(in this paper) outperforms the dynamically computed attention by more than 3 points.