본문 바로가기

Paper Review

BiDAF 리뷰 및 기록

Bi-Directional Attention Flow For Machine Comprehension

We introduce the Bi-Directional Attention Flow (BiDAF) network, a hierarchical muti-stage architecture for modeling the representations of the context paragraph at different levels.

BiDAF includes character-level, word-level, contextual embeddings, and query-aware context representation.

MODEL


  1. Character Embedding Layer
    Maps each word to a high-dimensional vector space.
    Let x1,,xT{x_1, \dots, x_T} and q1,,qJ{q_1, \dots, q_J}

  2. Word Embedding Layer
    Use GloVe pre-trained model.
    The concatenation of the characters and word embedding vectors is passed to a two-layer Highway Network

  3. Contextual Embedding Layer
    Use Long Short-Term Memory Network (LSTM)

  4. Attention Flow Layer
    Inputs to the layer are contextual vectors of the context H\mathbf{H} and the query U\mathbf{U}.
    Outputs of the layer are the query-aware vector representations of the context words, G\mathbf{G}, along with the contextual embeddings from the previous layer.

    Both of two attentions, from context to query and fro query to context, are derived from share similarity matrix. SRT×J\mathbf{S}\in \mathbb{R}^{T\times J}, Stj=α(H:t,U:j)R\mathbf{S}_{tj}=\alpha(\mathbf{H}_{:t}, \mathbf{U}_{:j})\in \mathbb{R}.
    α(h,u)=w(S)[h;u;hu]\alpha(\mathbf{h}, \mathbf{u})=\mathbf{w}^\intercal_{(\mathbf{S})}[\mathbf{h};\mathbf{u};\mathbf{h}\circ\mathbf{u}]

    Context-to-query (C2Q) attention: which query words are most relevant to each context word. U~R2d×T\tilde{\mathbf{U}}\in\mathbb{R}^{2d\times T}
    Query-to-context (Q2C) attention: which context words have the closest similarity to one of the query words. H~R2d×T\tilde{\mathbf{H}}\in\mathbb{R}^{2d\times T}


  5. Modeling Layer
    Use two layers of bi-directional LSTM.

    The input to the modeling layer is G\mathbf{G}, which encodes the query-aware representations of context words.
    The output of the modeling layer captures the interaction among the context words conditioned on the query.

  6. Output Layer
    The output layer is application-specific.

Experiments

  1. Use SQuAD dataset
    • is a machine comprehension dataset on a large set of Wikipedia articles, with more than 100,000 questions.
    • 90k/10k train/dev question-context tuples
  2. Each paragraph and question are tokenized by a regular-expression-based word tokenizer (PTB Tokenizer)
  3. 100 1D filters for CNN char embedding, each with a width of 5.
  4. The model has about 2.6M parameters.
  5. Use AdaDelta optimizer, with a mini-batch size of 60.
  6. Initial learning rate 0.5 for 12 epochs.


Discussion

Ablation Study

  1. Word-level embedding is better at representing the semantics of each word as a whole
  2. Char-level embedding can better handle **out-of-vocab (OOV) **
  3. C2Q attention proves to be critical with a drop of more than 10 points on both metrics
  4. Static attention mechanism(in this paper) outperforms the dynamically computed attention by more than 3 points.

Visualization