본문 바로가기

Class

[cs224n] Lecture 9 Machine Translation and Advanced Recurrent LSTMs and GRUs

Machine Translation and Advanced Recurrent LSTMs and GRUs

[slide] [video]

  • Gated Recurrent Units by Cho et al. (2014)
  • Long-Short-Term-Memories by Hochreiter and Schmidhuber (1997)

Recap

Word2Vec
Jt(θ)=logσ(uoTvc)+jP(w)[logσ(ujTvc)]J_t(\theta)=\log{\sigma(u_o^Tv_c) + \sum_{j - P(w)}[\log{\sigma(-u_j^Tv_c)}]}

GloVe
J(θ)=12i,j=1Wf(Pij)(uiTvjlogPij)2J(\theta)=\frac{1}{2}\sum_{i,j=1}^Wf(P_{ij})(u_i^Tv_j - \log{P_{ij}})^2

Nnet & Max-margin
J=max(0,1s+sc)J=max(0, 1 - s + s_c)

Recurrent Neural Networks
ht=σ(W(hh)ht1+W(hx)x[t])y^t=softmax(W(S)ht)\begin{aligned} h_t &= \sigma\left(W^{(hh)}h_{t-1}+W^{(hx)}x_{[t]}\right) \\ \hat{y}_t &= \text{softmax}(W^{(S)}h_t) \end{aligned}

Cross Entropy Error
J(t)(θ)=j=1Vyt,jlogy^t,j J^{(t)}(\theta)=-\sum_{j=1}^{|V|}y_{t,j}\log{\hat{y}_{t,j}}

Mini-batched SGD
θnew=θoldαθJt:t+B(θ)\theta^{new} = \theta^{old}-\alpha\nabla_\theta J_{t:t+B}(\theta)

Machine Translation

Current statistical machine translation systems

Bayesian Rule
e^=argmaxep(ef)=argmaxep(fe)p(e) \hat{e} = \arg\max_ep(e|f) = \arg\max_ep(f|e)p(e)

  • ff: source language
  • ee: target language
  • p(fe)p(f|e): translation model
  • p(e)p(e): language model

Step 1 Alignment

  • source와 target의 각 단어가 어떻게 매칭이 될 수 있는지 정렬
  • one-to-many or many-to-many alignment
    alignment

After many steps

Each phrase in source language has many possible translations resulting in large search space

Decode: Search for best of many hypotheses

Hard search problem that also includes language model


Neural Machine Translation


  • Single recurrent neural network
  • encode word vectors through RNN
  • and decode by using a last hidden state encoded from source sentence

Encoder
ht=ϕ(ht1,xt)=f(W(hh)ht1+W(hx)xt) h_t = \phi(h_{t-1}, x_t) = f(W^{(hh)}h_{t-1}+W^{(hx)}x_t)
Decoder
ht=ϕ(ht1)=f(W(hh)ht1)yt=softmax(W(S)ht)\begin{aligned} h_t &= \phi(h_{t-1})=f(W^{(hh)}h_{t-1}) \\ y_t &= \text{softmax}(W^{(S)}h_t) \end{aligned}

Compute every hidden state in decoder from

  • Previous hidden state
  • Last hidden vector of encoder c=hTc=h_T
  • Previous predicted output word yt1y_{t-1}

Minimize cross entropy error for all target words conditioned on source words
maxθ1Nn=1Nlogpθ(y(n)x(n)) \underset{\theta}{\max}\frac{1}{N}\sum_{n=1}^N\log{p_\theta (y^{(n)}|x^{(n)})}

GRUs

  • standard RNN은 hidden layer를 직접적으로 계산
    ht=f(W(hh)ht1+W(hx)xth_t=f(W^{(hh)}h_{t-1}+W^{(hx)}x_t
  • GRU는 현재 input word vector와 hidden state를 기반으로 update gate 를 업데이트
  • gate들은 σ\sigma non-linearity를 사용, 0~1 값은 확률로써 사용 가능
    zt=σ(W(z)xt+U(z)ht1)z_t=\sigma(W^{(z)}x_t+U^{(z)}h_{t-1})
  • reset gate 도 업데이트
    rt=σ(W(r)xt+U(r)ht1)r_t=\sigma(W^{(r)}x_t+U^{(r)}h_{t-1})
  • New memory content
  • reset gate 값에 따라 이전 hidden state를 얼마나 유지할 것인지 혹은 버릴 것인지 결정
    h~=tanh(Wxt+rtht1)\tilde{h} = \tanh(Wx_t+r_t\circ h_{t-1})
  • Final memory
    ht=ztht1+(1zt)h~th_t = z_t\circ h_{t-1}+(1-z_t)\circ \tilde{h}_t
    • ztz_t 가 1이면 이전 step hidden state를 그냥 copy

Long-short-term-memories (LSTMs)

  • Input gate (current cell matters)
  • Forget (gate 0, forget past)
  • Output (how much cell is exposed)
  • New memory cell
    it=σ(W(i)xt+U(i)ht1)ft=σ(W(f)xt+U(f)ht1)ot=σ(W(o)xt+U(o)ht1)c~t=tanh(W(c)xt+U(c)ht1)ct=ftct1+itc~tht=ottanh(ct)\begin{aligned} i_t &= \sigma(W^{(i)}x_t + U^{(i)}h_{t-1}) \\ f_t &= \sigma(W^{(f)}x_t+U^{(f)}h_{t-1}) \\ o_t &= \sigma(W^{(o)}x_t + U^{(o)}h_{t-1}) \\ \tilde{c}_t &= \tanh(W^{(c)}x_t + U^{(c)}h_{t-1}) \\ c_t &= f_t\circ c_{t-1} + i_t \circ \tilde{c}_t \\ h_t &= o_t \circ \tanh(c_t) \end{aligned}