Class

# Machine Translation and Advanced Recurrent LSTMs and GRUs

[slide] [video]

• Gated Recurrent Units by Cho et al. (2014)
• Long-Short-Term-Memories by Hochreiter and Schmidhuber (1997)

## Recap

Word2Vec
$J_t(\theta)=\log{\sigma(u_o^Tv_c) + \sum_{j - P(w)}[\log{\sigma(-u_j^Tv_c)}]}$

GloVe
$J(\theta)=\frac{1}{2}\sum_{i,j=1}^Wf(P_{ij})(u_i^Tv_j - \log{P_{ij}})^2$

Nnet & Max-margin
$J=max(0, 1 - s + s_c)$

Recurrent Neural Networks
\begin{aligned} h_t &= \sigma\left(W^{(hh)}h_{t-1}+W^{(hx)}x_{[t]}\right) \\ \hat{y}_t &= \text{softmax}(W^{(S)}h_t) \end{aligned}

Cross Entropy Error
$J^{(t)}(\theta)=-\sum_{j=1}^{|V|}y_{t,j}\log{\hat{y}_{t,j}}$

Mini-batched SGD
$\theta^{new} = \theta^{old}-\alpha\nabla_\theta J_{t:t+B}(\theta)$

## Machine Translation

### Current statistical machine translation systems

Bayesian Rule
$\hat{e} = \arg\max_ep(e|f) = \arg\max_ep(f|e)p(e)$

• $f$: source language
• $e$: target language
• $p(f|e)$: translation model
• $p(e)$: language model

#### Step 1 Alignment

• source와 target의 각 단어가 어떻게 매칭이 될 수 있는지 정렬
• one-to-many or many-to-many alignment #### After many steps

Each phrase in source language has many possible translations resulting in large search space

#### Decode: Search for best of many hypotheses

Hard search problem that also includes language model

### Neural Machine Translation • Single recurrent neural network
• encode word vectors through RNN
• and decode by using a last hidden state encoded from source sentence

Encoder
$h_t = \phi(h_{t-1}, x_t) = f(W^{(hh)}h_{t-1}+W^{(hx)}x_t)$
Decoder
\begin{aligned} h_t &= \phi(h_{t-1})=f(W^{(hh)}h_{t-1}) \\ y_t &= \text{softmax}(W^{(S)}h_t) \end{aligned}

Compute every hidden state in decoder from

• Previous hidden state
• Last hidden vector of encoder $c=h_T$
• Previous predicted output word $y_{t-1}$

Minimize cross entropy error for all target words conditioned on source words
$\underset{\theta}{\max}\frac{1}{N}\sum_{n=1}^N\log{p_\theta (y^{(n)}|x^{(n)})}$

## GRUs

• standard RNN은 hidden layer를 직접적으로 계산
$h_t=f(W^{(hh)}h_{t-1}+W^{(hx)}x_t$
• GRU는 현재 input word vector와 hidden state를 기반으로 update gate 를 업데이트
• gate들은 $\sigma$ non-linearity를 사용, 0~1 값은 확률로써 사용 가능
$z_t=\sigma(W^{(z)}x_t+U^{(z)}h_{t-1})$
• reset gate 도 업데이트
$r_t=\sigma(W^{(r)}x_t+U^{(r)}h_{t-1})$
• New memory content
• reset gate 값에 따라 이전 hidden state를 얼마나 유지할 것인지 혹은 버릴 것인지 결정
$\tilde{h} = \tanh(Wx_t+r_t\circ h_{t-1})$
• Final memory
$h_t = z_t\circ h_{t-1}+(1-z_t)\circ \tilde{h}_t$
• $z_t$ 가 1이면 이전 step hidden state를 그냥 copy

## Long-short-term-memories (LSTMs)

• Input gate (current cell matters)
• Forget (gate 0, forget past)
• Output (how much cell is exposed)
• New memory cell
\begin{aligned} i_t &= \sigma(W^{(i)}x_t + U^{(i)}h_{t-1}) \\ f_t &= \sigma(W^{(f)}x_t+U^{(f)}h_{t-1}) \\ o_t &= \sigma(W^{(o)}x_t + U^{(o)}h_{t-1}) \\ \tilde{c}_t &= \tanh(W^{(c)}x_t + U^{(c)}h_{t-1}) \\ c_t &= f_t\circ c_{t-1} + i_t \circ \tilde{c}_t \\ h_t &= o_t \circ \tanh(c_t) \end{aligned}

#### 'Class' 카테고리의 다른 글

 [cs224n] Lecture 9 Machine Translation and Advanced Recurrent LSTMs and GRUs  (0) 2018.04.11 2018.03.25 2018.03.10 2015.05.08 2015.05.08 2015.04.29