본문 바로가기

Class

[cs224n] Lecture 9 Machine Translation and Advanced Recurrent LSTMs and GRUs

Yumere 2018. 4. 11. 15:53

Machine Translation and Advanced Recurrent LSTMs and GRUs

[slide] [video]

Gated Recurrent Units by Cho et al. (2014)
Long-Short-Term-Memories by Hochreiter and Schmidhuber (1997)

Recap

Word2Vec
$J_t(\theta)=\log{\sigma(u_o^Tv_c) + \sum_{j - P(w)}[\log{\sigma(-u_j^Tv_c)}]}$

GloVe
$J(\theta)=\frac{1}{2}\sum_{i,j=1}^Wf(P_{ij})(u_i^Tv_j - \log{P_{ij}})^2$

Nnet & Max-margin
$J=max(0, 1 - s + s_c)$

Recurrent Neural Networks
$\begin{aligned} h_t &= \sigma\left(W^{(hh)}h_{t-1}+W^{(hx)}x_{[t]}\right) \\ \hat{y}_t &= \text{softmax}(W^{(S)}h_t) \end{aligned}$

Cross Entropy Error
$J^{(t)}(\theta)=-\sum_{j=1}^{|V|}y_{t,j}\log{\hat{y}_{t,j}}$

Mini-batched SGD
$\theta^{new} = \theta^{old}-\alpha\nabla_\theta J_{t:t+B}(\theta)$

Machine Translation

Current statistical machine translation systems

Bayesian Rule
$\hat{e} = \arg\max_ep(e|f) = \arg\max_ep(f|e)p(e)$

$f$ : source language
$e$ : target language
$p(f|e)$ : translation model
$p(e)$ : language model

Step 1 Alignment

source와 target의 각 단어가 어떻게 매칭이 될 수 있는지 정렬
one-to-many or many-to-many alignment

After many steps

Each phrase in source language has many possible translations resulting in large search space

Decode: Search for best of many hypotheses

Hard search problem that also includes language model

Neural Machine Translation

Single recurrent neural network
encode word vectors through RNN
and decode by using a last hidden state encoded from source sentence

Encoder
$h_t = \phi(h_{t-1}, x_t) = f(W^{(hh)}h_{t-1}+W^{(hx)}x_t)$
Decoder
$\begin{aligned} h_t &= \phi(h_{t-1})=f(W^{(hh)}h_{t-1}) \\ y_t &= \text{softmax}(W^{(S)}h_t) \end{aligned}$

Compute every hidden state in decoder from

Previous hidden state
Last hidden vector of encoder $c=h_T$
Previous predicted output word $y_{t-1}$

Minimize cross entropy error for all target words conditioned on source words
$\underset{\theta}{\max}\frac{1}{N}\sum_{n=1}^N\log{p_\theta (y^{(n)}|x^{(n)})}$

GRUs

standard RNN은 hidden layer를 직접적으로 계산
$h_t=f(W^{(hh)}h_{t-1}+W^{(hx)}x_t$
GRU는 현재 input word vector와 hidden state를 기반으로 update gate 를 업데이트
gate들은 $\sigma$ non-linearity를 사용, 0~1 값은 확률로써 사용 가능
$z_t=\sigma(W^{(z)}x_t+U^{(z)}h_{t-1})$
reset gate 도 업데이트
$r_t=\sigma(W^{(r)}x_t+U^{(r)}h_{t-1})$
New memory content
reset gate 값에 따라 이전 hidden state를 얼마나 유지할 것인지 혹은 버릴 것인지 결정
$\tilde{h} = \tanh(Wx_t+r_t\circ h_{t-1})$
Final memory
$h_t = z_t\circ h_{t-1}+(1-z_t)\circ \tilde{h}_t$
- $z_t$ 가 1이면 이전 step hidden state를 그냥 copy

Long-short-term-memories (LSTMs)

Input gate (current cell matters)
Forget (gate 0, forget past)
Output (how much cell is exposed)
New memory cell
$\begin{aligned} i_t &= \sigma(W^{(i)}x_t + U^{(i)}h_{t-1}) \\ f_t &= \sigma(W^{(f)}x_t+U^{(f)}h_{t-1}) \\ o_t &= \sigma(W^{(o)}x_t + U^{(o)}h_{t-1}) \\ \tilde{c}_t &= \tanh(W^{(c)}x_t + U^{(c)}h_{t-1}) \\ c_t &= f_t\circ c_{t-1} + i_t \circ \tilde{c}_t \\ h_t &= o_t \circ \tanh(c_t) \end{aligned}$

저작자표시 (새창열림)

'Class' 카테고리의 다른 글

[cs224n] Lecture 4 Word Window Classification and Neural Networks (0)	2018.03.25
[cs224n] Lecture 1 Natural Language Processing with Deep Learning (0)	2018.03.10
Algorithm 정리 (0)	2015.05.08
computer network 정리 (0)	2015.05.08
OS(Operating System) 정리 (2)	2015.04.29

티스토리툴바