Enriching Word Vectors with Subword Information

Word embedding 방법 중 FastText에 대한 리뷰입니다.

Model

Take into account morphology
Consider subword units
Represent words by a sum of its character n-grams

skip-gram introduced by Mikolov et al.
$\begin{gathered} w\in\{1, ..., W\} \\ \text{Maximize following log-likelihood} \\ \displaystyle\sum_{t=1}^T\sum_{c\in C_t}\log{p(w_c|w_t)} \end{gathered}$

$C_t$ 는 $w_t$ 를 감싸고 있는 context words

위의 skip-gram 모델은 FastText에는 적용할 수 없음
위의 방법은 $w_t$ 가 주어지면 단 하나의 context word $w_c$ 를 예측하는 것을 의미하기 때문
대신 context words가 존재하는지 존재하지 않는지를 예측하는 것을 이용할 수 있음

For the word at position $t$ we consider all context words as positive examples and sample negatives at random from the dictionary.
For a chosen context position $c$ , using the binary logistic loss, we obtain the following negative log-likelihood

$\log{(1 + e^{-s(w_t, w_c)})} + \sum_{n\in \mathcal{N}_{t,c}}\log{1+e^{s(w_t, n)}}$

$\mathcal{N}_{t,c}$ 는 set of negative examples
$s$ 는 scoring function
$s(w_t, w_c) = \mathbf{u}_{w_t}^T\mathbf{v}_{w_c}$

위의 수식을 개인적으로 해석해보자면, 두 벡터의 내적 $\mathbf{w_t}\cdot\mathbf{w_c} = |w_t|\cdot|w_c|\cos{\theta}$ 이고 $w_t, w_c$ 가 단위벡터라 가정하면, $\cos{\theta}$ 에 의해 score function 값이 결정된다.

첫 번째 수식, $\log{(1+e^{-s(w_t,w_c)})}$ 는 $\theta$ (각도)가 0이 될 수록(같은 벡터가 될 수록) 값이 작아진다(loss 값이 작아진다). 두 번째 수식은 이와 반대로 $\theta$ 값이 커질수록(orthogonal) 값이 작아진다.

결론적으로 context word은 같은 위치에 있게 하고, negative sample은 직교하도록 하는 loss function 이다.

한 가지 더 첨언하자면, 각각의 단어는 2개의 vector를 갖고 있다(input/output vector).
예를 들어 “I go to school” 라는 문장이 있을 경우, go가 center word가 되는 vector( $\mathbf{v}_c$ )가 존재 할 수 있고, to 가 center word가 되고 go가 context word가 돼서 context word로써의 go vector( $\mathbf{u}_c$ )가 존재할 수 있다.

word2vec, Mikolov et al(2013) 논문에 나온다

위의 negative examples을 이용하는 방법이 Mikolov et al. (2013)에 나오는 Negative Sampling

Subword model

위에서 기본적인 word2vec에 대한 설명이어었고, 아래부터는 character n-grams을 이용하는 이 모델에 대한 설명이다.

기존의 skip-gram model에서 각각의 word들을 distinct vector로 나타내는 방법은 internal structure of words를 무시한다는 한계가 있음
이 논문에서는 이 internal structure를 고려하기 위해 scoring function $s$ 를 새로 정의하고자 함

모델에 대한 설명

각 word $w$ 는 bag-of-character n-gram으로 나타내질 수 있음
각각의 단어는 < 와 > 를 prefix와 suffix로 가짐
자기 자신도 n-gram에 포함

$n=3$ 일 경우
<wh, whe, her, ere, re>, special sequence <where>

Note
<her>(그녀) sequence 는 where에서 나온 her과는 다름

$G$ 는 n-gram dictionary size
$\mathcal{G}_w$ 는 word $w$ 에 나타나는 n-gram들의 set

$\begin{gathered} \mathcal{G}_w \subset \{1,...,\mathcal{G}\} \\ s(w,c) = \sum_{g\in \mathcal{G}}\mathbf{z}_g^T\mathbf{v}_c \end{gathered}$

위의 방식으로 하면

단어 안에 존재하는 n-gram에 대한 정보(representation)를 공유할 수 있음
Rare word에 대해서도 학습할 수 있음

Experiments

추가 필요

저작자표시 (새창열림)

'Paper Review' 카테고리의 다른 글

BiDAF 리뷰 및 기록 (0)	2018.11.02
Fake News Detection on Social Media: A Data Mining Perspective (0)	2018.03.26
Joint Many-Task(JMT) Model 관련 paper 리뷰 (0)	2017.09.07
간단한 Softmax Regression (0)	2017.04.17
간단한 Logistic Regression (0)	2017.04.17

Yumere

Enriching Word Vectors with Subword Information