[cs224n] Lecture 4 Word Window Classification and Neural Networks

Word Window Classification and Neural Networks

지금까지 Skip-gram & CBOW 등을 이용한 Word2Vec 방법은 Unsupervised Method 였다.

Goal:
$p(y|x)=\frac{exp(W_yx)}{\sum_{c=1}^Cexp(W_cx)}$
$\text{Where, } W\in\mathbb{R}^{C\text{x}d}$

Details of softmax

$p(y|x) = \frac{exp(f_y)}{\sum_{c=1}^Cexp(f_c)} = softmax(f)_y$

$\text{Where, } f_y, f_c$ is unnormalized output

For each training example { $x,y$ }, our objective is to maximize the probability of the correct class $y$

Hence, minimize the negative log probability of that class

$-\log{p(y|x)} = -\log{\frac{exp(f_y)}{\sum_{c=1}^Cexp(f_c)}}$

Maximize probability = Minimize negative log probability

Really full loss function over any dataset includes regularization over all parameters $\theta$
Regularization will prevent overfitting when we have a lot of features(or later a very powerful/deep model)

$J(\theta) = \frac{1}{N}\displaystyle\sum_{i=1}^N-\log{\frac{e^{f_{y_i}}}{\sum_{c=1}^Ce^{f_c}}} + \lambda\sum_k\theta_k^2$

x-axis: more powerful model or more training iteration
BLUE: training error / RED: test error

Classifying single words is rarely done
- antonyms, ambiguous named entities, etc.
So, we need window classification

Window classification means that classify a word in its context window of neighboring words

Many possibilities exist for classifying one word in context, e.g. averaging all the words in a window but that looses position information

Updating concatenated word vectors
video link 알아서 잘 해보자… 어렵지 않으니

Idea for objective function: make score of true window larger and corrupt window’s score lower: minimize

$J = max(0, 1 - s + s_c)$

주로 Multi Layer Perceptron 에서의 backprop 내용이 있어서 크게 정리할 만한 내용은 없는듯

[cs224n] Lecture 9 Machine Translation and Advanced Recurrent LSTMs and GRUs (0)	2018.04.11
[cs224n] Lecture 1 Natural Language Processing with Deep Learning (0)	2018.03.10
Algorithm 정리 (0)	2015.05.08
computer network 정리 (0)	2015.05.08
OS(Operating System) 정리 (2)	2015.04.29