Notice
Recent Posts
«   2018/08   »
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31
Tags more
Archives
Today
8
Total
129,997
관리 메뉴

## [cs224n] Lecture 4 Word Window Classification and Neural Networks 본문

Class

### [cs224n] Lecture 4 Word Window Classification and Neural Networks

Yumere 2018.03.25 15:27

## Word Window Classification and Neural Networks

### Overview

• Classification background
• Updating word vectors for classificaiton
• Window classification & cross entropy derivation tips
• A single layer neural network
• Max-Margin loss ans backprop

지금까지 Skip-gram & CBOW 등을 이용한 Word2Vec 방법은 Unsupervised Method 였다.

Goal:
$p(y|x)=\frac{exp(W_yx)}{\sum_{c=1}^Cexp(W_cx)}$
$\text{Where, } W\in\mathbb{R}^{C\text{x}d}$

Details of softmax

$p(y|x) = \frac{exp(f_y)}{\sum_{c=1}^Cexp(f_c)} = softmax(f)_y$

$\text{Where, } f_y, f_c$ is unnormalized output

For each training example {$x,y$}, our objective is to maximize the probability of the correct class $y$

Hence, minimize the negative log probability of that class

$-\log{p(y|x)} = -\log{\frac{exp(f_y)}{\sum_{c=1}^Cexp(f_c)}}$

Maximize probability = Minimize negative log probability

### Regularization term

• Really full loss function over any dataset includes regularization over all parameters $\theta$
• Regularization will prevent overfitting when we have a lot of features(or later a very powerful/deep model)

$J(\theta) = \frac{1}{N}\displaystyle\sum_{i=1}^N-\log{\frac{e^{f_{y_i}}}{\sum_{c=1}^Ce^{f_c}}} + \lambda\sum_k\theta_k^2$

x-axis: more powerful model or more training iteration
BLUE: training error / RED: test error

### Window Classification

• Classifying single words is rarely done
• antonyms, ambiguous named entities, etc.
• So, we need window classification

Window classification means that classify a word in its context window of neighboring words

• Many possibilities exist for classifying one word in context, e.g. averaging all the words in a window but that looses position information

Updating concatenated word vectors
video link 알아서 잘 해보자… 어렵지 않으니

• softmax 함수만 사용해서 학습하는건 좋은 방법은 아니다
• softmax 함수 1개만 있다는 건 linear decision boundary 만 결정한다는 뜻
• 그렇기 때문에 아래의 오른쪽 사진과 같이 Neural Network 사용해야 한다

### Max-margin loss

Idea for objective function: make score of true window larger and corrupt window’s score lower: minimize

$J = max(0, 1 - s + s_c)$

주로 Multi Layer Perceptron 에서의 backprop 내용이 있어서 크게 정리할 만한 내용은 없는듯

#### 'Class' 카테고리의 다른 글

 [cs224n] Lecture 9 Machine Translation and Advanced Recurrent LSTMs and GRUs  (0) 2018.04.11 2018.03.25 2018.03.10 2015.05.08 2015.05.08 2015.04.29