본문 바로가기

Class

[cs224n] Lecture 4 Word Window Classification and Neural Networks

Word Window Classification and Neural Networks

Overview

  • Classification background
  • Updating word vectors for classificaiton
  • Window classification & cross entropy derivation tips
  • A single layer neural network
  • Max-Margin loss ans backprop

지금까지 Skip-gram & CBOW 등을 이용한 Word2Vec 방법은 Unsupervised Method 였다.

Goal:
p(yx)=exp(Wyx)c=1Cexp(Wcx) p(y|x)=\frac{exp(W_yx)}{\sum_{c=1}^Cexp(W_cx)}
Where, WRCxd\text{Where, } W\in\mathbb{R}^{C\text{x}d}

Details of softmax

p(yx)=exp(fy)c=1Cexp(fc)=softmax(f)y p(y|x) = \frac{exp(f_y)}{\sum_{c=1}^Cexp(f_c)} = softmax(f)_y

Where, fy,fc\text{Where, } f_y, f_c is unnormalized output

For each training example {x,yx,y}, our objective is to maximize the probability of the correct class yy

Hence, minimize the negative log probability of that class

logp(yx)=logexp(fy)c=1Cexp(fc) -\log{p(y|x)} = -\log{\frac{exp(f_y)}{\sum_{c=1}^Cexp(f_c)}}

Maximize probability = Minimize negative log probability


Regularization term

  • Really full loss function over any dataset includes regularization over all parameters θ\theta
  • Regularization will prevent overfitting when we have a lot of features(or later a very powerful/deep model)

J(θ)=1Ni=1Nlogefyic=1Cefc+λkθk2 J(\theta) = \frac{1}{N}\displaystyle\sum_{i=1}^N-\log{\frac{e^{f_{y_i}}}{\sum_{c=1}^Ce^{f_c}}} + \lambda\sum_k\theta_k^2

x-axis: more powerful model or more training iteration
BLUE: training error / RED: test error


Window Classification

  • Classifying single words is rarely done
    • antonyms, ambiguous named entities, etc.
  • So, we need window classification

Window classification means that classify a word in its context window of neighboring words

  • Many possibilities exist for classifying one word in context, e.g. averaging all the words in a window but that looses position information

Updating concatenated word vectors
video link 알아서 잘 해보자… 어렵지 않으니

  • softmax 함수만 사용해서 학습하는건 좋은 방법은 아니다
  • softmax 함수 1개만 있다는 건 linear decision boundary 만 결정한다는 뜻
  • 그렇기 때문에 아래의 오른쪽 사진과 같이 Neural Network 사용해야 한다


Max-margin loss

Idea for objective function: make score of true window larger and corrupt window’s score lower: minimize

J=max(0,1s+sc) J = max(0, 1 - s + s_c)

주로 Multi Layer Perceptron 에서의 backprop 내용이 있어서 크게 정리할 만한 내용은 없는듯