수요조 PAPER REVIEW

토론

0.25 라는 숫자의 이유?
- 당시에 0.25가 국룰이었음
grad 끄는법 ?
- Tensor에서 requires_grad=False를 주면 된다
L2 constraint weight vector?
- gradient clip이 아니라 weight vector에 대한 constraint다
- employ dropout on the penultimate layer with a constraint on l2-norms of the weight vectors(Hinton et al. 2012)
  - We use the standard, stochastic gradient descent procedure for training the dropout neuralnetworks on mini-batches of training cases, but we modify the penalty term that is normally used to prevent the weights from growing too large. Instead of penalizing the squared length(L2 norm) of the whole weight vector, we set an upper bound on the L2 norm of the incoming weight vector for each individual hidden unit.
multi-channel 구현 방법 ?
time-series
- 처음에 언어문제를 DNN으로 접근해서 시계열적 정보를 못 넣음
- → N-gram을 사용하게 됨
- → RNN의 hidden state로 문제를 풀려고 함
- → CNN도 필터를 통해 마치 n-gram같은 효과를 낼 수 있음 (⇒ 현 논문)
- → Transformer는 positional encoding으로 그런 시계열적 문제를 풀려고 함
- 결론 : 언어는 어떤 문장에서 단어의 순서라는 중요한 요소가 있는 문제이고 이걸 여러가지 모델로 '시간'이라는 요소를 넣는 것이지 꼭 RNN은 time-series이고 다른 것들은 아니다 라고 말할 수 없음

Long-term Dependency문제를 해결하기 위해 나온 ByteNet : https://arxiv.org/abs/1610.10099

word2vec에 없는 단어는 랜덤 vector로 주어져야함 (Wosrds not present in the set of pre-trained words are initalized randomly) :
CNN - static :
CNN - mutli-channel :
L2 norm weight :
CNN max-over-time-pooling

. fasttext, Glove 성능 차이

. CNN > RNN?

. word2vec이 uniform인지 아닌지?