CLIP | Notion

details

from scratch로 학습
- ImageNet으로 학습된 image encoder를 가져온다던지 pretrained text encoder를 가져온다던지 하지 않음
efficient pre-training method
- CLIP이 co-occurence에 대해서만 학습이 되어서 문제다 라고 한 후속연구들이 좀 있었던 것 같은데 clip에서는 “efficient”한 training을 하기 위해 그렇게 했단거
  - https://github.com/long8v/PTIR/issues/106
- 아래가 수렴 속도에 대한 그래프인데
  - Transformer Language Model은 이미지가 주어지고 captioning loss처럼 generative하게 한거
  - Bag-of-Words Prediction은 이미지와 관련된 텍스트가 있는지 없는지 classifier 하는거
  - Bag-of-Words 대비 Bag-of-Words Contrastive가 수렴속도가 더 빠른건 “exact word”를 예측하지 않아도 된다는 점?
    - 특히 image - text label이 noisy하고 semantic한 정보를 담고 있으면 되므로 좀 더 합리적
    - 그런 맥락에서 오히려 PALI-X에서 bag-of-words multi-label classification 하는게 constrative로 하는것보다 좋아보였다 (ocr은 정확한 단어를 예측해야 하므로)
text encoder
- transform for input
  - random square crop만 사용. 다른 augmentation 적용 x
- GPT-2 style의 Transformer
  - 63M
  - 12 layers / 512 hidden dim / 8 num heads
  - vocab
    - BPE 학습 vocab size 49,152
  - max seq len 76
- [SOS] / [EOS] 토큰이 앞뒤로 붙고 마지막 레이어의 [EOS] 토큰에 대한 feature가 text의 feature
- Masked Self-attention
  - 1. pretrained language model의 능력을 보존하기 위해 (?)
    - 어차피 마지막 레이어의 [EOS] feature 보는데 masked인지 아닌지 상관없을 것 같은데
    - +) from scratch로 학습하는데 무슨 상관 ?
  - 1. language modeling을 auxiliary loss를 사용 (→ further work)
Scaling
- ResNet의 경우 width / depth / resolution을 같은 비율로 키움
- text encoder의 경우 scaling에 상관없이 성능에 그렇게 상관이 없어서 ResNet이 width를 키운 그 비율만큼만 width(hid dim)을 키워줬음
training
- 5 ResNets, 3 ViTs
- ResNet
  - ResNet-50 / ResNet-101
  - 4x / 16x / 64x
- ViT
  - ViT-B/32, ViT-B/16, ViT-L/14
- 32 epochs / AdamW
- lr cosine scheduler
- hparam은 ResNet-50 1 epoch 해서 찾고 이후엔 휴리스틱하게 고름
- temperature ~ 0.07로 초기화 / 100이 안되도록 clipping
- bs 32,768
- mixed-precision
- gradient checkpointing / half-precision Adam statistics
- RN50x64 기준 592개의 V100으로 18일 걸림
- ViT-L/14 256개의 V100으로 12일
- ViT-L/14의 경우 resolution을 336으로 올려서 한 에폭 더 학습함
detailed hparams

Untitled

위의 설명대로 ViT에 맞춰서 text transformer의 embedding dimension / head 개수를 맞췄다.( 512 → 768, num_heads 8 → 12)

vision scale에 따른 성능

Untitled

ocr capability
- dataset
  - MNIST
  - SVHN : street view house number
    - blurry
    - http://ufldl.stanford.edu/housenumbers/
  - IIIT5K : cropped word
  - Hateful Memes
    - https://ai.facebook.com/blog/hateful-memes-challenge-and-data-set/
  - STT-2
    - 원래는 감정분류하는 NLP 데이터셋인데
    - https://huggingface.co/datasets/sst2
    - 아래처럼 render된 digitial 이미지로 했다는 것 같음
- 성능
  - digital render된게 성능이 좋음 (mnist / SVHN 같은거 hand written이거나 blurry해서 잘 안됨)
  - Hateful Memes도 성능 좋네
  - SST-2 같은것 linear probing 했을 때 CBoW 쓴거랑 비슷했다 !
  - zero shot으로 어떻게 성능 평가를 어떻게 했다는지 잘 모르겠다
    - 감정 분류 하는데 대상이 되는 prompt가 어떻게 되는거징