



























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
**CS 222 / EE 228: Deep Learning**
Typology: Lecture notes
1 / 35
This page cannot be seen from the preview
Don't miss anything!
CS 222: Natural Language Processing (NLP)
8-2: VLMs
Spring 2025
Slides modified from CMU 10-423/10-623 Generative AI & MIT
EfficientML.ai
VLM intuition:
“How to feed a pig efficiently? … ”), transfer it into a
sequence of tokens.
VLM intuition:
sequence of tokens, which is acceptable by transformer as
input.
VLM Encoder
encoders.
a. CLIP based VLM encoder (Used in GPT-V)
b. VǪ-VAE based VLM encoder
VLM - ViT -CLIP
Convert 2D Images to a Sequence of Patches
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [Dosovitskiy et al. , 2021]
Convert 2D Images to a Sequence of Patches
Each patch is a token
Image size: 96x Patch size: 32x
Number of tokens: 3x3= Dimension of each token: 3x32x32=
Practical Implementation
Image size: 96x Patch size: 32x
Number of tokens: 3x3= Dimension of each token: 3x32x32=
13
Convert the 2D Image to a sequence of patches
32x32 Conv, stride 32, padding 0 in_channels=3, out_channels = 768
Apply the Standard Transformer Encoder
Convert the 2D Image to a sequence of patches
Feed patch embeddings to the standard transformer encoder
Image size: 96x Patch size: 32x
Number of tokens: 3x3= Dimension of each token: 3x32x32=
14
Image Classification Results
Inferior to CNNs When the Dataset Size is Limited
CNN
ViT
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [Dosovitskiy et al. , 2021]
16
Image Classification Results
Surpasses CNNs When Pre-training with Large Dataset
ViT
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [Dosovitskiy et al. , 2021]
17
CNN
Motivation
19
● ViT needs large
datasets to work
well.
● Labeling large
datasets is costly.
Image credit: https://web.cs.ucdavis.edu/~hpirsiav/papers/transfer_cvpr18.pdf
20
Solution: training with unlabeled dataset