
























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
**CS 222 / EE 228: Deep Learning**
Typology: Lecture notes
1 / 32
This page cannot be seen from the preview
Don't miss anything!
CS 222: Natural Language Processing (NLP) 8-1: PEFT & LoRA Spring 2025 Slides modified from CMU 10-423/10-623 Generative AI, and Princeton COS 597G
LLM Learning Paradigms
2: In-context Learning (No Params Update) ● Pretrain a language model on task (LM) ● Manually design a “prompt” (e.g., few-shot) that demonstrates how to formulate a task as a generation task. ● No need to update the model weights at all!
4 Promp
3: Parameter-efficient Fine tuning (Update partial params) ● Standard fine-tuning requires a separate model per task—making it impractical to store thousands of full models for personalized use. ● If we fine tuned a subset of the parameters for each task, we could alleviate storage costs. This is parameter-efficiency.
Parameter Efficient Fine-Tuning Goal:
Parameter Efficient Fine-Tuning a. Subset: Pick a subset of the parameters and fine-tune only those (e.g. only the top K layers of a K+L layer deep neural network) b. Adapters: add additional layers that have few parameters and tune only the parameters of those layers, keeping all others fixed c. Prefix Tuning: for a Transformer LM, pretend as if there exist many tokens that came before your sequence and tune the keys/values corresponding to those tokens d. LoRA: learn a small delta for the each of the parameter matrices with the delta chosen to be low rank 13 soft ii.ITken to (^) train not Cora
B: Adapters Module
Wup^ ∈ R r×d W
∈ Rd×r MLP narrow r usually 1 of original
Decoder-only Transformer x 1 x 2 x 4 1 1 p(w |h ) h 1 p(w 2 |h 2 ) h 2 3 3 p(w |h ) h 3 p(w 4 |h 4 ) h 4 The ba t made noise … Transformer layer x 3 Transformer layer Transformer layer 11 Encoder-only Transformer h 1 h 3 h 4 [CLS] ca t sa t Transformer layer Transformer layer Transformer layer x 1 x 2 [MASK] x 3 x 4 h 2 p(w 1 |h 2 ) 𝑃 1 (·,·) (^) The Loss
Adapter (B) vs. Top K (A)
C: Prefix Tuning
16 D: LOW-RANK ADAPTATION (LORA) final exam
How large are LLMs? 17
Fine-Tuning LLMs without Regularization 33 Question: Why don’t LLMs overfit when we fine- tune them without regularization? Hypothesis: They are intrinsically low dimensional Small data^ can't^ geneline
LoRA Key Idea
z = W 0 x
x z
0 Linear Linear
Linear
x z
where r << min(d, k)
20 Figure inspired by He et al. (2022) Linear W 0 ∈ R d×k (^) , x ∈ Rk , z ∈ Rd W 0 ∈ R d×k , A ∈ R r×k , B ∈ R d×r Low rank^ I (^) at BoA 11 UtoW (^) X T.net trick