Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

**CS 222 / EE 228: Deep Learning**, Lecture notes of Natural Language Processing (NLP)

**CS 222 / EE 228: Deep Learning**

Typology: Lecture notes

2024/2025

Uploaded on 05/23/2025

fancycode
fancycode 🇺🇸

7 documents

1 / 32

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Instructor: Yue Dong
Office hour: Thursday 8 am - 9:30 am MRB 4135 (or right after
each class)
CS 222: Natural Language Processing (NLP)
8-1: PEFT & LoRA
Spring 2025
Slides modified from CMU 10-423/10-623 Generative AI,
and Princeton COS 597G
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20

Partial preview of the text

Download **CS 222 / EE 228: Deep Learning** and more Lecture notes Natural Language Processing (NLP) in PDF only on Docsity!

Instructor: Yue Dong

Office hour: Thursday 8 am - 9:30 am MRB 4135 (or right after

each class)

CS 222: Natural Language Processing (NLP) 8-1: PEFT & LoRA Spring 2025 Slides modified from CMU 10-423/10-623 Generative AI, and Princeton COS 597G

LLM Learning Paradigms

2: In-context Learning (No Params Update) ● Pretrain a language model on task (LM) ● Manually design a “prompt” (e.g., few-shot) that demonstrates how to formulate a task as a generation task. ● No need to update the model weights at all!

fromage

Brown et al. 2020

4 Promp

3: Parameter-efficient Fine tuning (Update partial params) ● Standard fine-tuning requires a separate model per task—making it impractical to store thousands of full models for personalized use. ● If we fine tuned a subset of the parameters for each task, we could alleviate storage costs. This is parameter-efficiency.

Image: (He et al. 2022) 5

Parameter Efficient Fine-Tuning Goal:

  • perform fine-tuning of fewer parameters
  • but matches the performance of full fine-tuning 13

Parameter Efficient Fine-Tuning a. Subset: Pick a subset of the parameters and fine-tune only those (e.g. only the top K layers of a K+L layer deep neural network) b. Adapters: add additional layers that have few parameters and tune only the parameters of those layers, keeping all others fixed c. Prefix Tuning: for a Transformer LM, pretend as if there exist many tokens that came before your sequence and tune the keys/values corresponding to those tokens d. LoRA: learn a small delta for the each of the parameter matrices with the delta chosen to be low rank 13 soft ii.ITken to (^) train not Cora

B: Adapters Module

  • An adapter layer: a feed- forward neural network with one hidden layer, and a residual connection
  • For input dimension, d, the adapter layer also has output dimension d, but bottlenecks to a lower dimension m in the middle 19 Figure from https://arxiv.org/pdf/1902.

d

d

r

Wup^ ∈ R r×d W

down

∈ Rd×r MLP narrow r usually 1 of original

Decoder-only Transformer x 1 x 2 x 4 1 1 p(w |h ) h 1 p(w 2 |h 2 ) h 2 3 3 p(w |h ) h 3 p(w 4 |h 4 ) h 4 The ba t made noise … Transformer layer x 3 Transformer layer Transformer layer 11 Encoder-only Transformer h 1 h 3 h 4 [CLS] ca t sa t Transformer layer Transformer layer Transformer layer x 1 x 2 [MASK] x 3 x 4 h 2 p(w 1 |h 2 ) 𝑃 1 (·,·) (^) The Loss

Adapter (B) vs. Top K (A)

  • Pretrained Model: BERT- Large
  • Baseline Method: fine- tune top K layers
  • Adapters achieve nearly the performance (i.e. 0% delta) of full fine-tuning but with substantially fewer parameters 13 Figure from due to Missmatch^ D top k

C: Prefix Tuning

  1. Inject (dummy) prefix tokens, indexed by P idx , before the real tokens
  2. Represent i’th prefix token’s activation by trainable parameters: h i = P θ [i, :]
  3. For each i let P θ [i, :] = MLP(Q θ [i, :]), because having Q θ of lower dimension than P θ improves stability during training
  4. During training, keep all Transformer parameters fixed, except for θ Figure from http://arxiv.org/abs/2101. Lotion etc

16 D: LOW-RANK ADAPTATION (LORA) final exam

How large are LLMs? 17

Model Creators Year of

release

Training Data (#

tokens)

Model Size (#

parameters)

GPT-2 OpenAI 2019 ~10 billion (40Gb) 1.5 billion

GPT-

(cf. ChatGPT)

OpenAI 2020 300 billion 175 billion

PaLM Google 2022 780 billion 540 billion

Chinchilla DeepMind 2022 1.4 trillion 70 billion

LaMDA

(cf. Bard)

Google 2022 1.56 trillion 137 billion

LLaMA Meta 2023 1.4 trillion 65 billion

LLaMA-2 Meta 2023 2 trillion 70 billion

GPT-4 OpenAI 2023?? (1.76 trillion)

Gemini (Ultra) Google 2023?? (1.5 trillion)

LLaMA-3 Meta 2024 15 trillion 405 billion

Fine-Tuning LLMs without Regularization 33 Question: Why don’t LLMs overfit when we fine- tune them without regularization? Hypothesis: They are intrinsically low dimensional Small data^ can't^ geneline

LoRA Key Idea

  • Keep the original pretrained parameters W 0 fixed during fine-tuning
  • Learn an additive modification to those parameters ΔW
  • Define ΔW via a low rank decomposition: ∆W = BA where BA has rank r, which is much less than the input dimension k or the output dimension d z = W 0 x + BAx = (W 0
  • BA)x

LoRA Linear Layer

z = W 0 x

Standard Linear Layer

x z

W

0 Linear Linear

B

Linear

A

d

r

x z

W 0

where r << min(d, k)

k

20 Figure inspired by He et al. (2022) Linear W 0 ∈ R d×k (^) , x ∈ Rk , z ∈ Rd W 0 ∈ R d×k , A ∈ R r×k , B ∈ R d×r Low rank^ I (^) at BoA 11 UtoW (^) X T.net trick