Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

CS 222 / EE 228: Deep Learning, Lecture notes of Natural Language Processing (NLP)

University of California-Riverside Natural Language Processing (NLP)

**CS 222 / EE 228: Deep Learning**

Typology: Lecture notes

2024/2025

Uploaded on 05/23/2025

fancycode 🇺🇸

7 documents

1 / 32

This page cannot be seen from the preview

Don't miss anything!

Instructor: Yue Dong

Oﬃce hour: Thursday 8 am - 9:30 am MRB 4135 (or right after

each class)

CS 222: Natural Language Processing (NLP)

8-1: PEFT & LoRA

Spring 2025

Slides modiﬁed from CMU 10-423/10-623 Generative AI,

and Princeton COS 597G

Partial preview of the text

Download CS 222 / EE 228: Deep Learning and more Lecture notes Natural Language Processing (NLP) in PDF only on Docsity!

Instructor: Yue Dong

Office hour: Thursday 8 am - 9:30 am MRB 4135 (or right after

each class)

CS 222: Natural Language Processing (NLP) 8-1: PEFT & LoRA Spring 2025 Slides modified from CMU 10-423/10-623 Generative AI, and Princeton COS 597G

LLM Learning Paradigms

2: In-context Learning (No Params Update) ● Pretrain a language model on task (LM) ● Manually design a “prompt” (e.g., few-shot) that demonstrates how to formulate a task as a generation task. ● No need to update the model weights at all!

fromage

Brown et al. 2020

4 Promp

3: Parameter-efficient Fine tuning (Update partial params) ● Standard fine-tuning requires a separate model per task—making it impractical to store thousands of full models for personalized use. ● If we fine tuned a subset of the parameters for each task, we could alleviate storage costs. This is parameter-efficiency.

Image: (He et al. 2022) 5

Parameter Efficient Fine-Tuning Goal:

perform fine-tuning of fewer parameters
but matches the performance of full fine-tuning 13

Parameter Efficient Fine-Tuning a. Subset: Pick a subset of the parameters and fine-tune only those (e.g. only the top K layers of a K+L layer deep neural network) b. Adapters: add additional layers that have few parameters and tune only the parameters of those layers, keeping all others fixed c. Prefix Tuning: for a Transformer LM, pretend as if there exist many tokens that came before your sequence and tune the keys/values corresponding to those tokens d. LoRA: learn a small delta for the each of the parameter matrices with the delta chosen to be low rank 13 soft ii.ITken to (^) train not Cora

B: Adapters Module

An adapter layer: a feed- forward neural network with one hidden layer, and a residual connection
For input dimension, d, the adapter layer also has output dimension d, but bottlenecks to a lower dimension m in the middle 19 Figure from https://arxiv.org/pdf/1902.

d

r

Wup^ ∈ R r×d W

down

∈ Rd×r MLP narrow r usually 1 of original

Decoder-only Transformer x 1 x 2 x 4 1 1 p(w |h ) h 1 p(w 2 |h 2 ) h 2 3 3 p(w |h ) h 3 p(w 4 |h 4 ) h 4 The ba t made noise … Transformer layer x 3 Transformer layer Transformer layer 11 Encoder-only Transformer h 1 h 3 h 4 [CLS] ca t sa t Transformer layer Transformer layer Transformer layer x 1 x 2 [MASK] x 3 x 4 h 2 p(w 1 |h 2 ) 𝑃 1 (·,·) (^) The Loss

Adapter (B) vs. Top K (A)

Pretrained Model: BERT- Large
Baseline Method: fine- tune top K layers
Adapters achieve nearly the performance (i.e. 0% delta) of full fine-tuning but with substantially fewer parameters 13 Figure from due to Missmatch^ D top k

C: Prefix Tuning

Inject (dummy) prefix tokens, indexed by P idx , before the real tokens
Represent i’th prefix token’s activation by trainable parameters: h i = P θ [i, :]
For each i let P θ [i, :] = MLP(Q θ [i, :]), because having Q θ of lower dimension than P θ improves stability during training
During training, keep all Transformer parameters fixed, except for θ Figure from http://arxiv.org/abs/2101. Lotion etc

16 D: LOW-RANK ADAPTATION (LORA) final exam

How large are LLMs? 17

Model Creators Year of

release

Training Data (#

tokens)

Model Size (#

parameters)

GPT-2 OpenAI 2019 ~10 billion (40Gb) 1.5 billion

GPT-

(cf. ChatGPT)

OpenAI 2020 300 billion 175 billion

PaLM Google 2022 780 billion 540 billion

Chinchilla DeepMind 2022 1.4 trillion 70 billion

LaMDA

(cf. Bard)

Google 2022 1.56 trillion 137 billion

LLaMA Meta 2023 1.4 trillion 65 billion

LLaMA-2 Meta 2023 2 trillion 70 billion

GPT-4 OpenAI 2023?? (1.76 trillion)

Gemini (Ultra) Google 2023?? (1.5 trillion)

LLaMA-3 Meta 2024 15 trillion 405 billion

Fine-Tuning LLMs without Regularization 33 Question: Why don’t LLMs overfit when we fine- tune them without regularization? Hypothesis: They are intrinsically low dimensional Small data^ can't^ geneline

LoRA Key Idea

Keep the original pretrained parameters W 0 fixed during fine-tuning
Learn an additive modification to those parameters ΔW
Define ΔW via a low rank decomposition: ∆W = BA where BA has rank r, which is much less than the input dimension k or the output dimension d z = W 0 x + BAx = (W 0

BA)x

LoRA Linear Layer

z = W 0 x

Standard Linear Layer

x z

W

0 Linear Linear

B

Linear

A

d

r

x z

W 0

where r << min(d, k)

k

20 Figure inspired by He et al. (2022) Linear W 0 ∈ R d×k (^) , x ∈ Rk , z ∈ Rd W 0 ∈ R d×k , A ∈ R r×k , B ∈ R d×r Low rank^ I (^) at BoA 11 UtoW (^) X T.net trick

**CS 222 / EE 228: Deep Learning**, Lecture notes of Natural Language Processing (NLP)

Related documents

Partial preview of the text

Download **CS 222 / EE 228: Deep Learning** and more Lecture notes Natural Language Processing (NLP) in PDF only on Docsity!

Instructor: Yue Dong

Office hour: Thursday 8 am - 9:30 am MRB 4135 (or right after

each class)

fromage

Brown et al. 2020

Image: (He et al. 2022) 5

d

d

r

down

Model Creators Year of

release

Training Data (#

tokens)

Model Size (#

parameters)

GPT-2 OpenAI 2019 ~10 billion (40Gb) 1.5 billion

GPT-

(cf. ChatGPT)

OpenAI 2020 300 billion 175 billion

PaLM Google 2022 780 billion 540 billion

Chinchilla DeepMind 2022 1.4 trillion 70 billion

LaMDA

(cf. Bard)

Google 2022 1.56 trillion 137 billion

LLaMA Meta 2023 1.4 trillion 65 billion

LLaMA-2 Meta 2023 2 trillion 70 billion

GPT-4 OpenAI 2023?? (1.76 trillion)

Gemini (Ultra) Google 2023?? (1.5 trillion)

LLaMA-3 Meta 2024 15 trillion 405 billion

LoRA Linear Layer

Standard Linear Layer

W

B

A

d

r

W 0

k

CS 222 / EE 228: Deep Learning, Lecture notes of Natural Language Processing (NLP)

Download CS 222 / EE 228: Deep Learning and more Lecture notes Natural Language Processing (NLP) in PDF only on Docsity!