Download Association rules for frequent item sets and more Exams Data Mining in PDF only on Docsity!
Association Rules & Frequent Itemsets
All you ever wanted to know about diapers,
beers and their correlation!
Data Mining: Association Rules 2
The Market-Basket Problem
- Given a database of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
Market-Basket transactions
TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} → {Beer}, {Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk}
Implication here means co-occurrence, not causality!
Data Mining: Association Rules 3
The Market-Basket Problem
Given a database of transactions
where each transaction is a collection of items (purchased by a customer in a visit)
find all rules that correlate the presence of one set of
items with that of another set of items
Example: 30% of all transactions that contain diapers also contain
beers; 5% of all transactions contain these items
- 30%: confidence of the rule
- 5%: support of the rule
We are interested in finding all rules,
rather than verifying that a particular rule holds
Data Mining: Association Rules 4
Applications of Market-Basket Analysis
- Supermarkets
- Placement
- Advertising
- Sales
- Coupons
- Many applications outside market basket data analysis
- Prediction (telecom switch failure)
- Web usage mining
- Many different types of association rules
Definition: Frequent Itemset
- Itemset
- A collection of one or more items
- Example: {Milk, Bread, Diaper}
- k-itemset
- An itemset that contains k items
- Support count (σσσσ)
- Frequency of occurrence of an itemset
- E.g. σ({Milk, Bread,Diaper}) = 2
- Support
- Fraction of transactions that contain an itemset
- E.g. s({Milk, Bread, Diaper}) = 2/
- Frequent Itemset
- An itemset whose support is greater than or equal to a minsup threshold
TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
Definition: Association Rule
Example:
{Milk ,Diaper}⇒ Beer
|T |
(Milk ,Diaper,Beer )
s
(Milk,Diaper )
( Milk,Diaper,Beer )
c
- Association Rule
- An implication expression of the form X → Y, where X and Y are itemsets
- Example: {Milk, Diaper} → {Beer}
- Rule Evaluation Metrics
- Support (s)
- Fraction of transactions that contain both X and Y
- Confidence (c)
- Measures how often items in Y appear in transactions that contain X
TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
Data Mining: Association Rules 7
Aspects of Association Rule Mining
- How do we generate rules fast?
- Performance measured in
- Number of database scans
- Number of itemsets that must be counted
- Which are the interesting rules?
Data Mining: Association Rules 8
Association Rule Mining Task
- Given a set of transactions T, the goal of
association rule mining is to find all rules
having
- support ≥ minsup threshold
- confidence ≥ minconf threshold
- Brute-force approach:
- List all possible association rules
- Compute the support and confidence for each rule
- Prune rules that fail the minsup and minconf
thresholds
⇒ Computationally prohibitive!
Data Mining: Association Rules 9
Mining Association Rules
Example of Rules:
{Milk,Diaper} → {Beer} (s=0.4, c=0.67) {Milk,Beer} → {Diaper} (s=0.4, c=1.0) {Diaper,Beer} → {Milk} (s=0.4, c=0.67) {Beer} → {Milk,Diaper} (s=0.4, c=0.67) {Diaper} → {Milk,Beer} (s=0.4, c=0.5) {Milk} → {Diaper,Beer} (s=0.4, c=0.5)
TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
Observations:
- All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
- Rules originating from the same itemset have identical support but can have different confidence
- Thus, we may decouple the support and confidence requirements Data Mining: Association Rules 10
Finding Association Rules
Two-step approach:
1. Frequent Itemset Generation
- Generate all itemsets whose support ≥ minsup
2. Rule Generation
- Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
- Frequent itemset generation is still
computationally expensive
Frequent Itemset Generation
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there are 2d^ possible candidate itemsets
Frequent Itemset Generation
- Brute-force approach:
- Each itemset in the lattice is a candidate frequent itemset
- Count the support of each candidate by scanning the database
- Match each transaction against every candidate
- Complexity ~ O(NMw) => Expensive since M = 2d^ !!!
TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
N
Transactions List of Candidates
M
w
Data Mining: Association Rules 19
The Apriori Algorithm
- Join Step: Ck is generated by joining Lk-1with itself
- Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
- Pseudo-code:
Ck: Candidate itemset of size k Lk : frequent itemset of size k L 1 = {frequent items}; for ( k = 1; Lk !=∅; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+ that are contained in t Lk+1 = candidates in Ck+1 with min_support end
return ∪ k Lk;
Data Mining: Association Rules 20
Apriori Algorithm from Agrawal et al. (1993)
Data Mining: Association Rules 21
Apriori Algorithm Example (s = 50%)
TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5
Database D itemset sup.
Scan D
C 1
itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2
C 2
Scan D
itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5}
C 2
Scan D^ L 3 itemset sup
C 3 itemset
itemset sup. {1} 2 {2} 3 {3} 3 {5} 3
L 1
itemset sup
L 2
Data Mining: Association Rules 22
Algorithm to Guess Itemsets
- Naïve way:
- Extend all itemsets with all possible items
- More sophisticated:
- Join Lk-1 with itself, adding only a single, final item e.g.: {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4} produces {1 2 3 4} and {1 3 4 5}
- Remove itemsets with an unsupported subset e.g.: {1 3 4 5} has an unsupported subset: {1 4 5} if minsup = 50%
- Use the database to further refine Ck
Apriori: How to Generate Candidates?
STEP 1: Self-join operation
STEP 2: Subset filtering
How to Count Supports of Candidates?
- Why counting supports of candidates is a problem?
- The total number of candidates can be very huge
- One transaction may contain many candidates
- Method:
- Candidate itemsets are stored in a hash-tree
- Leaf node of hash-tree contains a list of itemsets and counts
- Interior node contains a hash table
- Subset function: finds all the candidates contained in a transaction
Data Mining: Association Rules 25
Example of Generating Candidate Itemsets
- L 3 = { abc, abd, acd, ace, bcd }
- Self-joining: L 3 *L 3
- abcd from abc and abd
- acde from acd and ace
- Pruning based on the Apriori principle:
- acde is removed because ade is not in L 3
- C 4 = { abcd }
Data Mining: Association Rules 26
Run Time of Apriori
- k passes over data where k is the size of the
largest candidate itemset
- Memory chunking algorithm ⇒⇒⇒⇒ 2 passes over
data on disk but multiple in memory
Toivonen 1996 gives a statistical technique which
requires 1 + e passes (but more memory)
Brin 1997 - Dynamic Itemset Counting ⇒⇒⇒⇒ 1 + e
passes (less memory)
Methods to Improve Apriori’s Efficiency
- Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent
- Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans
- Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB
- Sampling: mining on a subset of given data
- lower support threshold
- a method to determine the completeness
- Dynamic itemset counting: add new candidate itemsets only
when all of their subsets are estimated to be frequent
Is Apriori Fast Enough? — Performance Bottlenecks
- The core of the Apriori algorithm:
- Use frequent ( k – 1)-itemsets to generate candidate frequent k-itemsets
- Use database scan and pattern matching to collect counts for the candidate itemsets
- The bottleneck of Apriori: candidate generation
- Huge candidate sets:
- 104 frequent 1-itemset will generate 10^7 candidate 2-itemsets
- To discover a frequent pattern of size 100, e.g., {a 1 , a 2 , …, a 100 }, one needs to generate 2^100 ≈ 1030 candidates.
- Multiple scans of database:
- Needs ( n + 1 ) scans, where n is the length of the longest pattern