Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Association rules for frequent item sets, Exams of Data Mining

For frequent data sets association rules in data mining

Typology: Exams

2017/2018

Uploaded on 10/20/2018

priti-narkar
priti-narkar 🇮🇳

1 document

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Association Rules & Frequent Itemsets
All you ever wanted to know about diapers,
beers and their correlation!
Data Mining: Association Rules 2
The Market-Basket Problem
Given a database of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
Market-Basket transactions
TID Items
1 Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} {Beer},
{Milk, Bread} {Eggs,Coke},
{Beer, Bread} {Milk}
Implication here means
co-occurrence, not causality!
Data Mining: Association Rules 3
The Market-Basket Problem
Given a database of transactions
where each transaction is a collection of items (purchased by a
customer in a visit)
find all rules that correlate the presence of one set of
items with that of another set of items
Example:
30% of all transactions that contain diapers also contain
beers; 5% of all transactions contain these items
30%: confidence of the rule
5%: support of the rule
We are interested in
finding
all rules,
rather than
verifying
that a particular rule holds
Data Mining: Association Rules 4
Applications of Market-Basket Analysis
Supermarkets
Placement
Advertising
Sales
Coupons
Many applications outside market basket data analysis
Prediction (telecom switch failure)
Web usage mining
Many different types of association rules
Temporal
Spatial
Causal
Data Mining: Association Rules 5
Definition: Frequent Itemset
Itemset
A collection of one or more items
Example: {Milk, Bread, Diaper}
k-itemset
An itemsetthat contains k items
Support count (σ
σσ
σ)
Frequency of occurrence of an
itemset
E.g. σ({Milk, Bread,Diaper}) = 2
Support
Fraction of transactions that
contain an itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
An itemset whose support is greater
than or equal to a
minsup
threshold
TID Items
1 Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Data Mining: Association Rules 6
Definition: Association Rule
Example:
Beer}Diaper,Milk{
4.0
5
2
|T|
)
Beer
Diaper,
,
Milk
(
===
s
67.0
3
2
)Diaper,Milk(
)
Beer
Diaper,
Milk,
(
===
σ
σ
c
Association Rule
An implication expression of
the form X Y, where X and Y
are itemsets
Example:
{Milk, Diaper} {Beer}
Rule Evaluation Metrics
Support (s)
Fraction of transactions that
contain both X and Y
Confidence (c)
Measures how often items in Y
appear in transactions that
contain X
TID Items
1 Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
Milk, Diaper, Beer, Coke
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
pf3
pf4
pf5

Partial preview of the text

Download Association rules for frequent item sets and more Exams Data Mining in PDF only on Docsity!

Association Rules & Frequent Itemsets

All you ever wanted to know about diapers,

beers and their correlation!

Data Mining: Association Rules 2

The Market-Basket Problem

  • Given a database of transactions, find rules that will

predict the occurrence of an item based on the

occurrences of other items in the transaction

Market-Basket transactions

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Example of Association Rules

{Diaper} → {Beer}, {Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk}

Implication here means co-occurrence, not causality!

Data Mining: Association Rules 3

The Market-Basket Problem

Given a database of transactions

where each transaction is a collection of items (purchased by a customer in a visit)

find all rules that correlate the presence of one set of

items with that of another set of items

Example: 30% of all transactions that contain diapers also contain
beers; 5% of all transactions contain these items
  • 30%: confidence of the rule
  • 5%: support of the rule

We are interested in finding all rules,

rather than verifying that a particular rule holds

Data Mining: Association Rules 4

Applications of Market-Basket Analysis

  • Supermarkets
    • Placement
    • Advertising
    • Sales
    • Coupons
  • Many applications outside market basket data analysis
    • Prediction (telecom switch failure)
    • Web usage mining
  • Many different types of association rules
    • Temporal
    • Spatial
    • Causal

Definition: Frequent Itemset

  • Itemset
    • A collection of one or more items
      • Example: {Milk, Bread, Diaper}
    • k-itemset
      • An itemset that contains k items
  • Support count (σσσσ)
    • Frequency of occurrence of an itemset
    • E.g. σ({Milk, Bread,Diaper}) = 2
  • Support
    • Fraction of transactions that contain an itemset
    • E.g. s({Milk, Bread, Diaper}) = 2/
  • Frequent Itemset
    • An itemset whose support is greater than or equal to a minsup threshold

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Definition: Association Rule

Example:

{Milk ,Diaper}⇒ Beer
|T |
(Milk ,Diaper,Beer )
s
(Milk,Diaper )
( Milk,Diaper,Beer )
c
  • Association Rule
    • An implication expression of the form X → Y, where X and Y are itemsets
    • Example: {Milk, Diaper} → {Beer}
  • Rule Evaluation Metrics
    • Support (s)
      • Fraction of transactions that contain both X and Y
    • Confidence (c)
      • Measures how often items in Y appear in transactions that contain X

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Data Mining: Association Rules 7

Aspects of Association Rule Mining

  • How do we generate rules fast?
    • Performance measured in
      • Number of database scans
      • Number of itemsets that must be counted
  • Which are the interesting rules?

Data Mining: Association Rules 8

Association Rule Mining Task

  • Given a set of transactions T, the goal of

association rule mining is to find all rules

having

  • support ≥ minsup threshold
  • confidence ≥ minconf threshold
  • Brute-force approach:
  • List all possible association rules
  • Compute the support and confidence for each rule
  • Prune rules that fail the minsup and minconf

thresholds

⇒ Computationally prohibitive!

Data Mining: Association Rules 9

Mining Association Rules

Example of Rules:

{Milk,Diaper} → {Beer} (s=0.4, c=0.67) {Milk,Beer} → {Diaper} (s=0.4, c=1.0) {Diaper,Beer} → {Milk} (s=0.4, c=0.67) {Beer} → {Milk,Diaper} (s=0.4, c=0.67) {Diaper} → {Milk,Beer} (s=0.4, c=0.5) {Milk} → {Diaper,Beer} (s=0.4, c=0.5)

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Observations:

  • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
  • Rules originating from the same itemset have identical support but can have different confidence
  • Thus, we may decouple the support and confidence requirements Data Mining: Association Rules 10

Finding Association Rules

Two-step approach:

1. Frequent Itemset Generation

  • Generate all itemsets whose support ≥ minsup

2. Rule Generation

  • Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
  • Frequent itemset generation is still

computationally expensive

Frequent Itemset Generation

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, there are 2d^ possible candidate itemsets

Frequent Itemset Generation

  • Brute-force approach:
    • Each itemset in the lattice is a candidate frequent itemset
    • Count the support of each candidate by scanning the database
    • Match each transaction against every candidate
    • Complexity ~ O(NMw) => Expensive since M = 2d^ !!!

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions List of Candidates

M
w

Data Mining: Association Rules 19

The Apriori Algorithm

  • Join Step: Ck is generated by joining Lk-1with itself
  • Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
  • Pseudo-code:

Ck: Candidate itemset of size k Lk : frequent itemset of size k L 1 = {frequent items}; for ( k = 1; Lk !=∅; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+ that are contained in t Lk+1 = candidates in Ck+1 with min_support end

return ∪ k Lk;

Data Mining: Association Rules 20

Apriori Algorithm from Agrawal et al. (1993)

Data Mining: Association Rules 21

Apriori Algorithm Example (s = 50%)

TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5

Database D itemset sup.

Scan D

C 1

itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2

C 2

Scan D

itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5}

C 2

Scan D^ L 3 itemset sup

C 3 itemset

itemset sup. {1} 2 {2} 3 {3} 3 {5} 3

L 1

itemset sup

L 2

Data Mining: Association Rules 22

Algorithm to Guess Itemsets

  • Naïve way:
    • Extend all itemsets with all possible items
  • More sophisticated:
    • Join Lk-1 with itself, adding only a single, final item e.g.: {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4} produces {1 2 3 4} and {1 3 4 5}
    • Remove itemsets with an unsupported subset e.g.: {1 3 4 5} has an unsupported subset: {1 4 5} if minsup = 50%
    • Use the database to further refine Ck

Apriori: How to Generate Candidates?

STEP 1: Self-join operation

STEP 2: Subset filtering

How to Count Supports of Candidates?

  • Why counting supports of candidates is a problem?
    • The total number of candidates can be very huge
    • One transaction may contain many candidates
  • Method:
    • Candidate itemsets are stored in a hash-tree
    • Leaf node of hash-tree contains a list of itemsets and counts
    • Interior node contains a hash table
    • Subset function: finds all the candidates contained in a transaction

Data Mining: Association Rules 25

Example of Generating Candidate Itemsets

  • L 3 = { abc, abd, acd, ace, bcd }
  • Self-joining: L 3 *L 3
    • abcd from abc and abd
    • acde from acd and ace
  • Pruning based on the Apriori principle:
    • acde is removed because ade is not in L 3
  • C 4 = { abcd }

Data Mining: Association Rules 26

Run Time of Apriori

  • k passes over data where k is the size of the

largest candidate itemset

  • Memory chunking algorithm ⇒⇒⇒⇒ 2 passes over

data on disk but multiple in memory

Toivonen 1996 gives a statistical technique which

requires 1 + e passes (but more memory)

Brin 1997 - Dynamic Itemset Counting ⇒⇒⇒⇒ 1 + e

passes (less memory)

Methods to Improve Apriori’s Efficiency

  • Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent
  • Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans
  • Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB
  • Sampling: mining on a subset of given data
    • lower support threshold
    • a method to determine the completeness
  • Dynamic itemset counting: add new candidate itemsets only

when all of their subsets are estimated to be frequent

Is Apriori Fast Enough? — Performance Bottlenecks

  • The core of the Apriori algorithm:
    • Use frequent ( k – 1)-itemsets to generate candidate frequent k-itemsets
    • Use database scan and pattern matching to collect counts for the candidate itemsets
  • The bottleneck of Apriori: candidate generation
    • Huge candidate sets:
      • 104 frequent 1-itemset will generate 10^7 candidate 2-itemsets
      • To discover a frequent pattern of size 100, e.g., {a 1 , a 2 , …, a 100 }, one needs to generate 2^100 ≈ 1030 candidates.
    • Multiple scans of database:
      • Needs ( n + 1 ) scans, where n is the length of the longest pattern