data mining algorithms - stanford university · data mining cs102 data mining looking for patterns...
TRANSCRIPT
![Page 1: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/1.jpg)
CS102Data Mining
Data Mining Algorithms
CS102Spring 2020
![Page 2: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/2.jpg)
CS102Data Mining
Data Tools and Techniques
§ Basic Data Manipulation and AnalysisPerforming well-defined computations or asking well-defined questions (“queries”)
§ Data MiningLooking for patterns in data
§ Machine LearningUsing data to build models and make predictions
§ Data VisualizationGraphical depiction of data
§ Data Collection and Preparation
![Page 3: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/3.jpg)
CS102Data Mining
Data MiningLooking for patterns in data
Similar to unsupervised machine learning• Popularity predates popularity of machine learning• “Data mining” often associated with specific data
types and patterns
We will focus on “market-basket” data• Widely applicable (despite the name)
And two types of data mining patterns• Frequent item-sets• Association rules
![Page 4: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/4.jpg)
CS102Data Mining
Other Data and Patterns
Other types of data• Networks/graphs• Streams• Text (“text mining”)
Other patterns• Similar items• Structural patterns in large graphs/networks• Clusters, anomalies
Specific techniquesfor each one
![Page 5: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/5.jpg)
CS102Data Mining
(In)Famous Early Success Stories
Victoria’s Secret
Walmart
Beer & Diapers
![Page 6: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/6.jpg)
CS102Data Mining
Market-Basket Data
Originated with retail data• Each shopper buys “market basket” of groceries• Mine data for patterns in buying habits
General definition• Domain of items• Transaction - one or more items occurring
together• Dataset – set of transactions (usually large)
![Page 7: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/7.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
![Page 8: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/8.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries
![Page 9: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/9.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cart
![Page 10: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/10.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cartOnline goods
![Page 11: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/11.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cartOnline goods Virtual shopping cart
![Page 12: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/12.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cartOnline goods Virtual shopping cart
University courses
![Page 13: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/13.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cartOnline goods Virtual shopping cart
University courses Student transcript
![Page 14: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/14.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cartOnline goods Virtual shopping cart
University courses Student transcriptUniversity students
![Page 15: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/15.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cartOnline goods Virtual shopping cart
University courses Student transcriptUniversity students Party
![Page 16: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/16.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cartOnline goods Virtual shopping cart
University courses Student transcriptUniversity students Party
Movies
![Page 17: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/17.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cartOnline goods Virtual shopping cart
University courses Student transcriptUniversity students Party
Movies Person
![Page 18: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/18.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cartOnline goods Virtual shopping cart
University courses Student transcriptUniversity students Party
Movies PersonSymptoms
![Page 19: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/19.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cartOnline goods Virtual shopping cart
University courses Student transcriptUniversity students Party
Movies PersonSymptoms Patient
![Page 20: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/20.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cartOnline goods Virtual shopping cart
University courses Student transcriptUniversity students Party
Movies PersonSymptoms PatientMenu items
![Page 21: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/21.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cartOnline goods Virtual shopping cart
University courses Student transcriptUniversity students Party
Movies PersonSymptoms PatientMenu items Restaurant customer
![Page 22: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/22.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cartOnline goods Virtual shopping cart
University courses Student transcriptUniversity students Party
Movies PersonSymptoms PatientMenu items Restaurant customer
Words
![Page 23: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/23.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cartOnline goods Virtual shopping cart
University courses Student transcriptUniversity students Party
Movies PersonSymptoms PatientMenu items Restaurant customer
Words Document
![Page 24: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/24.jpg)
CS102Data Mining
Data Mining Algorithms
Frequent Item-Sets – sets of items that occur frequently together in transactions• Groceries bought together• Courses taken by same students• Students going to parties together• Movies watched by same people
Association Rules – When certain items occur together, another item frequently occurs with them• Shoppers who buy phone + charger also buy case• Students who take Databases also take Machine
Learning• Diners who order curry and rice also order bread
![Page 25: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/25.jpg)
CS102Data Mining
Frequent Item-SetsSets of items that occur frequently
together in transactions
ØHow large is a “set”?ØWhat does “frequently” mean?
![Page 26: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/26.jpg)
CS102Data Mining
Frequent Item-SetsSets of items that occur frequently
together in transactions
ØHow large is a “set”?Usually specify a minimum min-set-sizePossibly also a maximum max-set-size
ØWhat does “frequently” mean?Notion of support
![Page 27: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/27.jpg)
CS102Data Mining
Support
Support for a set of items S in a dataset of transactions is the fraction of the transactions containing S:
Specify support-threshold for frequent item-setsOnly return sets wheresupport > support-threshold
# of transactions containing S____________________________________
total # of transactions
![Page 28: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/28.jpg)
CS102Data Mining
Your TurnTransactions:
T1: milk, eggs, juiceT2: milk, juice, cookiesT3: eggs, chipsT4: milk, eggsT5: milk, juice, cookies, chips
What are the frequent item-sets if:• min-set-size = 2 (no max-set-size)• support-threshold = 0.3
Support:# of transactions containing S
____________________________________
total # of transactions
![Page 29: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/29.jpg)
CS102Data Mining
Computing Frequent Item-Sets“Apriori” algorithm
Efficiency relies on the following property:
Or the inverse:
If S is a frequent item-set satisfyingsupport-threshold t, then every subset of S is alsoa frequent item-set satisfying support-threshold t.
If S is not a frequent item-set satisfyingsupport-threshold t, then no superset of S can bea frequent item-set satisfying support-threshold t.
![Page 30: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/30.jpg)
CS102Data Mining
Association RulesWhen a set of items S occurs together,
another item i frequently occurs with themS à i
ØHow large is a “set”?ØWhat does “occurs together” mean?ØWhat does “frequently occurs with them” mean?
![Page 31: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/31.jpg)
CS102Data Mining
Association RulesWhen a set of items S occurs together,
another item i frequently occurs with themS à i
ØHow large is a “set”?Usually specify a minimum min-set-size for SPossibly also a maximum max-set-size for S
ØWhat does “occurs together” mean?ØWhat does “frequently occurs with them” mean?
![Page 32: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/32.jpg)
CS102Data Mining
Association RulesWhen a set of items S occurs together,
another item i frequently occurs with themS à i
ØHow large is a “set”?Usually specify a minimum min-set-size for SPossibly also a maximum max-set-size for S
ØWhat does “occurs together” mean?Notion of support
ØWhat does “frequently occurs with them” mean?Notion of confidence
![Page 33: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/33.jpg)
CS102Data Mining
Support and ConfidenceSupport for association rule S à i in a dataset of transactions is fraction of transactions containing S:
Confidence for association rule S à i in a dataset of transactions is the fraction of transactions containing S that also contain i:
# of transactions containing S____________________________________
total # of transactions
# of transactions containing S and i__________________________________________
# of transactions containing S
![Page 34: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/34.jpg)
CS102Data Mining
Support and Confidence
Specify support-threshold and confidence-threshold for association rules
Only return rules where:support > support-threshold andconfidence > confidence-threshold
![Page 35: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/35.jpg)
CS102Data Mining
Your TurnTransactions:
T1: milk, eggs, juiceT2: milk, juice, cookiesT3: eggs, chipsT4: milk, eggsT5: milk, juice, cookies, chips
What are the association rules S à i if:• min-set-size = 1 (no max-set-size)• support-threshold = 0.5• confidence-threshold = 0.5
# of transactions containing S____________________________________
total # of transactions
# of transactions containing S and i__________________________________________
# of transactions containing SConfidence:
Support:
Reminder: support andconfidence must be >
threshold, not ≥
![Page 36: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/36.jpg)
CS102Data Mining
Computing Association Rules
1. Use frequent item-sets to find left-hand sides S satisfying support threshold
2. Then extend to find right-hand sides S à isatisfying confidence threshold
NOT a property:
If S à i is an association rule satisfyingsupport-threshold t and confidence-threshold c,
and S’Í S, then S’ à i is an an associationrule satisfying support-threshold t and
confidence-threshold c.
Why Not?
![Page 37: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/37.jpg)
CS102Data Mining
Association Rules: LiftAssociation rule S à i might have high confidence because item i appears frequently, not because it’s associated with S.
Confidence for association rule S à i in a dataset of transactions is the fraction of transactions containing S that also contain i :
# of transactions containing S and i__________________________________________
# of transactions containing S
Lift
divided by the overall frequency of i :,
#trans containing S and i #trans containing i______________________________ ÷ _______________________
#trans containing S total #trans
![Page 38: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/38.jpg)
CS102Data Mining
Lift: ExamplesTransactions:
T1: milk, eggs, juiceT2: milk, juice, cookiesT3: eggs, chipsT4: milk, eggsT5: milk, juice, cookies, chips
juice à cookies Lift = (2/3) ÷ (2/5) = 10/6 = 1.67
eggs à milk Lift = (2/3) ÷ (4/5) = 10/12 = 0.83
Lift:#trans containing S and i #trans containing i
______________________________ ÷ _______________________
#trans containing S total #trans
Lift = 1: no associationLift > 1: associationLift < 1: anti-association
![Page 39: Data Mining Algorithms - Stanford University · Data Mining CS102 Data Mining Looking for patterns in data Similar to unsupervised machine learning •Popularity predates popularity](https://reader036.vdocuments.us/reader036/viewer/2022062506/5fbb87018afc8e40ad6447e9/html5/thumbnails/39.jpg)
CS102Data Mining
Data Mining Algorithms
CS102Spring 2020