Download - Data mining-2011-09
![Page 1: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/1.jpg)
Data-mining, Hadoop and the Single Node
![Page 2: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/2.jpg)
Map-Reduce
Input Output
Shuffle
![Page 3: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/3.jpg)
MapR's Streaming Performance
Read Write0
250
500
750
1000
1250
1500
1750
2000
2250
Read Write0
250
500
750
1000
1250
1500
1750
2000
2250
HardwareMapRHadoopMB
persec
Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB
11 x 7200rpm SATA 11 x 15Krpm SAS
Higher is better
![Page 4: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/4.jpg)
Terasort on MapR
1.0 TB0
10
20
30
40
50
60
3.5 TB0
50
100
150
200
250
300
MapRHadoop
Elapsed time (mins)
10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm
Lower is better
![Page 5: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/5.jpg)
Data Flow Expected Volumes
Node
Storage
6 x 1Gb/s =600 MB / s
12 x 100MB/s =900 MB / s
![Page 6: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/6.jpg)
MUCH faster for some operations
# of files (millions)
CreateRate
Same 10 nodes …
![Page 7: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/7.jpg)
ClusterNode
NFSServer
Universal export to self
Task
Cluster Nodes
![Page 8: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/8.jpg)
ClusterNode
NFSServer
Task
ClusterNode
NFSServer
Task
ClusterNode
NFSServer
Task
Nodes are identical
![Page 9: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/9.jpg)
Sharded text Indexing
MapReducer
Input documents
Localdisk Search
EngineLocal
disk
Clustered index storage
Assign documents to shards
Index text to local disk and then copy index to
distributed file store
Copy to local disk typically required before
index can be loaded
![Page 10: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/10.jpg)
Conventional data flow
MapReducer
Input documents
Localdisk Search
EngineLocal
disk
Clustered index storage
Failure of a reducer causes garbage to accumulate in the
local disk
Failure of search engine requires
another download of the index from clustered storage.
![Page 11: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/11.jpg)
SearchEngine
Simplified NFS data flows
MapReducer
Input documents
Clustered index storage
Failure of a reducer is cleaned up by
map-reduce framework
Search engine reads mirrored index directly.
Index to task work directory via NFS
![Page 12: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/12.jpg)
Aggregatenew
centroids
K-means, the movie
Assignto
Nearestcentroid
Centroids
Input
![Page 13: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/13.jpg)
But …
![Page 14: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/14.jpg)
Averagemodels
Parallel Stochastic Gradient Descent
Trainsub
model
Model
Input
![Page 15: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/15.jpg)
Updatemodel
Variational Dirichlet Assignment
Gathersufficientstatistics
Model
Input
![Page 16: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/16.jpg)
Old tricks, new dogs
• Mapper– Assign point to cluster– Emit cluster id, (1, point)
• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, (n, sum/n)
• Output to HDFS
Read fromHDFS to local disk by distributed cache
Written by map-reduce
Read from local disk from distributed cache
![Page 17: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/17.jpg)
Old tricks, new dogs
• Mapper– Assign point to cluster– Emit cluster id, (1, point)
• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, (n, sum/n)
• Output to HDFSMapR FS
Read fromNFS
Written by map-reduce
![Page 18: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/18.jpg)
Poor man’s Pregel
• Mapper
• Lines in bold can use conventional I/O via NFS
18
while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input formatemit summary
![Page 19: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/19.jpg)
![Page 20: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/20.jpg)
Mahout
• Scalable Data Mining for Everybody
![Page 21: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/21.jpg)
What is Mahout
• Recommendations (people who x this also x that)
• Clustering (segment data into groups of)• Classification (learn decision making from
examples)• Stuff (LDA, SVD, frequent item-set, math)
![Page 22: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/22.jpg)
What is Mahout?
• Recommendations (people who x this also x that)
• Clustering (segment data into groups of)• Classification (learn decision making from
examples)• Stuff (LDA, SVM, frequent item-set, math)
![Page 23: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/23.jpg)
Classification in Detail
• Naive Bayes Family– Hadoop based training
• Decision Forests– Hadoop based training
• Logistic Regression (aka SGD)– fast on-line (sequential) training
![Page 24: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/24.jpg)
Classification in Detail
• Naive Bayes Family– Hadoop based training
• Decision Forests– Hadoop based training
• Logistic Regression (aka SGD)– fast on-line (sequential) training
![Page 25: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/25.jpg)
So What?
• Online training has low overhead for small and moderate size data-sets
big starts here
![Page 26: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/26.jpg)
An Example
![Page 27: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/27.jpg)
And Another
From: Dr. Paul AcquahDear Sir,Re: Proposal for over-invoice Contract Benevolence
Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor....
Date: Thu, May 20, 2010 at 10:51 AMFrom: George <[email protected]>
Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?
![Page 28: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/28.jpg)
Mahout’s SGD
• Learns on-line per example– O(1) memory– O(1) time per training example
• Sequential implementation– fast, but not parallel
![Page 29: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/29.jpg)
Special Features
• Hashed feature encoding• Per-term annealing– learn the boring stuff once
• Auto-magical learning knob turning– learns correct learning rate, learns correct
learning rate for learning learning rate, ...
![Page 30: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/30.jpg)
Feature Encoding
![Page 31: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/31.jpg)
Hashed Encoding
![Page 32: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/32.jpg)
Feature Collisions
![Page 33: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/33.jpg)
Learning Rate AnnealingLe
arni
ng R
ate
# training examples seen
![Page 34: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/34.jpg)
Per-term AnnealingLe
arni
ng R
ate
# training examples seen
Common Feature
Rare Feature
![Page 35: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/35.jpg)
General Structure
• OnlineLogisticRegression– Traditional logistic regression– Stochastic Gradient Descent– Per term annealing– Too fast (for the disk + encoder)
![Page 36: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/36.jpg)
Next Level
• CrossFoldLearner– contains multiple primitive learners– online cross validation– 5x more work
![Page 37: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/37.jpg)
And again
• AdaptiveLogisticRegression– 20 x CrossFoldLearner– evolves good learning and regularization rates– 100 x more work than basic learner– still faster than disk + encoding
![Page 38: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/38.jpg)
A comparison
• Traditional view– 400 x (read + OLR)
• Revised Mahout view– 1 x (read + mu x 100 x OLR) x eta– mu = efficiency from killing losers early– eta = efficiency from stopping early
![Page 39: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/39.jpg)
Click modeling architecture
Featureextraction
anddown
sampling
Input
Side-data
Datajoin
SequentialSGD
Learning
Map-reduce
Now via NFS
![Page 40: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/40.jpg)
Click modeling architecture
Map-reduceMap-reduce
Featureextraction
anddown
sampling
Input
Side-data
Datajoin
SequentialSGD
Learning
Map-reduce cooperates
with NFSSequential
SGDLearning
SequentialSGD
Learning
SequentialSGD
Learning
![Page 41: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/41.jpg)
Deployment
• Training– ModelSerializer.writeBinary(..., model)
• Deployment– m = ModelSerializer.readBinary(...)– r = m.classifyScalar(featureVector)
![Page 42: Data mining-2011-09](https://reader035.vdocuments.us/reader035/viewer/2022062706/5575d0cfd8b42a917e8b4789/html5/thumbnails/42.jpg)
The Upshot• One machine can go fast– SITM trains in 2 billion examples in 3 hours
• Deployability pays off big– simple sample server farm