data mining-2011-09

42
Data-mining, Hadoop and the Single Node

Upload: ted-dunning

Post on 09-Jun-2015

1.190 views

Category:

Technology


0 download

DESCRIPTION

Talk given on September 20 to the bay area data mining. The basic idea is that integrating map-reduce programs with the real world is easier than ever.

TRANSCRIPT

Page 1: Data mining-2011-09

Data-mining, Hadoop and the Single Node

Page 2: Data mining-2011-09

Map-Reduce

Input Output

Shuffle

Page 3: Data mining-2011-09

MapR's Streaming Performance

Read Write0

250

500

750

1000

1250

1500

1750

2000

2250

Read Write0

250

500

750

1000

1250

1500

1750

2000

2250

HardwareMapRHadoopMB

persec

Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB

11 x 7200rpm SATA 11 x 15Krpm SAS

Higher is better

Page 4: Data mining-2011-09

Terasort on MapR

1.0 TB0

10

20

30

40

50

60

3.5 TB0

50

100

150

200

250

300

MapRHadoop

Elapsed time (mins)

10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm

Lower is better

Page 5: Data mining-2011-09

Data Flow Expected Volumes

Node

Storage

6 x 1Gb/s =600 MB / s

12 x 100MB/s =900 MB / s

Page 6: Data mining-2011-09

MUCH faster for some operations

# of files (millions)

CreateRate

Same 10 nodes …

Page 7: Data mining-2011-09

ClusterNode

NFSServer

Universal export to self

Task

Cluster Nodes

Page 8: Data mining-2011-09

ClusterNode

NFSServer

Task

ClusterNode

NFSServer

Task

ClusterNode

NFSServer

Task

Nodes are identical

Page 9: Data mining-2011-09

Sharded text Indexing

MapReducer

Input documents

Localdisk Search

EngineLocal

disk

Clustered index storage

Assign documents to shards

Index text to local disk and then copy index to

distributed file store

Copy to local disk typically required before

index can be loaded

Page 10: Data mining-2011-09

Conventional data flow

MapReducer

Input documents

Localdisk Search

EngineLocal

disk

Clustered index storage

Failure of a reducer causes garbage to accumulate in the

local disk

Failure of search engine requires

another download of the index from clustered storage.

Page 11: Data mining-2011-09

SearchEngine

Simplified NFS data flows

MapReducer

Input documents

Clustered index storage

Failure of a reducer is cleaned up by

map-reduce framework

Search engine reads mirrored index directly.

Index to task work directory via NFS

Page 12: Data mining-2011-09

Aggregatenew

centroids

K-means, the movie

Assignto

Nearestcentroid

Centroids

Input

Page 13: Data mining-2011-09

But …

Page 14: Data mining-2011-09

Averagemodels

Parallel Stochastic Gradient Descent

Trainsub

model

Model

Input

Page 15: Data mining-2011-09

Updatemodel

Variational Dirichlet Assignment

Gathersufficientstatistics

Model

Input

Page 16: Data mining-2011-09

Old tricks, new dogs

• Mapper– Assign point to cluster– Emit cluster id, (1, point)

• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, (n, sum/n)

• Output to HDFS

Read fromHDFS to local disk by distributed cache

Written by map-reduce

Read from local disk from distributed cache

Page 17: Data mining-2011-09

Old tricks, new dogs

• Mapper– Assign point to cluster– Emit cluster id, (1, point)

• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, (n, sum/n)

• Output to HDFSMapR FS

Read fromNFS

Written by map-reduce

Page 18: Data mining-2011-09

Poor man’s Pregel

• Mapper

• Lines in bold can use conventional I/O via NFS

18

while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input formatemit summary

Page 19: Data mining-2011-09
Page 20: Data mining-2011-09

Mahout

• Scalable Data Mining for Everybody

Page 21: Data mining-2011-09

What is Mahout

• Recommendations (people who x this also x that)

• Clustering (segment data into groups of)• Classification (learn decision making from

examples)• Stuff (LDA, SVD, frequent item-set, math)

Page 22: Data mining-2011-09

What is Mahout?

• Recommendations (people who x this also x that)

• Clustering (segment data into groups of)• Classification (learn decision making from

examples)• Stuff (LDA, SVM, frequent item-set, math)

Page 23: Data mining-2011-09

Classification in Detail

• Naive Bayes Family– Hadoop based training

• Decision Forests– Hadoop based training

• Logistic Regression (aka SGD)– fast on-line (sequential) training

Page 24: Data mining-2011-09

Classification in Detail

• Naive Bayes Family– Hadoop based training

• Decision Forests– Hadoop based training

• Logistic Regression (aka SGD)– fast on-line (sequential) training

Page 25: Data mining-2011-09

So What?

• Online training has low overhead for small and moderate size data-sets

big starts here

Page 26: Data mining-2011-09

An Example

Page 27: Data mining-2011-09

And Another

From:  Dr. Paul AcquahDear Sir,Re: Proposal for over-invoice Contract Benevolence

Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit.  I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor....

Date: Thu, May 20, 2010 at 10:51 AMFrom: George <[email protected]>

Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?

Page 28: Data mining-2011-09

Mahout’s SGD

• Learns on-line per example– O(1) memory– O(1) time per training example

• Sequential implementation– fast, but not parallel

Page 29: Data mining-2011-09

Special Features

• Hashed feature encoding• Per-term annealing– learn the boring stuff once

• Auto-magical learning knob turning– learns correct learning rate, learns correct

learning rate for learning learning rate, ...

Page 30: Data mining-2011-09

Feature Encoding

Page 31: Data mining-2011-09

Hashed Encoding

Page 32: Data mining-2011-09

Feature Collisions

Page 33: Data mining-2011-09

Learning Rate AnnealingLe

arni

ng R

ate

# training examples seen

Page 34: Data mining-2011-09

Per-term AnnealingLe

arni

ng R

ate

# training examples seen

Common Feature

Rare Feature

Page 35: Data mining-2011-09

General Structure

• OnlineLogisticRegression– Traditional logistic regression– Stochastic Gradient Descent– Per term annealing– Too fast (for the disk + encoder)

Page 36: Data mining-2011-09

Next Level

• CrossFoldLearner– contains multiple primitive learners– online cross validation– 5x more work

Page 37: Data mining-2011-09

And again

• AdaptiveLogisticRegression– 20 x CrossFoldLearner– evolves good learning and regularization rates– 100 x more work than basic learner– still faster than disk + encoding

Page 38: Data mining-2011-09

A comparison

• Traditional view– 400 x (read + OLR)

• Revised Mahout view– 1 x (read + mu x 100 x OLR) x eta– mu = efficiency from killing losers early– eta = efficiency from stopping early

Page 39: Data mining-2011-09

Click modeling architecture

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

SequentialSGD

Learning

Map-reduce

Now via NFS

Page 40: Data mining-2011-09

Click modeling architecture

Map-reduceMap-reduce

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

SequentialSGD

Learning

Map-reduce cooperates

with NFSSequential

SGDLearning

SequentialSGD

Learning

SequentialSGD

Learning

Page 41: Data mining-2011-09

Deployment

• Training– ModelSerializer.writeBinary(..., model)

• Deployment– m = ModelSerializer.readBinary(...)– r = m.classifyScalar(featureVector)

Page 42: Data mining-2011-09

The Upshot• One machine can go fast– SITM trains in 2 billion examples in 3 hours

• Deployability pays off big– simple sample server farm