decision tree induction in hierarchic distributed systems

14
Decision Tree Induction in Hierarchic Distributed Systems With: Amir Bar-Or, Ran Wolff, Daniel Keren

Upload: seven

Post on 13-Feb-2016

30 views

Category:

Documents


3 download

DESCRIPTION

With: Amir Bar-Or, Ran Wolff, Daniel Keren. Decision Tree Induction in Hierarchic Distributed Systems. Motivation. Large distributed computation is costly Especially data intensive and synchronization intensive ones; e.g., data mining Decision tree induction: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Decision Tree Induction in Hierarchic Distributed Systems

Decision Tree Induction in Hierarchic Distributed

Systems

With:Amir Bar-Or, Ran Wolff, Daniel Keren

Page 2: Decision Tree Induction in Hierarchic Distributed Systems

MotivationLarge distributed computation is

costly

• Especially data intensive and synchronization intensive ones; e.g., data mining

• Decision tree induction:– Collect global statistics (thousands) for every

attribute (thousands) in every tree node (hundreds)

– Global statistics – global synchronization

Page 3: Decision Tree Induction in Hierarchic Distributed Systems

MotivationHierarchy Helps

• Simplifies synchronization– Synchronize on each level

• Simplifies communication• An “industrial strength” architecture– The way real systems (including grids)

are often organized

Page 4: Decision Tree Induction in Hierarchic Distributed Systems

Motivation Mining highly dimensional data

• Thousands of sources • Central control• Examples:– Genomically enriched healthcare data– Text repositories

Page 5: Decision Tree Induction in Hierarchic Distributed Systems

Objectives of the Algorithm• Exact results– Common approaches would either• Collect a sample of the data• Build independent models at each site and

then use centralized meta-learning atop of them

• Communication efficiency– Naive approach: collect exact statistics

for each tree node would result in GBytes of communication

Page 6: Decision Tree Induction in Hierarchic Distributed Systems

Decision tree in a Teaspoon• A tree were at each level the learning

samples are splitted according to one attribute’s value

• Hill-climbing heuristic is used to induce the tree– The attribute that maximized a gain

function is taken– Gain functions: Gini or Information Gain

• No real need to compute the gain

Page 7: Decision Tree Induction in Hierarchic Distributed Systems

Main Idea• Infer deterministic bounds on the

gain of each attribute• Improve bounds until best attribute is

provenly better than the rest• Communication efficiency is

achieved because bounds require just limited data– Partial statistics for promising attributes– Rough bound on irrelevant attributes

Page 8: Decision Tree Induction in Hierarchic Distributed Systems

Hierarchical Algorithm• At each level of the hierarchy–Wait for reports from all descendants• Contain upper and lower bounds on the gain

of each attribute, number of samples from each class

– Use descendant's report to compute cumulative bounds

– If no clear separation, request descendants to tighten bounds by sending more data

– At worst, all data is gathered

Page 9: Decision Tree Induction in Hierarchic Distributed Systems

Deterministic Bounds• Upper

bound

• Lower bound

iii nPGnPG /

2

111212

1

/1/1 n,nminn+nn+PGPG

Page 10: Decision Tree Induction in Hierarchic Distributed Systems

Performance Figures• 99% reduction in

communication bandwidth

• Out of 1000 SNP, only ~12 were reported to higher levels of the hierarchy

• Percent declines with hierarchy level

Page 11: Decision Tree Induction in Hierarchic Distributed Systems

Performance Figures• 99% reduction in

communication bandwidth

• Out of 1000 SNP, only ~12 were reported to higher levels of the hierarchy

• Percent declines with hierarchy level

Page 12: Decision Tree Induction in Hierarchic Distributed Systems

More Performance Figures• Larger datasets

require lower bandwidth

• Outlier noise is not a big issue– White noise even

better

Page 13: Decision Tree Induction in Hierarchic Distributed Systems

More Performance Figures• Larger datasets

require lower bandwidth

• Outlier noise is not a big issue– White noise even

better

Page 14: Decision Tree Induction in Hierarchic Distributed Systems

Future Work• Text mining• Incremental algorithm• Accommodation of failure• Testing on a real grid system• Is this a general framework?