apache hama @ samsung sw academy

Post on 11-Nov-2014

1.059 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Apache Hamaa Bulk Synchronous Parallel Computing

Edward J. Yoon<edwardyoon@apache.org>

Who Am I

• Edward J. Yoon–@eddieyoon

• Founder of Apache Hama• PMC member of Apache BigTop• Oracle Employee

What’s Hama?

• Open Source – Under Apache 2.0 License

• Written In Java• Apache Top Level Project

Characteristics

• a General BSP computing engine– M/R like Input/Output Formatter

• SequenceFile, Text, Accumulo, Hbase, …, etc.

– Job Manager– Checkpoint Recovery

• Streaming and Pipes – Python, C++, …, etc.

• Graph and Machine Learning Packages– K-means, Gradient Descent, Collaborative Filtering

Bulk Synchronous Parallel?

• Originally introduced by Valiant• a Sequence of supersteps

Compare to M/R and MPI

• Supports message-passing paradigm style of application development

• Provides a flexible, simple, and easy-to-use small APIs

• Enables to perform better than MPI for communication-intensive applications

• Guarantees impossibility of deadlocks or collisions in the communication mechanisms

So, fit for what?

• Processing Big Data w/ complicated relationships– e.g., graph or network.

• Iterative or Recursive scientific applications

• Continuous Event Processing

Which is the Big Data?

Could be applied to

• Analyze user actions and patterns• Social Target Marketing• Observe evolution of Social networks• Detect anomaly rapidly in Real-time• Business Intelligence

Internals

• Pluggable RPC Architecture for message transfer– e.g., Hadoop RPC, Avro RPC, …, etc.

• Message Collector, Bundler, and Compressor to reduce network overheads and contentions– e.g., Snappy, Bzip2, …, etc.

BSP API

public abstract void bsp(BSPPeer<K1, V1, K2, V2, M> peer) throws IOException, SyncException;

BSP Examples

• Pi Calculation• Sparse Matrix-Vector Multiplication• K-means Clustering• Gradient Descent

Graph API

public void compute(Iterator<M> messages) throws IOException;

Graph Examples

• In-link Count• Single Source Shortest Path• Pagerank• Bipartitie Matching• Semi-Clustering

Find Maximum Value

SSSP Performance

• a SSSP for random graph of 1 billion edges is computed in 400 seconds on 1 Oracle BDA

top related