cascading user group meet

42
Simplifying Application Development on Hadoop WidasConcepts Unternehmensberatung GmbH Maybachstraße 2 71299 Wimsheim http Big Data Engineer, WidasConcepts Vinoth Kannan Cascading User Group Meet Berlin, Germany 26.05.201 4

Upload: vinoth-kumar-kannan

Post on 19-Aug-2014

639 views

Category:

Engineering


3 download

DESCRIPTION

Leading Data with Innovation. My Talk on using Cascading for Data driven applications presented during the meetup in Berlin.

TRANSCRIPT

Page 1: Cascading User Group Meet

Simplifying Application

Development on Hadoop

WidasConcepts Unternehmensberatung GmbH Maybachstraße 2 71299 Wimsheim http://www.widas.de

Big Data Engineer, WidasConceptsVinoth Kannan

Cascading User Group Meet

Berlin, Germany26.05.2014

Page 2: Cascading User Group Meet

2What is Hadoop?

“Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.“ (Wikipedia)

Designed for

Possible to:

Works on:

• Batch• Processing

• Horizontal Scaling

• Bringing Computation to Data

Principles of Hadoop:

Page 3: Cascading User Group Meet

3Main Features

Reliable and Redundant• No performance or data loss even on failure

Powerful• Possible to have huge clusters (largest 40,000 nodes)• Supports “Best of Breed Analytics“

Scalable• Linearly scalable with increase in data volume

Cost Efficient• No need for expensive hardware. Supports commodity

hardwareSimple and flexible APIs• Great ecosystem with multitude of solutions to support

Page 4: Cascading User Group Meet

4Traditional vs. Hadoop

Traditional Hadoop

More and larger server necessary to accomplish tasks: • computing capacity • data capacity

Instead of upgrading the server, the cluster size is increased with more machines

Page 5: Cascading User Group Meet

5

MapReduce are programming model to run applications mostly on Hadoop

What is MapReduce?

Mapper

• Converts input (K,V) to new (K,V)

Shuffle

• Sorts and Groups similar keys with all its values

Reducer

• Translates the Value each unique Key to new (K,V)

Page 6: Cascading User Group Meet

6MapReduce Paradigm

Map Shuffle Reduce

(K1, V1)

(K1, V1)

(K1, V1)

(K2, V2) (K5, V5)

(K2, V2)

(K3, V3)

(K4, v4)

(K2, V2)

(K3, V3)

(K4, v4)

(K2, V2)

(K3, V3)

(K4, v4)

(K6, V6)

(K7, V7)

Page 7: Cascading User Group Meet

7Map Reduce with Multiple data sources

HDFS

Cassandra

SQL

HBase

MapReduce job

HDFS

Neo4j

SQL

MongoDB

Input Processing Output

Page 8: Cascading User Group Meet

8Jumping to the Hadoop Bandwagon

Page 9: Cascading User Group Meet

9Challenges with Map Reduce

Complex jobs which requires multiple mappers and reducersChaining multiple MR jobs and scheduling them togetherWrong level of granularity of MRTransforming business rules into Map Reduce paradigmTesting and maintaining the code

Page 10: Cascading User Group Meet

10Growing opportunities in Hadoop

With the growing job trends in Hadoop, there is a huge gap in the skillset required to meet the demandHuge investment already made by enterprises in existing business processes and training

Page 11: Cascading User Group Meet

How to Train Your Elephant ?!

Page 12: Cascading User Group Meet

Cascading

Page 13: Cascading User Group Meet

13What is Cascading ?

Cascading is a open source Java framework that provides application development platform for building data applications on Hadoop.

Developed by Chris Wensel in 2007Underlying motivation for developing the Cascading Java framework

Difficulty for Java developers to write MapReduce Code

MapReduce is based on functional programming

element

Page 14: Cascading User Group Meet

14Enterprise Data Flow - Challenge

Business Goals Data Sources

Using existing Skillset, business process and

tools

Page 15: Cascading User Group Meet

15Cascading Building Blocks – Highlevel Overview

Cascading

MapReduce

HDFSDistributed Storage

Page 16: Cascading User Group Meet

16Cascading in Short

Functional programming way to HadoopAlternative and Easy API for MapReduceReusable Java components Possibility for Test driven developmentCan be used with any JVM- based languages

Java, JRuby, Clojure, etc

Page 17: Cascading User Group Meet

17Cascading Building Blocks

Pipes

Sinks

Taps Flow

Page 18: Cascading User Group Meet

18Sample Look of Cascading Flow

Source Tap

Sink Tap

Pipe Assembly

Flow

Page 19: Cascading User Group Meet

19Cascading Pipe Assemblies

Original Tuple Streams

TransformedTuple Streams

Pipe

Each

GroupBy

CoGroup

Every

SubAssembly

Page 20: Cascading User Group Meet

20The quintessential WordCount Example

Page 21: Cascading User Group Meet

21The quintessential WordCount Example

Page 22: Cascading User Group Meet

22The quintessential WordCount Example

Page 23: Cascading User Group Meet

23The quintessential WordCount Example

Initialize properties and tell Hadoop which jar file to use

Page 24: Cascading User Group Meet

24The quintessential WordCount Example

Word-count

Page 25: Cascading User Group Meet

25The quintessential WordCount Example

Word-count

Page 26: Cascading User Group Meet

26Typical Pipe Assembly

CSV

NoSQL

Sequence File

Flow DefinitionFlow A

Page 27: Cascading User Group Meet

27Cascading Multiple Flows

Flow A

Flow E

Flow B

Flow C

Flow D

Flow F

Flow G

Flow H

Page 28: Cascading User Group Meet

28Cascading Pipe Assemblies

lhs pipe definition

rhs pipe definition

Join lhs & rhs pipes

Join pipe assembly

Page 29: Cascading User Group Meet

29Cascading real-world Data Flow Use Cases

Analytics on login information

Analytics from ClickStream Data

Page 30: Cascading User Group Meet

30Support With multiple data Sources

HDFS

Cassandra

Mongodb

ElasticSearch

HBase

Memcached

Neo4j

Solr

ElephantDB RDBMS

Splunk

http://www.cascading.org/extensions/

Page 31: Cascading User Group Meet

31Support With major Serializers

http://www.cascading.org/extensions/

JSON AVRO

KYRO THRIFT

Page 32: Cascading User Group Meet

Predictive Models on Hadoop

Page 33: Cascading User Group Meet

33

Cascading Pattern is a machine learning project within the Cascading development framework used to build enterprise data workflows

Pattern uses the industrial standard Predictive Model Markup Language (PMML), an XML-based file format developed by Data Mining group

PMML is supported by most of the popular analytical tools such as R, SaS, TeraData, Weka, Knime, Microsoft etc

Cascading Pattern

http://www.dmg.org/

Page 34: Cascading User Group Meet

34

Track trips

Maintain Logbook

Get Notified about best gas stations

Manage and compare vehicle cost

Fleet management

Social platform connecting drivers

Cascading Pattern on CarbookPlus

www.carbookplus.com

Page 35: Cascading User Group Meet

35CarbookPlus Fuel Cost Predicition

“MDM: Mobilitäts Daten Marktplatz”, is a German federal government organization that provides open data about the fuel prices across Germany on real time.

http://www.mdm-portal.de/

Our Objective :• Store the data from MDM into

HDFS• Process and clean the data

with Cascading• Build a model with R,

predicting the fuel price trend for the next 7 days & 24 hours

• Export the model as PMML• Scale-out on the hadoop

cluster, with Cascading Pattern

• Store the results in Mongodb

Page 36: Cascading User Group Meet

36Exporting PMML model from R

Export model as PMML file

Page 37: Cascading User Group Meet

37Cascading Pattern Flow Definition

Page 38: Cascading User Group Meet

38Fuel Cost Predictor Result

Page 39: Cascading User Group Meet

39Algorithms Supported by Cascading Pattern

Random ForestLinear RegressionLogistical RegressionK-Means ClusteringHierarchical ClusteringMultinominal Model

https://github.com/cascading/pattern

Page 40: Cascading User Group Meet

40

Cascading Pattern to Support more predictive modelsNeural NetworkSupport Vector Machine

More new features in Cascading 3.0

Future of Cascading

YARNCluster Resource Management

HDFSDistributed Storage

Cascading 3.0

SparkTezExecution Engine Storm

Page 41: Cascading User Group Meet

When do you Start ?

Page 42: Cascading User Group Meet

42Questions?

Q & AThank you !!

Vinoth KannanCredits

www.soundcloud.com

www.concurrentinc.com

www.cascading.org

Big Data EngineerWidasConcepts Gmbhwww.widas.de

@WidasConcepts@vinoth4v

/WidasConcepts

[email protected]