cascading user group meet

Post on 19-Aug-2014

639 Views

Category:

Engineering

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Leading Data with Innovation. My Talk on using Cascading for Data driven applications presented during the meetup in Berlin.

TRANSCRIPT

Simplifying Application

Development on Hadoop

WidasConcepts Unternehmensberatung GmbH Maybachstraße 2 71299 Wimsheim http://www.widas.de

Big Data Engineer, WidasConceptsVinoth Kannan

Cascading User Group Meet

Berlin, Germany26.05.2014

2What is Hadoop?

“Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.“ (Wikipedia)

Designed for

Possible to:

Works on:

• Batch• Processing

• Horizontal Scaling

• Bringing Computation to Data

Principles of Hadoop:

3Main Features

Reliable and Redundant• No performance or data loss even on failure

Powerful• Possible to have huge clusters (largest 40,000 nodes)• Supports “Best of Breed Analytics“

Scalable• Linearly scalable with increase in data volume

Cost Efficient• No need for expensive hardware. Supports commodity

hardwareSimple and flexible APIs• Great ecosystem with multitude of solutions to support

4Traditional vs. Hadoop

Traditional Hadoop

More and larger server necessary to accomplish tasks: • computing capacity • data capacity

Instead of upgrading the server, the cluster size is increased with more machines

5

MapReduce are programming model to run applications mostly on Hadoop

What is MapReduce?

Mapper

• Converts input (K,V) to new (K,V)

Shuffle

• Sorts and Groups similar keys with all its values

Reducer

• Translates the Value each unique Key to new (K,V)

6MapReduce Paradigm

Map Shuffle Reduce

(K1, V1)

(K1, V1)

(K1, V1)

(K2, V2) (K5, V5)

(K2, V2)

(K3, V3)

(K4, v4)

(K2, V2)

(K3, V3)

(K4, v4)

(K2, V2)

(K3, V3)

(K4, v4)

(K6, V6)

(K7, V7)

7Map Reduce with Multiple data sources

HDFS

Cassandra

SQL

HBase

MapReduce job

HDFS

Neo4j

SQL

MongoDB

Input Processing Output

8Jumping to the Hadoop Bandwagon

9Challenges with Map Reduce

Complex jobs which requires multiple mappers and reducersChaining multiple MR jobs and scheduling them togetherWrong level of granularity of MRTransforming business rules into Map Reduce paradigmTesting and maintaining the code

10Growing opportunities in Hadoop

With the growing job trends in Hadoop, there is a huge gap in the skillset required to meet the demandHuge investment already made by enterprises in existing business processes and training

How to Train Your Elephant ?!

Cascading

13What is Cascading ?

Cascading is a open source Java framework that provides application development platform for building data applications on Hadoop.

Developed by Chris Wensel in 2007Underlying motivation for developing the Cascading Java framework

Difficulty for Java developers to write MapReduce Code

MapReduce is based on functional programming

element

14Enterprise Data Flow - Challenge

Business Goals Data Sources

Using existing Skillset, business process and

tools

15Cascading Building Blocks – Highlevel Overview

Cascading

MapReduce

HDFSDistributed Storage

16Cascading in Short

Functional programming way to HadoopAlternative and Easy API for MapReduceReusable Java components Possibility for Test driven developmentCan be used with any JVM- based languages

Java, JRuby, Clojure, etc

17Cascading Building Blocks

Pipes

Sinks

Taps Flow

18Sample Look of Cascading Flow

Source Tap

Sink Tap

Pipe Assembly

Flow

19Cascading Pipe Assemblies

Original Tuple Streams

TransformedTuple Streams

Pipe

Each

GroupBy

CoGroup

Every

SubAssembly

20The quintessential WordCount Example

21The quintessential WordCount Example

22The quintessential WordCount Example

23The quintessential WordCount Example

Initialize properties and tell Hadoop which jar file to use

24The quintessential WordCount Example

Word-count

25The quintessential WordCount Example

Word-count

26Typical Pipe Assembly

CSV

NoSQL

Sequence File

Flow DefinitionFlow A

27Cascading Multiple Flows

Flow A

Flow E

Flow B

Flow C

Flow D

Flow F

Flow G

Flow H

28Cascading Pipe Assemblies

lhs pipe definition

rhs pipe definition

Join lhs & rhs pipes

Join pipe assembly

29Cascading real-world Data Flow Use Cases

Analytics on login information

Analytics from ClickStream Data

30Support With multiple data Sources

HDFS

Cassandra

Mongodb

ElasticSearch

HBase

Memcached

Neo4j

Solr

ElephantDB RDBMS

Splunk

http://www.cascading.org/extensions/

31Support With major Serializers

http://www.cascading.org/extensions/

JSON AVRO

KYRO THRIFT

Predictive Models on Hadoop

33

Cascading Pattern is a machine learning project within the Cascading development framework used to build enterprise data workflows

Pattern uses the industrial standard Predictive Model Markup Language (PMML), an XML-based file format developed by Data Mining group

PMML is supported by most of the popular analytical tools such as R, SaS, TeraData, Weka, Knime, Microsoft etc

Cascading Pattern

http://www.dmg.org/

34

Track trips

Maintain Logbook

Get Notified about best gas stations

Manage and compare vehicle cost

Fleet management

Social platform connecting drivers

Cascading Pattern on CarbookPlus

www.carbookplus.com

35CarbookPlus Fuel Cost Predicition

“MDM: Mobilitäts Daten Marktplatz”, is a German federal government organization that provides open data about the fuel prices across Germany on real time.

http://www.mdm-portal.de/

Our Objective :• Store the data from MDM into

HDFS• Process and clean the data

with Cascading• Build a model with R,

predicting the fuel price trend for the next 7 days & 24 hours

• Export the model as PMML• Scale-out on the hadoop

cluster, with Cascading Pattern

• Store the results in Mongodb

36Exporting PMML model from R

Export model as PMML file

37Cascading Pattern Flow Definition

38Fuel Cost Predictor Result

39Algorithms Supported by Cascading Pattern

Random ForestLinear RegressionLogistical RegressionK-Means ClusteringHierarchical ClusteringMultinominal Model

https://github.com/cascading/pattern

40

Cascading Pattern to Support more predictive modelsNeural NetworkSupport Vector Machine

More new features in Cascading 3.0

Future of Cascading

YARNCluster Resource Management

HDFSDistributed Storage

Cascading 3.0

SparkTezExecution Engine Storm

When do you Start ?

42Questions?

Q & AThank you !!

Vinoth KannanCredits

www.soundcloud.com

www.concurrentinc.com

www.cascading.org

Big Data EngineerWidasConcepts Gmbhwww.widas.de

@WidasConcepts@vinoth4v

/WidasConcepts

vinoth.kannan@widas.de

top related