cascading user group meet
DESCRIPTION
Leading Data with Innovation. My Talk on using Cascading for Data driven applications presented during the meetup in Berlin.TRANSCRIPT
Simplifying Application
Development on Hadoop
WidasConcepts Unternehmensberatung GmbH Maybachstraße 2 71299 Wimsheim http://www.widas.de
Big Data Engineer, WidasConceptsVinoth Kannan
Cascading User Group Meet
Berlin, Germany26.05.2014
2What is Hadoop?
“Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware.“ (Wikipedia)
Designed for
Possible to:
Works on:
• Batch• Processing
• Horizontal Scaling
• Bringing Computation to Data
Principles of Hadoop:
3Main Features
Reliable and Redundant• No performance or data loss even on failure
Powerful• Possible to have huge clusters (largest 40,000 nodes)• Supports “Best of Breed Analytics“
Scalable• Linearly scalable with increase in data volume
Cost Efficient• No need for expensive hardware. Supports commodity
hardwareSimple and flexible APIs• Great ecosystem with multitude of solutions to support
4Traditional vs. Hadoop
Traditional Hadoop
More and larger server necessary to accomplish tasks: • computing capacity • data capacity
Instead of upgrading the server, the cluster size is increased with more machines
5
MapReduce are programming model to run applications mostly on Hadoop
What is MapReduce?
Mapper
• Converts input (K,V) to new (K,V)
Shuffle
• Sorts and Groups similar keys with all its values
Reducer
• Translates the Value each unique Key to new (K,V)
6MapReduce Paradigm
Map Shuffle Reduce
(K1, V1)
(K1, V1)
(K1, V1)
(K2, V2) (K5, V5)
(K2, V2)
(K3, V3)
(K4, v4)
(K2, V2)
(K3, V3)
(K4, v4)
(K2, V2)
(K3, V3)
(K4, v4)
(K6, V6)
(K7, V7)
7Map Reduce with Multiple data sources
HDFS
Cassandra
SQL
HBase
MapReduce job
HDFS
Neo4j
SQL
MongoDB
Input Processing Output
8Jumping to the Hadoop Bandwagon
9Challenges with Map Reduce
Complex jobs which requires multiple mappers and reducersChaining multiple MR jobs and scheduling them togetherWrong level of granularity of MRTransforming business rules into Map Reduce paradigmTesting and maintaining the code
10Growing opportunities in Hadoop
With the growing job trends in Hadoop, there is a huge gap in the skillset required to meet the demandHuge investment already made by enterprises in existing business processes and training
How to Train Your Elephant ?!
Cascading
13What is Cascading ?
Cascading is a open source Java framework that provides application development platform for building data applications on Hadoop.
Developed by Chris Wensel in 2007Underlying motivation for developing the Cascading Java framework
Difficulty for Java developers to write MapReduce Code
MapReduce is based on functional programming
element
14Enterprise Data Flow - Challenge
Business Goals Data Sources
Using existing Skillset, business process and
tools
15Cascading Building Blocks – Highlevel Overview
Cascading
MapReduce
HDFSDistributed Storage
16Cascading in Short
Functional programming way to HadoopAlternative and Easy API for MapReduceReusable Java components Possibility for Test driven developmentCan be used with any JVM- based languages
Java, JRuby, Clojure, etc
17Cascading Building Blocks
Pipes
Sinks
Taps Flow
18Sample Look of Cascading Flow
Source Tap
Sink Tap
Pipe Assembly
Flow
19Cascading Pipe Assemblies
Original Tuple Streams
TransformedTuple Streams
Pipe
Each
GroupBy
CoGroup
Every
SubAssembly
20The quintessential WordCount Example
21The quintessential WordCount Example
22The quintessential WordCount Example
23The quintessential WordCount Example
Initialize properties and tell Hadoop which jar file to use
24The quintessential WordCount Example
Word-count
25The quintessential WordCount Example
Word-count
26Typical Pipe Assembly
CSV
NoSQL
Sequence File
Flow DefinitionFlow A
27Cascading Multiple Flows
Flow A
Flow E
Flow B
Flow C
Flow D
Flow F
Flow G
Flow H
28Cascading Pipe Assemblies
lhs pipe definition
rhs pipe definition
Join lhs & rhs pipes
Join pipe assembly
29Cascading real-world Data Flow Use Cases
Analytics on login information
Analytics from ClickStream Data
30Support With multiple data Sources
HDFS
Cassandra
Mongodb
ElasticSearch
HBase
Memcached
Neo4j
Solr
ElephantDB RDBMS
Splunk
http://www.cascading.org/extensions/
31Support With major Serializers
http://www.cascading.org/extensions/
JSON AVRO
KYRO THRIFT
Predictive Models on Hadoop
33
Cascading Pattern is a machine learning project within the Cascading development framework used to build enterprise data workflows
Pattern uses the industrial standard Predictive Model Markup Language (PMML), an XML-based file format developed by Data Mining group
PMML is supported by most of the popular analytical tools such as R, SaS, TeraData, Weka, Knime, Microsoft etc
Cascading Pattern
http://www.dmg.org/
34
Track trips
Maintain Logbook
Get Notified about best gas stations
Manage and compare vehicle cost
Fleet management
Social platform connecting drivers
Cascading Pattern on CarbookPlus
www.carbookplus.com
35CarbookPlus Fuel Cost Predicition
“MDM: Mobilitäts Daten Marktplatz”, is a German federal government organization that provides open data about the fuel prices across Germany on real time.
http://www.mdm-portal.de/
Our Objective :• Store the data from MDM into
HDFS• Process and clean the data
with Cascading• Build a model with R,
predicting the fuel price trend for the next 7 days & 24 hours
• Export the model as PMML• Scale-out on the hadoop
cluster, with Cascading Pattern
• Store the results in Mongodb
36Exporting PMML model from R
Export model as PMML file
37Cascading Pattern Flow Definition
38Fuel Cost Predictor Result
39Algorithms Supported by Cascading Pattern
Random ForestLinear RegressionLogistical RegressionK-Means ClusteringHierarchical ClusteringMultinominal Model
https://github.com/cascading/pattern
40
Cascading Pattern to Support more predictive modelsNeural NetworkSupport Vector Machine
More new features in Cascading 3.0
Future of Cascading
YARNCluster Resource Management
HDFSDistributed Storage
Cascading 3.0
SparkTezExecution Engine Storm
When do you Start ?
42Questions?
Q & AThank you !!
Vinoth KannanCredits
www.soundcloud.com
www.concurrentinc.com
www.cascading.org
Big Data EngineerWidasConcepts Gmbhwww.widas.de
@WidasConcepts@vinoth4v
/WidasConcepts