PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos

Download PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos

Post on 20-Aug-2015

1.645 views

Category:

Technology

1 download

TRANSCRIPT

1. Paco Nathan liber118.com/pxn/ Enterprise Data Workows with Cascading and Mesos Licensed under a Creative Commons Attribution- NonCommercial-NoDerivs 3.0 Unported License. 1Saturday, 27 July 13 2. Cascading / Cascalog / Scalding Enterprise Data Workflows with Cascading Cluster Computing with Mesos Looking ahead 2Saturday, 27 July 13 3. Cascading origins API author Chris Wensel worked as a system architect at an Enterprise rm well-known for many popular data products. Wensel was following the Nutch open source project where Hadoop started. Observation: would be difcult to nd Java developers to write complex Enterprise apps in MapReduce potential blocker for leveraging new open source technology. 3Saturday, 27 July 13 4. Cascading functional programming Key insight: MapReduce is based on functional programming back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease stafng problems as Main Street Enterprise rms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workows: leverages JVM and Java-based tools without any need to create new languages allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters 4Saturday, 27 July 13 5. Cascading functional programming Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading used for their large-scale production deployments new case studies for Cascading apps are mostly based on domain-specic languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology Dan Woods, 2013-04-17 Forbes forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming- practices-will-improve-your-return-from-technology/ 5Saturday, 27 July 13 6. Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML Cascading integrations partners: Microsoft Azure, Hortonworks, Amazon AWS, MapR, EMC, SpringSource, Cloudera taps: Memcached, Cassandra, MongoDB, HBase, JDBC, Parquet, etc. serialization: Avro, Thrift, Kryo, JSON, etc. topologies: Apache Hadoop, tuple spaces, local mode 6Saturday, 27 July 13 7. Cascading deployments case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc. use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc. 7Saturday, 27 July 13 8. Cascading deployments case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc. use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc. workow abstraction addresses: stafng bottleneck; system integration; operational complexity; test-driven development 8Saturday, 27 July 13 9. Document Collection Word Count Tokenize GroupBy token Count R M 1 map 1 reduce 18 lines code gist.github.com/3900702 WordCount conceptual ow diagram cascading.org/category/impatient 9Saturday, 27 July 13 10. WordCount Cascading app in Java String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Document Collection Word Count Tokenize GroupBy token Count R M 10Saturday, 27 July 13 11. mapreduce Every('wc')[Count[decl:'count']] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] GroupBy('wc')[by:['token']] Each('token')[RegexSplitGenerator[decl:'token'][args:1]] Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [head] [tail] [{2}:'token', 'count'] [{1}:'token'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] wc[{1}:'token'] [{1}:'token'] [{2}:'token', 'count'] [{2}:'token', 'count'] [{1}:'token'] [{1}:'token'] WordCount generated ow diagram Document Collection Word Count Tokenize GroupBy token Count R M 11Saturday, 27 July 13 12. (ns impatient.core (:use [cascalog.api] [cascalog.more-taps :only (hfs-delimited)]) (:require [clojure.string :as s] [cascalog.ops :as c]) (:gen-class)) (defmapcatop split [line] "reads in a line of string and splits it by regex" (s/split line #"[[](),.)s]+")) (defn -main [in out & args] (? ?word) (c/count ?count))) ; Paul Lam ; github.com/Quantisan/Impatient WordCount Cascalog / Clojure Document Collection Word Count Tokenize GroupBy token Count R M 12Saturday, 27 July 13 13. github.com/nathanmarz/cascalog/wiki implements Datalog in Clojure, with predicates backed by Cascading for a highly declarative language run ad-hoc queries from the Clojure REPL approx. 10:1 code reduction compared with SQL composable subqueries, used for test-driven development (TDD) practices at scale Leiningen build: simple, no surprises, in Clojure itself more new deployments than other Cascading DSLs Climate Corp is largest use case: 90% Clojure/Cascalog has a learning curve, limited number of Clojure developers aggregators are the magic, and those take effort to learn WordCount Cascalog / Clojure Document Collection Word Count Tokenize GroupBy token Count R M 13Saturday, 27 July 13 14. import com.twitter.scalding._ class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ [](),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true)) } WordCount Scalding / Scala Document Collection Word Count Tokenize GroupBy token Count R M 14Saturday, 27 July 13 15. github.com/twitter/scalding/wiki extends the Scala collections API so that distributed lists become pipes backed by Cascading code is compact, easy to understand nearly 1:1 between elements of conceptual ow diagram and function calls extensive libraries are available for linear algebra, abstract algebra, machine learning e.g., Matrix API, Algebird, etc. signicant investments by Twitter, Etsy, eBay, etc. great for data services at scale less learning curve than Cascalog WordCount Scalding / Scala Document Collection Word Count Tokenize GroupBy token Count R M 15Saturday, 27 July 13 16. Workow Abstraction pattern language Cascading uses a plumbing metaphor in the Java API, to dene workows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R Data is represented as ows of tuples. Operations within the ows bring functional programming aspects into Java A Pattern Language Christopher Alexander, et al. amazon.com/dp/0195019199 16Saturday, 27 July 13 17. Workow Abstraction literate programming Cascading workows generate their own visual documentation: ow diagrams in formal terms, ow diagrams leverage a methodology called literate programming provides intuitive, visual representations for apps great for cross-team collaboration Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R Literate Programming Don Knuth literateprogramming.com 17Saturday, 27 July 13 18. Workow Abstraction business process following the essence of literate programming, Cascading workows provide statements of business process this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.) this is especially apparent in large-scale Cascalog apps: Specify what you require, not how to achieve it. by virtue of the pattern language, the ow planner then determines how to translate business process into efcient, parallel jobs at scale 18Saturday, 27 July 13 19. Follow-Up blog, developer community, code/wiki/gists, maven repo, commercial products, etc.: cascading.org zest.to/group11 github.com/Cascading conjars.org goo.gl/KQtUL concurrentinc.com 19Saturday, 27 July 13 20. Cascading / Cascalog / Scalding Enterprise Data Workflows with Cascading Cluster Computing with Mesos Looking ahead 20Saturday, 27 July 13 21. Anatomy of an Enterprise app Denition a typical Enterprise workow which crosses through multiple departments, languages, and technologies ETL data prep predictive model data sources end uses 21Saturday, 27 July 13 22. Anatomy of an Enterprise app Denition a typical Enterprise workow which crosses through multiple departments, languages, and technologies ETL data prep predictive model data sources end uses ANSI SQL for ETL 22Saturday, 27 July 13 23. Anatomy of an Enterprise app Denition a typical Enterprise workow which crosses through multiple departments, languages, and technologies ETL data prep predictive model data sources end usesJ2EE for business logic 23Saturday, 27 July 13 24. Anatomy of an Enterprise app Denition a typical Enterprise workow which crosses through multiple departments, languages, and technologies ETL data prep predictive model data sources end uses SAS for predictive models 24Saturday, 27 July 13 25. Anatomy of an Enterprise app Denition a typical Enterprise workow which crosses through multiple departments, languages, and technologies ETL data prep predictive model data sources end uses SAS for predictive modelsANSI SQL for ETL most of the licensing costs 25Saturday, 27 July 13 26. Anatomy of an Enterprise app Denition a typical Enterprise workow which crosses through multiple departments, languages, and technologies ETL data prep predictive model data sources end usesJ2EE for business logic most of the project costs 26Saturday, 27 July 13 27. ETL data prep predictive model data sources end uses Lingual: DW ANSI SQL Pattern: SAS, R, etc. PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workow components into an integrated app one among many, typically based on 100% open source a compiler sees it all cascading.org 27Saturday, 27 July 13 28. a compiler sees it all ETL data prep predictive model data sources end uses Lingual: DW ANSI SQL Pattern: SAS, R, etc. PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workow components into an integrated app one among many, typically based on 100% open source FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement ); flowDef.addAssemblyPlanner( sqlPlanner ); cascading.org 28Saturday, 27 July 13 29. a compiler sees it all ETL data prep predictive model data sources end uses Lingual: DW ANSI SQL Pattern: SAS, R, etc. PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workow components into an integrated app one among many, typically based on 100% open source FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields(); flowDef.addAssemblyPlanner( pmmlPlanner ); 29Saturday, 27 July 13 30. cascading.org ETL data prep predictive model data sources end uses Lingual: DW ANSI SQL Pattern: SAS, R, etc. PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workow components into an integrated app one among many, typically based on 100% open source visual collaboration for the business logic is a great way to improve how teams work together Failure Traps bonus allocation employee PMML classifier quarterly sales Join Count leads 30Saturday, 27 July 13 31. Lingual CSV data in local le system cascading.org/lingual 31Saturday, 27 July 13 32. Lingual shell prompt, catalog cascading.org/lingual 32Saturday, 27 July 13 33. Lingual queries cascading.org/lingual 33Saturday, 27 July 13 34. # load the JDBC package library(RJDBC) # set up the driver drv

Recommended

View more >