cascading concurrent yahoo lunch_nlearn
Post on 14-Jul-2015
102 Views
Preview:
TRANSCRIPT
DRIVING INNOVATION THROUGH DATABUILDING PRODUCTION-GRADE HADOOP APPLICATIONS WITH CASCADINGSupreet Oberoi VP Field Engineering, Concurrent Inc
ABOUT ME
2
• I am a Data Engineer, not a Data Scientist
• I help Enterprises develop decisions on building their “Big Data” roadmap and technical strategy — use cases, products, technology decisions, employee skills
• I design Hadoop applications with the intent to operationalize them in Enterprise settings — applications on which business depend, and last longer than the technologies underneath them…
• This talk is about learning how to design your BD strategy that leverages best
BUILDING AN OPEN PLATFORM IS KEY TO PREVENTING LOCK-IN
3
• Open Language
• Open Data
• Open Hardware
• Open Compute Platform
• Open Development Platform
OPEN LANGUAGES ALLOW YOU TO HARNESS THE TALENT OF YOUR ENTERPRISE
4
• Don’t equate architecture with language; develop architecture to support multiple
• Support SQL and SQL-like languages
• Encourage development in proven & scalable languages as Java
• Develop architecture to support change of programming languages (even for same app)
• Have common performance-management tools across all programming environments
OPEN DATA ENABLES REUSE OF DATA AND APPS
5
• Prevent exclusive access to data sets through proprietary tools
• Promote a common meta-data repository
• Forbid storing data in proprietary formats
• Build seamless integration capabilities
Develop a common operating picture by promoting reuse with open data
OPEN HARDWARE PROMOTES REUSE OF INFRASTRUCTURE
6
• Get commodity hardware — commodity hardware will always cost less than “optimized” specialized hardware (note: definition of “specialized” is up for debate)
• Develop and maintain a cluster that can be reused by different applications and technology stacks — avoid custom software installations on the cluster, or setting up dedicated clusters for given tech stacks
• Harness the power of collective from the cluster — avoid fragmenting the cluster if possible
OPEN COMPUTE PLATFORM MAKES YOU SELECT THE RIGHT TOOL FOR THE PROBLEM
7
• Ensure that moving your application from one Hadoop compute platform (e.g. MapReduce) to another (e.g., Tez) does not:
• impact application code
• impact production-monitoring tools
• Resist compute platforms that require your enterprise to acquire significantly new skills (even if it is easy) to become productive
• Avoid new platforms that partition the cluster
• Avoid platforms that do not support Open Data
Make tradeoffs between reliability & speed based on your business context
OPEN DEVELOPMENT PLATFORM PROVIDES LONG-TERM SUSTAINABILITY
8
• Invest in picking the correct development platform — open, easy, scalable, popular, tools, …
• Bet on a sustainable open source platform
• Measure the vitality of the community:
• number of downloads, extensions (living ecosystem), extensible architecture, consumers of the technology, code stability…
Development platforms improve developer productivity and operational excellence — picking a correct platform gives you best practices developed by the community, achieving higher quality
A proven platform provides tools to get your apps to production
GET TO KNOW CONCURRENT
9
Leader in Application Infrastructure for Big Data
• Building enterprise software to simplify Big Data application development and management
Products and Technology
• CASCADINGOpen Source - The most widely used application infrastructure for building Big Data apps with over 175,000 downloads each month
• DRIVEN Enterprise data application management for Big Data apps
Proven — Simple, Reliable, Robust
• Thousands of enterprises rely on Concurrent to provide their data application infrastructure.
Founded: 2008 HQ: San Francisco, CA
CEO: Gary Nakamura CTO, Founder: Chris Wensel
www.concurrentinc.com
BIG DATA — OPERATIONALIZE YOUR DATA APPS WITH CASCADING
10
“It’s all about the apps”There needs to be a comprehensive solution for building, deploying, running
and managing this new class of enterprise applications.
Business Strategy Data & Technology
ChallengesSkill sets, systems integration, standard op procedure and
operational visibility
Connecting Business and Data
DATA APPLICATIONS - ENTERPRISE NEEDS
11
Enterprise Data Application Infrastructure
• Need reliable, reusable tooling to quickly build and consistently deliver data products
• Need the degrees of freedom to solve problems ranging from simple to complex with existing skill sets
• Need the flexibility to easily adapt an application to meet business needs (latency, scale, SLA), without having to rewrite the application
• Need operational visibility for entire data application lifecycle
Cascading Apps
CASCADING - DE-FACTO FOR DATA APPS
12
New Fabrics
ClojureSQL Ruby
StormTez
System Integration
Mainframe DB / DW Data Stores HadoopIn-Memory
• Standard for enterprise data app development
• Your programming language of choice
• Cascading applications that run on MapReduce will also run on Apache Spark, Storm, and …
WORD COUNT EXAMPLE WITH CASCADING
13
String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
configuration
integration// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
processing
// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
scheduling
// connect the taps, pipes, etc., into a flow definitionFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// create the FlowFlow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of WorkwcFlow.complete(); // <<-- Runs jobs on Cluster
• Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical
• Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct)
• Aggregations ‣ Count, Average, etc
SOME COMMON PATTERNS
14
filter
filter
function
functionfilterfunctiondata
PipelineSplit Join
Merge
data
Topology
• Java API
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
CASCADING
15
Process Planner
Processing API Integration APIScheduler API
Scheduler
Apache Hadoop
Cascading
Data Stores
ScriptingScala, Clojure, JRuby, Jython, Groovy
Enterprise Java
THE STANDARD FOR DATA APPLICATION DEVELOPMENT
16
www.cascading.org
Build data apps that are
scale-free
Design principals ensure best practices at any scale
Test-Driven Development
Efficiently test code and process local files before
deploying on a cluster
Staffing Bottleneck
Use existing Java,Scala, SQL, modeling skill sets
Operational Complexity
Simple - Package up into one jar and hand to
operations
Application Portability
Write once, then run on different computation
fabrics
Systems Integration
Hadoop never lives alone. Easily integrate to existing
systems
Proven application development framework for building data apps
Application platform that addresses:
STRONG ORGANIC GROWTH
17
175,000+ downloads / month7000+ Deployments
CASCADING DATA APPLICATIONS
18
Enterprise ITExtract Transform Load
Log File Analysis Systems Integration Operations Analysis
Corporate AppsHR Analytics
Employee Behavioral Analysis Customer Support | eCRM
Business Reporting
TelecomData processing of Open Data
Geospatial Indexing Consumer Mobile Apps Location based services
Marketing / RetailMobile, Social, Search Analytics
Funnel Analysis Revenue Attribution
Customer Experiments Ad Optimization
Retail Recommenders
Consumer / EntertainmentMusic Recommendation Comparison Shopping Restaurant Rankings
Real Estate Rental Listings
Travel Search & Forecast
FinanceFraud and Anomaly Detection
Fraud Experiments Customer Analytics
Insurance Risk Metric
Health / BiotechAggregate Metrics For Govt
Person Biometrics Veterinary Diagnostics Next-Gen Genomics
Argonomics Environmental Maps
BUSINESSES DEPEND ON US
19
• Cascading Java API
• Data normalization and cleansing of search and click-through logs for
use by analytics tools, Hive analysts
• Easy to operationalize heavy lifting of data in one framework
BUSINESSES DEPEND ON US
20
• Cascalog (Clojure)
• Weather pattern modeling to protect growers against loss
• ETL against 20+ datasets daily
• Machine learning to create models
• Purchased by Monsanto for $930M US
BUSINESSES DEPEND ON US
21
• Scalding (Scala)
• Makes complex analysis of very large data sets simple
• Machine learning, linear algebra to improve
• 30,000 jobs a day — this works @ scale
• Ad quality (matching users and ad effectiveness)
Confidential
CASCADING DEPLOYMENTS
2222
BROAD SUPPORT
23
Hadoop ecosystem supports Cascading
… AND INCLUDES RICH SET OF EXTENSIONS
24
http://www.cascading.org/extensions/
CASCADING 3.0
25
“Write once and deploy on your fabric of choice.”
• The Innovation — Cascading 3.0 will allow for data apps to execute on existing and emerging fabrics through its new customizable query planner.
• Cascading 3.0 will support — Local In-Memory, Apache MapReduce and soon thereafter (3.1) Apache Tez, Apache Spark and Apache Storm
Enterprise Data Applications
MapReduceLocal In-Memory
Other Custom
Computation Fabrics
USE LINGUAL TO MIGRATE ITERATIVE ETL TASKS TO HADOOP
26
• Lingual is an extension to Cascading that executes ANSI SQL queries as Cascading apps
• Supports integrating with any data source that can be accessed through JDBC — Cascading Tap can be created for any source supporting JDBC
• Great for migration of data, integrating with non-Big Data assets — extends life of existing IT assets in an organization
Query Planner
JDBC API Lingual APIProvider API
Cascading
Apache Hadoop
Lingual
Data Stores
CLI / Shell Enterprise Java
Catalog
SCALDING
27
• Scalding is a language binding to Cascading for Scala • The name Scalding comes from the combining of SCALa and cascaDING
• Scalding is great for Scala developers; can crisply write constructs for matrix math…
• Scalding has very large commercial deployments at: • Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality • Ebay - Use cases include search analytics and other production data pipelines
PATTERN SCORES MODELS AT SCALE
28
• Pattern is an open source project that allows to leverage Predictive Model Markup Language (PMML) models and translate them into Cascading apps.
• PMML is an XML-based popular analytics framework that allows applications to describe data mining and machine learning algorithms
• PMML models from popular analytics frameworks can be reused and deployed within Cascading workflows
• Vendor frameworks - SAS, IBM SPSS, MicroStrategy, Oracle • Open source frameworks - R, Weka, KNIME, RapidMiner
• Pattern is great for migrating your model scoring to Hadoop from your decision systems
Confidential29
PATTERN SCORES MODELS AT SCALEStep 1: Train your model with industry-leading Tools
Step 2: Score your models at scale with Pattern
OPERATIONAL EXCELLENCE WITH DRIVEN
From Development — Building and Testing• Design & Development • Debugging • Tuning
To Production — Monitoring and Tracking• Maintain Business SLAs • Balance & Controls • Application and Data Quality • Operational Health • Real-time Insights
Visibility Through All Stages of App Lifecycle
30
• Understand the anatomy of your Hive app • Track execution of queries as single business process • Identify outlier behavior by comparison with historical runs • Analyze rich operational meta-data • Correlate Hive app behavior with other events on cluster
DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS
31
DRIVEN ARCHITECTURE
• Easily comprehend, debug, and tune your data applications
• Get rich insights on your application performance
• Monitor applications in real-time • Compare app performance with
historical (previous) iterations
DEEPER VISUALIZATION INTO YOUR HADOOP CODE
33
Debug and optimize your Hadoop applications more effectively with Driven
• Quickly breakdown how often applications execute based on their tags, teams, or names
• Immediately identify if any application is monopolizing cluster resources
• Understand the utilization of your cluster with a timeline of all applications running
GET OPERATIONAL INSIGHTS WITH DRIVEN
34
Visualize the activity of your applications to help maintain SLAs
• Easily keep track of all your applications by segmenting them with user-defined tags
• Segment your applications for trending analysis, cluster analysis, and developing chargeback models
• Quickly breakdown how often applications execute based on their tags, teams, or names
ORGANIZE YOUR APPLICATIONS WITH GREATER FIDELITY
35
Segment your applications for greater insights across all your applications
• Invite others to view and collaborate on a specific application
• Gain visibility to all the apps and their owners associated with each team
• Simply manage your teams and the users assigned to them
COLLABORATE WITH TEAMS
36
Utilize teams to collaborate and gain visibility over your set of applications
• Identify problematic apps with their owners and teams
• Search for groups of applications segmented by user-defined tags
• Compare specific applications with their previous iterations to ensure that your application can meet its SL
MANAGE PORTFOLIO OF BIG DATA APPLICATIONS
37
Fast, powerful, rich search capabilities enable you to easily find the exact set of applications that you’re looking for
• Understand the anatomy of your Hive app • Track execution of queries as single business process • Identify outlier behavior by comparison with historical runs • Analyze rich operational meta-data • Correlate Hive app behavior with other events on cluster
DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS
38
• Cascading framework enables developers to intuitively create data applications that scale and are robust, future-proof, supporting new execution fabrics without requiring a code rewrite
• Scalding — a Scala-based extension to Cascading — provides crisp programming constructs for algorithm developers and data scientists
• Driven — an application visualization product — provides rich insights into how your applications executes, improving developer productivity by 10x
• Cascading 3.0 opens up the query planner — write apps once, run on any fabric
SUMMARY - BUILD ROBUST DATA APPS RIGHT THE FIRST TIME WITH CASCADING
39
Concurrent offers training classes for Cascading (DEC 9) & Scalding (NOV 4)
CONTACT INFORMATION
Supreet Oberoisupreet@concurrentinc.com
650-868-7675 (m) @supreet_online
DRIVING INNOVATION THROUGH DATATHANK YOUSupreet Oberoi
DRIVING INNOVATION THROUGH DATABACKUPSupreet Oberoi
DIFFERENT PHILOSOPHY THAN GUI TOOLS
43
• Cascading is a general-purpose framework to develop data applications; supports development through the entire lifecycle of data - from staging to final data sets
• Developing with an API is more productive and intuitive than with a UI — incorporate best-practices
• Can do in three lines of code what takes 20-clicks in an app (fluent API with IDE makes it even simpler)
• Can test locally and deploy production without code change
• Because it is based in code: debuggable, extensible, deployable, traceable
• GUI tools do not help in visualizing the must-have insights
• Real-time application visualization, application bottlenecks, anatomy of the application, application dependencies, cost breakdown of an operation (join), bottlenecks due to code, data, network, cluster
Confidential44
• Hierarchical Clustering • K-Means Clustering • Linear Regression • Logistic Regression • Random Forest
algorithms extended based on customer use cases –
PATTERN: ALGOS IMPLEMENTED
CASCADING 3.0 IMPACT - DATA APP DEVELOPMENT FOR SPARK ON ROBUST FRAMEWORK
45
• Cascading 3.0 will ease application migration to Spark
• Enterprises can standardize on one API to meet business challenges and solve a variety of business problems ranging from simple to complex, regardless of latency or scale
• Third party products, data apps, frameworks and dynamic programming languages on Cascading will immediately benefit from this portability
• Even more operational visibility from development through production with Driven
BUSINESSES DEPEND ON US
46
• Estimate suicide risk from what people write online
• Cascading + Cassandra
• You can do more than optimize add yields
• http://www.durkheimproject.org
top related