keynote #2 - operability in hadoop ecosystem @ abdw17, pune
TRANSCRIPT
© 2015 DataTorrent Confidential – Do Not Distribute
Agenda
• Big Data so far
• Operability Definition
• Components of Operability
• Laws of Operability
• Guiding Principles
© 2015 DataTorrent Confidential – Do Not Distribute
…Today
… 1990s
…2007-2009
2015-2016
Big Data Journey So Far
MapReduceBatchScale-Out
Data At RestDataBasesScale-Up
Data In MotionReal-Time StreamingScale-Out
© 2015 DataTorrent Confidential – Do Not Distribute
Productization & Operations of Big Data
• Big Data is neither Productized nor Operationalized
• Total Cost of Ownership (TCO) =• Cost to Develop + Cost to Launch + Cost of ongoing Operations
• Time to Value• Time to Develop + Time to Test/Launch + Continue to extract value
© 2015 DataTorrent Confidential – Do Not Distribute
Operability Definition
• Can the enterprise operate the product/application to meet SLA
under planned total cost of ownership?
© 2015 DataTorrent Confidential – Do Not Distribute
Operability Components
• SLA• Latency
• Resources
• Uptime
• Fault Tolerance and High Availability
• SecOps: Security and Certifications; Laws
• Resource Cost: Scalability and Performance
• DevOps: Ease of Integration and native operational support
• Operational Expertise
• Maintenance: Ease of Upgrading and Backward Compatibility
© 2015 DataTorrent Confidential – Do Not Distribute
Laws of Operability for a Pipeline
• What are the laws of operability?
• Why are these laws even needed?
• Measure
• Predict/Forecast
• Architectural Decisions
• Evaluate Impact of Native Hadoop Applications on Operations
© 2015 DataTorrent Confidential – Do Not Distribute
Uptime
Job A
Job B
Pipeline
Hadoop
Uptime = 95%
Cluster Y
Uptime = 95%
Uptime for Non Native Hadoop Pipeline
Job C
Cluster X
Uptime = 95%
Job A Job B Job C
1- Non Native Hadoop Pipeline
Job A
Job BJob C
Hadoop
Uptime = 95%
2 - Native Hadoop Pipeline
• Cluster X = 365 * .95 = 347 days
• Hadoop = 347 * .95 = 329 days
• Downtime of 52 days
Uptime for Native Hadoop Pipeline
• Cluster Y = 329 * .95 = 312 days
• Hadoop = 365 * .95 = 346 days
• Downtime of 18 days
© 2015 DataTorrent Confidential – Do Not Distribute
Cost Structure
Job A
Job B
Pipeline
Hadoop Cluster Y
Job C
Cluster X
Job A Job B Job C
1: Non Native Hadoop Pipeline
Job A
Job BJob C
Hadoop
2: Native Hadoop Pipeline
Cost for non Native Hadoop pipeline
Resources needed for cluster X
+ Resources needed for Cluster Y
Cost of Native Hadoop pipeline
Resources needed for Hadoop (already invested)
Resources = Machines (Hardware + Software) + Human (Expertise)
+ Resources needed for Hadoop
© 2015 DataTorrent Confidential – Do Not Distribute
Single Point of Failure
Job A
Job B
Pipeline
Job C
Job A Job B Job C
1: Non Native Hadoop Pipeline
Job A
Job BJob C
2: Native Hadoop Pipeline
???
???
Job A
Job CCluster X Hadoop Cluster Y
Hadoop
No Single Point of Failure Pipeline 1 = No (Cluster X) and Yes (Hadoop) and Yes (Cluster Y) = No
No Single Point of Failure Pipeline 2 = Yes (Hadoop) = Yes
© 2015 DataTorrent Confidential – Do Not Distribute
Ease of Integration and DevOps
Job A
Job B
Pipeline
Job C
Job A Job B Job C
1: Non Native Hadoop Pipeline
Job A
Job BJob C
2: Native Hadoop Pipeline
???
???
Job A
Job CCluster X Hadoop Cluster Y
Hadoop
Fully Integrated with DevOps Tools for Pipeline 1 = No (Cluster X) and Yes (Hadoop) and Yes (Cluster Y) = No
Fully Integrated with DevOps Tools for Pipeline 2 = Yes (Hadoop) = Yes
© 2015 DataTorrent Confidential – Do Not Distribute
Laws of Operability
Job 1
Pipeline
Cluster 1
Infrastructure for (Big Data) Pipeline Processing
Uptime = U1 * U2 * … * Un
Job 2
Job …
Job n
Job 1
Cluster 2
Job 2
Cluster …
Job …
Cluster n
Job n…
Cost = C1 + C2 + … + Cn
No Single Point of Failure = S1 and S2 and … and Sn
Easy to Integrate = I1 and I2 and … and In
© 2015 DataTorrent Confidential – Do Not Distribute
Cost Structure for Big Data Products (Fortune 50)Functional Design• Read File
• Hash Join
• Apply Rules
• Write Results
• Upload Results
Operational Design• Parallel read
• Parallel write
• Skew analysis of entire data flow
• Which design meets SLA
• Analyze single point of failure
• Bottleneck analysis: Data, Compute, CPU, Memory, Disk, I/O
• Node outage, fault tolerance. Data center outage?
• Multi-Tenancy : Multiple apps at the same time
• Uptime analysis
• Infrastructure requirements and design (Hadoop grid, and node design)
• Error handling
• Alerts, and escalation policy
• Integration with current monitoring: Webservices
• Launch runbooks
• Testing and certification design+infrastructure
• Upgrade path, runbooks
• Audit: Intermediate results
• Versioning and backward compatibility
• Expertise, and outsourcing
• Support structure, escalation
• Pre and Post launch support. Ongoing cost
• DevOps Training
• Security and Access
• …
Functional Cost < 20% 80% < Operational Cost
© 2015 DataTorrent Confidential – Do Not Distribute
Guiding Principles of Operability
Job 1
Pipeline
Cluster 1
Infrastructure for (Big Data) Pipeline Processing
Cost = At most 20% Functional, at least 80% Operational
Job 2
Job …
Job n
Job 1
Cluster 2
Job 2
Cluster …
Job …
Cluster n
Job n…
Operability has to first class citizen of a platform. It cannot be slapped on
Operability is inversely proportional to the number of hops (clusters) in a pipeline
Operability is vastly higher if taken care of by the platform as opposed to user code
Operability is design decision on day 1
© 2015 DataTorrent Confidential – Do Not Distribute
Operability – The Graveyard of Big Data Projects