cisco it’s hadoop adventure with mapr - big data · pdf filecisco it’s hadoop...
TRANSCRIPT
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 1 Cisco Confidential © 2015 Cisco and/or its affiliates. All rights reserved. 1
Cisco IT’s Hadoop Adventure with MapR
Robert Novak Cisco Big Data Partner CSE October 2015
With thanks to Alex Garbarini and Virendra Singh, Cisco IT
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 2
• ~20 year full stack sysadmin (retired)
• 10+ year big data admin (Hadoop since 2009)
• Cisco UCS C-Series early adopter (2011)
• Victor Kiam of Big Data on UCS (kinda)
• Cisco Big Data Safari Tour Guide since 2014
• Blog at rsts11.com and Cisco Blogs Twitter: @gallifreyan Also found at: linkedin.com/in/rnovak
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 3
• Motivation for Hadoop Deployment at Cisco
• Hadoop Platform
• Timeline
• Key Decisions / Lessons Learned
• Data Lake Considerations
• Use Cases
• The Quiz
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 4
In 2011-2012, Data Wars Were Beginning • Growing beyond 20 years of database based application development
- Cost Structure - Development Methodology & Project lifecycle - Programming Model - Maturity Curve of The Technology Is Different
• FUD – Fear, Uncertainty and Doubt
• Availability of agile, skilled workforce
• Reuse legacy skills and apply new tools
• Rapid pace of innovation and constantly changing industry dynamics
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 5
Cisco IT’s Hadoop Adventure (so far)
POCs 2011
Multi-tenant Shared Platform July 2012
Use Cases Deployment Starting 2013….
Enterprise Data Lake 2014
Growth & Expanding Ecosystem …to infinity and beyond!
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 6
Open Source vs Distribution
Operational Excellence, Availability, Performance, Skill Set
Architecture Cisco UCS Integrated Infrastructure Scalability, sustainability, repeatable performance
Ecosystem Hive (SQL), Mahout, HBase, Spark
Environment Lifecycle Production, Stage, Development & Technical POC
(Isolate usage by Risk & Development lifecycle)
Data Lake Data Governance, reduce cost, increase efficiency, eliminate duplication and silos
Key Decisions Rationale
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 7
• Architecture and Distribution Considerations
• Multi-tenant
• Mission Critical Features
• Plan For Scalability, Expect The Unexpected
• Support: Open Source or Distribution
• Opex Or Capex, Licenses Or Headcount, You Pay One Way Or Another
• Leverage Skills:
• Components That Empower Users’ Existing Skills Like Informatica and SQL
Lessons from Technology Journey
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 8
• Hive doesn’t support ANSI SQL
• Reusable UDFs for Hive were created
• Workload management
• Cisco uses Tidal Enterprise Scheduler for workload management and error handling
• Hadoop scales linearly
• Our platform grew 100% in the first year.
• Invest in architecture that enables rapid growth with minimal pain
Lessons from Technology Journey
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 9
Shared Data ó Rich Analytics
Supply Chain
Engineering
Finance
Advanced Services
Marketing
Sales
IT
Cisco Services
Security
Enterprise Platform(s)
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 10
• Metadata-driven utilities to automate data ingestion
• Access Control Driven by Metadata
• Scalable Cost effective Unify resources
Enterprise Data Lake
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 11
• Migrate ETL Processing from EDW
• Data Lake & Adhoc Data Analysis
• Data Archiving
• Customer Segmentation
• Multi-Channel Scoring
• Content Auto-tagging
• Smart Analytics Offerings
• Service Opportunity Identification
• Organization Network Analytics
• Engineering Source Code Monitoring
• Access Control and Auditing
Data Platform Option to Reduce Cost
Marketing & Content Management
Services Risk & Compliance
Cisco IT Use Cases for Hadoop in Production
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 12
Operational Cost / Productivity
• Distributed Name Node
• Snapshots
• Volume Based Disaster Recovery
• Higher performance and fewer nodes ($)
• More efficient use of resources, scale with data and processing
• HBase, MapR-DB, and Hadoop on the same cluster
• NFS (Fully Read & Write)
• Multiple versions of components on the same cluster
Performance High Availability
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 13
Databases
Data Platform Reference Architecture v3
Docs, Cases, Content, Social Media, Clicksteam
Operational Intelligence
Index & Search
IT App & System Logs & Config.
Internet of Everything (IoE)
Self Service Dashboard
Rapid Business Intell.
Data Exploration
Mission Critical Operational Reports
Financial Reporting & Extract
Operational Intelligence(Splunk UI)
Real time Predictive
Data Analysis, Text Analytics
Machine Learning,, Statistical Analysis (R)
Machine Data Insights (e.g. In supply chain)
SFDC
Data Sources Data Consumption
Big Data Platform
Hadoop & Spark on UCS
• Machine Learning • Data Archiving • Data Science
Mission Critical Reporting
Legacy EDW • Financial SSOTs • Stable core • Controlled Change
Agile Analytics
SAP HANA on UCS
• Predictive Engine • Real time BI
Network of Truth
(Mobile / Browser / Data Service)
Experience Toolkit
Cisco Data Virtualization (Composite) Logical Data Abstraction Layer across transactional, SaaS, Big Data & DW
Rapid Prototyping / Data Integration / Data Services
SAS
Hadoop & Spark
Data Storage and Processing
HANA
Analytics & Modeling
Customer Network, Product Usage
Customer Registry
ERP
Databases ALL other Sources
Cisco D
ata Virtualization (Com
posite)
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 14
Thank you. Thank you.
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 15 15
Cisco Hadoop Platform – Physical Architecture
Cisco Unified Computing System C240 M3
Cisco UCS 62XXUP Fabric InterConnects ( Per Domain )
Scalability
High Performance
Unified Management
Operational Simplicity
High Availability
Components Details
OS RHEL 6.4
Distribution MapR (M7)
Server (node) UCS 240 M3 – 16 cores (w HT – Hyper Threading 32 cores)
Processor E5-2655
Memory/Node
256 GB
Storage/Node 24*1 TB (22 HDFS)
No. of Nodes 54
Cores 864 (Hyper Threading enabled)
Total Memory 13824 GB
Storage 1188 TB
No-SQL HBASE (MapR - M7) ZooKeeper, CLDB, WebServer, JobTracker 3 nodes each, File Server, TaskTracker across all nodes, Platfora 4 nodes
Multi UCS cluster Hadoop environment Multi-Tenant model for PROD and DEV/Stage
8X 10 Gb/s Each 8X 10 Gb/s Each
N7K
Production Capacity
80 Gb/s 80 Gb/s
Scalability
High Performance
Unified Management
Operational Simplicity
High Availability
Cisco Nexus 2232PP 10 GE Fabric Extenders ( Per Rack)
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 16 16
Components POC DEV QA Production Software OS RHEL 6.4 RHEL 6.4 RHEL 6.4 RHEL 6.4
Hadoop Distribution MapR M7 3.1.0 MapR M7 3.1.0 MapR M7 3.1.0 MapR M7 3.1.0
Server-Cluster Cisco UCS Servers UCS C210 M2 UCS C210 M2/ C240 M3
UCS C240 M3 UCS C240 M3
Processor Intel® Xeon® X5675
Intel® Xeon® X5675
Intel® Xeon® X5675
Intel® Xeon® E5-2655
Memory per Node 48 GB 48 GB / 256 GB 256 GB 256 GB
Storage per Node (HDFS)
14*1 TB 7200 RPM SATA
14*1 TB / 22 *1TB 7200 RPM SATA
22*1 TB 7200 RPM SATA
22*1 TB 7200 RPM SATA
Rack Level No. of Nodes 4 18 8 54
Processors/Cores 48 240 128 864
Memory 4x48=192 GB 12x48 + 6x256 GB 8x256 GB 54x256 = 13.8 TB
Storage Capacity ( 3 way Replication, Compression)
4x18 = 72 TB 12x14 + 6x22 = 257 TB
150TB 1188 TB
Hadoop Lifecycles
Cisco Confidential 17 © 2013-2014 Cisco and/or its affiliates. All rights reserved.
Business Benefits
Cisco UCS Big Data Common Platform (CPA) A Highly Scalable Architecture Designed to Meet Variety of Scale-Put Application Demands § UCS Fabric Interconnects provide high-speed, fully
redundant, active-active connectivity
§ Unified fabric (single wire management) § 66% reduction in switch ports
§ 66% reduction in cables
§ Powered by UCS C-Series Rack servers § Form factor extension to UCS blade system
§ UCS Manager § Global view of the cluster § Proactive monitoring of health
§ 1 Click system software management
§ UCS Central § Unified management across clusters
(thousands of nodes) § Application isolation
Business Benefits § Operational Simplification: Simplified and policy-based management § Modular Solution: Modular framework that can scale from small to very
large
§ Risk Reduction: Pre-validation, tighter integration and optimizations reduce integration and deployment risk
§ Lower TCO: Unified fabric, unified management and infrastructure optimized for performance lowers TCO significantly
Architectural Benefits § Scalability: Modular building block, scalable up to
7.2 PB with single management domain
§ Performance: Best-in-class performance of compute and network for massively scale-out applications
§ Management and Monitoring: Unified management across clusters (thousands of nodes)
Hadoop Requirements Distributed powerful computing Reliable Hardware Local storage in PB Low Latency Low Cost Scalability and Performance Manageability
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 18
Hadoop Use Cases Organization
(vs) Adoption Level
Production Pipeline
EDS - iCAM - Party Ranking Service - Teradata ETL Offload - Data Lake
CSTG
- Connected Analytics Network Deployment (CAND) - Smart Call Home - Cloud Consumption (Sentinel)
- NOS Online - Network SSOT
Marketing - Multi-Channel Scoring - Automatic Qualified Leads
CWCS Metadata - Content Auto-Tagging
CITS - Cisco Partner Annuity Initiative - Social Media Services
GIS - Collaboration Dashboard - Item, BOM & Compliance Data Analytics
Legal - Data Warehouse Expansion
Supply Chain - Measurement - ACTS - TST
© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 19
Hadoop Platform Security Current State
Load Balanced
Job Submission
Edge Servers Sqoop
A tool for moving data to/from non-Hadoop data stores Pig
A high level data flow language Hive
SQL like language to query and analyze data using MR Impala
Interactive SQL tool on Hadoop Mahout
Data mining algorithm using MR R
Statistical & Machine Learning language Oozie
A job control workflow Flume
Tool to ingest/stream log data TES Agent
To allow scheduled jobs to execute
CLDB MapR-FS, Job Tracker ZooKeeper
Generic User ID
Admin
ACL to limit access
Used for Authentication Port opened for Hadoop Services (CLDB, Jobtracker, File System & Zookeepr)
Hadoop Developer/ Data Analyst
Secure Shell Login
Tableau Dashboards
Penthao BI & DI Platform
Port opened for Hadoop Services (CLDB, Jobtracker, File System & Zookeepr)
iCAM Servers
Port opened for Hadoop Services (CLDB, Jobtracker, File System & Zookeepr)
Hadoop Admins
Business User Replication