full speed ahead: the briefing room with john myers and mapr
TRANSCRIPT
Twitter Tag: #briefr The Briefing Room
Reveal the essential characteristics of enterprise software, good and bad
Provide a forum for detailed analysis of today’s innovative technologies
Give vendors a chance to explain their product to savvy analysts
Allow audience members to pose serious questions... and get answers!
Mission
Twitter Tag: #briefr The Briefing Room
Topics
September: HADOOP 2.0
October: DATA MANAGEMENT
November: ANALYTICS
Twitter Tag: #briefr The Briefing Room
Analyst: John Myers
John Myers is Managing Research Director at
Enterprise Management Associates
Twitter Tag: #briefr The Briefing Room
MapR
MapR develops Apache Hadoop-related software
Its Hadoop distribution boasts data protection, no single point of failure and industry leading performance
The MapR distribution also features the complete Apache Spark stack, including Spark SQL, Spark Streaming, MLLib and GraphX
Twitter Tag: #briefr The Briefing Room
Guest: Sameer Nori
Sameer Nori is the Senior Product
Marketing Manager for MapR
®© 2015 MapR Technologies 2
Agenda 1. Customer Requirements
2. Hadoop ecosystem and The MapR Data Platform
3. Evolution of SQL-on-Hadoop
4. Customer Examples
®© 2015 MapR Technologies 3
MapR Architected A Platform For The Age Of Big Data
Apps Databases Operational App platform Storage
1980s 2000s 2010s
Big data apps
RDBMs
SAN/NAS
Monolithic
UNIX Linux
RDBMs
Scale out
Web
Structured Unstructured
Operational Analytics
®© 2015 MapR Technologies 4
What MapR Customers Demand 1. Efficiency at scale
– Multi-tenancy: Ability to support multiple teams/projects on one platform – Resource management: MUST support Hadoop and non-Hadoop workloads
2. Real-time: MUST support real-time and batch workloads on one cluster
3. Reliable – Business continuity – must meet SLA’s
4. Secure – MUST integrate with existing security & data governance standards
5. Agile - MUST support governed and exploratory BI on one platform
®© 2015 MapR Technologies 5
2004
2006
2009
2011
2013
2015
Architecting for Production Success
MapR in stealth
MapR 5.0 – Extending Real-time beyond Hadoop for Big Data Apps
MapR becomes Hadoop technology leader
MapR-DB – real-time, in-Hadoop DB
Google publishes details of GFS
Hadoop developed at Yahoo!
Built for the enterprise Built for today’s use cases Built for as-it-happens, agile businesses
®© 2015 MapR Technologies 7
No NameNode architecture
MapReduce/YARN HA
NFS HA
Instant recovery
Rolling upgrades
HA is built in
• Distributed metadata can self-heal • No practical limit on # of files
• Jobs are not impacted by failures • Meet your data processing SLAs
• High throughput and resilience for NFS-based data ingestion, import/export and multi-client access
• Files and tables are accessible within seconds of a node failure or cluster restart
• Upgrade the software with no downtime
• No special configuration to enable HA • All MapR customers operate with HA
High Availability (HA) Everywhere
®© 2015 MapR Technologies 8
Disaster Recovery: Mirroring • Flexible
– Choose the volumes/directories to mirror – You don’t need to mirror the entire cluster – Any remote cluster can run active volumes
mirrored to other clusters – Scheduled/incremental to set low RPO – Promotable mirrors to set low RTO
• Fast – No performance impact – Block-level (8KB) deltas – Automatic compression
• Safe – Point-in-time consistency – End-to-end checksums
• Easy – Graceful handling of network issues – No third-party software – Takes less than two minutes to configure!
Production
WAN
Production Research
Datacenter 1 Datacenter 2
WAN EC2
®© 2015 MapR Technologies 9
Multi-tenancy Isolation • Tasks sandboxed so they don’t impact other tasks or system daemons • System resources protected from runaway jobs • Volume-based data placement • Label-based job scheduling
Quotas • Storage quotas by volume/user/group • CPU and memory quotas by queue/user/group
Security and delegation • Wire-level authentication and encryption (Kerberos not required) • Fine-grained administration permissions including volume-level delegation • Authenticate users to AD, LDAP and Kerberos via Linux PAM
Reporting • Detailed reporting on resource usage (75+ different metrics) • All reports are available via UI, CLI and REST API
®© 2015 MapR Technologies 10
1980 2000 2010 1990 2020
Fixed schema
DBA controls structure
Dynamic / Flexible schema Application controls structure
NON-RELATIONAL DATASTORES RELATIONAL DATABASES
GBs-TBs TBs-PBs Volume
Database
Data Increasingly Stored in Non-Relational Datastores
Structure
Development
Structured Structured, semi-structured and unstructured
Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
®© 2015 MapR Technologies 11
Drill’s Role in the Enterprise Data Architecture
Raw data
• JSON, CSV, ...
“Optimized” data
• Parquet, …
Centrally-structured data
• Schemas in Hive Metastore
Relational data
• Highly-structured data
Hive, Impala, Spark SQL
Oracle, Teradata
Exploration (known and unknown questions)
®© 2015 MapR Technologies 12
Drill is Designed for a Wide Set of Use Cases
Raw Data Exploration JSON Analytics Data Hub Analytics …
Hive HBase Files Directories …
{JSON}, Parquet Text Files …
…
®© 2015 MapR Technologies 13
Cisco was able to analyze service sales opportunities in 1/10 the time, at 1/10 the cost, and generated $40 million in incremental service bookings in the first year.
Cisco: 360° Customer View Cisco uses integrated customer data to increase revenues
• Create shared view of customer & operations across 75,000 employees • Increase revenue opportunities with sales partners
• Customer information was siloed in different divisions • Customer interactions were inconsistent and not satisfying • Missed opportunities for upselling/cross selling
• Use MapR to collect customer information across touch points • Integrate billing, support, manufacturing, social media, websites, dial-in
data • Generate new sales leads internally and for partners
OBJECTIVES
CHALLENGES
SOLUTION
Architecture for Sales Partner Opportunities
Business Impact
®© 2015 MapR Technologies 14
Cisco Data Platforms Reference Architecture
“The entire market is starting to realize that data is everywhere and an agile ecosystem is paramount. The marketplace demands the flexibility to meet specific needs and decisions are being made based on how well the ecosystem players are integrated.”
Arvind Bedi, Director IT, Cisco Systems
DATABASES
DOSC, CASES, CONTENT, SOCIAL
MEDIA, CLICKSTEAM
Data Storage and Processing
ERP
SFDC SAP HANA ON UCS
AGILE ANALYTICS
MAPR DISTRIBUTION FOR HADOOP
Streaming (Spark
Streaming, Storm)
MapR-DB
MAPR DISTRIBUTION FOR HADOOP
Batch (MR, Spark, Hive, Pig, …)
MapR-FS
BIG DATA PLATFORM
MISSION CRITICAL REPORTING
DATA SECURITY, INFRASTRUCTURE
CUSTOMER NETWORK, PRODUCT USAGE
INTERNET OF EVERYTHING (IoE)
SELF SERVICE DASHBOARD
RAPID BUSINESS MODEL
DATA EXPLORATION
REAL TIME PREDICTIVE
MISSION CRITICAL OPERATIONAL
REPORTS
FINANCIAL REPORTING &
EXTRACT
DATA ANALYSIS, TEXT ANALYTICS
MACHINE LEARNING, STATISTICAL
ANALYSIS
MACHINE DATA INSIGHTS
FINANCIALS STABLE CORE CONTROLLED CHANGE
Network of Trust
MapR Data Platform
Data Consumption Data Sources
ALL Other Sources
Data Bases
(Mobile/ Browser/ Data Service)
Interactive (Drill, Impala)
®© 2015 MapR Technologies 15
“HDFS is great internally, but to get data in and out of Hadoop, you have to do some kind of HDFS export. With MapR, you can just mount [HDFS] as NFS and then use native tools whether they’re in Windows, Unix, Linux or whatever.” - Mike Brown, comScore CTO
comScore: Internet Analytics and Ad Optimization comScore delivers insights about online consumer behavior
• Provide digital analytics services—syndicated and custom solutions in audience measurement, e-commerce, advertising,video & mobile
• Keeping up with data. In the past 5 years, comScore’s volume of new data/month has grown from 100 billion to 1.7 trillion records
• comScore chose MapR for NFS, performance, operational efficiency • MapR processes over 1.7 trillion Internet and mobile records/month,
reaching more than 90% of the Internet population • MapR streaming writes eliminated Cassandra staging cluster cost
OBJECTIVES
CHALLENGES
SOLUTION
Business Impact
®© 2015 MapR Technologies 16
Getting Started with MapR On- Demand Training https://www.mapr.com/training
MapR Sandbox https://www.mapr.com/sandbox
Discussion Questions
• What sets Apache Drill above other SQL on Hadoop options? There are several either in “development” or available with standard distributions
• How does SPARK work with MapReduce to provide both the “high speed” and the “high capacity?” Many business users “want it all and they want it now”…
© 2015 Enterprise Management Associates, Inc. Slide 18
Discussion Questions
• Without a “structure” or utilizing a variable, multi-structured data sets causes issues for SQL toolsets. How does MapR approach the ingestion of those variable sources before they are “finalized” or during times of flux?
• Continuous data streams are becoming more important as apart of sensor and IoT use cases. How does MapR handle the truly real-time aspects of data ingestion as well as data query?
© 2015 Enterprise Management Associates, Inc. Slide 19
Discussion Questions
• EMA research is showing the growth of data democratization or the penetration of data “work” and decision making in organizations. How many users of MapR environments are business stakeholders vs technologists?
© 2015 Enterprise Management Associates, Inc. Slide 20
Twitter Tag: #briefr The Briefing Room
Upcoming Topics
www.insideanalysis.com
September: HADOOP 2.0
October: DATA MANAGEMENT
November: ANALYTICS