big data analytics the network is the bottleneck › us › images › 11_jack_norris.pdf ·...
TRANSCRIPT
1/31/2012 ©MapR Technologies - Confidential 1
Big Data Analytics The Network is the Bottleneck
1/31/2012 ©MapR Technologies - Confidential 2
Data Volume Growing 44x
2020: 35.2
Zettabytes
2010:
1.2
Zettabytes
Data is Growing Faster than Moore’s Law
Business Analytics Requires a New Approach
Source: IDC Digital Universe Study, sponsored by EMC, May 2010
IDC Digital Universe
Study 2011
1/31/2012 ©MapR Technologies - Confidential 3
The Next Generation Distribution
• Complete Distribution for Apache Hadoop
• Integrated, tested, hardened
• Supported
• 100% Hadoop, HBase, HDFS API compatible
• Unique advanced features
• No changes required to Hadoop applications
• Runs on commodity hardware
1/31/2012 ©MapR Technologies - Confidential 4
Innovations of Next Generation Distribution
• High Availability Architecture • Snapshots • Mirroring
• NFS Access • Graphical Management
• Speed jobs by more than 2X • Save $$$ on hardware
1/31/2012 ©MapR Technologies - Confidential 5
Importance of File-based Access
File Browsers
Access Directly “Drag & Drop”
Random Read Random Write
Log directly
grep
sed
sort
tar
Standard Linux Commands & Tools
Applications
Hadoop Cluster
1/31/2012 ©MapR Technologies - Confidential 6
High Availability and Data Protection
MapR Distribution
Hive Pig Oozie Sqoop Plume HBase
Mahout Cascading Nagios
Integration
Ganglia
Integration Flume More
MapReduce
MapR’s Lockless Storage Services ™
Distributed NameNode HA™
JobTracker HA ™
• High availability
• Stateful failover
• Unlimited number of files
A B D D’
Data Blocks
Active Files Snapshots
C
• Recover from app or user errors
• Zero performance loss on write
• Easy recovery with drag and drop
1/31/2012 ©MapR Technologies - Confidential 7
File Create Benchmark
Out of box
Testing completed on 10 node cluster, 2x Quad-Core, 24G DRAM 12 x 1TB SATA Drives @ 7200 rpm
MapR Distribution
Standard Distributions
Out of box
Tuned
Total Files (M)
1/31/2012 ©MapR Technologies - Confidential 8
MapR Performance Advantages
YCSB on HBase (higher is better)
Terasort (lower is better)
10 node cluster, 2x Quad-Core, 24G DRAM
12 x 1TB SATA Drives @ 7200 rpm, Quad NICs
Elap
sed
tim
e in
min
ute
s
Rec
ord
Inse
rts
per
sec
(0
00
s)
0
50
100
150
200
250
MapR
Other
3.5 TB 0
100
200
300
400
500
600
WAL Off WAL On
1/31/2012 ©MapR Technologies - Confidential 9 9