inside mapr's m7
DESCRIPTION
This talk takes a technological deep dive into MapR M7 including information on some of the key challenges that were solved during the implementation of M7. MapR's M7 is a clean room replication of the HBase API written in C++ and fully integrated into the MapR platform. In the process of implementing M7, we learned some lessons and solved some interesting challenges. Ted Dunning shares some of these experiences and lessons. Many of these lessons apply across the board to high performance query systems in general and can be applied much more widely. Some of the resulting techniques have already been adopted by the Apache Drill project, but there are lots more places that these techniques can be used.TRANSCRIPT
1©MapR Technologies - Confidential
Inside MapR’s M7How to get a million ops per second on 10 nodes
2©MapR Technologies - Confidential
Me, Us
Ted Dunning, Chief Application Architect, MapRCommitter PMC member, Mahout, Zookeeper, DrillBought the beer at the first HUG
MapRDistributes more open source components for HadoopAdds major technology for performance, HA, industry standard API’s
TonightHash tag - #mapr #fastSee also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR
3©MapR Technologies - Confidential
MapR does MapReduce (fast)
TeraSort Record1 TB in 54 seconds
1003 nodes
MinuteSort Record1.5 TB in 59 seconds
2103 nodes
4©MapR Technologies - Confidential
MapR: Lights Out Data Center Ready
• Automated stateful failover
• Automated re-replication
• Self-healing from HW and SW failures
• Load balancing
• Rolling upgrades
• No lost jobs or data
• 99999’s of uptime
Reliable Compute Dependable Storage
• Business continuity with snapshots and mirrors
• Recover to a point in time• End-to-end check summing • Strong consistency• Built-in compression• Mirror between two sites by RTO
policy
5©MapR Technologies - Confidential
Part 1:What’s past is prologue
6©MapR Technologies - Confidential
Part 1:What’s past is prologue HBase is really good except when it isn’t but it has a heart of gold
7©MapR Technologies - Confidential
Part 2:An implementation tour
8©MapR Technologies - Confidential
Part 2:An implementation tour with many tricks and clever ploys
9©MapR Technologies - Confidential
Part 3:Results
10©MapR Technologies - Confidential
11©MapR Technologies - Confidential
Part 1:What’s past is prologue
12©MapR Technologies - Confidential
Dynamo DB
ZopeDB
Shoal
CloudKit
Vertex DB
FlockDB
NoSQL
13©MapR Technologies - Confidential
HBase Table Architecture
Tables are divided into key ranges (regions) Regions are served by nodes (RegionServers) Columns are divided into access groups (columns families)
CF1 CF2 CF3 CF4 CF5
R1
R2
R3
R4
14©MapR Technologies - Confidential
HBase Architecture is Better
Strong consistency model– when a write returns, all readers will see same value– "eventually consistent" is often "eventually inconsistent"
Scan works– does not broadcast– ring-based NoSQL databases (eg, Cassandra, Riak) suffer on scans
Scales automatically– Splits when regions become too large– Uses HDFS to spread data, manage space
Integrated with Hadoop– map-reduce on HBase is straightforward
15©MapR Technologies - Confidential
But ... how well do you know HBCK?a.k.a. HBase Recovery
HBase-5843: Improve HBase MTTR – Mean Time To Recover HBase-6401: HBase may lose edits after a crash with 1.0.3
– uses appends
HBase-3809: .META. may not come back online if …. etc about 40-50 Jiras on this topic
Very complex algorithm to assign a region– and still does not get it right on reboot
16©MapR Technologies - Confidential
HBase Issues
Reliability• Compactions disrupt operations• Very slow crash recovery• Unreliable splitting
Business continuity• Common hardware/software issues cause downtime• Administration requires downtime• No point-in-time recovery• Complex backup process
Performance• Many bottlenecks result in low throughput• Limited data locality• Limited # of tables
Manageability• Compactions, splits and merges must be done manually (in reality)• Basic operations like backup or table rename are complex
17©MapR Technologies - Confidential
Examples: Performance Issues
Limited support for multiple column families: HBase has issues handling multiple column family due to compactions. The standard HBase documentation recommends no more than 2-3 column families. (HBASE-3149)
Limited data locality: HBase does not take into account block locations when assigning regions. After a reboot, RegionServers are often reading data over the network rather than the local drives. (HBASE-4755, HBASE-4491)
Cannot utilize disk space: HBase RegionServers struggle with more than 50-150 regions per RegionServer so a commodity server can only handle about 1TB of HBase data, wasting disk space. (http://hbase.apache.org/book/important_configurations.html, http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/)
Limited # of tables: A single cluster can only handle several tens of tables effectively. (http://hbase.apache.org/book/important_configurations.html)
18©MapR Technologies - Confidential
Examples: Manageability Issues
Manual major compactions: HBase major compactions are disruptive so production clusters keep them disabled and rely on the administrator to manually trigger compactions. (http://hbase.apache.org/book.html#compaction)
Manual splitting: HBase auto-splitting does not work properly in a busy cluster so users must pre-split a table based on their estimate of data size/growth. (http://chilinglam.blogspot.com/2011/12/my-experience-with-hbase-dynamic.html)
Manual merging: HBase does not automatically merge regions that are too small. The administrator must take down the cluster and trigger the merges manually.
Basic administration is complex: Renaming a table requires copying all the data. Backing up a cluster is a complex process. (HBASE-643)
19©MapR Technologies - Confidential
Examples: Reliability Issues
Compactions disrupt HBase operations: I/O bursts overwhelm nodes (http://hbase.apache.org/book.html#compaction)
Very slow crash recovery: RegionServer crash can cause data to be unavailable for up to 30 minutes while WALs are replayed for impacted regions. (HBASE-1111)
Unreliable splitting: Region splitting may cause data to be inconsistent and unavailable. (http://chilinglam.blogspot.com/2011/12/my-experience-with-hbase-dynamic.html)
No client throttling: HBase client can easily overwhelm RegionServers and cause downtime. (HBASE-5161, HBASE-5162)
20©MapR Technologies - Confidential
One Issue – Crash Recovery Too Slow
HBASE-1111 superseded by HBASE-5843 which is blocked by
HDFS-3912 HBASE-6736 HBASE-6970 HBASE-7989 HBASE-6315HBASE-7815 HBASE-6737 HBASE-6738 HBASE-7271 HBASE-7590HBASE-7756 HBASE-8204 HBASE-5992 HBASE-6156 HBASE-6878HBASE-6364 HBASE-6713 HBASE-5902 HBASE-4755 HBASE-7006HDFS-2576 HBASE-6309 HBASE-6751 HBASE-6752 HBASE-6772HBASE-6773 HBASE-6774 HBASE-7246 HBASE-7334 HBASE-5859HBASE-6058 HBASE-6290 HBASE-7213 HBASE-5844 HBASE-5924HBASE-6435 HBASE-6783 HBASE-7247 HBASE-7327 HDFS-4721HBASE-5877 HBASE-5926 HBASE-5939 HBASE-5998 HBASE-6109HBASE-6870 HBASE-5930 HDFS-4754 HDFS-3705
21©MapR Technologies - Confidential
What is the source of these
problems?
22©MapR Technologies - Confidential
RegionServers are problematic
Coordinating 3 separate distributed systems is very hard– HBase, HDFS, ZK– Each of these systems has multiple internal systems– Too many races, too many undefined properties
Distributed transaction framework not available– Too many failures to deal with
Java GC wipes out the RS from time to time– Cannot use -Xmx20g for a RS
Hence all the bugs– HBCK is your "friend"
23©MapR Technologies - Confidential
Region Assignment in Apache HBase
24©MapR Technologies - Confidential
Files are broken into blocks Distributed across data-nodes
NameNode holds (in DRAM) Directories, Files Block replica locations
Data Nodes Serve blocks No idea about files/dirs All ops go to NN
HDFS Architecture Review
DataNodes save Blocks
Filessharded intoblocks
25©MapR Technologies - Confidential
NameNode holds in-memory Dir hierarchy ("names") File attrs ("inode") Composite file structure
Array of block-ids
1-byte file in HDFS 1 HDFS "block" on 3 DN's 3 entries in NN totaling 1K DRAM
A File at the NameNode
Composite File Structure
26©MapR Technologies - Confidential
DN reports blocks to NN– 128M blocks– 12T of disk => DN sends 100K blocks/report– RPC on wire is 4M– causes extreme load• at both DN and NN
With NN-HA, DN's do dual block-reports– one to primary, one to secondary– doubles the load on DN
NN scalability problems
27©MapR Technologies - Confidential
Scaling Parameters
Unit of I/O– 4K/8K (8K in MapR)
Unit of Chunking (a map-reduce split)– 10-100's of megabytes
Unit of Resync (a replica)– 10-100's of gigabytes– container in MapR
i/o 10^3map-red
10^6resync
10^9admin
HDFS 'block'
Unit of Administration (snap, repl, mirror, quota, backup)– 1 gigabyte - 1000's of terabytes– volume in MapR– what data is affected by my
missing blocks?
28©MapR Technologies - Confidential
NameNode
E F
NameNode
E F
NameNode
E F
MapR's No-NameNode ArchitectureHDFS Federation MapR (distributed metadata)
• Multiple single points of failure• Limited to 50-200 million files• Performance bottleneck• Commercial NAS required
• HA w/ automatic failover• Instant cluster restart• Up to 1 trillion files• 20x higher performance• 100% commodity hardware
NAS appliance
NameNode
A B
NameNode
C D
NameNode
E F
DataNode DataNode DataNode
DataNode DataNode DataNode
A F C D E D
B C E B
C F B F
A B
A D
E
29©MapR Technologies - Confidential
Each container contains Directories & files Data blocks
Replicated on servers Millions of containers in
a typical cluster
MapR's Distributed NameNodeFiles/directories are sharded into blocks, whichare placed into mini NNs (containers ) on disks
Containers are 16-32 GB segments of disk, placed on nodes
Patent Pending
30©MapR Technologies - Confidential
M7 Containers
Container holds many files– regular, dir, symlink, btree, chunk-map, region-map, …– all random-write capable– each can hold 100's of millions of files
Container is replicated to servers– unit of resynchronization
Region lives entirely inside 1 container– all files + WALs + btree's + bloom-filters + range-maps
31©MapR Technologies - Confidential
Read-write Replication
Write are synchronous– All copies have same data
Data is replicated in a "chain" fashion– better bandwidth, utilizes full-duplex
network links well
Meta-data is replicated in a "star" manner– response time better, bandwidth not
of concern– data can also be done this way
31
client1client2
clientN
33©MapR Technologies - Confidential
HB loss + upstream entity reports failure => server dead
Increment epoch at CLDB
Rearrange replication
Exact same code for files and M7 tables
No ZK needed at this level
Failure Handling
Containers managed at CLDB (HB, container-reports).
Container Location DataBase (CLDB)
34©MapR Technologies - Confidential
Same 10 nodes, but with 3X repl
0 1000 2000 3000 4000 5000 60000
2000
4000
6000
8000
10000
12000
14000
16000
18000
Files (M)
File
cre
ates
/s
0 100 200 400 600 800 1000
MapR distribution
Other distribution
Benchmark: File creates (100B)Hardware: 10 nodes, 2 x 4 cores, 24 GB RAM, 12 x 1 TB 7200 RPM
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60
50100150200250300350400
Files (M)
File
cre
ates
/s
Other distributionMapR Other Advantage
Rate (creates/s) 14-16K 335-360 40x
Scale (files) 6B 1.3M 4615x
35©MapR Technologies - Confidential
Recap
HBase has a good basis– But is handicapped by HDFS– But can’t do without HDFS– HBase can’t be fixed in isolation
Separating key storage scaling parameters is key– Allows additional layer of storage indirection– Results in huge scaling and performance improvement
Low-level transactions is hard– Allows R/W file system, decentralized meta-data– Also allows non-file implementations
36©MapR Technologies - Confidential
Part 2:An implementation tour
37©MapR Technologies - Confidential
An Outline of Important Factors
Start with MapR FS (mutability, transactions, real snapshots) C++ not Java (data never moves, better control) Lockless design, custom queue executive (3 ns switch) New RPC layer (> 1 M RPC / s) Cut out the middle man (single hop to data) Hybridize log-structured merge trees and B-trees Adjust sizes and fanouts Don’t be silly
38©MapR Technologies - Confidential
An Outline of Important Factors
Start with MapR FS (mutability, transactions, real snapshots) C++ not Java (data never moves, better control) Lockless design, custom queue executive (3 ns switch) New RPC layer (> 1 M RPC / s) Cut out the middle man (single hop to data) Hybridize log-structured merge trees and B-trees Adjust sizes and fanouts Don’t be silly
We get these all for free by putting tables into MapR FS
39©MapR Technologies - Confidential
M7: Tables Integrated into Storage
No extra daemons to manage
One hop to dataSuperior caching
policies
No JVM problems
40©MapR Technologies - Confidential
Lesson 0: Implement
tables in the file system
41©MapR Technologies - Confidential
Why Not Java?
Disclaimer: I am a pro-Java bigot But that only goes so far …
Consider the memory size of
struct {x, y}[] a; Consider also interpreting data as it has arrived from the wire Consider the problem of writing a micro-stack queue executive
with hundreds of thousands of threads and 3 ns context switch Consider the problem of a core-locked processes running cache
aware, lock-free, zero copy queue of tasks Consider the GC-free life-style
42©MapR Technologies - Confidential
At What Cost
But writing performant C++ is hard Managing low-level threads is hard Implementing very fast failure recovery is hard Doing manual memory allocation is hard (and dangerous)
Benefits outweigh costs with the right dev team
Benefits dwarfed by the costs with the wrong dev team
43©MapR Technologies - Confidential
Lesson 1: With great speed comes great
responsibility
44©MapR Technologies - Confidential
M7 Table Architecture
45©MapR Technologies - Confidential
M7 Table Architecture
This structure is internal and not user-visible
46©MapR Technologies - Confidential
Multi-level Design
Fixed number of levels like HBase Specialized fanout to match sizes to device physics
Mutable file system allows chimeric LSM-tree / B-tree
Sized to match container structure
Guaranteed locality– If the data moves, the new node will handle it– If the node fails, the new node will handle it
47©MapR Technologies - Confidential
Lesson 2: Physics. Not just a good
idea. It’s the law.
48©MapR Technologies - Confidential
RPC Reimplementation
At very high data rates, protobuf is too slow – Not good as an envelope, still a great schema definition language– Most systems never hit this limit
Alternative 1– Lazy parsing allows deferral of content parsing– Naïve implementation imposes (yet another) extra copy
Alternative 2– Bespoke parsing of envelope from the wire– Content packages can land fully aligned and ready for battle directly from
the wire
Let’s use BOTH ideas
49©MapR Technologies - Confidential
Lesson 3: Hacking and abstraction can co-exist
50©MapR Technologies - Confidential
Don’t Be Silly
Detailed review of the code revealed an extra copy– It was subtle. Really.
Performance increased when this was stopped
Not as easy to spot as it sounds– But absolutely still worth finding and fixing
51©MapR Technologies - Confidential
Part 3:Results
52©MapR Technologies - Confidential
Server Reboot
Full container-reports are tiny– CLDB needs 2G dram for 1000-node cluster
Volumes come online very fast– each volume independent of others– as soon as min-repl # of containers ready– no need to wait for whole cluster
(eg, HDFS waits for 99.9% blocks reporting)
1000-node cluster restart < 5 mins
53©MapR Technologies - Confidential
M7 provides Instant Recovery
0-40 microWALs per region– idle WALs go to zero quickly, so most are empty– region is up before all microWALs are recovered– recovers region in background in parallel– when a key is accessed, that microWAL is recovered inline– 1000-10000x faster recovery
Why doesn't HBase do this?– M7 leverages unique MapR-FS capabilities, not impacted by HDFS
limitations– No limit to # of files on disk– No limit to # open files– I/O path translates random writes to sequential writes on disk
54©MapR Technologies - Confidential
Other M7 Features
Smaller disk footprint–M7 never repeats the key or column name
Columnar layout–M7 supports 64 column families– in-memory column-families
Online admin–M7 schema changes on the fly– delete/rename/redistribute tables
55©MapR Technologies - Confidential
Binary Compatible
HBase applications work "as is" with M7– No need to recompile (binary compatible)
Can run M7 and HBase side-by-side on the same cluster– eg, during a migration– can access both M7 table and HBase table in same program
Use standard Apache HBase CopyTable tool to copy a table from HBase to M7 or vice-versa, viz.,
% hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=/user/srivas/mytable oldtable
56©MapR Technologies - Confidential
M7 vs CDH - Mixed Load 50-50
57©MapR Technologies - Confidential
M7 vs CDH - Mixed Load 50-50
58©MapR Technologies - Confidential
M7 vs CDH - Mixed Load 50-50
59©MapR Technologies - Confidential
Recap
HBase has some excellent core ideas– But is burdened by years of technical debt– Much of the debt was charged on the HDFS credit cards
MapR FS provides ideal substrate for HBase-like service– One hop from client to data– Many problems never even exist in the first place– Other problems have relatively simple solutions with better foundation
Practical results bear out the theory
60©MapR Technologies - Confidential
Me, Us
Ted Dunning, Chief Application Architect, MapRCommitter PMC member, Mahout, Zookeeper, DrillBought the beer at the first HUG
MapRDistributes more open source components for HadoopAdds major technology for performance and HAAdds industry standard API’s
TonightHash tag - #nosqlnow #mapr #fastSee also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR
61©MapR Technologies - Confidential