inside mapr's m7

1©MapR Technologies - Confidential

Inside MapR’s M7How to get a million ops per second on 10 nodes


Me, Us

Ted Dunning, Chief Application Architect, MapRCommitter PMC member, Mahout, Zookeeper, DrillBought the beer at the first HUG

MapRDistributes more open source components for HadoopAdds major technology for performance, HA, industry standard API’s

TonightHash tag - #mapr #fastSee also - @ApacheMahout @ApacheDrill

@ted_dunning and @mapR


MapR does MapReduce (fast)

TeraSort Record1 TB in 54 seconds

1003 nodes

MinuteSort Record1.5 TB in 59 seconds

2103 nodes


MapR: Lights Out Data Center Ready

• Automated stateful failover

• Automated re-replication

• Self-healing from HW and SW failures

• Load balancing

• Rolling upgrades

• No lost jobs or data

• 99999’s of uptime

Reliable Compute Dependable Storage

• Business continuity with snapshots and mirrors

• Recover to a point in time• End-to-end check summing • Strong consistency• Built-in compression• Mirror between two sites by RTO

policy


Part 1:What’s past is prologue


Part 1:What’s past is prologue HBase is really good except when it isn’t but it has a heart of gold


Part 2:An implementation tour


Part 2:An implementation tour with many tricks and clever ploys


Part 3:Results


Part 1:What’s past is prologue


Dynamo DB

ZopeDB

Shoal

CloudKit

Vertex DB

FlockDB

NoSQL


HBase Table Architecture

Tables are divided into key ranges (regions) Regions are served by nodes (RegionServers) Columns are divided into access groups (columns families)

CF1 CF2 CF3 CF4 CF5

R1

R2

R3

R4


HBase Architecture is Better

Strong consistency model– when a write returns, all readers will see same value– "eventually consistent" is often "eventually inconsistent"

Scan works– does not broadcast– ring-based NoSQL databases (eg, Cassandra, Riak) suffer on scans

Scales automatically– Splits when regions become too large– Uses HDFS to spread data, manage space

Integrated with Hadoop– map-reduce on HBase is straightforward


But ... how well do you know HBCK?a.k.a. HBase Recovery

HBase-5843: Improve HBase MTTR – Mean Time To Recover HBase-6401: HBase may lose edits after a crash with 1.0.3

– uses appends

HBase-3809: .META. may not come back online if …. etc about 40-50 Jiras on this topic

Very complex algorithm to assign a region– and still does not get it right on reboot


HBase Issues

Reliability• Compactions disrupt operations• Very slow crash recovery• Unreliable splitting

Business continuity• Common hardware/software issues cause downtime• Administration requires downtime• No point-in-time recovery• Complex backup process

Performance• Many bottlenecks result in low throughput• Limited data locality• Limited # of tables

Manageability• Compactions, splits and merges must be done manually (in reality)• Basic operations like backup or table rename are complex


Examples: Performance Issues

Limited support for multiple column families: HBase has issues handling multiple column family due to compactions. The standard HBase documentation recommends no more than 2-3 column families. (HBASE-3149)

Limited data locality: HBase does not take into account block locations when assigning regions. After a reboot, RegionServers are often reading data over the network rather than the local drives. (HBASE-4755, HBASE-4491)

Cannot utilize disk space: HBase RegionServers struggle with more than 50-150 regions per RegionServer so a commodity server can only handle about 1TB of HBase data, wasting disk space. (http://hbase.apache.org/book/important_configurations.html, http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/)

Limited # of tables: A single cluster can only handle several tens of tables effectively. (http://hbase.apache.org/book/important_configurations.html)

http://hbase.apache.org/book/important_configurations.html


http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/

http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/



Examples: Manageability Issues

Manual major compactions: HBase major compactions are disruptive so production clusters keep them disabled and rely on the administrator to manually trigger compactions. (http://hbase.apache.org/book.html#compaction)

Manual splitting: HBase auto-splitting does not work properly in a busy cluster so users must pre-split a table based on their estimate of data size/growth. (http://chilinglam.blogspot.com/2011/12/my-experience-with-hbase-dynamic.html)

Manual merging: HBase does not automatically merge regions that are too small. The administrator must take down the cluster and trigger the merges manually.

Basic administration is complex: Renaming a table requires copying all the data. Backing up a cluster is a complex process. (HBASE-643)

http://hbase.apache.org/book.html

http://chilinglam.blogspot.com/2011/12/my-experience-with-hbase-dynamic.html



Examples: Reliability Issues

Compactions disrupt HBase operations: I/O bursts overwhelm nodes (http://hbase.apache.org/book.html#compaction)

Very slow crash recovery: RegionServer crash can cause data to be unavailable for up to 30 minutes while WALs are replayed for impacted regions. (HBASE-1111)

Unreliable splitting: Region splitting may cause data to be inconsistent and unavailable. (http://chilinglam.blogspot.com/2011/12/my-experience-with-hbase-dynamic.html)

No client throttling: HBase client can easily overwhelm RegionServers and cause downtime. (HBASE-5161, HBASE-5162)









One Issue – Crash Recovery Too Slow

HBASE-1111 superseded by HBASE-5843 which is blocked by

HDFS-3912 HBASE-6736 HBASE-6970 HBASE-7989 HBASE-6315HBASE-7815 HBASE-6737 HBASE-6738 HBASE-7271 HBASE-7590HBASE-7756 HBASE-8204 HBASE-5992 HBASE-6156 HBASE-6878HBASE-6364 HBASE-6713 HBASE-5902 HBASE-4755 HBASE-7006HDFS-2576 HBASE-6309 HBASE-6751 HBASE-6752 HBASE-6772HBASE-6773 HBASE-6774 HBASE-7246 HBASE-7334 HBASE-5859HBASE-6058 HBASE-6290 HBASE-7213 HBASE-5844 HBASE-5924HBASE-6435 HBASE-6783 HBASE-7247 HBASE-7327 HDFS-4721HBASE-5877 HBASE-5926 HBASE-5939 HBASE-5998 HBASE-6109HBASE-6870 HBASE-5930 HDFS-4754 HDFS-3705


What is the source of these

problems?


RegionServers are problematic

Coordinating 3 separate distributed systems is very hard– HBase, HDFS, ZK– Each of these systems has multiple internal systems– Too many races, too many undefined properties

Distributed transaction framework not available– Too many failures to deal with

Java GC wipes out the RS from time to time– Cannot use -Xmx20g for a RS

Hence all the bugs– HBCK is your "friend"


Region Assignment in Apache HBase


Files are broken into blocks Distributed across data-nodes

NameNode holds (in DRAM) Directories, Files Block replica locations

Data Nodes Serve blocks No idea about files/dirs All ops go to NN

HDFS Architecture Review

DataNodes save Blocks

Filessharded intoblocks


NameNode holds in-memory Dir hierarchy ("names") File attrs ("inode") Composite file structure

Array of block-ids

1-byte file in HDFS 1 HDFS "block" on 3 DN's 3 entries in NN totaling 1K DRAM

A File at the NameNode

Composite File Structure


DN reports blocks to NN– 128M blocks– 12T of disk => DN sends 100K blocks/report– RPC on wire is 4M– causes extreme load• at both DN and NN

With NN-HA, DN's do dual block-reports– one to primary, one to secondary– doubles the load on DN

NN scalability problems


Scaling Parameters

Unit of I/O– 4K/8K (8K in MapR)

Unit of Chunking (a map-reduce split)– 10-100's of megabytes

Unit of Resync (a replica)– 10-100's of gigabytes– container in MapR

i/o 10^3map-red

10^6resync

10^9admin

HDFS 'block'

Unit of Administration (snap, repl, mirror, quota, backup)– 1 gigabyte - 1000's of terabytes– volume in MapR– what data is affected by my

missing blocks?


NameNode

E F

NameNode

E F

NameNode

E F

MapR's No-NameNode ArchitectureHDFS Federation MapR (distributed metadata)

• Multiple single points of failure• Limited to 50-200 million files• Performance bottleneck• Commercial NAS required

• HA w/ automatic failover• Instant cluster restart• Up to 1 trillion files• 20x higher performance• 100% commodity hardware

NAS appliance

NameNode

A B

NameNode

C D

NameNode

E F

DataNode DataNode DataNode

DataNode DataNode DataNode

A F C D E D

B C E B

C F B F

A B

A D

E


Each container contains Directories & files Data blocks

Replicated on servers Millions of containers in

a typical cluster

MapR's Distributed NameNodeFiles/directories are sharded into blocks, whichare placed into mini NNs (containers ) on disks

Containers are 16-32 GB segments of disk, placed on nodes

Patent Pending


M7 Containers

Container holds many files– regular, dir, symlink, btree, chunk-map, region-map, …– all random-write capable– each can hold 100's of millions of files

Container is replicated to servers– unit of resynchronization

Region lives entirely inside 1 container– all files + WALs + btree's + bloom-filters + range-maps


Read-write Replication

Write are synchronous– All copies have same data

Data is replicated in a "chain" fashion– better bandwidth, utilizes full-duplex

network links well

Meta-data is replicated in a "star" manner– response time better, bandwidth not

of concern– data can also be done this way

31

client1client2

clientN


HB loss + upstream entity reports failure => server dead

Increment epoch at CLDB

Rearrange replication

Exact same code for files and M7 tables

No ZK needed at this level

Failure Handling

Containers managed at CLDB (HB, container-reports).

Container Location DataBase (CLDB)


Same 10 nodes, but with 3X repl

0 1000 2000 3000 4000 5000 60000

2000

4000

6000

8000

10000

12000

14000

16000

18000

Files (M)

File

cre

ates

/s

0 100 200 400 600 800 1000

MapR distribution

Other distribution

Benchmark: File creates (100B)Hardware: 10 nodes, 2 x 4 cores, 24 GB RAM, 12 x 1 TB 7200 RPM

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60

50100150200250300350400

Files (M)

File

cre

ates

/s

Other distributionMapR Other Advantage

Rate (creates/s) 14-16K 335-360 40x

Scale (files) 6B 1.3M 4615x


Recap

HBase has a good basis– But is handicapped by HDFS– But can’t do without HDFS– HBase can’t be fixed in isolation

Separating key storage scaling parameters is key– Allows additional layer of storage indirection– Results in huge scaling and performance improvement

Low-level transactions is hard– Allows R/W file system, decentralized meta-data– Also allows non-file implementations


Part 2:An implementation tour


An Outline of Important Factors

Start with MapR FS (mutability, transactions, real snapshots) C++ not Java (data never moves, better control) Lockless design, custom queue executive (3 ns switch) New RPC layer (> 1 M RPC / s) Cut out the middle man (single hop to data) Hybridize log-structured merge trees and B-trees Adjust sizes and fanouts Don’t be silly


An Outline of Important Factors

Start with MapR FS (mutability, transactions, real snapshots) C++ not Java (data never moves, better control) Lockless design, custom queue executive (3 ns switch) New RPC layer (> 1 M RPC / s) Cut out the middle man (single hop to data) Hybridize log-structured merge trees and B-trees Adjust sizes and fanouts Don’t be silly

We get these all for free by putting tables into MapR FS


M7: Tables Integrated into Storage

No extra daemons to manage

One hop to dataSuperior caching

policies

No JVM problems


Lesson 0: Implement

tables in the file system


Why Not Java?

Disclaimer: I am a pro-Java bigot But that only goes so far …

Consider the memory size of

struct {x, y}[] a; Consider also interpreting data as it has arrived from the wire Consider the problem of writing a micro-stack queue executive

with hundreds of thousands of threads and 3 ns context switch Consider the problem of a core-locked processes running cache

aware, lock-free, zero copy queue of tasks Consider the GC-free life-style


At What Cost

But writing performant C++ is hard Managing low-level threads is hard Implementing very fast failure recovery is hard Doing manual memory allocation is hard (and dangerous)

Benefits outweigh costs with the right dev team

Benefits dwarfed by the costs with the wrong dev team


Lesson 1: With great speed comes great

responsibility


M7 Table Architecture


M7 Table Architecture

This structure is internal and not user-visible


Multi-level Design

Fixed number of levels like HBase Specialized fanout to match sizes to device physics

Mutable file system allows chimeric LSM-tree / B-tree

Sized to match container structure

Guaranteed locality– If the data moves, the new node will handle it– If the node fails, the new node will handle it


Lesson 2: Physics. Not just a good

idea. It’s the law.


RPC Reimplementation

At very high data rates, protobuf is too slow – Not good as an envelope, still a great schema definition language– Most systems never hit this limit

Alternative 1– Lazy parsing allows deferral of content parsing– Naïve implementation imposes (yet another) extra copy

Alternative 2– Bespoke parsing of envelope from the wire– Content packages can land fully aligned and ready for battle directly from

the wire

Let’s use BOTH ideas


Lesson 3: Hacking and abstraction can co-exist


Don’t Be Silly

Detailed review of the code revealed an extra copy– It was subtle. Really.

Performance increased when this was stopped

Not as easy to spot as it sounds– But absolutely still worth finding and fixing


Part 3:Results


Server Reboot

Full container-reports are tiny– CLDB needs 2G dram for 1000-node cluster

Volumes come online very fast– each volume independent of others– as soon as min-repl # of containers ready– no need to wait for whole cluster

(eg, HDFS waits for 99.9% blocks reporting)

1000-node cluster restart < 5 mins


M7 provides Instant Recovery

0-40 microWALs per region– idle WALs go to zero quickly, so most are empty– region is up before all microWALs are recovered– recovers region in background in parallel– when a key is accessed, that microWAL is recovered inline– 1000-10000x faster recovery

Why doesn't HBase do this?– M7 leverages unique MapR-FS capabilities, not impacted by HDFS

limitations– No limit to # of files on disk– No limit to # open files– I/O path translates random writes to sequential writes on disk


Other M7 Features

Smaller disk footprint–M7 never repeats the key or column name

Columnar layout–M7 supports 64 column families– in-memory column-families

Online admin–M7 schema changes on the fly– delete/rename/redistribute tables


Binary Compatible

HBase applications work "as is" with M7– No need to recompile (binary compatible)

Can run M7 and HBase side-by-side on the same cluster– eg, during a migration– can access both M7 table and HBase table in same program

Use standard Apache HBase CopyTable tool to copy a table from HBase to M7 or vice-versa, viz.,

% hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=/user/srivas/mytable oldtable


M7 vs CDH - Mixed Load 50-50


Recap

HBase has some excellent core ideas– But is burdened by years of technical debt– Much of the debt was charged on the HDFS credit cards

MapR FS provides ideal substrate for HBase-like service– One hop from client to data– Many problems never even exist in the first place– Other problems have relatively simple solutions with better foundation

Practical results bear out the theory


Me, Us

Ted Dunning, Chief Application Architect, MapRCommitter PMC member, Mahout, Zookeeper, DrillBought the beer at the first HUG

MapRDistributes more open source components for HadoopAdds major technology for performance and HAAdds industry standard API’s

TonightHash tag - #nosqlnow #mapr #fastSee also - @ApacheMahout @ApacheDrill

@ted_dunning and @mapR

inside mapr's m7

Technology

hbase recovery hbase

hug mapr

hbase mttr

tb of hbase data

hbase major compactions

hbase autosplitting

hbase regionservers

standard hbase documentation