inside mapr's m7

60
1 ©MapR Technologies - Confidential Inside MapR’s M7 How to get a million ops per second on 10 nodes

Upload: mapr-technologies

Post on 13-Jan-2015

1.311 views

Category:

Technology


2 download

DESCRIPTION

This talk takes a technological deep dive into MapR M7 including information on some of the key challenges that were solved during the implementation of M7. MapR's M7 is a clean room replication of the HBase API written in C++ and fully integrated into the MapR platform. In the process of implementing M7, we learned some lessons and solved some interesting challenges. Ted Dunning shares some of these experiences and lessons. Many of these lessons apply across the board to high performance query systems in general and can be applied much more widely. Some of the resulting techniques have already been adopted by the Apache Drill project, but there are lots more places that these techniques can be used.

TRANSCRIPT

Page 1: Inside MapR's M7

1©MapR Technologies - Confidential

Inside MapR’s M7How to get a million ops per second on 10 nodes

Page 2: Inside MapR's M7

2©MapR Technologies - Confidential

Me, Us

Ted Dunning, Chief Application Architect, MapRCommitter PMC member, Mahout, Zookeeper, DrillBought the beer at the first HUG

MapRDistributes more open source components for HadoopAdds major technology for performance, HA, industry standard API’s

TonightHash tag - #mapr #fastSee also - @ApacheMahout @ApacheDrill

@ted_dunning and @mapR

Page 3: Inside MapR's M7

3©MapR Technologies - Confidential

MapR does MapReduce (fast)

TeraSort Record1 TB in 54 seconds

1003 nodes

MinuteSort Record1.5 TB in 59 seconds

2103 nodes

Page 4: Inside MapR's M7

4©MapR Technologies - Confidential

MapR: Lights Out Data Center Ready

• Automated stateful failover

• Automated re-replication

• Self-healing from HW and SW failures

• Load balancing

• Rolling upgrades

• No lost jobs or data

• 99999’s of uptime

Reliable Compute Dependable Storage

• Business continuity with snapshots and mirrors

• Recover to a point in time• End-to-end check summing • Strong consistency• Built-in compression• Mirror between two sites by RTO

policy

Page 5: Inside MapR's M7

5©MapR Technologies - Confidential

Part 1:What’s past is prologue

Page 6: Inside MapR's M7

6©MapR Technologies - Confidential

Part 1:What’s past is prologue HBase is really good except when it isn’t but it has a heart of gold

Page 7: Inside MapR's M7

7©MapR Technologies - Confidential

Part 2:An implementation tour

Page 8: Inside MapR's M7

8©MapR Technologies - Confidential

Part 2:An implementation tour with many tricks and clever ploys

Page 9: Inside MapR's M7

9©MapR Technologies - Confidential

Part 3:Results

Page 10: Inside MapR's M7

10©MapR Technologies - Confidential

Page 11: Inside MapR's M7

11©MapR Technologies - Confidential

Part 1:What’s past is prologue

Page 12: Inside MapR's M7

12©MapR Technologies - Confidential

Dynamo DB

ZopeDB

Shoal

CloudKit

Vertex DB

FlockDB

NoSQL

Page 13: Inside MapR's M7

13©MapR Technologies - Confidential

HBase Table Architecture

Tables are divided into key ranges (regions) Regions are served by nodes (RegionServers) Columns are divided into access groups (columns families)

CF1 CF2 CF3 CF4 CF5

R1

R2

R3

R4

Page 14: Inside MapR's M7

14©MapR Technologies - Confidential

HBase Architecture is Better

Strong consistency model– when a write returns, all readers will see same value– "eventually consistent" is often "eventually inconsistent"

Scan works– does not broadcast– ring-based NoSQL databases (eg, Cassandra, Riak) suffer on scans

Scales automatically– Splits when regions become too large– Uses HDFS to spread data, manage space

Integrated with Hadoop– map-reduce on HBase is straightforward

Page 15: Inside MapR's M7

15©MapR Technologies - Confidential

But ... how well do you know HBCK?a.k.a. HBase Recovery

HBase-5843: Improve HBase MTTR – Mean Time To Recover HBase-6401: HBase may lose edits after a crash with 1.0.3

– uses appends

HBase-3809: .META. may not come back online if …. etc about 40-50 Jiras on this topic

Very complex algorithm to assign a region– and still does not get it right on reboot

Page 16: Inside MapR's M7

16©MapR Technologies - Confidential

HBase Issues

Reliability• Compactions disrupt operations• Very slow crash recovery• Unreliable splitting

Business continuity• Common hardware/software issues cause downtime• Administration requires downtime• No point-in-time recovery• Complex backup process

Performance• Many bottlenecks result in low throughput• Limited data locality• Limited # of tables

Manageability• Compactions, splits and merges must be done manually (in reality)• Basic operations like backup or table rename are complex

Page 17: Inside MapR's M7

17©MapR Technologies - Confidential

Examples: Performance Issues

Limited support for multiple column families: HBase has issues handling multiple column family due to compactions. The standard HBase documentation recommends no more than 2-3 column families. (HBASE-3149)

Limited data locality: HBase does not take into account block locations when assigning regions. After a reboot, RegionServers are often reading data over the network rather than the local drives. (HBASE-4755, HBASE-4491)

Cannot utilize disk space: HBase RegionServers struggle with more than 50-150 regions per RegionServer so a commodity server can only handle about 1TB of HBase data, wasting disk space. (http://hbase.apache.org/book/important_configurations.html, http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/)

Limited # of tables: A single cluster can only handle several tens of tables effectively. (http://hbase.apache.org/book/important_configurations.html)

Page 18: Inside MapR's M7

18©MapR Technologies - Confidential

Examples: Manageability Issues

Manual major compactions: HBase major compactions are disruptive so production clusters keep them disabled and rely on the administrator to manually trigger compactions. (http://hbase.apache.org/book.html#compaction)

Manual splitting: HBase auto-splitting does not work properly in a busy cluster so users must pre-split a table based on their estimate of data size/growth. (http://chilinglam.blogspot.com/2011/12/my-experience-with-hbase-dynamic.html)

Manual merging: HBase does not automatically merge regions that are too small. The administrator must take down the cluster and trigger the merges manually.

Basic administration is complex: Renaming a table requires copying all the data. Backing up a cluster is a complex process. (HBASE-643)

Page 19: Inside MapR's M7

19©MapR Technologies - Confidential

Examples: Reliability Issues

Compactions disrupt HBase operations: I/O bursts overwhelm nodes (http://hbase.apache.org/book.html#compaction)

Very slow crash recovery: RegionServer crash can cause data to be unavailable for up to 30 minutes while WALs are replayed for impacted regions. (HBASE-1111)

Unreliable splitting: Region splitting may cause data to be inconsistent and unavailable. (http://chilinglam.blogspot.com/2011/12/my-experience-with-hbase-dynamic.html)

No client throttling: HBase client can easily overwhelm RegionServers and cause downtime. (HBASE-5161, HBASE-5162)

Page 20: Inside MapR's M7

20©MapR Technologies - Confidential

One Issue – Crash Recovery Too Slow

HBASE-1111 superseded by HBASE-5843 which is blocked by

HDFS-3912 HBASE-6736 HBASE-6970 HBASE-7989 HBASE-6315HBASE-7815 HBASE-6737 HBASE-6738 HBASE-7271 HBASE-7590HBASE-7756 HBASE-8204 HBASE-5992 HBASE-6156 HBASE-6878HBASE-6364 HBASE-6713 HBASE-5902 HBASE-4755 HBASE-7006HDFS-2576 HBASE-6309 HBASE-6751 HBASE-6752 HBASE-6772HBASE-6773 HBASE-6774 HBASE-7246 HBASE-7334 HBASE-5859HBASE-6058 HBASE-6290 HBASE-7213 HBASE-5844 HBASE-5924HBASE-6435 HBASE-6783 HBASE-7247 HBASE-7327 HDFS-4721HBASE-5877 HBASE-5926 HBASE-5939 HBASE-5998 HBASE-6109HBASE-6870 HBASE-5930 HDFS-4754 HDFS-3705

Page 21: Inside MapR's M7

21©MapR Technologies - Confidential

What is the source of these

problems?

Page 22: Inside MapR's M7

22©MapR Technologies - Confidential

RegionServers are problematic

Coordinating 3 separate distributed systems is very hard– HBase, HDFS, ZK– Each of these systems has multiple internal systems– Too many races, too many undefined properties

Distributed transaction framework not available– Too many failures to deal with

Java GC wipes out the RS from time to time– Cannot use -Xmx20g for a RS

Hence all the bugs– HBCK is your "friend"

Page 23: Inside MapR's M7

23©MapR Technologies - Confidential

Region Assignment in Apache HBase

Page 24: Inside MapR's M7

24©MapR Technologies - Confidential

Files are broken into blocks Distributed across data-nodes

NameNode holds (in DRAM) Directories, Files Block replica locations

Data Nodes Serve blocks No idea about files/dirs All ops go to NN

HDFS Architecture Review

DataNodes save Blocks

Filessharded intoblocks

Page 25: Inside MapR's M7

25©MapR Technologies - Confidential

NameNode holds in-memory Dir hierarchy ("names") File attrs ("inode") Composite file structure

Array of block-ids

1-byte file in HDFS 1 HDFS "block" on 3 DN's 3 entries in NN totaling 1K DRAM

A File at the NameNode

Composite File Structure

Page 26: Inside MapR's M7

26©MapR Technologies - Confidential

DN reports blocks to NN– 128M blocks– 12T of disk => DN sends 100K blocks/report– RPC on wire is 4M– causes extreme load• at both DN and NN

With NN-HA, DN's do dual block-reports– one to primary, one to secondary– doubles the load on DN

NN scalability problems

Page 27: Inside MapR's M7

27©MapR Technologies - Confidential

Scaling Parameters

Unit of I/O– 4K/8K (8K in MapR)

Unit of Chunking (a map-reduce split)– 10-100's of megabytes

Unit of Resync (a replica)– 10-100's of gigabytes– container in MapR

i/o 10^3map-red

10^6resync

10^9admin

HDFS 'block'

Unit of Administration (snap, repl, mirror, quota, backup)– 1 gigabyte - 1000's of terabytes– volume in MapR– what data is affected by my

missing blocks?

Page 28: Inside MapR's M7

28©MapR Technologies - Confidential

NameNode

E F

NameNode

E F

NameNode

E F

MapR's No-NameNode ArchitectureHDFS Federation MapR (distributed metadata)

• Multiple single points of failure• Limited to 50-200 million files• Performance bottleneck• Commercial NAS required

• HA w/ automatic failover• Instant cluster restart• Up to 1 trillion files• 20x higher performance• 100% commodity hardware

NAS appliance

NameNode

A B

NameNode

C D

NameNode

E F

DataNode DataNode DataNode

DataNode DataNode DataNode

A F C D E D

B C E B

C F B F

A B

A D

E

Page 29: Inside MapR's M7

29©MapR Technologies - Confidential

Each container contains Directories & files Data blocks

Replicated on servers Millions of containers in

a typical cluster

MapR's Distributed NameNodeFiles/directories are sharded into blocks, whichare placed into mini NNs (containers ) on disks

Containers are 16-32 GB segments of disk, placed on nodes

Patent Pending

Page 30: Inside MapR's M7

30©MapR Technologies - Confidential

M7 Containers

Container holds many files– regular, dir, symlink, btree, chunk-map, region-map, …– all random-write capable– each can hold 100's of millions of files

Container is replicated to servers– unit of resynchronization

Region lives entirely inside 1 container– all files + WALs + btree's + bloom-filters + range-maps

Page 31: Inside MapR's M7

31©MapR Technologies - Confidential

Read-write Replication

Write are synchronous– All copies have same data

Data is replicated in a "chain" fashion– better bandwidth, utilizes full-duplex

network links well

Meta-data is replicated in a "star" manner– response time better, bandwidth not

of concern– data can also be done this way

31

client1client2

clientN

Page 32: Inside MapR's M7

33©MapR Technologies - Confidential

HB loss + upstream entity reports failure => server dead

Increment epoch at CLDB

Rearrange replication

Exact same code for files and M7 tables

No ZK needed at this level

Failure Handling

Containers managed at CLDB (HB, container-reports).

Container Location DataBase (CLDB)

Page 33: Inside MapR's M7

34©MapR Technologies - Confidential

Same 10 nodes, but with 3X repl

0 1000 2000 3000 4000 5000 60000

2000

4000

6000

8000

10000

12000

14000

16000

18000

Files (M)

File

cre

ates

/s

0 100 200 400 600 800 1000

MapR distribution

Other distribution

Benchmark: File creates (100B)Hardware: 10 nodes, 2 x 4 cores, 24 GB RAM, 12 x 1 TB 7200 RPM

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60

50100150200250300350400

Files (M)

File

cre

ates

/s

Other distributionMapR Other Advantage

Rate (creates/s) 14-16K 335-360 40x

Scale (files) 6B 1.3M 4615x

Page 34: Inside MapR's M7

35©MapR Technologies - Confidential

Recap

HBase has a good basis– But is handicapped by HDFS– But can’t do without HDFS– HBase can’t be fixed in isolation

Separating key storage scaling parameters is key– Allows additional layer of storage indirection– Results in huge scaling and performance improvement

Low-level transactions is hard– Allows R/W file system, decentralized meta-data– Also allows non-file implementations

Page 35: Inside MapR's M7

36©MapR Technologies - Confidential

Part 2:An implementation tour

Page 36: Inside MapR's M7

37©MapR Technologies - Confidential

An Outline of Important Factors

Start with MapR FS (mutability, transactions, real snapshots) C++ not Java (data never moves, better control) Lockless design, custom queue executive (3 ns switch) New RPC layer (> 1 M RPC / s) Cut out the middle man (single hop to data) Hybridize log-structured merge trees and B-trees Adjust sizes and fanouts Don’t be silly

Page 37: Inside MapR's M7

38©MapR Technologies - Confidential

An Outline of Important Factors

Start with MapR FS (mutability, transactions, real snapshots) C++ not Java (data never moves, better control) Lockless design, custom queue executive (3 ns switch) New RPC layer (> 1 M RPC / s) Cut out the middle man (single hop to data) Hybridize log-structured merge trees and B-trees Adjust sizes and fanouts Don’t be silly

We get these all for free by putting tables into MapR FS

Page 38: Inside MapR's M7

39©MapR Technologies - Confidential

M7: Tables Integrated into Storage

No extra daemons to manage

One hop to dataSuperior caching

policies

No JVM problems

Page 39: Inside MapR's M7

40©MapR Technologies - Confidential

Lesson 0: Implement

tables in the file system

Page 40: Inside MapR's M7

41©MapR Technologies - Confidential

Why Not Java?

Disclaimer: I am a pro-Java bigot But that only goes so far …

Consider the memory size of

struct {x, y}[] a; Consider also interpreting data as it has arrived from the wire Consider the problem of writing a micro-stack queue executive

with hundreds of thousands of threads and 3 ns context switch Consider the problem of a core-locked processes running cache

aware, lock-free, zero copy queue of tasks Consider the GC-free life-style

Page 41: Inside MapR's M7

42©MapR Technologies - Confidential

At What Cost

But writing performant C++ is hard Managing low-level threads is hard Implementing very fast failure recovery is hard Doing manual memory allocation is hard (and dangerous)

Benefits outweigh costs with the right dev team

Benefits dwarfed by the costs with the wrong dev team

Page 42: Inside MapR's M7

43©MapR Technologies - Confidential

Lesson 1: With great speed comes great

responsibility

Page 43: Inside MapR's M7

44©MapR Technologies - Confidential

M7 Table Architecture

Page 44: Inside MapR's M7

45©MapR Technologies - Confidential

M7 Table Architecture

This structure is internal and not user-visible

Page 45: Inside MapR's M7

46©MapR Technologies - Confidential

Multi-level Design

Fixed number of levels like HBase Specialized fanout to match sizes to device physics

Mutable file system allows chimeric LSM-tree / B-tree

Sized to match container structure

Guaranteed locality– If the data moves, the new node will handle it– If the node fails, the new node will handle it

Page 46: Inside MapR's M7

47©MapR Technologies - Confidential

Lesson 2: Physics. Not just a good

idea. It’s the law.

Page 47: Inside MapR's M7

48©MapR Technologies - Confidential

RPC Reimplementation

At very high data rates, protobuf is too slow – Not good as an envelope, still a great schema definition language– Most systems never hit this limit

Alternative 1– Lazy parsing allows deferral of content parsing– Naïve implementation imposes (yet another) extra copy

Alternative 2– Bespoke parsing of envelope from the wire– Content packages can land fully aligned and ready for battle directly from

the wire

Let’s use BOTH ideas

Page 48: Inside MapR's M7

49©MapR Technologies - Confidential

Lesson 3: Hacking and abstraction can co-exist

Page 49: Inside MapR's M7

50©MapR Technologies - Confidential

Don’t Be Silly

Detailed review of the code revealed an extra copy– It was subtle. Really.

Performance increased when this was stopped

Not as easy to spot as it sounds– But absolutely still worth finding and fixing

Page 50: Inside MapR's M7

51©MapR Technologies - Confidential

Part 3:Results

Page 51: Inside MapR's M7

52©MapR Technologies - Confidential

Server Reboot

Full container-reports are tiny– CLDB needs 2G dram for 1000-node cluster

Volumes come online very fast– each volume independent of others– as soon as min-repl # of containers ready– no need to wait for whole cluster

(eg, HDFS waits for 99.9% blocks reporting)

1000-node cluster restart < 5 mins

Page 52: Inside MapR's M7

53©MapR Technologies - Confidential

M7 provides Instant Recovery

0-40 microWALs per region– idle WALs go to zero quickly, so most are empty– region is up before all microWALs are recovered– recovers region in background in parallel– when a key is accessed, that microWAL is recovered inline– 1000-10000x faster recovery

Why doesn't HBase do this?– M7 leverages unique MapR-FS capabilities, not impacted by HDFS

limitations– No limit to # of files on disk– No limit to # open files– I/O path translates random writes to sequential writes on disk

Page 53: Inside MapR's M7

54©MapR Technologies - Confidential

Other M7 Features

Smaller disk footprint–M7 never repeats the key or column name

Columnar layout–M7 supports 64 column families– in-memory column-families

Online admin–M7 schema changes on the fly– delete/rename/redistribute tables

Page 54: Inside MapR's M7

55©MapR Technologies - Confidential

Binary Compatible

HBase applications work "as is" with M7– No need to recompile (binary compatible)

Can run M7 and HBase side-by-side on the same cluster– eg, during a migration– can access both M7 table and HBase table in same program

Use standard Apache HBase CopyTable tool to copy a table from HBase to M7 or vice-versa, viz.,

% hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=/user/srivas/mytable oldtable

Page 55: Inside MapR's M7

56©MapR Technologies - Confidential

M7 vs CDH - Mixed Load 50-50

Page 56: Inside MapR's M7

57©MapR Technologies - Confidential

M7 vs CDH - Mixed Load 50-50

Page 57: Inside MapR's M7

58©MapR Technologies - Confidential

M7 vs CDH - Mixed Load 50-50

Page 58: Inside MapR's M7

59©MapR Technologies - Confidential

Recap

HBase has some excellent core ideas– But is burdened by years of technical debt– Much of the debt was charged on the HDFS credit cards

MapR FS provides ideal substrate for HBase-like service– One hop from client to data– Many problems never even exist in the first place– Other problems have relatively simple solutions with better foundation

Practical results bear out the theory

Page 59: Inside MapR's M7

60©MapR Technologies - Confidential

Me, Us

Ted Dunning, Chief Application Architect, MapRCommitter PMC member, Mahout, Zookeeper, DrillBought the beer at the first HUG

MapRDistributes more open source components for HadoopAdds major technology for performance and HAAdds industry standard API’s

TonightHash tag - #nosqlnow #mapr #fastSee also - @ApacheMahout @ApacheDrill

@ted_dunning and @mapR

Page 60: Inside MapR's M7

61©MapR Technologies - Confidential