keys for success from streams to queries

56
© 2014 MapR Technologies 1 © 2014 MapR Technologies Keys for Success from Streams to Queries Ted Dunning

Upload: hadoop-summit

Post on 07-Jan-2017

105 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Keys for Success from Streams to Queries

© 2014 MapR Technologies 1© 2014 MapR Technologies

Keys for Success from Streams to QueriesTed Dunning

Page 2: Keys for Success from Streams to Queries

© 2014 MapR Technologies 2

Contact Information

Ted DunningChief Applications Architect at MapR Technologies

Committer & PMC for Apache’s Drill, Zookeeper & othersVP of Incubator at Apache Foundation

Email [email protected] [email protected]

Twitter @ted_dunning

Page 3: Keys for Success from Streams to Queries

© 2014 MapR Technologies 3

Goals• Real-time or near-time

– Includes situations with deadlines– Also includes situations where delay is simply undesirable– Even includes situations where delay is just fine

• Micro-services– Streaming is a convenient idiom for design– Micro-services … you know we wanted it– Service isolation is a key requirement

Page 4: Keys for Success from Streams to Queries

© 2014 MapR Technologies 4

Real-time or Near-time?• The real point is flow versus state (catch me later today)

• One consequence of flow-based computing is real-time and near-time become relatively easy

• Life may be a bitch, but it doesn’t happen in batches!

Page 5: Keys for Success from Streams to Queries

© 2014 MapR Technologies 6

Agenda• Background / micro-services

• Global requirements

• Scale

Page 6: Keys for Success from Streams to Queries

© 2014 MapR Technologies 7

A microservice is

loosely coupledwith bounded context

Page 7: Keys for Success from Streams to Queries

© 2014 MapR Technologies 8

How to Couple Services and Break micro-ness• Shared schemas, relational stores• Ad hoc communication between services• Enterprise service busses• Brittle protocols• Poor protocol versioning• Single PoF schema repositories

Don’t do this!

Page 8: Keys for Success from Streams to Queries

© 2014 MapR Technologies 9

How to Decouple Services• Use self-describing data • Private database tables• Infrastructural communication between services• Use modern protocols• Adopt future-proof protocol practices

• Use shared storage only where necessary due to scale

Page 9: Keys for Success from Streams to Queries

© 2014 MapR Technologies 11

What is the Right Structure for Flow Compute?• Traditional message queues?

– Message queues are classic answer– Key feature/bug is out-of-order acknowledgement– Many implementations– You pay a huge performance hit for persistence

• Kafka-esque Logs?– Logs are like queues, but with ordering– Out of order consumption is possible, acknowledgement not so much– Canonical base implementation is Kafka– Performance plus persistence

Page 10: Keys for Success from Streams to Queries

© 2014 MapR Technologies 12

ScenariosProfile Database

Page 11: Keys for Success from Streams to Queries

© 2014 MapR Technologies 13

The task

Page 12: Keys for Success from Streams to Queries

© 2014 MapR Technologies 14

Traditional Solution

Page 13: Keys for Success from Streams to Queries

© 2014 MapR Technologies 15

What Happens Next?

Page 14: Keys for Success from Streams to Queries

© 2014 MapR Technologies 16

What Happens Next?

Page 15: Keys for Success from Streams to Queries

© 2014 MapR Technologies 17

How to Get Service Isolation

Page 16: Keys for Success from Streams to Queries

© 2014 MapR Technologies 18

New Uses of Data

Page 17: Keys for Success from Streams to Queries

© 2014 MapR Technologies 19

Scaling Through Isolation

Page 18: Keys for Success from Streams to Queries

© 2014 MapR Technologies 20

Lessons• De-coupling and isolation are key• Private data stores/tables are important,

– But node local storage of private data is a bug• Propagate events, not table updates

• Note that tables and streams should be as easy as files– It should not be necessary to provision a cluster to get them

Page 19: Keys for Success from Streams to Queries

© 2014 MapR Technologies 21

ScenariosIoT Data Aggregation

Page 20: Keys for Success from Streams to Queries

© 2014 MapR Technologies 22

Basic Situation

Each location has many

pumps

Multiple locations

Page 21: Keys for Success from Streams to Queries

© 2014 MapR Technologies 23

What Does a Pump Look Like

TemperaturePressure

Flow

TemperaturePressureFlow

Winding temperature

VoltageCurrent

Page 22: Keys for Success from Streams to Queries

© 2014 MapR Technologies 24

Basic Situation

Each location has many

pumps

Multiple locations

Page 23: Keys for Success from Streams to Queries

© 2014 MapR Technologies 25

Basic Architecture Reflects Business Structure

Page 24: Keys for Success from Streams to Queries

© 2014 MapR Technologies 26

Lessons• Data architecture should reflect business structure

• Even very modest designs involve multiple data centers

• Schemas cannot be frozen in the real world

• Security must follow data ownership

Page 25: Keys for Success from Streams to Queries

© 2014 MapR Technologies 27

ScenariosGlobal Data Recovery

Page 26: Keys for Success from Streams to Queries

© 2014 MapR Technologies 28

Page 27: Keys for Success from Streams to Queries

© 2014 MapR Technologies 29

Page 28: Keys for Success from Streams to Queries

© 2014 MapR Technologies 30

Page 29: Keys for Success from Streams to Queries

© 2014 MapR Technologies 31

Page 30: Keys for Success from Streams to Queries

© 2014 MapR Technologies 32

Lessons• Streams and tables are primitive building blocks

• Arbitrary number of topics important for simplicity + performance

• Updates happen in many places

• Mobility implies change in replication patterns

• Multi-master updates simplify design massively

Page 31: Keys for Success from Streams to Queries

© 2014 MapR Technologies 33

Converged Requirements

Page 32: Keys for Success from Streams to Queries

© 2014 MapR Technologies 34

What Have We Learned?• Need persistence and performance

– Possibly for years and to 100’s of millions t/s• Must have convergence

– Need files, tables AND streams– Need volumes, snapshots, mirrors, permissions and …

• Must have platform security– Cannot depend on perimeter– Must follow business structure

• Must have global scale and scope– Millions of topics for natural designs– Multi-master replication and update

Page 33: Keys for Success from Streams to Queries

© 2014 MapR Technologies 35

The Importance of Common API’s• Commonality and interoperability are critical

– Compare Hadoop eco-system and the noSQL world• Table stakes

– Persistence– Performance– Polymorphism

• Major trend so far is to adopt Kafka API– 0.9 API and beyond remove major abstraction leaks– Kafka API supported by all major Hadoop vendors

Page 34: Keys for Success from Streams to Queries

© 2014 MapR Technologies 36

What we do at MapR

Page 35: Keys for Success from Streams to Queries

© 2014 MapR Technologies 37

Evolution of Data Storage

FunctionalityCompatibility

Scalability

LinuxPOSIX

Over decades of progress,Unix-based systems have set the standard for compatibility and functionality

Page 36: Keys for Success from Streams to Queries

© 2014 MapR Technologies 38

FunctionalityCompatibility

Scalability

LinuxPOSIX

HadoopHadoop achieves much higher scalability by trading away essentially all of this compatibility

Evolution of Data Storage

Page 37: Keys for Success from Streams to Queries

© 2014 MapR Technologies 39

Evolution of Data Storage

FunctionalityCompatibility

Scalability

LinuxPOSIX

Hadoop

MapR enhanced Apache Hadoop by restoring the compatibility while increasing scalability and performance

FunctionalityCompatibility

Scalability

POSIX

Page 38: Keys for Success from Streams to Queries

© 2014 MapR Technologies 40

FunctionalityCompatibility

Scalability

LinuxPOSIX

Hadoop

Evolution of Data Storage

Adding converged tables and streams enhances the functionality of the base file system

Page 39: Keys for Success from Streams to Queries

© 2014 MapR Technologies 41

http://bit.ly/fastest-big-data

Page 40: Keys for Success from Streams to Queries

© 2014 MapR Technologies 42

How we do this with MapR• MapR Streams is a C++ reimplementation of Kafka API

– Advantages in predictability, performance, scale– Common security and permissions with entire MapR converged data

platform• Semantic extensions

– A cluster contains volumes, files, tables … and now streams– Streams contain topics– Can have default stream or can name stream by path name

• Core MapR capabilities preserved– Consistent snapshots, mirrors, multi-master replication

Page 41: Keys for Success from Streams to Queries

© 2014 MapR Technologies 43

MapR core Innovations• Volumes

– Distributed management– Data placement

• Read/write random access file system– Allows distributed meta-data– Improved scaling– Enables NFS access

• Application-level NIC bonding• Transactionally correct snapshots and mirrors

Page 42: Keys for Success from Streams to Queries

© 2014 MapR Technologies 44

MapR's Containers

Each container contains Directories & files Data blocks

Replicated on servers No need to manage

directly

Files/directories are sharded into blocks, whichare placed into containers on disks

Containers are 16-32 GB segments of disk, placed on nodes

Page 43: Keys for Success from Streams to Queries

© 2014 MapR Technologies 45

MapR's Containers

Each container has a replication chain

Updates are transactional Failures are handled by

rearranging replication

Page 44: Keys for Success from Streams to Queries

© 2014 MapR Technologies 46

Container locations and replication

CLDB

N1, N2N3, N2N1, N2N1, N3N3, N2

N1

N2

N3Container location database (CLDB) keeps track of nodes hosting each container and replication chain order

Page 45: Keys for Success from Streams to Queries

© 2014 MapR Technologies 47

MapR ScalingContainers represent 16 - 32GB of data

Each can hold up to 1 Billion files and directories 100M containers = ~ 2 Exabytes (a very large cluster)

250 bytes DRAM to cache a container 25GB to cache all containers for 2EB cluster

But not necessary, can page to disk Typical large 10PB cluster needs 2GB

Container-reports are 100x - 1000x < HDFS block-reports Serve 100x more data-nodes Increase container size to 64G to serve 4EB cluster

Page 46: Keys for Success from Streams to Queries

© 2014 MapR Technologies 48

But Wait, There’s More• Directories and files are implemented in terms of B-trees

– Key is offset, value is data blob– Internal transactional semantics guarantees safety and consistency– Layout algorithms give very high layout linearization

• Tables are implemented in terms of B-trees– Twisted B-tree implementation allows virtues of log-structured merge tree

without the compaction delays– Tablet splitting without pausing, integration with file system transactions

• Common security and permissions scheme

Page 47: Keys for Success from Streams to Queries

© 2014 MapR Technologies 49

And More …• Streams are implemented in terms of B-trees as well

– Topics and consumer offsets are kept in stream, not ZK– Similar splitting technology as MapR DB tables – Consistent permissions, security, data replication

• Standard Kafka 0.9 API• Plans to add OJAI for high-level structuring

• Performance is very high

Page 48: Keys for Success from Streams to Queries

© 2014 MapR Technologies 50

Example

Page 49: Keys for Success from Streams to Queries

© 2014 MapR Technologies 51

Page 50: Keys for Success from Streams to Queries

© 2014 MapR Technologies 52

Lessons• API’s matter more than implementations

• There is plenty of room to innovate ahead of the community

• Posix, HDFS, HBASE all define useful API’s

• Kafka 0.9+ also defines a useful and broadly adopted API

Page 51: Keys for Success from Streams to Queries

© 2014 MapR Technologies 53

Call to action:

Require convergence

Page 52: Keys for Success from Streams to Queries

© 2014 MapR Technologies 54

Page 53: Keys for Success from Streams to Queries

© 2014 MapR Technologies 55

Short Books by Ted Dunning & Ellen Friedman• Published by O’Reilly in 2014 - 2016• For sale from Amazon or O’Reilly• Free e-books currently available courtesy of MapR

http://bit.ly/ebook-real-world-hadoop

http://bit.ly/mapr-tsdb-ebook

http://bit.ly/ebook-anomaly

http://bit.ly/recommendation-ebook

Page 54: Keys for Success from Streams to Queries

© 2014 MapR Technologies 56

Streaming Architectureby Ted Dunning and Ellen Friedman © 2016 (published by O’Reilly)

Free copies at book signing today

http://bit.ly/mapr-ebook-streams

Page 55: Keys for Success from Streams to Queries

© 2014 MapR Technologies 57

Thank You!

Page 56: Keys for Success from Streams to Queries

© 2014 MapR Technologies 58

Q & A@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies