brian bulkowski. aerospike

37
© 2014 Aerospike. All rights reserved. Confidential 1 1M++ queries per second with 100TB+ How, and why, to achieve Operational Big Data Brian Bulkowski CTO and co-founder Aerospike

Upload: volha-banadyseva

Post on 10-May-2015

2.813 views

Category:

Software


3 download

DESCRIPTION

#BigDataBY

TRANSCRIPT

Page 1: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 1

1M++ queries per second with 100TB+

How, and why, to achieve Operational Big Data

Brian Bulkowski CTO and co-founder

Aerospike

Page 2: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 2

REQUIREMENTS FOR INTERNET ENTERPRISES

Page 3: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 3

Introduction to Advertising: Real-time Bidding

Page 4: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 4

North American RTB speeds & feeds

■  1 to 6 billion cookies tracked ■ Some companies track 200M, some track 20B

■  Each bidder has their own data pool ■ Data is your weapon ■ Recent searches, behavior, IP addresses ■ Audience clusters (K-cluster, K-means) from offline Hadoop

■  “Remnant” from Google, Yahoo is about 0.6 million / sec ■  Facebook exchange: about 0.6 million / sec ■  “other” is 0.5 million / sec

Currently about 3.0M / sec in North American

Page 5: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 5

Advertising requirements

■  100 millisecond or 150 millisecond ad delivery

■ De-facto standard set in 2004 by Washington Post and others

■  North America is 70 to 90 milliseconds wide ■ Two or three data centers

■  Auction is limited to 30 milliseconds ■ Typically closes in 5 milliseconds

■  Winners have more data, better models – in 5 milliseconds

Page 6: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 6

MILLIONS OF CONSUMERS BILLIONS OF DEVICES

APP SERVERS

DATA WAREHOUSE INSIGHTS

Advertising Technology Stack

WRITE CONTEXT

OPERATIONAL DB

WRITE REAL-TIME CONTEXT READ RECENT CONTENT PROFILE STORE Cookies, email, deviceID, IP address, location, segments, clicks, likes, tweets, search terms... REAL-TIME ANALYTICS Best sellers, top scores, trending tweets

BATCH ANALYTICS Discover patterns, segment data: location patterns, audience affinity

Page 7: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 7

Financial Services – Intraday Positions

LEGACY DATABASE (MAINFRAME)

Read/Write

Start of Day Data Loading

End of Day Reconciliation

Query REAL-TIME DATA FEED

ACCOUNT POSITIONS

XDR

10M+ user records Primary key access 1M+ TPS planned

Finance App

Records App

RT Reporting App

Page 8: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 8

Social Media

MYSQL or POSTGRES (ROTATIONAL DISK)

Recent user generated content

Java application tier

Data abstraction and sharding

MODIFIED REDIS (SSD ENABLED)

Content and Historical data

Page 9: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 9

Travel Portal

PRICING DATABASE (RATE LIMITED)

Poll for Pricing Changes

PRICING DATA

Store Latest Price

SESSION MANAGEMENT

Session Data

Read Price

XDR

Airlines forced interstate banking Legacy mainframe technology Multi-company reservation and pricing Requirement: 1M TPS allowing overhead

Travel App

Page 10: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 10

SOURCE DEVICE/ USER

QOS & Real-Time Billing for Telcos

■  In-switch Per HTTP request Billing ■ US Telcos: 200M subscribers, 50 metros

■  In-memory use case

Hot Standby

Execute Request

Real-time Checks

DESTINATION

Update Device User Settings

Request

XDR

Real-time Auth. QoS Billing

Config Module App

Page 11: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 11

MILLIONS OF CONSUMERS BILLIONS OF DEVICES

APP SERVERS

BATCH ANALYTICS INSIGHTS

BATCH ANALYTICS

The New Architecture

WRITE CONTEXT

TRANSACTIONS & HOT ANALYTICS

Page 12: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 12

Old Architecture ( scale out in 2000 )

Request routing and sharding

APP SERVERS

CACHE

DATABASE

STORAGE

CONTENT DELIVERY NETWORK

LOAD BALANCER

Page 13: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 13

Modern Scale Out Architecture

Load balancer Simple stateless APP SERVERS

IN-MEMORY NoSQL

RESEARCH WAREHOUSE

CONTENT DELIVERY NETWORK

LOAD BALANCER

Long term cold storage Fast stateless

Page 14: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 14

Modern Scale Out Architecture

Load balancer Simple stateless APP SERVERS

IN-MEMORY NoSQL

RESEARCH WAREHOUSE

CONTENT DELIVERY NETWORK

LOAD BALANCER

Long term cold storage Fast stateless

HDFS BASED

Page 15: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 15

Build a data layer

Use open source

Focus on Key Value

Use In-memory NoSQL

Use Flash

Page 16: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 16

How fast can you go

How to do it

Page 17: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 17

SHARED-NOTHING SYSTEM:100% DATA AVAILABILITY ■  Every node in a cluster is identical,

handles both transactions and long running tasks

■  Data is replicated synchronously with immediate consistency within the cluster

■  Data is replicated asynchronously across data centers

OHIO Data Center

© 2013 Aerospike. All rights reserved Pg. 17

Page 18: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 18

LESSONS 1.  Optimize key-value code paths

■ No hot spots (e.g., robust DHT) ■ Scales up easily (e.g., easy to size) ■ Avoids points of failure (e.g., single node type) ■ Binary protocol

2.  Code in C ■ Read() / Write() / Linux AIO ( don’t trust a library ) ■ Multithread ! ■ Direct device ■ C++ could work, leads to structural complexity

3.  Memory allocation matters ■ Stack based allocators ■ Own stack allocator ■  JEMalloc for pools

© 2013 Aerospike. All rights reserved Pg. 18

Page 19: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 19

LESSONS (cont’d)

4.  Innovation: masters in a shared nothing system ■ Fast cluster organization ■ Fast transaction capabilities ■ Can be CP or AP - and resolve data accurately

5.  Clients are hard ■ Fast stable connection pools are hard ■ API design matters ■ Slow languages need Aerospike more

6.  Aggregations / queries required ■ Row oriented ■ Secondary indexes as filters, and MR style ■ Data transformation hurts

© 2013 Aerospike. All rights reserved Pg. 19

Page 20: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 20

LESSONS (cont’d) 7.  Network interrupts are painful

■ TCP is still better than UDP ■ Some great hardware out there (solarflare) ■ Network queues, RRD ■ New PCI-e, other interfaces are not there yet ■ Larger clusters solve interface issues

8.  “Large data types” (better documents) ■ Ordered List, Map, Set, Stack ■ Time series and documents ■ Beyond per-row storage layout, more optimal than document

9.  UDFs for flexibility

© 2013 Aerospike. All rights reserved Pg. 20

Page 21: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 21

ROBUST DHT TO ELIMINATE HOT SPOTS How Data Is Distributed (Replication Factor 2)

■  Every key is hashed into a 20 byte (fixed length) string using the RIPEMD160 hash function

■  This hash + additional data (fixed 64 bytes) are stored in RAM in the index

■  Some bits from this hash value are used to compute the partition id

■  There are 4096 partitions

■  Partition id maps to node id based on cluster membership

cookie-abcdefg-12345678

182023kh15hh3kahdjsh

Partition ID

Master node

Replica node

… 1 4

1820 2 3

1821 3 2

4096 4 1

© 2013 Aerospike. All rights reserved Pg. 21

Page 22: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 22

REAL-TIME PRIORITIZATION TO MEET SLA

1.  Write sent to row master

2.  Latch against simultaneous writes

3.  Apply write to master memory and replica memory synchronously

4.  Queue operations to disk

5.  Signal completed transaction (optional storage commit wait)

6.  Master applies conflict resolution policy (rollback/ rollforward)

master replica

1.  Cluster discovers new node via gossip protocol

2.  Paxos vote determines new data organization

3.  Partition migrations scheduled

4.  When a partition migration starts, write journal starts on destination

5.  Partition moves atomically

6.  Journal is applied and source data deleted

transactions continue Writing with Immediate Consistency Adding a Node

Page 23: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 23

SQL & NoSQL

➤ Secondary index §  Equality, Range, IN (,,,), Compound §  e.g. WHERE group_id = 1234,

WHERE last_activity > 1349293398, WHERE branch_id IN (5,6,7,8)

➤ Filters

§  SQL: Where clause with non-indexed “AND”s (e.g. “AND gender=‘M’ ”)

§  NOSQL: Map step

➤ Aggregation §  SQL: GROUP BY, ORDER BY, LIMIT, OFFSET §  NOSQL: Reduce step

Secondary Key

Primary Key

Record

Filter Map

Aggregate

DRAM

SSD

Aggregate

Client

Client

Server

Reduce

Aggregate

Query

Page 24: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 24

Key Value Store + Lists, Maps ■  Namespaces (policy containers)

■ Determine storage - DRAM or Flash ■ Determine replication factor ■ Contain records and sets

■  Sets (tables) of records ■ Arbitrary grouping

■  Records (rows) of key/bins ■ Block size (128k – 2MB)

■ Bin with same name can contain values of different types

■  String, integer, bytes (raw, blob, etc) ■  list ( an ordered collection of values ) ■  map ( a collection of keys and values )

■ Bins can be added anytime

■ Meta data ■  Generation counter so apps can ensure that a

record was not modified since last read ■  Time-to-live value for auto expiration, keeping most

recent context or "hot" data, aging out historical context

Page 25: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 25

Flash storage

The Power of Flash Storage

Page 26: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 26

DATABASE

OS FILE SYSTEM

PAGE CACHE

BLOCK INTERFACE

SSD HDD

BLOCK INTERFACE

SSD SSD

OPEN NVM

SSD

Ask me and I’ll tell you the answer. Ask me. I’ll look up the answer and then tell it to you.

DATABASE

HYBRID MEMORY SYSTEM™

•  Direct device access •  Large Block Writes •  Indexes in DRAM •  Highly Parallelized •  Log-structured FS “copy-on-write” •  Fast restart with shared memory

FLASH OPTIMIZED HIGH PERFORMANCE

Page 27: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 27 © 2012 Aerospike. All rights reserved. Pg. 27

Measure your drives! Aerospike Certification Tool (ACT) http://github.com/aerospike/act Transactional database workload Reads: 1.5KB

(can’t batch / cache reads, random) Writes: 128K blocks

(log based layout) (plus defragmentation)

Turn up the load until latency is over required SLA

Page 28: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 28

Aerospike’s Flash Experience

■  Know your Flash ■ ACT benchmark http://github.com/aerospike/act ■ Read-write benchmark results back to 2011

■  All clouds support flash now ■ New EC2 instances ■ Google Compute ■ Internap, Softlayer, GoGrid…

■  Write durability usually not a problem with modern flash

■ Durability is high (5 “drive writes per day” for 5 years, etc) ■ Read performance suffers under write load anyway

Page 29: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 29

Aerospike’s Flash Experience

■  Densities increasing ■ 100G 2 years ago à 800G today ■ SATA vs PCI-E ■ Appliances: 50T per 1U this year

■  Prices still dropping: perhaps $1/G next year

■  Intel P3700 results ■ 250K per device @ $2.5 / G ■ Old standard: Micron P320h 500K @ $8 / G

■  “Wide SATA” ■ 20 SATA drives ■ LSI “pass through mode” ■ 250K+ per server

Page 30: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 30

Use key-value with In-memory NoSQL

■  Primary key access has always been the path to database performance

■  Creates sanitized application logic

■  De-normalize your data, or “join” in the application layer ■  Largest scale systems built on sharded key-value operations

■  Use developer-friendly features: native lists, maps, documents

Amazon EC2 performance with Aerospike

Page 31: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 31

Amazon EC2 results

Page 32: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 32

Amazon EC2 results

Page 33: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 33

Amazon EC2 results

Page 34: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 34

Amazon EC2 results

Page 35: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 35

Use Open Source

Page 36: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 36

Aerospike: the trusted In-Memory NoSQL

Performance • Over ten trillion transactions per month • 99% of transactions < 2 ms • 150K TPS per server

Scalability • Billions of Internet users • Clustered Software • Maintenance without downtime • Scale up & scale out

Reliability • 50 customers; zero down-time • Immediate Consistency • Rapid Failover; Data Center Replication

Price/Performance • Makes impossible projects affordable • Flash-optimized • 1/10 the servers required

Page 37: Brian Bulkowski. Aerospike

© 2014 Aerospike. All rights reserved. Confidential 37