aerospike architecture
TRANSCRIPT
© 2014 Aerospike. All rights reserved. Confidential 1
Aerospike aer . o . spike [air-oh- spahyk]
noun, 1. tip of a rocket that enhances speed and stability
DEVELOPING WITH AEROSPIKE
ARCHITECTURE OVERVIEW
IN-MEMORY + NOSQL + ACID
© 2014 Aerospike. All rights reserved. Confidential 2
Objectives
This module provides an overview of the Aerospike architecture. At the end
of this module you will have a high level understanding of
■ Client
■ Cluster
■ Storage
■Primary & Secondary indexes
■RAM
■Flash
■ Cross Datacenter Replication (XDR)
© 2014 Aerospike. All rights reserved. Confidential 3
The Big Picture
© 2014 Aerospike. All rights reserved. Confidential 4
Aerospike Goals
Aerospike technology goals are to meet these challenges:
■ Handle extremely high rates of read/write transactions over persistent
data
■ Avoid hot spots to maintain tight latency SLAs
■ Provide immediate consistency with replication
■ Ensure long running tasks do not slow down transactions
■ Scale linearly as data sizes and workloads increase
■ Add capacity with no service interruption
© 2014 Aerospike. All rights reserved. Confidential 5
Architecture – The Big Picture
1) No Hotspots
– DHT simplifies data
partitioning
2) Smart Client – 1 hop to data,
no load balancers
3) Shared Nothing Architecture,
every node identical
7) XDR – sync replication across
data centers ensures Zero Downtime
4) Single row ACID
– synch replication in cluster
5) Smart Cluster, Zero Touch
– auto-failover, rebalancing,
rack aware, rolling upgrades..
6) Transactions and long running
tasks prioritized real-time
© 2014 Aerospike. All rights reserved. Confidential 6
What Aerospike offers
■ Row oriented
■Key value store
■ Fast
■Like Redis and Memcache
■Whole key space
■ Complex types
■Like Redis and MongoDB
■List/Map/JSON
■Large Data types
■ Queries and Aggregations
■Secondary index
■Sub-second MapRedude
■ High performance
■Linux Daemon
■Multi-core aware
■Multi-socket aware
■No Garbage collection issues
■ Run anywhere
■Cloud friendly
■TCP networking
■ Flash optimized
■Near-DRAM Flash performance
© 2014 Aerospike. All rights reserved. Confidential 7
Client
© 2014 Aerospike. All rights reserved. Confidential 8
Smart Client™
■ The Aerospike Client is
implemented as a library, JAR or
DLL, and consists of 2 parts:
■Operation APIs – These are the
operations that you can execute on the
cluster – CRUD+ etc.
■First class observer of the Cluster –
Monitoring the state of each node and
aware on new nodes or node failures.
© 2014 Aerospike. All rights reserved. Confidential 9
From Key to Node
■ Distributed Hash Table with No Hotspots
■Every key hashed with RIPEMD160
into an ultra efficient 20 byte (fixed length) string
■Hash + additional (fixed 64 bytes) data
forms index entry in RAM
■Some bits from hash value are used to
calculate the Partition ID (4096 partitions)
■Partition ID maps to Node ID in the cluster
■ 1 Hop to data
■Smart Client simply calculates Partition
ID to determine Node ID
■No Load Balancers required
© 2014 Aerospike. All rights reserved. Confidential 10
Cluster
© 2014 Aerospike. All rights reserved. Confidential 11
The Cluster (servers)
■ Local servers
■XDR to remote cluster
■ Automatic Load balancing
■ Quick fail over
■ Detects new nodes (multicast)
■ Rebalances data (measured rate)
■ Add nodes under load
■ Rack awareness
■ “proxy” to correct node
■ Locally attached storage
© 2014 Aerospike. All rights reserved. Confidential 12
Data Distribution
Data is distributed evenly across nodes in a cluster using the Aerospike
Smart Partitions™ algorithm.
■ RIPEMD160 (no collisions yet found)
■ 4096 Data Partitions
■ Even distribution of
■Partitions across nodes
■Records across Partitions
■Data across Flash devices
■ Primary and Replica
Partitions
© 2014 Aerospike. All rights reserved. Confidential 13
Single Row ACID
Writing with Immediate Consistency
1. Write sent to record master
2. Latch against simultaneous writes
3. Apply write synchronously to master memory and replica memory
4. Queue operations to storage
5. Signal completed transaction
(optional storage commit wait)
6. Master applies conflict resolution policy
(rollback/ roll forward)
© 2014 Aerospike. All rights reserved. Confidential 14
Automatic rebalancing
Adding, or Removing a node, the Cluster automatically
rebalances
1. Cluster discovers new node via gossip protocol
2. Paxos vote determines new data organization
3. Partition migrations scheduled
4. When a partition migration starts,
write journal starts on destination
5. Partition moves atomically
6. Journal is applied and source data deleted
After migration is complete, the Cluster is evenly
balanced.
© 2014 Aerospike. All rights reserved. Confidential 15
Cluster formation
…and the Cluster forms
Individual nodes go in…
Gossip protocol
Heartbeat
PAXOS
© 2014 Aerospike. All rights reserved. Confidential 16
Aerospike Management and Monitoring
Aerospike provides a sophisticated management console: Aerospike
Management Console (AMC).
Plugins also are available for:
■ Graphite
■ Nagios
APIs for monitoring
■ Roll you own tool
■ 100’s of monitoring parameters
■ Fine grained latency monitoring
© 2014 Aerospike. All rights reserved. Confidential 17
Primary Index
Primary index
■Hash of Hash of rb trees
■ Index
■64 bytes
■Write generation
■Time To Live
■Storage address
■Uses shared memory for
Fast Restart
■ Single bin
■Optimization for minimal data volume
■ If using integers, store data in index
rather than in storage
© 2014 Aerospike. All rights reserved. Confidential 18
Primary Index – Single Bin
Storing a key-value entry where the value is a single integer can be very
useful. A namespace can be configured, and optimized, for a single Bin
record.
■ No Bin management overhead
■ Integer – stored in Index for free (data-in-index)
© 2014 Aerospike. All rights reserved. Confidential 19
Secondary Indexes
■ Bin (Column) indexes
■ Low selectivity = 100’s of rows
■ DDL – which Bins to index
■String or Range Integer
■ In RAM – fast
■ Multi-node
■Collocated with primary index
■ Reference local data only
■ Index creation
■Tools: AQL, ascli
■Client API – developer only
© 2014 Aerospike. All rights reserved. Confidential 20
User Defined Functions
User Defined Functions (UDFs) are an extensibility mechanism.
Adding a function that can be evaluated in the Cluster, local to the data.
■ UDFs are common in many databases:
■MySQL, SQL server, Oracle, DB2, Redis, Postgress, and others
■ UDFs move the compute close to the data
■ Lua (and C) today
© 2014 Aerospike. All rights reserved. Confidential 21
Storage
© 2014 Aerospike. All rights reserved. Confidential 22
Data Storage Layer
© 2014 Aerospike. All rights reserved. Confidential 23
Data on Flash / SSD
■ Indexes in RAM (64 bytes per)
■Low wear
■ Data in Flash (SSD)
■Record data stored contiguously
■ 1 read per record (multithreaded)
■Automatic continuous defragment, eviction
■Log structured file system, “copy on write”
■O_DIRECT, O_SYNC
■Data written in flash optimal large blocks
■Automatic distribution (no RAID)
■Writes cached
BLOCK INTERFACE
SSD SSDSSD
AEROSPIKE
HYBRID MEMORY SYSTEM™
© 2014 Aerospike. All rights reserved. Confidential 24
Data in RAM
Data in RAM is very fast – at a price
■ Indexes and Data
■ $$$ (great < 100G, Cloud)
■ More servers
■ Super fast
■ Optional HDD as backing store
Every cluster should have a RAM namespace
■ High frequency and low latency keys (hot keys)
© 2014 Aerospike. All rights reserved. Confidential 26
Reading a record (select)
The entire record is read from storage into the server RAM, only the
requested Bins are returned to the client.
© 2014 Aerospike. All rights reserved. Confidential
Pg. 26
© 2014 Aerospike. All rights reserved. Confidential 27
Writing a new record (insert)
Entries are added into the primary and any secondary indexes. The record
is placed in a write buffer to be written to the next available block.
© 2014 Aerospike. All rights reserved. Confidential
Pg. 27
© 2014 Aerospike. All rights reserved. Confidential 28
Updating a record (update)
The entire record is read in to server RAM, updated as needed, then written
to a new block (copy on write).
The index(es) point to the new location.
© 2014 Aerospike. All rights reserved. Confidential
Pg. 28
© 2014 Aerospike. All rights reserved. Confidential 29
Deleting a record (delete)
Deleting a record removes the entries
from the indexes only. Very fast
The background defragmentation will
physically delete the record at a future
time.
© 2014 Aerospike. All rights reserved. Confidential 30
Cross Datacenter Replication – XDR
© 2014 Aerospike. All rights reserved. Confidential 31
XDR Architecture
Each node in the clusterDistributed clusters
© 2014 Aerospike. All rights reserved. Confidential 32
XDR Topologies
Simple Active-Passive Simple Active-Active
Star Replication More Complex Topology
© 2014 Aerospike. All rights reserved. Confidential 33
Failure Handling
Node failure within a cluster – nodes with replica data will continue
Link failure – XDR keeps track of link failures and data to be shipped over
that link. It will recover when the link comes up.
Node failure in a Cluster Link failure between Clusters
© 2014 Aerospike. All rights reserved. Confidential 34
Shipping
XDR can have fine control shipping
■ Namespaces
■ Sets
Compression
© 2014 Aerospike. All rights reserved. Confidential 35
Summary
You have learned about the Aerospike architecture
■ Client,
■ Cluster
■ Storage
■Primary & Secondary indexes
■RAM
■Flash
■ Cross Datacenter Replication (XDR)
© 2014 Aerospike. All rights reserved. Confidential 36