scaling up with aerospike!

27
Scaling up (and easing) operations at 1 Million TPS @ <1 ms latency. LSPE, Jun 14, 2014

Upload: anshu-prateek

Post on 13-Jul-2015

130 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Scaling up with Aerospike!

Scaling up (and easing) operations at 1 Million TPS @ <1 ms latency.

LSPE, Jun 14, 2014

Page 2: Scaling up with Aerospike!

Agenda of this talk

● Some types of B ig Data?● What are the problems that come with scale?● What is the solution? (Or how Aerospike tackles

these problem and how is Aerospike the solution for the above problems).

Page 3: Scaling up with Aerospike!

● Anshu Prateek● Aerospike Devops Lead● Ex - Yahoo! Search Operations● http://about.me/anshuprateek● [email protected]

Page 4: Scaling up with Aerospike!

Big Data Type

● Volume – Hadoop – PB / Hrs of jobs● Variety – ETL – Many data sources, mashup,

analyze● Velocity – Do it fast, do it now!

→ Volume and Variety need Velocity to be useful.

Page 5: Scaling up with Aerospike!

What starts failing at scale?

● Machines / hardware ● Network● Unplanned load● Operator error

Page 6: Scaling up with Aerospike!

Big Data..

● Volume – Hadoop – PB / Hrs of jobs● Variety – ETL – Many data sources, mashup,

analyze● Velocity – Do it fast, do it now!

→ Volume and Variety need Velocity to be useful.

Page 7: Scaling up with Aerospike!

Velocity in Aerospike

● Latency

Page SLA 700ms , Ads SLA 50 ms

→Data store <5ms– Hybrid DRAM + SSD optimized storage

● Throughput– Horizontal scalability (Linear is desirable)

Page 8: Scaling up with Aerospike!

Prod example:

● 20 Nodes● 1.6TB per node● 50GB DRAM usage● 14 Billion objects● 70k TPS (r+w) per node peak

Page 9: Scaling up with Aerospike!
Page 10: Scaling up with Aerospike!

● 98% of queries < 1ms●

Page 11: Scaling up with Aerospike!

Yet another prod graph...

Page 12: Scaling up with Aerospike!

What starts failing at scale?

● Machines / hardware ● Network● Unplanned load● Operator error

Page 13: Scaling up with Aerospike!

Start scaling with Aerospike..

● Machines / hardware – Replication / auto-balancing

● Network– Availability of islands– Auto balancing with eventual consistency

● Unplanned load– Have lot of headroom

● Operator error– What if the system reduces operational needs– Tools

Page 14: Scaling up with Aerospike!

Operational Ease

● Reducing initial setup time– Auto sharding– Auto cluster discovery

● Configuration– People don't read documents

● RTFM!

– Good default value– retain the power to control when needed

● Static configs● Dynamic configs

Page 15: Scaling up with Aerospike!

Tools

● Do all nodes have same config?– asmonitor -e 'compareconfig'

● Whats the cluster status?– asmonitor -e 'info'

● Oops, this needs to be changed!– asinfo -v 'set-

config:context=service;letschangethis=value'

Page 16: Scaling up with Aerospike!

Tools

● Nagios● Graphite● AMC

Page 17: Scaling up with Aerospike!

Capacity Planning

Page 18: Scaling up with Aerospike!

Managing with AMC

Page 19: Scaling up with Aerospike!

Managing with AMC

Page 20: Scaling up with Aerospike!

Managing with AMC

Page 21: Scaling up with Aerospike!

Headroom!

● How many TPS can we do ?

Page 22: Scaling up with Aerospike!
Page 23: Scaling up with Aerospike!

● 330 GCE● 300 x 1TB● Debian, Cassandra 2.2● Median Latency – 10.3 ms● 95% < 23 ms

Page 24: Scaling up with Aerospike!

Aerospike

Page 25: Scaling up with Aerospike!
Page 26: Scaling up with Aerospike!
Page 27: Scaling up with Aerospike!