running cassandra on amazon ec2

28
@cassandralondon

Upload: dave-gardner

Post on 13-Dec-2014

24.284 views

Category:

Technology


3 download

DESCRIPTION

What are the challenges of running Apache Cassandra on Amazon EC2? Is it a good idea?In this presentation, we explore reasons for and against running the distributed database Cassandra on EC2. We look at the I/O performance of EC2 and

TRANSCRIPT

Page 1: Running Cassandra on Amazon EC2

@cassandralondon

Page 2: Running Cassandra on Amazon EC2

Thanks

Page 3: Running Cassandra on Amazon EC2

Reminder

Next meetup Wednesday 8th December

Jake Luciani will be giving a talk on "Lucandra" (a Cassandra backend for Lucene open source search software)

Page 4: Running Cassandra on Amazon EC2

Quick intro to Cassandra

• Decentralized

• Fault-tolerant

• Tunable consistency

• Elasticity

Page 5: Running Cassandra on Amazon EC2

This talk

Why consider EC2?

What are the challenges of running Cassandra on EC2?

Is it a good idea?

Page 6: Running Cassandra on Amazon EC2

Cassandra design decisions

Cassandra designed to run on many commodity servers

It is designed to deal with unreliable hardware and networks

Page 7: Running Cassandra on Amazon EC2

Why consider EC2?

On demand instances

“frees you from the costs and complexities of planning, purchasing, and maintaining hardware and transforms what are commonly large fixed costs into much smaller variable costs”

http://aws.amazon.com/ec2/pricing/

Page 8: Running Cassandra on Amazon EC2

Why consider EC2?

Multiple “Availability Zones” in multiple regions (US East, US West, Ireland and Singapore)

http://aws.amazon.com/ec2/

Page 9: Running Cassandra on Amazon EC2

Writing to Cassandra

1. Write added to local log on targetmachine2. Memtable updated3. Memtable flushed to disk as datafiles (SSTable plus SSTable Index)4. Eventually data files are compacted

http://wiki.apache.org/cassandra/ArchitectureOverview#Write_path

IO

IO

IO

Page 10: Running Cassandra on Amazon EC2

Reading from Cassandra

1. Read from any node2. Partitioner3. Wait for R responses4. Wait for N – R responses in thebackground and perform read repair http://wiki.apache.org/cassandra/ArchitectureOverview#Read_path

IO

IO

Page 11: Running Cassandra on Amazon EC2

Reading from Cassandra

Reads from multiple SSTables

The application use-case will affect performance and what the bottleneck is (totally random reads being worst case)

IO

Page 12: Running Cassandra on Amazon EC2

The challenges

Getting good enough I/O performance

Not a huge number of resources on the Internet (new and shiny)

Some minor setup and monitoring challenges (documentation is available)

Page 13: Running Cassandra on Amazon EC2

EC2 I/O performance

Ephemeral or EBS; low, moderate or high I/O performance indicators

“other resources like the network and the disk subsystem are shared among instances… when a resource is under-utilized you will often be able to consume a higher share of that resource”

http://aws.amazon.com/ec2/instance-types/

Page 14: Running Cassandra on Amazon EC2

EBS or ephemeral?

Jonathan Ellis recently on mailing list:

“we recommend using raid0 ephemeral disks on EC2 with L or XL instance sizes for betteri/o performance.”

http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cold-boot-performance-problems-tp5615829p5615889.html

http://www.coreyhulen.org/?p=326

Page 15: Running Cassandra on Amazon EC2

EBS or ephemeral?

Amazon suggest EBS is better:

“Amazon EBS is particularly suited for applications that require a database, file system, or access to raw block level storage”

http://aws.amazon.com/ebs/

Page 16: Running Cassandra on Amazon EC2

“The latency and throughput of Amazon EBS volumes is designed to be significantly better than the Amazon EC2 instance stores in nearly all cases. You can also attach multiple volumes to an instance and stripe across the volumes. This is one way to improve I/O rates, especially if your application performs a lot of random access across your data set.”

http://aws.amazon.com/ebs/

Page 17: Running Cassandra on Amazon EC2

EC2 I/O benchmark

Throughput measured using dd

Seek measured using seeker.c

Software RAID uses mdadm

http://www.linuxinsight.com/how_fast_is_your_disk.htmlhttp://en.wikipedia.org/wiki/Mdadm

Page 18: Running Cassandra on Amazon EC2
Page 19: Running Cassandra on Amazon EC2
Page 20: Running Cassandra on Amazon EC2

Which is better?

EBS has better throughput, ephemeral better for random seeks

Generic benchmarks aren’t great – depends on your use case

Warning: EC2 performance not consistent

Page 21: Running Cassandra on Amazon EC2

EC2 Cassandra benchmark

Read and write TPS

Benchmarks carried out by Corey Hulen

http://www.coreyhulen.org/?p=326

Page 22: Running Cassandra on Amazon EC2
Page 23: Running Cassandra on Amazon EC2
Page 24: Running Cassandra on Amazon EC2

Which is better?

Corey suggests:

“Raid 0 EBS drives are the way to go”

“We didn’t notice a difference above the normal EC2 fluctuations when testing for 2 vs 4 drives”

Page 25: Running Cassandra on Amazon EC2

Conclusions

Cassandra will run acceptably on EC2, but real HW is better

It will depend on your use case – particularly the types of read that you do

Real HW may work out cheaper

Page 26: Running Cassandra on Amazon EC2

Conclusions

Ephemeral I/O seems to be better than EBS, although EBS has other advantages (doesn’t disappear if you stop the node)

Again, it will depend on use case

Page 27: Running Cassandra on Amazon EC2

Conclusions

Large nodes are the best bet

Small nodes have poor I/O

Extra large nodes are probably not worth it (better to have more nodes)

http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Nodes-dropping-out-of-cluster-due-to-GC-tp5128481p5131568.html

Page 28: Running Cassandra on Amazon EC2

Questions?

Please leave feedback on meetup.comFollow @cassandralondon on Twitter