Scaling Cassandra up and down into containers with ZFS
Chris Burroughs
AddThis
2015-09-24
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 1 / 52
Hello!
Chris Burroughs [email protected] @csby54
Engineer at AddThis
Co-organizer of the Cassandra DC Meetup
Occasional contributor
interrupt me!
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 2 / 52
1 Cassandra at AddThis
2 ZFS
3 Scaling up
4 Scaling down
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 3 / 52
Table of Contents
1 Cassandra at AddThis
2 ZFS
3 Scaling up
4 Scaling down
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 4 / 52
AddThis by the Numbers
80,000 request/second (3 billion views/day)
Tools on over 14 million domains
Mostly java on Linux
towards SOA microservices
multiple engineering “squads” with significant discretion
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 8 / 52
Cassandra at AddThis
Cassandra in production since 0.6
About a dozen clusters, new one created per use-case or SLA
Primarily used for latency sensitive, read-mostly storage
Every cluster is multi-DC
Virtual every page load with AddThis tools results in a least one read to Cassandra
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 9 / 52
Table of Contents
1 Cassandra at AddThis
2 ZFS
3 Scaling up
4 Scaling down
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 13 / 52
On Abstractions
Typical Storage:
Many moving parts: block devices, partitions, raid, volume manager.
Big Plan Up Front: changing partition either not done or painful.
Data integrity not warm and fuzzy: hope fsck works.
Typical Memory:
Virtual memory and malloc/free. Add more DRAM if needed.
Maybe worry about NUMA, at runtime.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 14 / 52
ZFS
A storage sub-system (fs, raid, volume manager)I Always consistent on disk (no fsck)I End-to-end data integrityI Universal: file-system, block, NFS, SMBI Concise, simple administrative toolsI scalable data structures (278 max pool size)
Started by Jeff Bonwick and Matthew Ahrens at Sun around 2001.
Available for: Illumos, Solaris, FreeBSD, Linux, MacOS X
Does for storage what VM did for memory.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 15 / 52
Timeline
2001: development started at Sun by Jeff Bonwick and Matthew Ahrens
2005: ZFS source code released
2008: ZFS released in FreeBSD 7.0
2010: Oracle proprietary fork, illumos project continues open-source development
2013: ZFS on (native) Linux GA
2013: Open-source ZFS bands together to form OpenZFS
2014: (new) OpenZFS for Mac OS X launch
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 16 / 52
Universal Storage
Compression, snapshots, etc. are common features of all datasets.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 18 / 52
COW Bonus: SnapshotsSnapshots: Read only copy at a point in time
create delete incremental
Traditional (rsync-esque) O(n) O(n) O(n)ZFS O(1) O(∆) O(∆)
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 20 / 52
Clones
“Clone” a snapshot to create a writeable dataset
Only pay for the difference in accumulated changes
Clones with no changes take up no space!
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 21 / 52
Wait, there is more!
End-To-End data integrity
Online Everything: Expansion, scrubbing, resilvering, “fsck”, etc.
Copy On Write with linearized writes
Transparent compression
Dataset send/recv
Flexible mount points
Nested property based configuration
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 23 / 52
zpool A collection of devices that provides physical storage and data replication.
vdev A device (or collection of devices) with certain performance or fault-tolerancecharacteristics. Building blocks of a zpool.
dataset The “data things” (usually filesystems) created on your zpool. Nested in ahierarchy.
property Key-value pairs used for configuration or reporting of datasets . Inherited inhierarchy.
ARC Adaptive Replacement Cache. Like a ZFS specific page cache.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 24 / 52
The ARC: your new best friend
$ arcstat.py -f time,read,miss,hit% 1
time read miss hit%
16:26:04 16 7 56
16:26:05 1.4K 671 52
16:26:06 1.9K 900 53
16:26:07 2.1K 972 53
16:26:08 1.5K 697 54
16:26:09 2.1K 906 56
16:26:10 1.9K 844 54
$ arcstat.py -f time,read,miss,hit% 1
time read miss hit%
16:25:47 5.1K 0 100
16:25:48 413K 0 100
16:25:49 403K 0 100
16:25:50 402K 0 100
16:25:51 452K 1 99
16:25:52 553K 0 100
16:25:53 354K 0 100
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 25 / 52
dstat plugin
$ dstat --cpu --zfs-arc --zfs-l2arc
----total-cpu-usage---- -----------ZFS-ARC----------- -------------ZFS-L2ARC-------------
usr sys idl wai hiq siq| mem hit miss reads hit%| size hit miss hit% read write
1 1 94 4 0 0|15.0G 796B 372B 1167B 68.2B| 206G 343B 28.8B 92.3B 37.8M 137k
1 1 95 4 0 0|15.0G 557B 395B 952B 58.5B| 206G 376B 19.0B 95.2B 41.5M 0
1 1 95 3 0 0|15.0G 553B 358B 911B 60.7B| 206G 344B 14.0B 96.1B 38.2M 0
1 1 95 4 0 0|15.0G 686B 412B 1098B 62.5B| 206G 396B 16.0B 96.1B 43.7M 0
1 1 95 4 0 0|15.0G 712B 409B 1121B 63.5B| 206G 386B 23.0B 94.4B 42.5M 0
1 1 96 3 0 0|15.0G 446B 331B 777B 57.4B| 206G 307B 24.0B 92.7B 34.0M 0
1 1 86 13 0 0|15.0G 708B 332B 1040B 68.1B| 206G 310B 22.0B 93.4B 33.7M 0
2 1 93 4 0 0|15.0G 1094B 294B 1388B 78.8B| 206G 280B 14.0B 95.2B 31.0M 4608B
hat tip: @AlTobey
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 26 / 52
Intelligent Prefetch
Cassandra Test cat */*.junk > /dev/null:
page cache 98th percentile reads reported by Cassandra increased 4-6x
arc 98th percentile reads reported by Cassandra increased 2x
LSM trees (Cassandra, HBase, LevelDB, RocksDB, BerkelyDB) mean linear scans are commonon modern storage systems.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 27 / 52
Commands
Only 2 you are ever likely to use
zpool Configure pools
zfs Configure file systems
zdb (Detailed debugging dump)
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 28 / 52
production-esque example
# zpool create -f tank mirror /dev/sdb /dev/sdc \
mirror /dev/sdd /dev/sde \
cache /dev/sdf
# zfs create tank/sstables
# zfs set mountpoint=/data/sstables tank/sstables
# zfs set compression=lz4 tank
# zfs set atime=off tank
# chown cassandra:cassandra /data/sstables/
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 374G 1.42T 30K /tank
tank/sstables 374G 1.42T 374G /data/sstables
# zfs get compressratio tank/sstables
NAME PROPERTY VALUE SOURCE
tank/sstables compressratio 1.08x -Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 29 / 52
ZFS on Linux History
In days gone by there was a FUSE project.
Port started by LLNL as a backend for their supercomputer.
Early 2013: “Ready for wide scale deployment on everything from desktops to supercomputers.”
Late 2014: Illumos/FreeBSD/Linux developers form OpenZFS group to coordinatedevelopment.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 30 / 52
Today
0.6.5 released September 2015
Native packages for most distributions:
Active user communityI http://zfsonlinux.orgI #zfsonlinux on freenodeI [email protected]
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 31 / 52
Zero in front of the version number?
clusterhq.com/blog/state-zfs-on-linux/
Close to feature parity with Illumos and FreeBSD.
Key end-to-end data integrity features work on Linux like other platforms.
Performance is workload dependent.
ZFS on Linux may be better than other options today for your use cases. It is notbetter in all cases.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 32 / 52
Table of Contents
1 Cassandra at AddThis
2 ZFS
3 Scaling up
4 Scaling down
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 33 / 52
Initial Problem Statement
Large-ish Cassandra cluster serving ML-derived data about URLs using AddThis.
Internal DC storage SLA: 98th percentile of 35ms
Data size and request volume growing, failing to meet SLA even while throwing hardwareat it.
Multiple revenue lines & products affected or launch blocked on cluster performance.
zipfian web traffic with a very long tail.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 34 / 52
Setup
# zpool create -f tank mirror /dev/sdb /dev/sdc \
mirror /dev/sdd /dev/sde \
cache /dev/sdf
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
cache
sdf ONLINE 0 0 0
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 36 / 52
ResultsTwice the performance with half the physical nodes.
(Mileage will vary with workload and DRAM:SSD:working-set ratios.)Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 38 / 52
Table of Contents
1 Cassandra at AddThis
2 ZFS
3 Scaling up
4 Scaling down
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 39 / 52
Cute Little Clusters
Datacenter: IAD
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN x.xx.xxx.125 154.14 MB 256 ? 87d41c52-2b25-466b-93c1-d65c72f5fc61 NOP
UN x.xx.xxx.124 154.17 MB 256 ? c1a44486-2133-40fc-ba9d-0e671c5b2fc7 NOP
UN x.xx.xxx.126 154.17 MB 256 ? 824f6018-eba6-4b44-b716-3b4eeaf69228 NOP
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 40 / 52
Constraints
Need more efficient hardware allocationfor small clusters (multi-tenancy)
Among the most latency sensitive services
Non-trivial legacy network requirements
Application transparency
Infrastructure transparency (inventory,dns, dhcp, config managemnt)
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 42 / 52
ZFS Enabling Containerization
Create from base image
Backup running applications
Migrate to new host
Manage quotas
→ zfs clone
→ zfs snapshot
→ zfs send/recv
→ zfs properties
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 44 / 52
Glue
$ ./bin/port register-container --hostname=HOSTNAME
$ ./bin/port build-container --tag=CONTAINER_TAG \
--resources=standard-small \
--host-tag=PHYSICAL_HOST_TAG
$ ./bin/cobbling-time add-chef-roles --tag=CONTAINER_TAG \
--roles=’ROLES’
$ ./bin/cobbling-time signoff --tag=CONTAINER_TAG \
--next-status=Allocated
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 46 / 52
# lxc-ls -l
drwxrwx--- 3 root root 5 Apr 17 15:23 drydock-2015-04-17T15:23:02
drwxrwx--- 3 root root 5 Mar 12 2015 T5501a951541e62d5
drwxrwx--- 3 root root 5 Mar 17 2015 T55081d472f641f10
drwxrwx--- 3 root root 5 Apr 14 10:27 T552d238170fc63be
drwxrwx--- 3 root root 5 Apr 16 11:38 T552fd746af5be6e9
drwxrwx--- 3 root root 5 Sep 4 08:46 T55e9928f940e0155
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank/lxc 52.8G 2.60T 79K /lxc
tank/lxc/T5501a951541e62d5 483M 150G 957M /lxc/T5501a951541e62d5/rootfs
tank/lxc/T55081d472f641f10 15.4G 585G 15.8G /lxc/T55081d472f641f10/rootfs
tank/lxc/T552d238170fc63be 22.1G 278G 22.5G /lxc/T552d238170fc63be/rootfs
tank/lxc/T552fd746af5be6e9 8.53G 591G 8.51G /lxc/T552fd746af5be6e9/rootfs
tank/lxc/T55e9928f940e0155 1.45G 124G 1.87G /lxc/T55e9928f940e0155/rootfs
tank/lxc/drydock-2015-04-17T15:23:02 734M 2.60T 734M /lxc/drydock-2015-04-17T15:23:02/rootfs
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 47 / 52
Results
> 2x consolidation
. . . able to defer hardware purchase for a year
Clean method for multiple Cassandra clusters per physical host. Can continue to breakapart clusters by use case and SLA!
Virtually every view (3 billion/day) of AddThis tools involves Cassandra, in acontainer, on ZFS.
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 49 / 52
Summary & Future Work
ZFS allows us to significantly improve the efficiency of both large and small clusters
ZFS is fundamental to container storage
Future: Continued performance investigations (align block sizes?)
Future: Is this going to evolve into writing our own IaaS?
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 50 / 52
We Are Hiring
http://www.addthis.com/careers
Chris Burroughs (AddThis) #CassandraSummit 2015-09-24 51 / 52