cassandra at glogster
TRANSCRIPT
Cassandra at Glogster Roman Komkov – [email protected]
System Engineer at Glogster
Prague Cassandra Meet up 03.09.2015
About me
2 years at Glogster EDU as System Engineer
5+ years of Linux administration
5+ years of Python development
Cluster, HA, Orchestration
CI, CD…
Twitter - @alkoengineering
GitHub, Freenode - decayofmind
About Glogster EDU
Started in 2009
Platform for presentation and interactive learning mainly used by educators and students
19 million users
Over 45 million glogs
40000 new glogs daily
Web service, mobile applications
http://edu.glogster.com
Cassandra at Glogster
From 2011 as primary DB for initial Glogster.com
From 2012 as backend (storage) DB for Glogster EDU
Started from 0.6… or 0.8, I guess…
10 nodes
RF=5, QUORUM
SATA disks
OrderPreservingPartitioner ¯\_(ツ)_/¯
Architecture
Cassandra now
5 nodes cluster
~600Gb average node size
RF=5, QUORUM
SSD disks
VNodes
OrderPreservingPartitioner…
pycassa + datastax-driver
0.8 problems
Migration with downtime by transferring a copy of data
HintedHandoff hell
No repairs, no cleanups
Enormous HeapSize (20GB)
Different time on servers
SOLUTION!
Upgrade to 1.0
1.1 problems
Cassandra guy left Glogster
Don’t touch it while it works
BUT…
Load averages like 14.0-16.0
2 disks failed
Everything is slow
Repairs? Never heard!
1.1 solutions
Replace disks, rebuild nodes. Don’t try to run repair on new node instead of ReplaceToken
Move old Glogster.com keyspace to another cluster Load gone
https://glogster.github.io/posts/2015/03/23/cassandra-migration.html
Nodes are fast again
Regular repairs and cleanups? Never did!
OpsCenter installed
Cluster upgraded to 1.2
1.2 and migration
Cluster migrated to the new servers without downtimehttp://www.planetcassandra.org/blog/cassandra-migration-to-ec2/
Vnodes
…
Old datacenter, connected to production was disconnected from new datacenter
Forgot about Hints TTL (max_hint_window_in_ms ~ 3 hours)
Forgot to run repair on cluster after
Old DC was decommissioned
Application switched the new one
…
DATA GONE
Here the hell begins
~ 1200 glogs remain on old decommissioned datacenter
Thanks God, we have RF=<N of nodes>
Transfer data from one old node to the new server
Run Cassandra on it, add node to the cluster
Run repair on entire cluster
Increase repair chance with read_repair_chance
Peacefully wait until done…
Do your complicated repairs through OpsCenter, cause it can continue if failed.
Full repair?
10 DAYS!!!
Conclusions and Improvements
Increase max_hint_window_in_ms value to something like 3 days
Make use of parallel things
CQL3 and datastax-driver
Upgrade to Cassandra 2.2 faster repairs and other operations New OpsCenter
Schedule regular backups and repairs
We still love Cassandra!
Questions?