cassandra at glogster

Cassandra at Glogster Roman Komkov – [email protected]

System Engineer at Glogster

Prague Cassandra Meet up 03.09.2015

About me

  2 years at Glogster EDU as System Engineer

  5+ years of Linux administration

  5+ years of Python development

  Cluster, HA, Orchestration

  CI, CD…

  Twitter - @alkoengineering

GitHub, Freenode - decayofmind

About Glogster EDU

  Started in 2009

  Platform for presentation and interactive learning mainly used by educators and students

  19 million users

  Over 45 million glogs

  40000 new glogs daily

  Web service, mobile applications

  http://edu.glogster.com

Cassandra at Glogster

  From 2011 as primary DB for initial Glogster.com

  From 2012 as backend (storage) DB for Glogster EDU

  Started from 0.6… or 0.8, I guess…

  10 nodes

  RF=5, QUORUM

  SATA disks

OrderPreservingPartitioner ¯\_(ツ)_/¯

Architecture

Cassandra now

  5 nodes cluster

  ~600Gb average node size

  RF=5, QUORUM

  SSD disks

VNodes

OrderPreservingPartitioner…

pycassa + datastax-driver

0.8 problems

  Migration with downtime by transferring a copy of data

HintedHandoff hell

  No repairs, no cleanups

  Enormous HeapSize (20GB)

  Different time on servers

SOLUTION!

  Upgrade to 1.0

1.1 problems

  Cassandra guy left Glogster

  Don’t touch it while it works

BUT…

  Load averages like 14.0-16.0

  2 disks failed

  Everything is slow

  Repairs? Never heard!

1.1 solutions

  Replace disks, rebuild nodes.   Don’t try to run repair on new node instead of ReplaceToken

  Move old Glogster.com keyspace to another cluster   Load gone

https://glogster.github.io/posts/2015/03/23/cassandra-migration.html

  Nodes are fast again

  Regular repairs and cleanups? Never did!

OpsCenter installed

  Cluster upgraded to 1.2

1.2 and migration

  Cluster migrated to the new servers without downtimehttp://www.planetcassandra.org/blog/cassandra-migration-to-ec2/

Vnodes

…

  Old datacenter, connected to production was disconnected from new datacenter

  Forgot about Hints TTL (max_hint_window_in_ms ~ 3 hours)

  Forgot to run repair on cluster after

  Old DC was decommissioned

  Application switched the new one

  …

DATA GONE

Here the hell begins

  ~ 1200 glogs remain on old decommissioned datacenter

  Thanks God, we have RF=<N of nodes>

  Transfer data from one old node to the new server

  Run Cassandra on it, add node to the cluster

  Run repair on entire cluster

  Increase repair chance with read_repair_chance

  Peacefully wait until done…

  Do your complicated repairs through OpsCenter, cause it can continue if failed.

Full repair?

10 DAYS!!!

Conclusions and Improvements

  Increase max_hint_window_in_ms value to something like 3 days

  Make use of parallel things

  CQL3 and datastax-driver

  Upgrade to Cassandra 2.2   faster repairs and other operations   New OpsCenter

  Schedule regular backups and repairs

  We still love Cassandra!

Questions?

cassandra at glogster

Software