building an activity feed with cassandra
TRANSCRIPT
Building an Activity Feed with Cassandra
Mark Dunphy, Software Engineer Behance/Adobe @dunphtastic
DisclaimerNot an operations person.
Will pretend to be one for the purpose of this talk.
Quick OverviewWhat is the Behance Activity Feed?
• Actions
• Comments, Appreciations, Etc
• Entities
• Projects, Works in Progress
• Actors
• Users
Project Entity
Actions taken by actors
Activity Fan Out
User A publishes a new project
Write to Follower A’s feed
Write to Follower B’s feed
Write to Follower C’s feed
Write to Follower D’s feed
Now that that’s over…
MongoDB 2011
• Smaller user base (~340,000).
• Built very quickly. Worked well at the time.
• Not well researched.
Fast forward to 2014
• Frequent node failures
• Heavy disk fragmentation caused by deletes
• Slow reads from disk. Started storing in RAM.
• Primary -> Secondary caused downtime for some.
• Scaled out vertically and horizontally.
Why Cassandra?
• Riak
• Very close. Community seemed lacking.
• Redis
• No native cluster. Too much maintenance.
• Memcached/MySQL
• Too much complex app logic.
Cassandra Wins.
• Fantastic community. #cassandra on Twitter
• Easy to read documentation
• Linearly scalable. Easy to grow cluster.
• Low maintenance overhead for ops team.
• Handles time series data very well.
Learning
• Cassandra Summit 2014
• Other team in Adobe
• Long nights reading documentation
Our Data
• Ephemeral
• “Source of truth” lives in a MySQL database
• Okay with *some* data loss
Our Rules
• User’s feed is comprised of entities with one set of actions
• User’s feed only contains one of any given entity
• An entity’s set of actions contains up to seven of the most recent actions taken by that user’s network
Planning
Language Support
• Most services on Behance are PHP
• No official Datastax PHP driver
–Mark Dunphy, 2014
“Looks like I’m learning python.”
Go to ProductionNo, nothing is working yet. I didn’t skip a slide.
• App/cluster in production before anything works
• Test real life load
• Fail spectacularly without anybody noticing
• Deploy risky changes without fear
• Run alongside MongoDB
January 19th, 2015
Query Patterns• “Create your data models based on the queries
you want to run” - Basically Everybody
• Wanted to…
• Read a user’s feed entities by type and time of most recent action…separately.
• Write/Update a user’s feed entities with new actions while knowing only user id and entity id
Data Models
–Mark Dunphy, January 2015
“An UPDATE in Cassandra works like an UPSERT! Let’s store the user’s entire feed in a
single row in a table! It’s so simple!”
First Data Model
CREATE TYPE activity.action ( created_on timestamp, secondary_entity_id int, actor_id int, verb_id int);
CREATE TYPE activity.entity ( entity_type_id int, entity_id int);
CREATE TABLE activity.project_actions ( modified_on timestamp, entity_id int, user_id int, actions list<frozen<action>>, PRIMARY KEY(user_id, entity_id))
CREATE TABLE activity.feeds ( modified_entities list<frozen<entity>>, modified_on timestamp, project_ids list<int>, user_id int, wip_revision_ids list<int>, PRIMARY KEY(user_id))
First Data Model
First Data Model
Moments Before Everything Exploded
–Mark Dunphy, January 2015
“Okay let’s keep nearly the same model, but use INSERT and DELETE instead of always
UPDATE. Just use batch statements.”
Second Data Model
Second Data Model
This was also a very very bad idea.
• Lose the benefit of Cassandra being distributed
• All queries go through the same coordinator which puts a lot of stress and responsibility on one node.
• Use concurrency and prepared statements instead. Datastax drivers make this easy.
Second Data Model
Second Data Model
Oops
Okay…
Now we’ve got it.
Winning Data Model
CREATE TYPE activity.action ( created_on timestamp, secondary_entity_id int, actor_id int, verb_id int);
CREATE TABLE activity.projects ( created_on timestamp, user_id int, entity_id int, actions list<frozen<action>>, PRIMARY KEY(user_id, created_on, entity_id))
CREATE TABLE activity.project_actions ( modified_on timestamp, entity_id int, user_id int, actions list<frozen<action>>, PRIMARY KEY(user_id, entity_id))
Much Nicer
Write Strategy• “User A comments on Project A. User B follows
User A.”
• Request out to add the comment action to User B’s feed
• Read existing actions for that entity (Project A) in B’s feed. Push new action on top.
• Write new actions list into new “row” in projects table
Read Strategy
• SELECT * FROM projects WHERE user_id = 123 AND created_on > 123214373
• Optimized for quick/easy reads. More important that a user’s feed loads quickly than it updating quickly.
• Use timestamp to “page” through data.
Lessons Learned
• Duplicate your data to achieve desired queries. Storage is cheap. Writes are cheap.
• Think outside the box. Cassandra is not relational.
• Never ever ever ignore inserts/deletes in favor of an update only workflow. Never. It is literally insane.
Final Specs• 16 node cluster on AWS EC2 c3.8xlarge
• Mix of SizeTieredCompactionStrategy and DateTieredCompactionStrategy
• NetworkTopologyStrategy
• Replication factor 3
• ConsistencyLevel = ONE for most requests
Final Specs
• Bursty write volume. Consistent read volume.
• 5k to 80k writes per second
• 2k to 4k reads per second
Questions?I might have answers.
Thank you!
Mark Dunphy, Software Engineer Behance/Adobe @dunphtastic