building an activity feed with cassandra

Building an Activity Feed with Cassandra

Mark Dunphy, Software Engineer Behance/Adobe @dunphtastic

DisclaimerNot an operations person.

Will pretend to be one for the purpose of this talk.

Quick OverviewWhat is the Behance Activity Feed?

• Actions

• Comments, Appreciations, Etc

• Entities

• Projects, Works in Progress

• Actors

• Users

Project Entity

Actions taken by actors

Activity Fan Out

User A publishes a new project

Write to Follower A’s feed

Write to Follower B’s feed

Write to Follower C’s feed

Write to Follower D’s feed

Now that that’s over…

MongoDB 2011

• Smaller user base (~340,000).

• Built very quickly. Worked well at the time.

• Not well researched.

Fast forward to 2014

• Frequent node failures

• Heavy disk fragmentation caused by deletes

• Slow reads from disk. Started storing in RAM.

• Primary -> Secondary caused downtime for some.

• Scaled out vertically and horizontally.

Why Cassandra?

• Riak

• Very close. Community seemed lacking.

• Redis

• No native cluster. Too much maintenance.

• Memcached/MySQL

• Too much complex app logic.

Cassandra Wins.

• Fantastic community. #cassandra on Twitter

• Easy to read documentation

• Linearly scalable. Easy to grow cluster.

• Low maintenance overhead for ops team.

• Handles time series data very well.

Learning

• Cassandra Summit 2014

• Other team in Adobe

• Long nights reading documentation

Our Data

• Ephemeral

• “Source of truth” lives in a MySQL database

• Okay with *some* data loss

Our Rules

• User’s feed is comprised of entities with one set of actions

• User’s feed only contains one of any given entity

• An entity’s set of actions contains up to seven of the most recent actions taken by that user’s network

Planning

Language Support

• Most services on Behance are PHP

• No official Datastax PHP driver

–Mark Dunphy, 2014

“Looks like I’m learning python.”

Go to ProductionNo, nothing is working yet. I didn’t skip a slide.

• App/cluster in production before anything works

• Test real life load

• Fail spectacularly without anybody noticing

• Deploy risky changes without fear

• Run alongside MongoDB

January 19th, 2015

Query Patterns• “Create your data models based on the queries

you want to run” - Basically Everybody

• Wanted to…

• Read a user’s feed entities by type and time of most recent action…separately.

• Write/Update a user’s feed entities with new actions while knowing only user id and entity id

Data Models

–Mark Dunphy, January 2015

“An UPDATE in Cassandra works like an UPSERT! Let’s store the user’s entire feed in a

single row in a table! It’s so simple!”

First Data Model

CREATE TYPE activity.action ( created_on timestamp, secondary_entity_id int, actor_id int, verb_id int);

CREATE TYPE activity.entity ( entity_type_id int, entity_id int);

CREATE TABLE activity.project_actions ( modified_on timestamp, entity_id int, user_id int, actions list<frozen<action>>, PRIMARY KEY(user_id, entity_id))

CREATE TABLE activity.feeds ( modified_entities list<frozen<entity>>, modified_on timestamp, project_ids list<int>, user_id int, wip_revision_ids list<int>, PRIMARY KEY(user_id))

First Data Model

First Data Model

Moments Before Everything Exploded

–Mark Dunphy, January 2015

“Okay let’s keep nearly the same model, but use INSERT and DELETE instead of always

UPDATE. Just use batch statements.”

Second Data Model

Second Data Model

This was also a very very bad idea.

• Lose the benefit of Cassandra being distributed

• All queries go through the same coordinator which puts a lot of stress and responsibility on one node.

• Use concurrency and prepared statements instead. Datastax drivers make this easy.

Second Data Model

Second Data Model

Oops

Okay…

Now we’ve got it.

Winning Data Model

CREATE TYPE activity.action ( created_on timestamp, secondary_entity_id int, actor_id int, verb_id int);

CREATE TABLE activity.projects ( created_on timestamp, user_id int, entity_id int, actions list<frozen<action>>, PRIMARY KEY(user_id, created_on, entity_id))

CREATE TABLE activity.project_actions ( modified_on timestamp, entity_id int, user_id int, actions list<frozen<action>>, PRIMARY KEY(user_id, entity_id))

Much Nicer

Write Strategy• “User A comments on Project A. User B follows

User A.”

• Request out to add the comment action to User B’s feed

• Read existing actions for that entity (Project A) in B’s feed. Push new action on top.

• Write new actions list into new “row” in projects table

Read Strategy

• SELECT * FROM projects WHERE user_id = 123 AND created_on > 123214373

• Optimized for quick/easy reads. More important that a user’s feed loads quickly than it updating quickly.

• Use timestamp to “page” through data.

Lessons Learned

• Duplicate your data to achieve desired queries. Storage is cheap. Writes are cheap.

• Think outside the box. Cassandra is not relational.

• Never ever ever ignore inserts/deletes in favor of an update only workflow. Never. It is literally insane.

Final Specs• 16 node cluster on AWS EC2 c3.8xlarge

• Mix of SizeTieredCompactionStrategy and DateTieredCompactionStrategy

• NetworkTopologyStrategy

• Replication factor 3

• ConsistencyLevel = ONE for most requests

Final Specs

• Bursty write volume. Consistent read volume.

• 5k to 80k writes per second

• 2k to 4k reads per second

Questions?I might have answers.

Thank you!

Mark Dunphy, Software Engineer Behance/Adobe @dunphtastic

building an activity feed with cassandra

Technology