stop worrying & love the sql - a case study

Stop WorryingLove the SQL! (the Quepid story)

OpenSource Connections

Me/UsDoug Turnbull@softwaredoug

Likes: Solr, Elasticsearch, Cassandra, Postgres

OpenSource Connections@o19s

Search, Discovery and Analytics

Let us introduce you to freelancing!

https://twitter.com/softwaredoug

https://twitter.com/softwaredoug

http://twitter.com/o19s

http://twitter.com/o19s


Most Importantly we do...Make my search results more relevant!

“Search Relevancy”

What database works best for problem X?“(No)SQL Architect/Trusted Advisor”


How products actually get built

Rena: Doug, John can you come by this afternoon?

One of our Solr-based products needs some urgent relevancy work

Its Friday, it needs to get done today!

Us: Sure!

The Client(Rena!)smart cookie!


A few hours laterUs: we’ve made a bit of progress!

image frustration-1081 by jseliger2

Rena: but everytime we fix something, we break an existing search!

Us: yeah! we’re stuck in a whack-a-mole-game

other image: whack a mole by jencu

http://www.flickr.com/photos/91262622@N02/

http://www.flickr.com/photos/jennycu/


Whack-a-MoleWhat search relevancy work actually looks like

http://www.youtube.com/watch?v=9Lvv7TX2Gyc


I HAVE AN IDEA● Middle of the afternoon, I stop doing search

work and start throwing together some python

from flask import Flaskapp = Flask(__name__)

Everyone: Doug, stop that, you have important search work to do!

Me: We’re not making any progress!WE NEED A WAY TO REGRESSION TEST OUR RELEVANCY AS WE TUNE!

Everyone: You’re nuts!


What did I make?Focus on gathering stakeholder (ie Rena) feedback on search, coupled w/ workbench tuning against that feedback

Today we have customers...

… forget that, tell me about your failures!


Our war storyMy mistakes:

● Building a product● Selling a product● As a user experience engineer● As an Angular developer● At choosing databases


Quepid 0.0.0.0.0.0.1Track multiple user searches

for this query (hdmi cables) Rena rates this document as a good/bad search result

need to store:<search> -> <id for search result> -> <rating 1-10>“hdmi cables” -> “doc1234” -> “10”

*Actual UI may have been much uglier


Data structure selection under duress

● What’s simple, easy, and will persist our data?

● What plays well with python?

● What can I get working now in Rena’s office?


Redis● In memory “Data Structure Server”

○ hashes, lists, simple key-> value storage

● Persistent -- write to disk every X minutes


Redis

from redis import Redisredis = Redis()redis.set("foo", "bar")redis.get("foo") # gets ‘bar’

$ pip install redis

Easy to install and go! Specific to our problem:

from redis import Redisredis = Redis()

ratings = {“doc1234”: “10”, “doc532”: “5”}searchQuery = “hdmi cables”

redis.hsetall(searchQuery, ratings)

Store a hash table at “hdmi cables” with:

“doc1234” -> “10”“doc532” -> “5”


Success!● My insanity paid off that afternoon

● Now we’re left with a pile of hacked together (terrible) code -- now what?


Adding some features● Would like to add multiple “cases”

(different search projects that solve different problems)

● Would like to add user accounts

● Still a one-off for Silverchair


CasesTuning a cable shopping site... … vs state laws


Cases in Redis?

from redis import Redisredis = Redis()


redis.hset(searchQuery, ratings)

Recall our existing implementation“data model”

Out of the box, redis can deal with 2 levels deep:{

“hdmi cables”: {“doc1234”: “10”,“doc532”: “5”

},“ethernet cables”...

}

Can’t add extra layer (redis hash only one layer)

{“cable site”: {“hdmi cables”: {...}“ethernet cables”: {...}

}“laws site: {...}}


Time to give up Redis?“All problems in computer science can be solved by another level of indirection” -- David Wheeler

Crazy Idea: Add dynamic prefix to query keys to indicate case, ie:{

“case_cablestore_hdmi cables”: {“doc1234”: “10”,“doc532”: “5”

},“case_cablestore_ethernet cables”: {… },“case_statelaws_car tax”: { …}

}

Queries for “Cable Store” case

Query for “State Laws” case

redis.keys(“case_cablestore*”)

To Fetch:


Store other info about cases?New problem: we need to store some information about cases, case name, et

{“case_cablestore_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“case_cablestore_ethernet cables”: {… },“case_statelaws_car tax”: { …}

}

Where would it go here?{

“case_cablestore” {“name”: “cablestore”,“created” “20140101”

},“case_cablestore_query_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“case_cablestore_query_ethernet cables”:

{… },“case_statelaws_query_car tax”: { …}

}


Oh but let’s add usersExtrapolating on past patterns {

“user_doug” {“name”: “Doug”,“created_date”: “20140101”

},“user_doug_case_cablestore” {

“name”: “cablestore”,“created_date” “20140101”

},“user_doug_case_cablestore_query_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“user_doug_case_cablestore_query_ethernet cables”:

{… },“user_tom_case_statelaws_query_car tax”: { …}

}image: Rage Wallpaper from Flickr user Thoth God of Knowledge

You right now!

http://www.flickr.com/photos/thoth-god/


Step BackWe ask ourselves: Is this tool a product? Is it useful outside of this customer?

What level of software engineering helps us move forward?

● Migrate to RDMS?● “NoSQL” options?● Clean up use of Redis somehow?


SubRedis

Operationalizes hierarchy inside of redis

https://github.com/softwaredoug/subredis

from redis import Redisfrom subredis import SubRedisredis = Redis()

sr = SubRedis(“case_%s” % caseId , redis)


sr.hsetall(searchQuery, ratings)

Create a redis sandbox for this case

Interact with this case’s queries with redis sandbox specific to that case

Behind the scenes, subredis queries/appends the case_1 prefix to everything




SubRedis == composable

userSr = SubRedis(“user_%s” % userId , redis)

caseSr = SubRedis(“case_%s” % caseId , userSr)

# Sandbox redis for queries about userratings = {“doc1234”: “10”, “doc532”: “5”}searchQuery = “hdmi cables”

caseSr.hsetall(searchQuery, ratings)

SubRedis takes any Redis like thing, and works safely in that sandbox

Now working on sandbox, within a sandbox


Does something reasonable under the hood

{

“user_1_name”: “Doug”,“user_1_created_date”: “Doug”,“user_1_case_1_name”: “name”: “cablestore”“user_1_case_1_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“user_2_name”, “Rena”,...

}

AllRedis

user_1 subred.

case_1subred.


We reflect again● Ok we tried this out as a product. Launched.

● Paid off *some* tech debt, but wtf are we doing

● Works well enough, we’ve got a bunch of new features, forge ahead

https://www.youtube.com/watch?v=uPN4bq5jDMI


We reflect again● We have real customers

● Our backend is evolving away from simple key-value storage○ user accounts? users that share cases? stored

search snapshots? etc etc


Attack of the relationalGiven our current set of tools, how would we solve the problem“case X can be shared between multiple users”?

{

“user_1_name”: “Doug”,“user_1_created_date”: “Doug”,“user_1_case_1_name”: “name”: “cablestore”“user_1_case_1_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“user_2_name”, “Rena”,“user_2_case_1_name”: “name”: “cablestore”“user_2_case_1_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},}

Could duplicate the data? This stinks!

● Updates require visiting many (every?) user, looking for this case

● Bloated database

Duplicate the data?


Attack of the relationalGiven our current set of tools, how would we solve the problem“case X can be shared between multiple users”?

{

“user_1_name”: “Doug”,“user_1_created_date”: “Doug”,“user_1_cases”: [1, ...]“case_1_name”: “name”: “cablestore”“case_1_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“user_2_name”, “Rena”,“user_2_cases”: [1, ...]...

}

User 1

Case 1

User 2

Store list of owned cases

Break out cases to a top-level record?


SudRedisRelational?{

“user_1_name”: “Doug”,“user_1_created_date”: “Doug”,“user_1_cases”: [1, ...]“case_1_name”: “name”: “cablestore”“case_1_hdmi cables”: {

“doc1234”: “10”,“doc532”: “5”

},“user_2_name”, “Rena”,“user_2_cases”: [1, ...]...

}

We’ve actually just normalized our data.

Why was this good?● We want to update case 1 in isolation

without anomalies● We don’t want to visit every user to

update case 1!● We want to avoid duplication

We just made our “NoSQL” database a bit relational


Other Problems● Simple CRUD tasks like “delete a case”

need to be coded up

● We’re managing our own record ids

● Is any of this atomic? does it occur in isolation?


What’s our next DB?● These problems are hard, we need a new

DB

● We also need better tooling!


Irony● This is the exact situation we warn clients

about in our (No)SQL Architect Roles.○ Relational == General Purpose○ Many-many, many-one, one-many, etc○ Relational == consistent tooling

○ NoSQL == solve specific problems well


So we went relational!● Took advantage of great tooling: MySQL,

Sqlalchemy (ORM), Alembic (migrations)

● Modeled our data relationships exactly like we needed them to be modeled


Map db Python classes

class SearchQuery(Base): __tablename__ = 'query' id = Column(Integer, primary_key=True) search_string = Column(String) ratings = relationship("QueryRating")

class QueryRating(Base): __tablename__ = 'rating' id = Column(Integer, primary_key=True) doc_id = Column(String) rating = Column(Integer)

Can model my domain in coder-friendly classes class SearchQuery(Base):

__tablename__ = 'query' id = Column(Integer, primary_key=True) search_string = Column(String) ratings = relationship("QueryRating")

class QueryRating(Base): __tablename__ = 'rating' id = Column(Integer, primary_key=True) doc_id = Column(String) rating = Column(Integer)


Easy CRUDq = SearchQuery(search_string=”hdmi cable”)db.session.add(q)db.session.commit()

del q.ratings[0]db.session.add(q)db.session.commit()

q = SearchQuery.query.filter(id=1).one()q.search_string=”foo”db.session.add(q)db.session.commit()

Create!

Delete!

Update!


Migrations are good

alembic revision --autogenerate -m "name for tries"alembic upgrade headalembic downgrade 0ab51c25c

How do you upgrade your database to add/move/reorganize data?

● Redis this was always done manually/scripted

● Migrations with RDMS are a very robust/well-understood way to handle this

SQLAlchemy has “alembic” to help:


Modeling Users ←→ Casesassociation_table = Table(case2users, Base.metadata, Column('case_id', Integer, ForeignKey('case.id')), Column('user_id', Integer, ForeignKey('user.id')))

class User(Base): __tablename__ = 'user' id = Column(Integer, primary_key=True) cases = relationship("Case", secondary=association_table)

class Case(Base): __tablename__ = 'case' id = Column(Integer, primary_key=True)

Can model many-many relationships


Ultimate Query Flexibilityfor user in User.query.all(): for case in user.cases: print case.caseName

for user in User.query.filter(User.isPaying==True): for case in user.cases: print case.caseName

Print all cases:

Cases from paying members:


Lots of things easier● backups● robust hosting services (RDS)● industrial strength ACID with flexible

querying● 3rd-party tooling (ie VividCortex for MySQL)

http://vividcortex.com


When NoSQL?● Solve specific problems well

○ Optimize for specific query patterns○ Full-Text Search (Elasticsearch, Solr)○ Caching, shared data structure (Redis)

● Optimize for specific scaling problems○ Provide a denormalized “view” of your data for

specific task


Final ThoughtsSometimes RDMS’s have harder initial hurdle for setup, figuring out migrations; data modeling; etc

Why isn’t the easy path the wise path?


In conclusion

stop worrying & love the sql - a case study

Technology