growing data analytics at etsy (cristopher bohn)

23
Growing Data Analytics at Etsy Chris Bohn (“CB”)

Upload: ontico

Post on 19-May-2015

3.103 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Growing Data Analytics at Etsy (Cristopher Bohn)

Growing Data Analytics at Etsy

Chris Bohn (“CB”)

Page 2: Growing Data Analytics at Etsy (Cristopher Bohn)
Page 3: Growing Data Analytics at Etsy (Cristopher Bohn)
Page 4: Growing Data Analytics at Etsy (Cristopher Bohn)

Fort Ross (Rossia) – Established 1812 as Csarist outpost

Fort Ross

Page 5: Growing Data Analytics at Etsy (Cristopher Bohn)

Fort Ross

– Close to Russian River– City of Sebastopol is nearby

Page 6: Growing Data Analytics at Etsy (Cristopher Bohn)

Russian Hill, San Francisco

Page 7: Growing Data Analytics at Etsy (Cristopher Bohn)

Big Data Analytics At Etsy.com

History of Etsy

Data architecture of Etsy through the years

Need for Data Analytics

Growth of “Big Data” service needs

Hadoop vs. other solutions

Vertica

Schlep and Autoschlep tools for data replication

Page 8: Growing Data Analytics at Etsy (Cristopher Bohn)

About Etsy

Founded in 2005 in New York City by three NYU students

Etsy is the leading marketplace for handcrafted goods

More than 20 million users and one million sellers

Etsy has will have sales of $1 Billion this year

400 employees, 200 engineers

50 remote employees all over the world

Now has mobile and iPad applications

Page 9: Growing Data Analytics at Etsy (Cristopher Bohn)

Etsy Stack

PHP front end (Rasmus Lerdorf, creator of PHP, is on staff)

Originally had PHP-->Python (Twisted Framework) middle layer-->PostgreSQL

Now PHP-->ORM (written in PHP)-->MySQL/PostgreSQL

Page 10: Growing Data Analytics at Etsy (Cristopher Bohn)

Original Etsy Stack

PHP front end servers

Middle layer written in Python which bound Python functions to PostgreSQL stored procedures; business logic in stored procedures and views

Started with one master PostgreSQL database (users, listings, sales, forums, conversations, “showcases”)

When business increased, physically separated database components into separate PostgreSQL servers:

Master (users, listings, transactions) Forums (community message boards) Convos (messaging between buyers and sellers) Showcases (advertising slots purchased by sellers in various categories)

Listing view counts kept in memory – problem when system had to be restarted, sellers lost the view counts for listings! Solved by creating another PostgreSQL server just for recording view counts for listings.

Search originally accomplished with Full Text Indexing on PostgreSQL master database. Did not scale, so introduced SOLR search in 2009 (inverted search lookup).

Page 11: Growing Data Analytics at Etsy (Cristopher Bohn)

Problems with the original Etsy stack

Data architecture did not scale – one master database is not a scalable architecture

Stored procedures contained a lot of business logic, which resulted in higher database CPU load than necessary

Difficult to hire engineers who could code both PHP front end and stored procedure logic in PostgreSQL – thus hard to change business logic and difficult to deploy changes

Even hard to find engineers who could code Twisted framework in Python

Python middle layer was a “stored procedure routing system” (called “sprouter”) and would bind PostgreSQL stored procedures and views to Python function calls

Worked on OID (Object ID) level – would bind PG objects to Python object by using OIDs

When PG objects were recreated, they would acquire a new OID, and the sprouter binding would break

Any change required complete reboot of sprouter layer to acquire new OIDs

This was “Version 2” architecture, which replaced previous one that was even worse. The development of Version 2 took 1.5 years and almost killed the company! WATCH OUT FOR VERSION 2!!!

Page 12: Growing Data Analytics at Etsy (Cristopher Bohn)

The NEW Etsy Stack and Architecture

Started with key senior position hires from Flickr who had scaled that site with sharding

Redesigned data architecture: Remove business logic – no more stored procedures Database now just for “dumb storage” of facts and dimensions Get rid of sprouter layer Create new PHP business logic layer, called EtsyORM Use better PHP templating AJAX interactive client side logic

Generally replace PostgreSQL with MySQL Because Flickr people were more familiar with it Sharded MySQL databases Plays well with PHP (part of LAMP stack, mature technology)

Denormalized data; PostgreSQL data was fairly normalized to reduce data footprint Shard along lines that make sense Use universal dedicated ticket server to issue unique sequence values Keep related data as close together as possible, and closeness of data is more

important than data footprint. Redundancy is OK!

Page 13: Growing Data Analytics at Etsy (Cristopher Bohn)

EtsyORM

MySQL shardsPostgreSQL master

MySQL Ticker Server

Client Requests

Lookup Index

Page 14: Growing Data Analytics at Etsy (Cristopher Bohn)

New Data Architecture – Good and Not So Good

Good:

Sharded data is great for single record lookup Load is spread out over many servers Lookup is fast Scaling data capability is easy because it is horizontal – just add more servers No business logic in database, all logic instead is in ORM layer

Not so good:

Sharded data, while good for single record lookup, is not good for aggregation Data is now spread out over several machines instead of concentrated in one Data has to be periodically “rebalanced” across shards Querying aggregated data now is harder – can't issue simple SQL commands, the data

has to be mapped and reduced Writing the queries is now an engineer job because query needs to go through ORM It is hard to do analysis on the business, because that requires a lot of aggregated data

Page 15: Growing Data Analytics at Etsy (Cristopher Bohn)

EtsyORM

MySQL shardsPostgreSQL master

MySQL Ticker Server

Client Requests

Lookup Index

Page 16: Growing Data Analytics at Etsy (Cristopher Bohn)

MySQL shards

PostgreSQL master

PostgreSQL BI database

Business Intelligence (BI) Data Architecture

Page 17: Growing Data Analytics at Etsy (Cristopher Bohn)

Current BI Architecture Problems

Problems with BI Server:

BI server has become like another master database, but with data from all sources including shards

BI server is very overloaded with data replication tasks and report queries

BI server is PostgreSQL and not well suited for aggregation and analytics

BI Server is often very slow

Problems with Hadoop:

Hadoop is batch oriented – not good for “ad hoc” queries

Programming Hadoop jobs is tedious and often takes a lot of trial and error to get right

Creating Hadoop jobs is a specialized programming task that business analysts are not able to perform

Page 18: Growing Data Analytics at Etsy (Cristopher Bohn)

Etsy's solution to the BI analytics problem: Vertica

Etsy licensed Vertica to be its new BI server platform

Vertica is a licensed product, and was bought by Hewlett-Packard last year

Was designed at MIT by Prof. Michael Stonebraker, who is known as the “Father of Postgres”

Vertica shares many internals with PostgreSQL. SQL parser and vsql command line client are derived from PostgreSQL

Vertica is a “columnar store” database which is optimized for data analytics – it excels at data aggregation

Vertica is a licensed, but there is a free version (one node, 1 TB data) which is very useful

Vertica is multiple peer architecture; typical installation has several nodes, each equal to the other. It “shards” out of the box and handles all distribution of data

Vertica puts a copy of small tables on each node, but segments large tables across nodes with internal hash algorithm. It does this seamlessly, so very easy to set up and manage

Has very rich SQL language, with good analytic queries and such things as windowing. Most queries that run on PostgreSQL BI run unchanged on Vertica.

Page 19: Growing Data Analytics at Etsy (Cristopher Bohn)

What is a Columnar Store database?

Traditional Relation Database stores rows, and those rows are indexed for fast retrieval of records; Columnar store instead stores ordered columns, not rows

Vertica has no traditional indexes, although it has primary key and foreign key constraints; uses encoding (preferable run length encoding “RLE”)

Relational database: Columnar database (Vertica):

RLE

id

user_id

charge_type

amount

id

user_id

charge_type

amount

1 101

sales fee

2.33

3 56

listing fee

0.202 101

sales fee

1.22

4 23

listing fee

0.203 56

listing fee

0.20

1 101

sales fee

2.33

4 23

listing fee

0.20

2 101

sales fee

1.225 128

sales fee

3.56

5 128

sales fee

3.56

3rows

Page 20: Growing Data Analytics at Etsy (Cristopher Bohn)

Getting Data To Vertica was a problem, but Etsy wrote a solution

There are no ETL (Extract, Transform, Load) tools for Vertica, except some support for moving data from HDFS

No ETL for getting data from relational databases over to Vertica

Etsy had the requirement that we need to get all data from MySQL shards and PostgreSQL databases into Vertica for it to be usefu to the business analysts

Etsy created two tool, schlep and autoschlep, to accomplish ETL from relational databases, and we are going to open source them

Page 21: Growing Data Analytics at Etsy (Cristopher Bohn)

About schlep

Schlep: Yiddish word meaning, “To carry a heavy load a long distance”

We built into Vertica as a SQL function, so that it is easy for the analysts to use. Schlep is overloaded and has 5 variants to allow additional options. It is simple to use:

> SELECT * FROM schlep(user, 'table_name');

This will move the table the BI PostgreSQL database into Vertica. It does the following:

Connects to PostgreSQL BI and obtains the DDL for the table

Maps the data types to Vertica types

Creates the table with correct permissions on Vertica

Copies data over to Vertica by creating a psql COPY process and piping that into vsql (Vertica) COPY process

Is very fast because Vertica does not check constraints by default when data is loaded

Schlep is a “one shot” static snapshot of the data. Once copied to Vertica, there is no further update

Works with PostgreSQL right now, MySQL replication release in early November

Page 22: Growing Data Analytics at Etsy (Cristopher Bohn)

About autoschlep

Autoschlep is a system that allows incremental replication(trickle load) of data from source toVertica.

Currently works with PostgreSQL, MySQL coming soon Works by putting an “after” trigger on source data table. Any CRUD (create, update, delete) is recorded by the the trigger in a staging table on the source database The autoschlep process is scheduled by cron or the autoschlep scheduler, or whatever scheduling system of choice Autoschlep then uses schlep to move the data from the staging table on the source database and puts it in an identical staging table on Vertica. It then does a MERGE of that data into the target table Etsy has used schlep and autoschlep to move billions of records to Vertica and keep it synchronized within 15 minutes of the source data

Autoschlep is called this way:

/autoschlep.py schema_name table_name primary_key

Page 23: Growing Data Analytics at Etsy (Cristopher Bohn)

Where to Get Vertica and the Schlep Tools

Vertica has a FREE version that is quite powerful. It is limited to one node and maximum 1 terabyte of data, but this can be very useful.