dos and donts on the way to 10m users

22
on the way to over 10M users… DOs and DON'Ts IGT July 2011 Even t

Upload: yoav-avrahami

Post on 12-Jan-2015

1.770 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: DOs and DONTs on the way to 10M users

on the way to over 10M users…

DOs and DON'Ts

IGT –

July 2011

Event

Page 2: DOs and DONTs on the way to 10M users

Wix Initial Architecture

Wix SitesWixML PagesUser AuthenticationUsers MediaPublic MediaTemplate Galleries

Public Sites ServingSites Editing APIUser AuthenticationUser Media upload, searchesPublic Media upload, management, searchesTemplate Galleries management

Server

Database

Page 3: DOs and DONTs on the way to 10M users

Wix Initial Architecture

• Tomcat, Hibernate, Custom web framework– Everything generated from HBM files– Built for fast development– Statefull login (tomcat session), EHCache, File uploads– Not considering performance, scalability, fast feature rollout, testing– It reflected the fact that we didn’t really know what is our business– Yoav A. - “it is great for a first stage start-up, but you will have to replace it

within 2 years”– Nadav A, after two years - “you were right, however, you failed to mention

how hard it’s gonna be”

Page 4: DOs and DONTs on the way to 10M users

Wix Initial Architecture

What we have learned • Don’t worry about ‘building it right from the start’ – you won’t• You are going to replace stuff you are building in the initial stages of a

startup or any project• Be ready to do it• Get it up to customers as fast as you can. Get feedback. Evolve.• Our mistake was not planning for gradual re-write• Build for gradual re-write as you learn the problems and find the right

solutions

Page 5: DOs and DONTs on the way to 10M users

Distributed Cache

• Next we added EHCache as Hibernate 2nd-level cache• Why?

– Cause it is in the design• How was it?

– Black Box cache– How do we know what is the state of our system?– How to invalidate the cache?– When to invalidate it?– How does operations manage the cache?

• Did we really need it?– No

• We eventually dropped it

Page 6: DOs and DONTs on the way to 10M users

Distributed Cache

What we have learned • You don’t need a Cache• Really, you don’t• Cache is not part of an architecture.

It is a means to make something more efficient• Architect while ignoring the caching.

Introduce caching only as needed to solve real performance problems.• When introducing a cache, think about

– cache management – how do you know what is in the cache? How do you find invalid data in the cache?

– Invalidation – who invalidates the cache? When? Why?– Cache Reset – can your architecture stand a cache restart?

Page 7: DOs and DONTs on the way to 10M users

Two years passed

• We learned what our business is – building websites• We started selling premium websites

Page 8: DOs and DONTs on the way to 10M users

Two years passed

• Our architecture evolved– We added a separate Billing segment– We moved static file storage and serving to a

separate instance

• But we started seeing problems– Updates to our server imposed complete wix downtime– Our static storage reached 500 GByte of small files, the

limit of Bash scripts– The server codebase became large and entangled. Feature rollout became

harder over time, requiring longer and longer manual regression• Strange full-table scans queries generated by Hibernate, which we still have no idea

what code is responsible for…– Statefull user sessions required a lot of memory and a statefull load balancer

Server(Tomcat) DB

statics(Lighttpd)

Media storage

Billing(Tomcat)

BillingDB

Page 9: DOs and DONTs on the way to 10M users

Editor & Public Segments

• The Challenge - Updates to our Server imposed downtime for our customer’s websites– Any Server or Database update has the potential of bringing down all Wix sites– Is a symptom of a larger issue

• The Server served two different concerns– Wix Users editing websites– Viewing Wix Sites, the sites created by the Wix editor

• The two concerns require a different SLA– Wix Sites should never ever have a downtime!– Wix Sites should work as fast as possible, always! – However, an editing system does not require this level of SLA.

• The two concerns evolve independently – Releases of Editing feature should have no impact on existing Wix sites

operations!

Page 10: DOs and DONTs on the way to 10M users

Editor & Public Segments

• Our Solution– Split the Server into two Segments – Public and Editor

• The Public segment targets serving websites for Wix Users– Has mostly read-only usage pattern – only updated on

when a site is published– Simple publishing system– Simple means it is easier to have higher SLA – Simple and read-only means it is easier to make DRP– Built using Spring MVC 2.5 and JDBC (without Hibernate)– MySQL used as NoSQL – single large table with XML text fields

• The Editor segment – Exposes the Wix Editing APIs, as well as user account and galleries

management APIs.– Has different release schedule compared to the Public segment

Public(Tomcat)

Public DB

Editor(Tomcat)

Editor DB

Page 11: DOs and DONTs on the way to 10M users

Editor & Public Segments

What we have learned• Architecture is inspired by aspects such as

– SLA– Release Cycles – deployment flexibility

• Separate Segments for discrete concerns– Editing (Editor Segment)– Publishing (Public Segment)

• modularity – in terms of breaking the architecture to services / modules / sub-systems (or any other name)– Enabler for gradual re-write– Enabler for continues delivery– Simplifies QA, Operations & Release Cycles

Public(Tomcat)

Public DB

Editor(Tomcat)

Editor DB

Page 12: DOs and DONTs on the way to 10M users

Editor & Public Segments

What we have learned• MySQL is a damn good NoSQL engine

– Our public DB is (mainly) one table with 10M entries– Queries are by primary key– Updates are by primary key– Instead of relations, we use text/xml columns

• Immutable Data– Architect your data to be immutable

• MySql sequential keys cause problems– Locks on key generation– Require a single instance to generate keys

• We use GUID keys– Can be generated by any client– No locks in key value generation– Enabler for Master-Master replication

Public(Tomcat)

Public DB

Editor(Tomcat)

Editor DB

Page 13: DOs and DONTs on the way to 10M users

Authentication Segment

• The Challenge – Statefull user sessions– Makes it harder to separate software to different segments

• Requires shared library or single sign-on– Requires statefull load balancer– Takes more memory

• Our solution– Dedicated authentication segment – authentication is a specific concern– Cookie based login – solves all the above problems– Implement stronger security solution only where really required

What we have learned• Do not use statefull sessions – they don’t scale (easily)• Use cookie based login – this is the standard on the web today

Page 14: DOs and DONTs on the way to 10M users

• The Challenge – Our static storage reached over 500 GByte of small files– The “upload to app server, copy to static server, serve by file server” pattern

proved inefficient, slow and error prone– Disk IO became slow and inefficient as the number of files increased– We needed a solution we can grow with –

• HTTP connections• number of files

– We needed control over caching and Http headers• Our Solution

– Prospero– You’ve heard about it in a previous presentation

Page 15: DOs and DONTs on the way to 10M users

What we have learned• It was hard to make Prospero work right

– But it was the right choice for us• Media files caching headers are critical

– Max-age, ETag, if-modified-since, etc.– Think how to tune those parameters for media files, as per your specific needs

• We tried using Amazon S3 as secondary storage– However, they proved unreliable

• The S3 service had connection reliability issues• S3 have “lost” a 200M files, 30 Tbyte volume

• We found that using a CDN in front of Prospero is very affective

Page 16: DOs and DONTs on the way to 10M users

CDN

What we have learned• Use a CDN!• CDN acts as a great connection manager

– We have CDN hit ratio’s of over 99.9%• Use the “Cache Killer” pattern

– http://static.wix.com/client/css/viewer.css?v=327– Makes flushing files from the CDN redundant– Enabler for longer caching periods

• There are many vendors– We started with 1 CDN vendor– We are now “playing” with multiple CDN vendors– Different CDN vendors have advantages at different geo

• Tune HTTP Headers per CDN Vendor– CDN Vendors interpret HTTP heads differently

Page 17: DOs and DONTs on the way to 10M users

Wix-Framework / TDD / DevOps

• The Challenge– The server codebase became large and entangled– Feature rollout became harder over time,

requiring longer and longer manual regression• The Solution

– The Wix Framework + TDD / CI / CD / DevOps• TDD – Developer responsible for all automated tests

– With QA Developers• CI / CD – automate build & deployment

– Solve the build / test / deployment / rollback issues first– Get the project to production with minimal functionality

• DevOps - Developer is responsible from Product to 10,000 active users

Page 18: DOs and DONTs on the way to 10M users

Wix-Framework

• The Wix Framework– Java, Spring, Jetty, MySQL, MongoDB– Spring MVC based

• Adjustments for Flash– Flash imposes some restrictions on HTTP which require special handling

• DevOps support– Built-in support for monitoring dashboard, configuration, usage tracking,

profiling and Self-Test in every app server– Automated Deployment – using Chef and SouceChef

• TDD Support– Unit-Testing helpers– Multi-browser Javascript Unit-Testing support, integrated with IDE– Integration Testing framework, allows running the same tests in embedded

mode (for developer) and deployed in production like env– Embedded MySQL & Embedded MongoDB

Page 19: DOs and DONTs on the way to 10M users

TDD / DevOps

• Our development process– Developer writes code and Unit-Tests– Developer and QA Developer write integration tests (IT)

• Build on developer machine

• On check-in

• Mark RC• Deploy to QA env

• Mark GA• Deploy to production

Compile Unit Tests Embedded ITs

Compile Unit Tests Self TestsDeploy – IT Env Run ITs

RC QASelf TestsDeploy – QA Env Monitor

GA Self TestsDeploy – Production Monitor

Page 20: DOs and DONTs on the way to 10M users

Wix-Framework / TDD / DevOps

What we have learned• TDD / CI / CD / DevOps works• It enables us to release frequently

– We release 2-3 times a day– With higher confidence in release quality– With reduced risk

• It takes time and effort– Requires management commitment– Transforms the way R&D, Product and operations work– Requires investment in supporting tools – the Wix Framework– Infrastructure – Build Server, Artifactory, Maven (+Plugins), Chef, SourceChef,

Testing environments (virtual boxes), Server Dashboard, New Relic.• It is worth the effort!

– A Wix developer – “never again shell I work without automated testing”

Page 21: DOs and DONTs on the way to 10M users

Some of our Tools

Page 22: DOs and DONTs on the way to 10M users