Download - mixi jp scaling out with open source
mixi.jpscaling out with open source
Batara Kesuma mixi, [email protected]
Introduction
•Batara Kesuma•CTO of mixi, Inc.
What is mixi?
•Social networking service• Diary, community, message, review, photo
album, etc.
• Invitation only
•Largest and fastest growing SNS in Japan
Latest information- Friends new diary- Comments history- Communities topics- Friends new reviews- Friends new albums
My latest diaries and reviews
User Testimonials
Friends
Community listing
History of mixi
•Development started in December 2003• Only 1 engineer (me)
• 4 months of coding
•Opened on February 2004
Two months later
•10,000 users•600,000 PV/day
The “Oh crap!” factor
•This model works•But how do we scale out?
The first year
•The online population of mixi grew significantly
•600 users to 210,000 users
The second year
•210,000 users to 2 million users
And now?
More than 3.7 million users15,000 new users/day
Population of Japan is: 127 millionInternet users: 86.7 million
Source CIA Factbook
70% of active users(last login less than 72 hours)
Average user spends 3 hours 20 minutes on mixi
per week
Ranked 35th on Alexa worldwide, and 3rd in
Japan
PV growth in 2 years
Google Japan
mixi
Amazon Japan
Users growth in 2 years
0
875,000
1,750,000
2,625,000
3,500,000
04/03 05/03 06/03
Users
Our technologysolutions
The technology behind
•Linux 2.6•Apache 2.0
•MySQL
•Perl 5.8•memcached
•Squid
mod_proxy
mod_perl
diary cluster message cluster
images
other cluster
HOT OBJECTS
memcached
REQUEST REQUEST
Powered by
MySQL
•More than 100 MySQL servers•Add more than 10 servers/month
•Non-persistent connection
•Mostly InnoDB•Heavily rely on the use of DB partitioning (our own solution)
DB replication
•MySQL server load gets heavy•Add more slaves
mod_perl
DB
RE
QU
ES
T
DB
Replicate
QUERY (WRITE)
QUERY (READ)
DB replication•Classic problem with DB replication
100 reads/s
50 writes/s
MASTER
50 reads/s
50 writes/s
50 reads/s
50 writes/s
SLAVES
100 reads/s
50 writes/s
MASTER
25 reads/s
50 writes/s
25 reads/s
50 writes/s
25 reads/s
50 writes/s
25 reads/s
50 writes/s
SLAVES
Some statistics•Diary related tables
•Read 85%•Write 15%
•Message related tables •Read 75%•Write 25%
DB partitioning
•Replication couldn’t keep up anymore
•Try to split the DB
How to split?
DB
message tables
diary tables
other tables
user A user B user C
Splitting vertically by users or splitting horizontally by table types
Vertical partition
DB
message tables
diary tables
other tables
user A user B user C
DB 1 DB 2
Vertical partition
•Too many tables to deal with at one time
•The transition in splitting gets complex and difficult
Horizontal partition
message tables
OLD DB
other tables
diary tables
Also called level 1 partitioning within mixi
message tables
NEW DB
$dbh = $db->load_dbh(type => “message”);
$dbh = $db->load_dbh();
diary tables
NEW DB
$dbh = $db->load_dbh(type => “diary”);
Partition map for level 1
•Small and static•Just put it in configuration file
•For example:$DB_DIARY = ‘DBI:mysql:host=db1;database=diary’;$DB_MESSAGE = ‘DBI:mysql:host=db2;database=message’;...
Easy transition
OLD DB NEW DB
mod_perlW
RITE
READ
WRITE
1 Writes to both DBs
SELECTINSERT IGNORE
2 Copies in background
READ
3Shifts reads
Problems with level 1
•Cannot use JOIN anymore• Use FEDERATED TABLE from MySQL 5
• Or do SELECT twice which is faster than using FEDERATED TABLEs
• If table is small, just duplicate it
Next step
•When the new DB gets overloaded
•We split the DB, yet again•Get ready for level 2
message tables
user id
Partitioning key
•user id, message id•Choose wisely!
user A user B
message id
message tablesor
Level 2 partition
message tables
LEVEL 1 DB
user A user B user C user D
message tables
NODE 1NEW DB message tables
NODE 2
Partition map for level 2
•Big and dynamic•Cannot put it all in configuration file
Partition map for level 2
•Manager based• Use another DB to do the partition
mapping
•Algorithm based• Partition map is counted inside
application
• node_id = member_id % TOTAL_NODE
Manager based
message tables
NODE 1
message tables
NODE 2
message tables
NODE 3
MANAGER DB
mod_perl
user_id=14
1 Asks for node_id
node_id=22 Returns node_id
3 Connects to node
Algorithm based
message tables
NODE 1
message tables
NODE 2
message tables
NODE 3
mod_perl
node_id=(user_id%3)+1node_id=3
1 Computes node_id
number of nodes = 3
2 Connects to node
Manager based
•Pros:• Easy to manage
• Add a new node, move data between nodes
•Cons:• This process increases by 1 query for
partition map
• It needs to send a request to the manager
Algorithm based
•Cons:• Difficult to manage
• Adding new nodes is tricky
•Pros:• Application servers can compute node id
by themselves
• Bypass the connection to the manager
Adding nodes is tricky
mod_perl
NODE 1
NODE 2+
NODE 3
NODE 4
READWRITE
2 Writes to both DBsif node_id is different
old_node_id=(member_id%2)+1
WRITE
number of nodes = 2
new_node_id=(member_id%4)+1
number of nodes = 41 Adds a new application logic C
OP
Y
CO
PY
3 Copies in background
READ4 Shifts reads
Problems with level 2
NODE 1member tables
NODE 2member tables
NODE 3member tables
• Too many connections to different DBs
• Fortunately, on mixi, the majority are small data sets
• Cache them all by using distributed memory caching
• We rarely hit the DB
NODE 1community tables
NODE 2community tables
• Average page load time is about 0.02 sec*
* depending on data sets average load time may vary
Caching
•memcached• Also used in LiveJournal, Slashdot, etc
•Install server on mod_perl machine
•39 machines x 2 GB memory
Summary of DB partitioning
•Level 1 partition (split by table types)
•Level 2 partition (split by partitioning key)•Manager based•Algorithm based
LEVEL 1message tables
1 Split by table types
Summary of DB partitioninguser A user B user C
message tables
OLD DB
other tables
diary tables
message tables
LEVEL 2
2
message tables
LEVEL 2
Split by partitioning key
Image Servers
Statistics
•Total size is more than 8 TB of storage
•Growth rate is about 23 GB / day•We use MySQL to store metadata only
Two types of images
•Frequently accessed images• Number of image files is relatively small
(about a few million files)
• For example, user profile photos, community logos
•Rarely accessed images• About hundred millions of image files
• Diary photos, album photos, etc
Frequently accessed images
•Few hundred GBs of files•Distribute via the use of FTP and Squid
•Third party Content Delivery Network
Frequently accessed images
mod_perl Storage
Squid CDNsto1.mixi.jp sto2.mixi.jp
UPLOAD
1 Uploads to storage
2 Pull images from storage
Rarely accessed images
•Few TBs of files•Newer files get accessed more often
•Cache hit ratio is very bad
•Distribute directly from storage
Uploading rarely accessed images
mod_perl
MANAGERDB
Storagesto1.mixi.jp
Storagesto2.mixi.jp
Storagesto3.mixi.jp
Storagesto4.mixi.jp
abc.gif
1 Assigns a id for an image file
area_id=1,2
2 Arranges a pair of area_id
UPLOAD
UPLOAD
3 Uploads image to storage
Viewing rarely accessed images
Storagesto1.mixi.jp
Storagesto2.mixi.jp
Storagesto3.mixi.jp
Storagesto4.mixi.jp
User
mod_perl
MANAGERDB
Asks for view_diary.pl
1
2 Detects abc.gif in view_diary.pl
abc.gifAsks for area_id 3
area_id =1
4 Returns area_id
Creates image URL
5
Returns view_diary.pland URL for abc.gif
6
Asks for abc.gif7
Returns abc.gif8
To do
•Try MySQL Cluster•Try to implement better algorithm• Consistent hashing?
• Linear hashing?
•Level 3 partitioning?• Split again by timestamp?
Questions?