scalable wikipedia with erlang - onscale
TRANSCRIPT
Scalable Wikipedia with Erlang
Thorsten Schütt 1
Thorsten Schütt, Florian Schintke , Alexander Reinefeld
Zuse Institute Berlin (ZIB)
onScale solutions
Scaling
Thorsten Schütt 2
Web 2.0 Hosting
1. Step
• Single Server
– Webserver
– DB Server
Clients
Thorsten Schütt 3
– DB Server
2. Step
• Single Webserver
Clients
Thorsten Schütt 4
• Single DB server
3. Step
• Load Balancer
• n Webservers
Clients
. . .
Thorsten Schütt 5
• Single DB server
. . .
4. Step
• Load Balancer
• n Webservers
Clients
Thorsten Schütt 6
• Master DB server
• Slave DB servers
. . .
5. Step
• Load Balancer
• n Webservers
Clients
Thorsten Schütt 7
• DB partitioning
• DB clusters
• n DB servers
• memcached
DB Directory
. . .
What did others do?
• SimpleDB
• BigTable
• Dynamo
• MySQL Cluster
Thorsten Schütt 8
• MySQL Cluster
• MapReduce
• Google FS
Our Approach: P2P in the Data Center
Clients
Key/Value Store
(simple DBMS)
Thorsten Schütt 9
P2P OverlayP2P Overlay
Transactions Transactions
ReplicationReplication
ACID
Scalable Wikipedia with Erlang
• Chord#
• Transactional Layer
• Load Balancing
Thorsten Schütt 10
• Wikipedia
Chord# = Chord \ Hash
Thorsten Schütt 11
Chord# = Chord \ Hash
Chord#
• A dictionary has 3 ops:
• insert(key, value)
• delete(key)
• lookup(key)
• Chord# implements a distributed dictionary
Thorsten Schütt 12
• Chord implements a distributed dictionary
Key Value
Clarke 2007
Allen 2006
Bachman 1973
Thompson 1983
Knuth 1974
Codd 1981
node A
node D
node B
node C
Each node only
stores part of the
data
Backus
Codd
Dijkstra
FlyodRivest
Wirth
Yao
Chord#
• Key Space is an arbitrary set
with a total order, e.g.
strings
• Every node is assigned a
Thorsten Schütt 13
Gray
Karp
Knuth
Milner
Minsky
Naur
Nygaard
Perlis
Ritchie
• Every node is assigned a
random key
• Nodes form logical ring
over the key space
Backus
Distributed Dictionary
• Items are stored on their
successor, i.e. the first node
encountered in clockwise
direction
Thorsten Schütt 14
Backus
Codd
Ritchie
Rivest
Wirth
Yao
(Thompson, 1983)
(Clarke, 2007)
(Allen, 2006)
direction
– Thomson on Wirth
– Allen on Backus
– Bachman on Backus
– Clarke on Codd
– …
(Bachman, 1973)
• Each node puts the successor and log2N exponentially spaced fingers in its
routing table
With each routing step using greedy routing,
Distributed Dictionary
Backus
CoddYao
Thorsten Schütt 15
• With each routing step using greedy routing,
the distance between the currrent node
and the destination is halved.
=> Yields O(log2N) hops.
Dijkstra
Flyod
Gray
Karp
Knuth
Milner
Minsky
Naur
Nygaard
Perlis
Ritchie
Rivest
Wirth
Summary: Chord#
• DHT is a fully decentralized data structure
• DHTs self-organize as nodes
• Only one hop for calculating a routing pointer, Chord needs log(N) hops.
DHTs/Chord Chord#
Thorsten Schütt 16
• DHTs self-organize as nodes join, leave, and fail
• All operations only require local knowledge
• max. log (N) hops, while Chord guarantees this only with “high probability”.
• Can adapt to imbalances in the query load; Chord can’t.
• Supports range queries.
Scalable Wikipedia with Erlang
• Chord#
• Transactional Layer
• Load Balancing
Thorsten Schütt 17
• Wikipedia
ACID
Transactions on DHTs are Challenging
• high churn rate
• nodes may leave, join, or crash at any time
� changing responsibilities
• “crash stop” fault model
Thorsten Schütt 18
• “crash stop” fault model
• no perfect “failure detector”
• never know whether a node crashed or just slow network
Transactions + Replicas
START
debit (a, 100);
START
debit (a1, 100);
debit (a2, 100);
debit (a3, 100);
Thorsten Schütt 19
deposit (b, 100);
COMMIT
3
deposit (b1, 100);
deposit (b2, 100);
deposit (b3, 100);
COMMIT
Adapted Paxos Commit
Leader ItemsTransaction
Manager
1. Step
2. Step
GetTPInitTM
RegisterRTM
RegisterTP
TMs and TPs
• Optimistic CC with fallback option
• Fast Read Operation
– just 1 round
– reads a majority (of the last versions)
of all replicates
Thorsten Schütt 20
3. Step
4. Step
5. Step
6. Step
TMs and TPs
Prepare
Prepared
Ack Prepared
Commit
of all replicates
• Write Operation
– 3 rounds
– nonblocking (fallback)
• succeeds when > f/2 nodes alive
Summary “Transactions“
• Consistent way to update several items and their replicas
Thorsten Schütt 21
• Mitigates some of the Overlay Oddities
• Node Failures
• Asynchronous programming model
Scalable Wikipedia with Erlang
• Chord#
• Transactional Layer
• Load Balancing
Thorsten Schütt 22
• Wikipedia
Backus
Codd
Dijkstra
FlyodRivest
Wirth
Yao1. Pick 2 random nodes („Naur“,
„Wirth“)
2. IF Load(„Wirth“) >> Load(„Naur“)
1. „Naur“ leaves system
2. „Naur“ joins as „Tarjan“
Load-Balancing
Overloaded
Thorsten Schütt 23
Gray
Karp
Knuth
Milner
Minsky
Naur
Nygaard
Perlis
RitchieLoad can be any metric
D. Karger, M. Ruhl. „Simple Efficient Load Balancing
Algorithms for Peer-to-Peer Systems“. IPTPS 2004.
Backus
Codd
Dijkstra
FlyodTarjan
Wirth
Yao
1. Pick 2 random nodes („Naur“, „Wirth“)
2. IF Load(„Wirth“) >> Load(„Naur“)
1. „Naur“ leaves system
2. „Naur“ joins as „Tarjan“
Load-Balancing
Thorsten Schütt 24
Gray
Karp
Knuth
Milner
Minsky
Naur
Nygaard
Perlis
Ritchie
RivestLoad can be any metric
D. Karger, M. Ruhl. „Simple Efficient Load Balancing
Algorithms for Peer-to-Peer Systems“. IPTPS 2004.
Multi Data-Center Scenario
• Multi-Data-Center Scenarios– Optimize for Latency
– Increase Availability
• Prefix Articles with– Language
Thorsten Schütt 25
– Language
– Replicas Number
• E.g. „de:Main Page“– 5 replicas
– 2 in Germany
– 1 in UK
– 1 in USA
– 1 in Asia
„0de:Main Page“, „1de:Main Page“, „2de:Main Page“, …
Scalable Wikipedia with Erlang
• Chord#
• Transactional Layer
• Load Balancing
Thorsten Schütt 26
• Wikipedia
Wikipedia
Top 10 Web sites
1. Yahoo!
2. Google
3. YouTube
4. Windows Live
Wikipedia is the top1 open Web Site
– source code is open
– architecture is open
Thorsten Schütt 27
4. Windows Live
5. MSN
6. Myspace
7. Wikipedia
8. Facebook
9. Blogger.com
10. Yahoo!カテゴリ
– architecture is open
– dumps available
Source: alexa.com
Wikipedia
Top 10 Web sites
1. Yahoo!
2. Google
3. YouTube
4. Windows Live
50.000 requests/sec
– 95% are answered by squid proxies
– 2,000 req./sec hit the backend
Thorsten Schütt 28
4. Windows Live
5. MSN
6. Myspace
7. Wikipedia
8. Facebook
9. Blogger.com
10. Yahoo!カテゴリ
The Wikipedia System Architecture
other
Thorsten Schütt 29
web servers
search servers
NFS
Erlang Implementation
Thorsten Schütt 30
Erlang Implementation
Overlay
• Implemented in Erlang
– Chord#
– Load-Balancing
– Transaction Framework
• OTP behaviours
– Supervisor
Thorsten Schütt 31
– Supervisor
– gen_server
• Distributed Erlang
– Security Problems
– Scalability Problems
-> Own Transport-Layer on top of TCP
Data Model
Wikipedia
– SQL DB
Chord#
– Key-Value Store
Thorsten Schütt 32
Map Relations to Key-Value Pairs
– (Title, List of Versions)
– (CategoryName, List of Titles)
– (Title, List of Titles) //Backlinks
CREATE TABLE /*$wgDBprefix*/page (
page_id int unsigned NOT
NULL auto_increment,
page_namespace int NOT NULL,
...
Data Model
void updatePage(string title, int oldVersion, string newText)
{
//new transaction
Transaction t = new Transaction();
//read old version
Page p = t.read(title);
//check for concurrent update
if(p.currentVersion != oldVersion)
Thorsten Schütt 33
t.abort();
else{
//write new text
t.write(p.add(newText));
//update categories
foreach(Category c in p)
t.write(t.read(c.name).add(title));
//commit
t.commit();
}
}
Wiki
Database:
• Chord#
• Mapping
– Wiki -> Key-Value Store
Thorsten Schütt 34
Renderer:
• Java
– Tomcat
– Plog4u
• Jinterface
– Interface to Erlang
IEEE Scale Challenge 2008
Live Demos
• Bavarian
• Simple English
• Browsing
• Deployments
– Planet-Lab
• 20 nodes
– Cluster
• 320 nodes in Berlin
Thorsten Schütt 35
• Browsing
• Editing with full History
• Category-Pages
• 320 nodes in Berlin
Summary
• Previously, P2P was mainly used for file sharing (read only).
• We support consistent, distributed write operations.
DHT + Transactions = Scalable, Reliable, Efficient Key/Value Store
Thorsten Schütt 36
• We support consistent, distributed write operations.
• Numerous applications
• Internet databases, transactional online-services, …
Team
• Thorsten Schütt,
• Florian Schintke,
• Monika Moser,
• Stefan Plantikow,
• Alexander Reinefeld,
Thorsten Schütt 37
• Alexander Reinefeld,
• Nico Kruber,
• Christian von Prollius,
• Seif Haridi (SICS),
• Ali Ghodsi (SICS),
• Tallat Shafaat (SICS)
Detailed Descriptions
WikiS. Plantikow, A. Reinefeld, F. Schintke.
Transactions for Distributed Wikis on Structured Overlays.
DSOM, October 2007.
TransactionsM. Moser, S. Haridi.
Atomic Commitment in Transactional DHTs.
1st CoreGRID Symposium, August 2007.
Thorsten Schütt 38
1st CoreGRID Symposium, August 2007.
T. Shafaat, M. Moser, A. Ghodsi , S. Haridi, T. Schütt, A. Reinefeld.
Key-Based Consistency and Availability in Structured Overlay Networks.
Infoscale, June 2008.
DHTT. Schütt, F. Schintke, A. Reinefeld.
A Structured Overlay for Multi-dimensional Range Queries.
Euro-Par, August 2007.
T. Schütt, F. Schintke, A. Reinefeld.
Structured Overlay without Consistent Hashing: Empirical Results.
GP2PC, May 2006.
Questions
Thorsten Schütt 39
Questions