sinfonia - university of cambridge · sinfonia a new paradigm for building scalable distributed...
TRANSCRIPT
![Page 1: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/1.jpg)
Sinfoniaa new paradigm
for building scalable distributed systems
(SOSP 2007)
Marcos AguileraArif MerchantMehul Shah
Alistair VeitchChristos Karamanolis
![Page 2: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/2.jpg)
Memory Nodes
App & Tx Coordinator Nodes
On the face of it, we have transaction coordinators alongside the application, and memory nodes to store the data. Is it just another ACID store that forces 2PC on you?
What I found interesting about this paper is that other approaches avoid 2pc at all costs; instead they relax consistency requirements (many use memcached) or avoid multi-word transactions (bigtable uses single row transactions)
![Page 3: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/3.jpg)
Memory Nodes
App & Tx Coordinator Nodes
Another distributed transactional store?
On the face of it, we have transaction coordinators alongside the application, and memory nodes to store the data. Is it just another ACID store that forces 2PC on you?
What I found interesting about this paper is that other approaches avoid 2pc at all costs; instead they relax consistency requirements (many use memcached) or avoid multi-word transactions (bigtable uses single row transactions)
![Page 4: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/4.jpg)
insert
delete
read
update
DB
read
A brief recap about db operation and 2pc. App begins a transaction, makes r requests (and touches m nodes in the process).DB often locks the data.
![Page 5: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/5.jpg)
insert
delete
read
update
DB
r data requests to m nodes
read
A brief recap about db operation and 2pc. App begins a transaction, makes r requests (and touches m nodes in the process).DB often locks the data.
![Page 6: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/6.jpg)
DB
prepare
prepare
prepare
2 pc starts after data is over. The db nodes log their data and prepare decision, thenthe coordinator logs its commit decision.
![Page 7: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/7.jpg)
DB
commit
commit
commit
![Page 8: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/8.jpg)
DB
commit
commit
commit
r + 2mround trips
![Page 9: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/9.jpg)
DB
commit
commit
commit
r + 2mround trips
m + 1disk writes
![Page 10: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/10.jpg)
sinfonia
if_cmp_rw_prepare
if_cmp_rw_prepare
if_cmp_rw_prepare
Sinfonia also exactly two phases, where the data transfer part is combined with prepare. Items are try-locked, then comparison is done, then the read and write is done. If the lock fails, the request fails.
It can do it because there is exactly one request allowed to read and write data. Clearly, this influences the way one writes application. The important takeaway is that such an API is useful in the real world. Multi-stage read-writes just have to be written as higher-order transactions (as with multi-word CAS)
To be fair, one can do this with stored procedures, but current databases donʼt allow you to exec stored procedure and prepare in one network hop.
![Page 11: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/11.jpg)
sinfonia
m data requests to m nodes
if_cmp_rw_prepare
if_cmp_rw_prepare
if_cmp_rw_prepare
Sinfonia also exactly two phases, where the data transfer part is combined with prepare. Items are try-locked, then comparison is done, then the read and write is done. If the lock fails, the request fails.
It can do it because there is exactly one request allowed to read and write data. Clearly, this influences the way one writes application. The important takeaway is that such an API is useful in the real world. Multi-stage read-writes just have to be written as higher-order transactions (as with multi-word CAS)
To be fair, one can do this with stored procedures, but current databases donʼt allow you to exec stored procedure and prepare in one network hop.
![Page 12: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/12.jpg)
sinfonia
commit
commit
commit
no coordinator disk write, plus commit write is lazy.
![Page 13: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/13.jpg)
sinfonia
1.5 m round trips
commit
commit
commit
no coordinator disk write, plus commit write is lazy.
![Page 14: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/14.jpg)
sinfonia
1.5 m round trips m disk writes
commit
commit
commit
no coordinator disk write, plus commit write is lazy.
![Page 15: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/15.jpg)
DB Sinfoniastructuredstorage linear range
locking isolation levels duration deadlocks
brief, deterministic locking interval
blocking non-blocking
db nodes don’t know about each other
mem nodes know about others, for each tx
sinfonia: much lower level; app may have to worry about garbage collecting space
sinfonia: no blocking. If lock not acquired, does not prepare.
2pc: coordinator is a bottleneck for recovery because only it knows the participants.
![Page 16: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/16.jpg)
coordinator crash
Management infrastructure periodically polls mem nodes about in-doubt transactions and a recovery coordinator kicks in when a coordinator crashes. It tells each node, for each in-doubt tx, to abort the tx unless it voted commit. If for a particular tx, all participating nodes say they voted to commit, then the rec. coordinator drives the tx to commit.
This is correct because this scheme can run concurrently with a coordinator which may have come back (maybe it got stuck in a GC or network hiccup). It is correct because no-one changes their vote.
To me, it is not clear from the paper how the management infrastructure knows which nodes the crashed coordinator was responsible for (unless there is a hint in the transaction id). Otherwise, it is reckless to start aborting all transactions currently in progress at all nodes.
![Page 17: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/17.jpg)
coordinator crash
Management infrastructure periodically polls mem nodes about in-doubt transactions and a recovery coordinator kicks in when a coordinator crashes. It tells each node, for each in-doubt tx, to abort the tx unless it voted commit. If for a particular tx, all participating nodes say they voted to commit, then the rec. coordinator drives the tx to commit.
This is correct because this scheme can run concurrently with a coordinator which may have come back (maybe it got stuck in a GC or network hiccup). It is correct because no-one changes their vote.
To me, it is not clear from the paper how the management infrastructure knows which nodes the crashed coordinator was responsible for (unless there is a hint in the transaction id). Otherwise, it is reckless to start aborting all transactions currently in progress at all nodes.
![Page 18: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/18.jpg)
coordinator crash
recovery coordinator
Management infrastructure periodically polls mem nodes about in-doubt transactions and a recovery coordinator kicks in when a coordinator crashes. It tells each node, for each in-doubt tx, to abort the tx unless it voted commit. If for a particular tx, all participating nodes say they voted to commit, then the rec. coordinator drives the tx to commit.
This is correct because this scheme can run concurrently with a coordinator which may have come back (maybe it got stuck in a GC or network hiccup). It is correct because no-one changes their vote.
To me, it is not clear from the paper how the management infrastructure knows which nodes the crashed coordinator was responsible for (unless there is a hint in the transaction id). Otherwise, it is reckless to start aborting all transactions currently in progress at all nodes.
![Page 19: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/19.jpg)
for every in-doubt tx abort unless you remember voting commit return vote
coordinator crash
recovery coordinator
Management infrastructure periodically polls mem nodes about in-doubt transactions and a recovery coordinator kicks in when a coordinator crashes. It tells each node, for each in-doubt tx, to abort the tx unless it voted commit. If for a particular tx, all participating nodes say they voted to commit, then the rec. coordinator drives the tx to commit.
This is correct because this scheme can run concurrently with a coordinator which may have come back (maybe it got stuck in a GC or network hiccup). It is correct because no-one changes their vote.
To me, it is not clear from the paper how the management infrastructure knows which nodes the crashed coordinator was responsible for (unless there is a hint in the transaction id). Otherwise, it is reckless to start aborting all transactions currently in progress at all nodes.
![Page 20: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/20.jpg)
coordinator crash
commit
commit
commit
if all of them voted to commit, a commit is sent to all, else abort.
![Page 21: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/21.jpg)
memory node crash
No recovery coordinator is used when a mem node crashes. Nodes know the other nodes that were involved in the various transactions, so they ask each other while recovering. Iʼm not a big fan of this architecture; Iʼd have preferred a recovery coordinator in all cases; it is less complex and a recovery from a system crash (like a power failure) doesnʼt swamp the network.
![Page 22: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/22.jpg)
memory node crash
No recovery coordinator is used when a mem node crashes. Nodes know the other nodes that were involved in the various transactions, so they ask each other while recovering. Iʼm not a big fan of this architecture; Iʼd have preferred a recovery coordinator in all cases; it is less complex and a recovery from a system crash (like a power failure) doesnʼt swamp the network.
![Page 23: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/23.jpg)
memory node crash
No recovery coordinator is used when a mem node crashes. Nodes know the other nodes that were involved in the various transactions, so they ask each other while recovering. Iʼm not a big fan of this architecture; Iʼd have preferred a recovery coordinator in all cases; it is less complex and a recovery from a system crash (like a power failure) doesnʼt swamp the network.
![Page 24: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/24.jpg)
enhancements
structured storage
smarter mem nodes
if cmp_read/write/prepare else read
temporary blocking instead of BAD_LOCK
Temporary blocking: Iʼd like a separate option that says block on locking for a limited amount of time instead of returning. It does introduce the possibility of limited-duration deadlocks, but may improve throughput.
Structured storage: different addressing options, not just offset, count. Ranges are prone to “off-by-one” errors that could result in livelocks and corrupted data. Key/Value storage keeps one keyʼs space logically separate from another.
App will also have to worry about portability. Fig. 7 in paper writes &newAttributes. This is tied to the current structure of attributes and to the machine that used it.
smarter: Compare could be any predicate (field2 > field 3). Actions could be increment, arithmetic, insertions etc.
Unnecessary round-trips on contention.
![Page 25: sinfonia - University of Cambridge · Sinfonia a new paradigm for building scalable distributed systems (SOSP 2007) Marcos Aguilera Arif Merchant Mehul Shah Alistair Veitch Christos](https://reader030.vdocuments.us/reader030/viewer/2022020214/5b0b41197f8b9ac7678dcb4c/html5/thumbnails/25.jpg)
related reading
Google: Chubby, BigTable, TaskMasterYouTube architecture
Yahoo: , PNuts
Microsoft: Partitioning and Recovery Service
Apache/Yahoo Hadoop Project: Zookeeper