c* summit 2013: cmb: an open message bus for the cloud by boris wolf
DESCRIPTION
The Comcast Silicon Valley Innovation Center has developed a general purpose message bus for the cloud. The service is API compatible with Amazon's SQS/SNS and is built on Cassandra and Redis with the goal of linear horizontal scalability. This presentation offers and in-depth look at the architecture of the system and how they employ Cassandra as a central component to meet key requirements. Latest feature enhancements and performance data will also be covered.TRANSCRIPT
![Page 1: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/1.jpg)
CMB – A Message Bus for the Cloud
![Page 2: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/2.jpg)
CMB – A Message Bus for the Cloud
CQS – Queuing Service CNS – Topic based Pub Sub Service
![Page 3: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/3.jpg)
Why did we build our own? • General purpose message bus to replace project driven one-off
solutions • Smooth data center failover, maybe even “active-active” queues • Must scale to millions of queues and 1000s of messages/sec (for
example 1 queue per STB) • Tight latency requirements (“10ms response time 95th pct”) • Evaluated other options to arrive at AWS SQS/SNS
![Page 4: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/4.jpg)
AWS SQS Primer “Simple Queuing Service” • Focus on guaranteed delivery • Best effort on orderly delivery, duplicates • Few simple core APIs:
SendMessage() ReceiveMessage() DeleteMessage() • Do not trust message recipients
![Page 5: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/5.jpg)
Advantages of adopting an API If you do it on your own: • API design typically biased towards first use case • Almost guaranteed: You won’t get it right the first time (iterations) • Difficult for new users to adopt: Documentation, tools, community,…
![Page 6: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/6.jpg)
Why did we build our own? AWS SQS
Guaranteed Delivery + Simple, Robust API + Scalability + Ac;ve-‐Ac;ve ? DC Failover ? Latency & Throughput ? Limita;ons (Msg Size, # Ar;facts, …) ?
![Page 7: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/7.jpg)
“Build a horizontally scalable queuing service on top of Cassandra (and Redis) which is API compatible with AWS SQS / SNS API”
![Page 8: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/8.jpg)
CQS over Cassandra and Redis Cassandra • Cross-DC persistence and replication • Proven horizontal scalability Redis • Meet latency requirements • Help with best effort ordering • Handle Visibility Timeout (VTO)
![Page 9: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/9.jpg)
Cassandra Data Modeling How to represent queued messages in Cassandra? • Single Column Queue • Single Row Queue • Multi-Row Queue
![Page 10: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/10.jpg)
Single Column Queue
![Page 11: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/11.jpg)
Single Row Queue
![Page 12: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/12.jpg)
Multi-Row Queue
![Page 13: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/13.jpg)
CQS Data Flow Example 1. SendMessage(MSG1) 2. SendMessage(MSG2) 3. SendMessage(MSG3) 4. MSG1 = ReceiveMessage() 5. DeleteMessage(MSG1)
![Page 14: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/14.jpg)
![Page 15: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/15.jpg)
![Page 16: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/16.jpg)
![Page 17: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/17.jpg)
![Page 18: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/18.jpg)
![Page 19: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/19.jpg)
![Page 20: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/20.jpg)
CQS Architecture Recap Cassandra Persistence Layer • Messages sharded across 100 rows per queue • Avoid wide rows (> 500K) • Minimize churn (Tombstones) • Distribute queue among Cassandra nodes Redis Caching Layer • To meet latency requirements • Payload cache (kicks in after first miss, pre-load next 10k) • Improve FIFOness by storing Msg IDs in Redis List • Handle message visibility entirely in Redis (Hashtable)
![Page 21: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/21.jpg)
CQS Key Cassandra Features Persistence and failover • Cross-DC replication in combination with Local Quorum Reads/Writes
(tunable consistency) Millions of queues, spiky traffic patterns • Massive horizontal scalability Message order (FIFOness) / future dated messages • Wide rows, composite column keys / TimeUUID and column sort order Message retention period (expiration) • TTL Fast lookup of static metadata (Queues, Users etc.) • Row Cache, Secondary Indexes
![Page 22: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/22.jpg)
CQS Scalability • Send(), Receive() and Delete() scale with Cassandra
Ring, API Servers (stateless) and Redis Shards • Are constant time operations • Queues not sharded across Redis servers!
![Page 23: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/23.jpg)
CQS Availability • Depends on availability of Cassandra • Service functions without Redis!
![Page 24: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/24.jpg)
CQS DC Failover
![Page 25: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/25.jpg)
AWS SNS API “Simple Notification Service”
• Topic based Publish/Subscribe Service • Supported protocols: HTTP/CQS/SQS • Few simple core APIs
CreateTopic() / DeleteTopic() Subscribe() / Unsubscribe() ConfirmSubscription() Publish()
• Do not trust message recipients (redelivery policy)
![Page 26: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/26.jpg)
CNS Data Flow Example • Single operation: Publish message MSG1 to a topic
T with four Subscribers S1, S2, S5, S6. • S1, S2 are HTTP endpoints • S5, S6 are CQS queues
![Page 27: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/27.jpg)
![Page 28: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/28.jpg)
![Page 29: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/29.jpg)
![Page 30: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/30.jpg)
![Page 31: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/31.jpg)
![Page 32: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/32.jpg)
![Page 33: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/33.jpg)
CNS Architecture Recap • CQS Queue preserves messages when Publish
Workers are down or overloaded • CQS Visibility Timeout takes care of guaranteed
delivery • Retry policy improves guaranteed delivery for
temporarily unavailable endpoints (http) • Publish Workers hardened for rogue endpoints (failing
endpoints, slow endpoints, …)
![Page 34: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/34.jpg)
Differences SQS/SNS and CQS/CNS Goal: Full API compatibility Current state: • All APIs implemented, most parameters supported • Can use AWS Java SDK and others Limitations: • AWS4 signatures not supported (V1 and V2 ok) • SMS endpoints not supported, limited email support Enhancements: • Additional APIs for monitoring and management • Unlimited number of queues, topics and subscriptions • Adjustable message size and other parameters (MSG <= 64KB, LP <= 20 sec, DS <= 900 sec, RP, …)
![Page 35: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/35.jpg)
CMB Ready for Production Use? • Code of CMB Core is stable • Extensive testing done (including throughput
scalability testing) • In use at Comcast (Sports, DVR, …)
![Page 36: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/36.jpg)
Testing Goals • Functional testing (unit tests, good code coverage) • Stress testing (simulate Redis outage, data center failover) • Endurance testing • Load testing: Verify linear horizontal scalability (CQS / CNS
throughput scalability)
![Page 37: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/37.jpg)
CQS Throughput Scalability • Throughput as a function of Cassandra Ring size • Increase load until throughput (msg/sec) reaches a maximum • Increase ring size and re-test • Ensure sufficient API and Redis capacity to support largest ring • Deployment: 10 API Servers, 5 Redis Shards, 4-16 Node Cassandra
Ring
![Page 38: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/38.jpg)
![Page 39: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/39.jpg)
![Page 40: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/40.jpg)
CQS Throughput Scalability # Load Gen # API Servers # Redis Shards Ring Size API / Sec P99
5 10 5 4 2832 <= 100 ms
5 10 5 8 6072 <= 100 ms
5 10 5 12 9472 <= 100 ms
6 12 6 16 11667 <= 100 ms
8 15 7 20 13514 <= 100 ms
8 15 7 24 15365 <= 100 ms
![Page 41: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/41.jpg)
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
4 8 12 16 20 24
API/sec
Ring Size
![Page 42: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/42.jpg)
CNS Throughput Scalability • Most important metric: End-to-end latency • Fixed number of subscribers, gradually increase #msg/sec published
until system is “overwhelmed” • Increase number of Publish Workers and re-test • Deployment: 8 node Cassandra Ring, 2 API Servers, 2 Redis Shards,
3-6 Publish Workers • Test setup: Single topic with 100 HTTP subscribers, 10 min test
duration
![Page 43: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/43.jpg)
CNS Throughput Scalability 3 Publish Workers
#PUB/SEC
#MSG/SEC
AVG(LAT)
API AVG(RT)
API P95(RT)
API AVG(CQ
S)
API P95(CQ
S)
PROD AVG(RT
)
PROD P95(RT)
PROD AVG(CQ
S)
PROD P95(CQS)
CONS AVG(RT)
CONS P95(RT)
CONS AVG(CQS)
CONS P95(CQS
)
HTTP AVG(RT)
HTTP P95(CQS
)
5 500 198 21 44 10 26 77 150 77 155 47 96 38 85 12 18
10 1000 177 15 30 8 16 68 119 68 118 39 70 31 59 12 17
20 2000 160 16 29 9 17 69 120 69 120 40 69 31 59 9 16
40 4000 209 18 37 10 22 78 138 78 139 41 74 32 62 11 18
80 8000 75656 28 61 14 27 237 1020 237 1020 143 790 131 770 14 21
![Page 44: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/44.jpg)
CNS Throughput Scalability 6 Publish Workers
#PUB/SEC #MSG/SEC AVG(LAT)
API AVG(RT)
API P95(RT)
API AVG(CQ
S)
API P95(CQ
S)
PROD AVG(RT
)
PROD P95(RT)
PROD AVG(CQ
S)
PROD P95(CQS)
CONS AVG(RT)
CONS P95(RT)
CONS AVG(CQS)
CONS P95(CQS
)
HTTP AVG(RT)
HTTP P95(CQS
)
5 500 247 37 117 19 66 141 290 139 290 77 200 57 160 18 34
10 1000 226 41 130 21 79 163 380 162 370 79 200 58 160 17 39
20 2000 199 37 118 20 74 133 280 133 280 68 180 50 150 11 21
40 4000 225 45 140 25 110 148 320 148 320 76 210 53 170 18 25
80 8000 267 48 126 25 80 149 300 149 300 77 180 58 150 22 38
160 16000 145135 76 180 41 120 228 460 228 460 115 280 97 250 28 70
![Page 45: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/45.jpg)
0
200
400
600
800
1000
1200
500 1000 2000 4000 8000
Lat
ency
(ms)
Throughput (Msgs/Sec)
6 workers
3 workers
CNS Throughput Scalability
![Page 46: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/46.jpg)
Use Case: X1 Sports App
![Page 47: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/47.jpg)
Use Case: X1 Sports App
![Page 48: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/48.jpg)
Use Case: X1 Sports App
![Page 49: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/49.jpg)
Moving Forward • Follow SNS / SQS APIs • More load and stress testing • Ease of deployment and scale up • More in-house production deployments (currently isolated
by application) • CQS as a Service
![Page 50: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/50.jpg)
Thank You!
http://github.com/Comcast/cmb
http://groups.google.com/forum/#!forum/cmb-user-forum
![Page 51: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/51.jpg)
BACKUP
![Page 52: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/52.jpg)
CNS Endurance Test
Single topic with 5 HTTP subscribers, 65 msg/sec published 14 mio messages published over 12hrs
![Page 53: C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris Wolf](https://reader034.vdocuments.us/reader034/viewer/2022051323/5483149fb07959600c8b495c/html5/thumbnails/53.jpg)
Use Case: EAS