mongodb san francisco 2013:geo searches for healthcare pricing data presented by robert stewart,...
DESCRIPTION
This talk covers the MongoDB deployment architecture used at Castlight Health to support very low latency spatial searches against our database of hundreds of millions of healthcare prices. The Geo haystack index in MongoDB and SSDs turned out to be the perfect solution for our problem. A strategy of replica set flipping also enables Castlight to swap in very large changes to the pricing data with no impact to the running application.TRANSCRIPT
![Page 1: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/1.jpg)
CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL
Geo Searches for Health Care Pricing Data
Robert Stewart
Senior Architect, Castlight Health
@wombatnation
1
![Page 2: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/2.jpg)
CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL
Castlight Health
The Business and Technical Problems
Initial Solution
MongoDB, Geo Haystack Index and SSDs
Replica Set Flipping
2
![Page 3: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/3.jpg)
3
Hosted web and mobile applications providing unbiased information on health care cost and quality
Customers are employers and health plans
Founded in 2008, raised $181 million in VC funding
#1 on Wall Street Journal’s list of “Top 50 Venture-Backed Companies” for 2011
Hiring!
Castlight Health
![Page 4: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/4.jpg)
4
Home Page
![Page 5: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/5.jpg)
5
Search Results
![Page 6: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/6.jpg)
6
Business Problem
Support searches for
Prices for a procedure performed by any in-network provider in a geographical area
Prices for all procedures performed by a single provider
Sub-second response, even if returning data on thousands of prices
![Page 7: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/7.jpg)
7
Need a very fast geo index
Rate count doubled in last 3 months to 600 million
Major rate updates monthly
Difficult to index data to ensure sequential reads
Sometimes lots of random reads
Technical Problems
![Page 8: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/8.jpg)
8
Pricing Retrieval Architecture
![Page 9: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/9.jpg)
9
Initial Solution
Store pricing data in MySQL
When Pricing Service starts, create two in-memory indexes and cache most of the rates
55 GB JVM Heap with lots of GC tuning
20-minute service startup time to build indexes
3 hours for background caching of most rates
Trouble Brewing: Total rates growing quickly Rolling restart becoming unacceptably slow If rates not in Java or MySQL cache, retrieval was very slow
![Page 10: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/10.jpg)
CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL
Enter the Mongo
10
![Page 11: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/11.jpg)
11
Geo Indexes
Tried standard geo 2D indexes in MongoDB
Too slow for my use case
Geo Haystack index
Conceptually similar
From docs.mongodb.org “A haystack index is a special index that is optimized to return
results over small areas. Haystack indexes improve performance on queries that use flat geometry.”
![Page 12: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/12.jpg)
12
Mercator Projection with 10 degree grid
![Page 13: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/13.jpg)
13
Geo Haystack
We chose degrees long-lat for x-y coordinate system
25 miles is our default search radius Roughly 0.5 degrees in middle of the US
db.priceables_1.ensureIndex(
{ loc: "geoHaystack", pm: 1 },
{ bucketSize: 0.5 })
db.runCommand(
{ geoSearch: "priceables_1",
near: [-122.4, 37.79],
maxDistance: 0.5,
search: { pm: 6757 },
limit: 50000 })
maxDistance calculated using great circle algorithm
![Page 14: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/14.jpg)
14
Geo Haystack Pros
Very fast when retrieving many documents in a relatively small search radius
Great when you also need to apply a secondary filter Compound 2dsphere index in Mongo 2.4 has even better support
![Page 15: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/15.jpg)
15
Geo Haystack Cons
Supports only one extra filter in index SERVER-2979
A bug if unindexed query on only the second part of the key SERVER-8645
> db.priceables_1.find({pm: 6757})
error: { "$err" : "assertion src/mongo/db/geo/haystack.cpp:178" }
Second part of index can’t have an array value
Location part of key can’t be null
![Page 16: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/16.jpg)
16
SSDs
For uncached data on HDD, Geo Haystack was twice as fast as custom Java geo index and MySQL
Still close to 1 minute for big queries with full data set
Death by random read
Tested with a $200 Samsung SSD Typical query dropped to 20 millis Big query only about 150 millis
![Page 17: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/17.jpg)
17
Random 4k block reads, 5 GB file, 16 threads
Mongoperf on SSDs
Env SSD Read Ops/s Read MB/s
Prod Samsung 200GB SLC 74k 288
QA VM Samsung 200GB SLC 30k 117
Dev Samsung 830 256GB SATA MLC 47k 183
Env SSD Write Ops/s Write MB/s
Prod Samsung 200GB SLC 1074 289
QA VM Samsung 200GB SLC 405 196
Dev Samsung 830 256GB SATA MLC 438 210
Sequential write of the 5 GB file
![Page 18: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/18.jpg)
18
Requirements Major price updates monthly Minor updates more frequently
Huge bulk loads with no impact on active replica set
I/O bound, not CPU bound
Low Impact Pricing Updates
![Page 19: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/19.jpg)
19
Two replica sets
Lowered cost with two SSDs on each pricing server
scp compressed files from QA to passive replica set Protip: to compress and uncompress
tar cvf - pricing | pigz > ~/pricing.tgz
pigz -dc pricing.tgz | tar xvf -
Page in index and data db.runCommand({ touch: "priceables_1", index: true, data: true })
Pricing Service operation to atomically flip
Replica Set Flipping Solution
![Page 20: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/20.jpg)
20
Replica Set Architecture
Physical Servers
ReplicaSets
prodpricing1
prodpricing2
Server pricing1
mongod 28001primary
mongod 28002secondary
Server pricing2
mongod 28001secondary
mongod 28002primary
Server db1
mongod 28001arbiter
Server db2
mongod 28002arbiter
![Page 21: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/21.jpg)
21
Obviously, increased cost, but only for SSDs
Recently added caching of remote pricing lookups TTL collections
Cache is lost during a flip
But, usually flip late at night
Cache eviction time is only a few hours
Replica Set Flipping Drawbacks
![Page 22: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/22.jpg)
22
Geo search speed with cold cache acceptable
Geo search speed with warm cache awesome
Pricing Service startup down to a few seconds
No production impact for major rate updates
Lowered risk for minor rate updates
Overall Results
![Page 23: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/23.jpg)
23
Summary
Geo Haystack Index great for … Retrieving lots of documents in a constrained search area Geo searches with a secondary filter
SSDs great for … Random reads Reducing need for lots of complex indexes
Replica set flipping great for … Instant swap of large amounts of data Primarily, if not solely, read only Trading cost for operational flexibility
![Page 24: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health](https://reader035.vdocuments.us/reader035/viewer/2022062615/548c95c0b479593d1f8b4998/html5/thumbnails/24.jpg)
CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL
Q & A
24