building a search data store for the world's largest biometric identity system
DESCRIPTION
Presentation contents for the upcoming talk at MongoDB Bangalore 2012. Event details are here : http://www.10gen.com/events/mongodb-bangaloreTRANSCRIPT
Slide 1
Search data store for the world's largest biometric identity system
CONFIDENTIAL: For limited circulation only
Regunath [email protected]
twitter @regunathb
Shashikant [email protected]
Slide 2
●1.2 billion residents●640,000 villages, ~60% lives under $2/day
●~75% literacy, <3% pays Income Tax, <20% banking
●~800 million mobile, ~200-300 mn migrant workers
●Govt. spends about $25-40B on direct subsidies●Residents have no standard identity document
●Most programs plagued with ghost and multiple identities causing leakage of 30-40%
India
Slide 3
●Create a common ‘national identity’ for every ‘resident’
●Biometric backed identity to eliminate duplicates
●‘Verifiable online identity’ for portability
●Applications ecosystem using open APIs
●Aadhaar enabled bank account and payment platform
●Aadhaar enabled electronics, paperless KYC (Know Your Customer)
Aadhaar
Slide 4
●Multi-attribute query like:name contains ‘regunath’ AND city = ‘bangalore’ AND address contains ‘J P Nagar’ AND YearOfBirth = ……
●Search 1.2B resident data with photo, history
●35Kb - Average record size
●Response times in milliseconds
●Open scale out
Search Requirements
Slide 5
●Auto-sharding
●Replication
●Failover… Essentially an AP (slaveOk) data store in CAP parlance
●Evolving schema
●Map-Reduce for analysis
●Full text search
●Compound (or) multi-keys
Why MongoDB
Slide 6
Design
●Read/Search
●Sharded Solr indexes for search
●Keyed document read from MongoDB
●Write
●Eventual consistency (across data sources) driven by application
●Composite MongodDB-Solr app persistence handler
Solr Indexes
MongoDB
Search API Client App
Name: ‘abcde’Address: ‘some place’ year: 1980
{ _id:123456789, name: ‘abcde’, year:1980, ….. }
1
2
Name=‘abcde’Address=‘some place’Year= 1980
Slide 7
●Start - 4M records in 2 shards Current - 250M records in 8 shards ( 8 x ~2 TB x 3 replicas)
●Performance , Reliability & Durability
●SlaveOk
●getLastError, Write Concern: availability vs durability· j = journaling· w = nodes-to-write
●Replica-sets / Shards – how?
Implementation and Deployment
RS 1
RS 2RS 1
Rs 2
RS 1RS 2
Config 1
Router
Config 2
Router
Config 3
Router
Primary
Secondary
Arbiter
Slide 8
●Monitoring tools evaluated
●MMS
●munin
●Manual approach - daily ritual
●RS, DB, config, router - health and stats
●Problem analysis stats
●mongostat, iostat, currentOps, logs
●Client connections
●Stats for storage, shards addition
●Data file size
●Shard data distribution
●Replication
Monitoring and Troubleshooting
Slide 9
●Indexing 32 fields
●Compound indexes
●Multi-keys indexes· {…"indexes" : [{ "email":"[email protected]", "phone":"123456789“ }] }· db.coll.find ({ "indexes.email" : "[email protected]" })
●Indexes use b-tree
●Many fields to index
●Performs well upto 1-2M documents
●Best if index fits in memory
●Data replication, RS failover
●Rollback when RS goes out of sync· Manual restore (physical data copy)· Restarting a very stale node
Key Learnings on MongoDB
Slide 10
Questions?
CONFIDENTIAL: For limited circulation only
Regunath [email protected]
twitter @regunathb
Shashikant [email protected]