building a search data store for the world's largest biometric identity system

Search data store for the world's largest biometric identity system

CONFIDENTIAL: For limited circulation only

Regunath [email protected]

twitter @regunathb

Shashikant [email protected]

mailto:[email protected]


●1.2 billion residents●640,000 villages, ~60% lives under $2/day

●~75% literacy, <3% pays Income Tax, <20% banking

●~800 million mobile, ~200-300 mn migrant workers

●Govt. spends about $25-40B on direct subsidies●Residents have no standard identity document

●Most programs plagued with ghost and multiple identities causing leakage of 30-40%

India

●Create a common ‘national identity’ for every ‘resident’

●Biometric backed identity to eliminate duplicates

●‘Verifiable online identity’ for portability

●Applications ecosystem using open APIs

●Aadhaar enabled bank account and payment platform

●Aadhaar enabled electronics, paperless KYC (Know Your Customer)

Aadhaar

●Multi-attribute query like:name contains ‘regunath’ AND city = ‘bangalore’ AND address contains ‘J P Nagar’ AND YearOfBirth = ……

●Search 1.2B resident data with photo, history

●35Kb - Average record size

●Response times in milliseconds

●Open scale out

Search Requirements

●Auto-sharding

●Replication

●Failover… Essentially an AP (slaveOk) data store in CAP parlance

●Evolving schema

●Map-Reduce for analysis

●Full text search

●Compound (or) multi-keys

Why MongoDB

Design

●Read/Search

●Sharded Solr indexes for search

●Keyed document read from MongoDB

●Write

●Eventual consistency (across data sources) driven by application

●Composite MongodDB-Solr app persistence handler

Solr Indexes

MongoDB

Search API Client App

Name: ‘abcde’Address: ‘some place’ year: 1980

{ _id:123456789, name: ‘abcde’, year:1980, ….. }

1

2

Name=‘abcde’Address=‘some place’Year= 1980

●Start - 4M records in 2 shards Current - 250M records in 8 shards ( 8 x ~2 TB x 3 replicas)

●Performance , Reliability & Durability

●SlaveOk

●getLastError, Write Concern: availability vs durability· j = journaling· w = nodes-to-write

●Replica-sets / Shards – how?

Implementation and Deployment

RS 1

RS 2RS 1

Rs 2

RS 1RS 2

Config 1

Router

Config 2

Router

Config 3

Router

Primary

Secondary

Arbiter

●Monitoring tools evaluated

●MMS

●munin

●Manual approach - daily ritual

●RS, DB, config, router - health and stats

●Problem analysis stats

●mongostat, iostat, currentOps, logs

●Client connections

●Stats for storage, shards addition

●Data file size

●Shard data distribution

●Replication

Monitoring and Troubleshooting

●Indexing 32 fields

●Compound indexes

●Multi-keys indexes· {…"indexes" : [{ "email":"[email protected]", "phone":"123456789“ }] }· db.coll.find ({ "indexes.email" : "[email protected]" })

●Indexes use b-tree

●Many fields to index

●Performs well upto 1-2M documents

●Best if index fits in memory

●Data replication, RS failover

●Rollback when RS goes out of sync· Manual restore (physical data copy)· Restarting a very stale node

Key Learnings on MongoDB

Questions?

CONFIDENTIAL: For limited circulation only

Regunath [email protected]

twitter @regunathb

Shashikant [email protected]



building a search data store for the world's largest biometric identity system

Documents