Download - Intelligent Stream Filtering Using MongoDB
Mihnea Giurgea
CONTENTS
• Who Are We?
• MongoDB On Amazon
• How We Do Stream-Filtering
• First Approach
• Second Approach
• Questions
CONTENTS
• Who Are We?
• MongoDB On Amazon
• How We Do Stream-Filtering
• First Approach
• Second Approach
• Questions
UBERVU AT A GLANCE
~50K
~30
32T
updates and
inserts per minute
Amazon instances
worth of EBS volumes
The force is
strong at
uberVU
CONTENTS
• Who Are We?
• MongoDB On Amazon
• How We Do Stream-Filtering
• First Approach
• Second Approach
• Questions
DB INFRASTRUCTURE
DB INFRASTRUCTURE
NO
SCALABLE
single points
of failure
horizontally &
vertically
MULTIPLE DB ENVIRONMENTS
• 4 different mongo environments
• each with its own shards, config servers, etc.
• Why?
• isolate problems & bad behavior
• ++reliability
• better resource (hardware) distribution
• different number of shards per database
• some databases need more or less replica nodes
MULTIPLE ENVIRONMENTS
• application servers hold 4 mongos,
instead of just 1
• each of the 3 config servers has 4 x
mongod processes
mongodmongodmongodmongod
mongodmongodmongodmongod
mongodmongodmongodmongod
MONGOD
• run only one mongod process per
replica node
• each shard resides on a MDADM
RAID 10 matrix
• consisting of 16 HDD x 250 GB each
AMAZON EC2 INSTANCES
• mongod primary• High-Memory Double Extra Large 34.2 GB
• mongod secondary• High-Memory Extra Large 17.1 GB
• config servers• Large Instance (cheapest 64-bit machine)
• expensive for its purpose :(
CONTENTS
• Who Are We?
• MongoDB On Amazon
• How We Do Stream-Filtering
• First Approach
• Second Approach
• Questions
THE PROBLEM
Gather mentions from web (Twitter,
Facebook, etc.)
Data Stream =
mentions around
a certain term
• mentions are
annotated (language,
location, sentiment,
etc.)
• data stream is
indexed in MongoDB
FILTERING
• filter data stream by time (since & until)
• filter by other attributes:
• platform: Twitter, Facebook
• language: English, French
• location: UK, US, Romania
• sentiment
• gender
• etc.
FILTERING
“MongoDB”
filtered by:
• United States
• gender: female
• sentiment: positive
CONTENTS
• Who Are We?
• MongoDB On Amazon
• How We Do Stream-Filtering
• First Approach
• Second Approach
• Questions
FIRST APPROACH
• if no filters are needed, 1 index will suffice:1. stream, time
• 1 filter => 2 indexes1. stream, time
2. stream, platform, time
• sort attribute must be last in index
FIRST APPROACH
• 2 filters => 4 indexes
1. stream, time
2. stream, platform, time
3. stream, language, time
4. stream, platform, language, time
• ...etc... (F filters => 2F indexes)
IMPROVEMENTS
• don’t really need (stream, platform, language,
time)
• when filtering for platform & language, use:
• stream, platform, time OR
• stream, language, time
• which one?
• the one with the smallest cardinality
IMPROVEMENTS
• saves index space
• but increases query scanning time
• finding the right indexes is a trade-off between:
• indexing space
• query scanning time
CONTENTS
• Who Are We?
• MongoDB On Amazon
• How We Do Stream-Filtering
• First Approach
• Second Approach
• Questions
IMPROVEMENTS
• Question: when filtering by platform &
language, what index should we use?
• stream, platform, time
• stream, language, time
• Answer: smallest cardinality
• we need to know the size of each attribute:
• platform: twitter - 90%
• language: English - 60%
• location: France - 8%
ATTRIBUTES
• normalize each attribute
• language: English => 13
• gender: male => 2038, etc.
• numbers use less space & are faster
• each mention now has several attributes:
{ 'platform': 'twitter','language': English', ---> { 'attributes': [13, 213, 2039, 1] }'location': 'UK','gender': 'female' }
MULTIKEY INDEX
• use a multikey index for attributes:
• stream, attributes, time
• use $all to query for multiple filters
db.find( { 'stream': 'mongo', 'platform': 'twitter', --->'gender': 'male','language': 'romanian'
} )
db.mentions.find( {'stream': 'mongo','attributes': { ‘$all':
[1, 2038, 58]}
} )
SORT BY FILTER
• $all: only the first item uses the index!
• the rest are scanned through
• ensure the first item has the smallest cardinality
• for the smallest query scanning time
{ location: france} < { gender: male } < { platform: twitter }
SECOND APPROACH
• now we only need 2 indexes!
• stream, time
• stream, attributes, time
• works for any number of filters
• is far from perfect
• but gets the job done
• with little resources
MORE IMPROVEMENTS
• don’t store all normalized attributes in index
• skip the very big ones:• platform: twitter - 90%
• language: English - 60%
• 90% selection rate: no index needed
• decreases index size
• no noticeable performance loss
USE _ID!
• use _id index instead of (stream, time)
• saves memory!
• Problem: _id must be unique
• (stream, time) index is not!
• Question: how to make (stream, time) unique?
• Answer: add some random number
USE _ID!
• pack stream, time & random into a number
• why: number look-ups are faster
• use all 64 bits available!
DUPLICATES
• we need to detect duplicates
• modify bit packing
• use mention.url
• instead of random bits
• uniquely identifies a mention
• for fastest index lookup use:• db.find({ _id: docid }).count()
CONTENTS
• Who Are We?
• MongoDB On Amazon
• How We Do Stream-Filtering
• First Approach
• Second Approach
• Questions
?