Download - Search at Tumblr (nyc search meetup)
Search at Tumblr
Yufei PanDirector of Search, Tumblr
16 January 2013
Tumblr - Follow the World’s Creators
Founded● David Karp● February 2007
Publishing Platform● 163 million blogs● 72 billion posts
Social Network● Follow, Mention● Like, Reblog
About search@tumblr
● Most important way to discover great content ○ 50M searches a day
● Limited search for a long time (2007-2012)○ Tagged page
■ mysql lookup of a single tag id■ sorted by reverse chronological order
○ Finding blog■ navigate through curated directories
About search@tumblr
● Search Team○ 2012 July, Jak joined as first search engineer!
● Features launched in 2013○ Post search, Blog search, Theme search○ Typeaheads, Recommendation, Trends
Jak Yufei Bennett Beitao Patrick Adam
Whole New Search
Post search● full text search● top and recent● post type filtering
Blog search● name & title ● top tags in posts● blog highlights
Related search● term co-occurrence
Typeahead AutocompletesSearch Autocomplete Mention Autocomplete Tag Suggest
● Interactive guide of tumblr content● High volume of traffic● Low latency
RecommendationsPersonalized Recommendation Weekly Dashboard Digest
Trends
Trending Tags Trending Blogs
Theme Search
Search Architecture
Recent Post Index
Global Tag Index
In-Blog Tag Index
Blog Full Index
Blog Top-K Index
Personalized Blog Index
Trending Blogs
Trending Posts
Trending Tags
Related Tag Index
Typeahead Indices
Blog Top Posts
Blog Top Tags Like Root
Search Offline Framework
Post Notecount
PostModel
Blog Model
UserModel
Activity Streams (Fire Geyser) Scribe logs, Sqoop tables (HDFS)
Follower Counts
Blog Global Rank
Blog FeedbackTwo Degree
Data
Offline
Post Search Typeahead Blog
RecommendRelated Tags
Blog Highlights
Blog Top Tags
Search Online Framework
Blog Search
Trending Tags
Trending Blogs
Trending Posts
Online
Rediscover
Solr
MySQL
TopPost Index
Theme Index
Nginx
Linux
Software Stack
● Search Online○ HAProxy, Nginx, PHP○ Memcache○ Icinga, Scribe, OpenTSDB
● Search Data○ Solr, Redis, MySQL
● Search Offline○ Sqoop, Hadoop○ Java, Hive, Pig, Scalding, Python
Search Online Framework
SearchBase
QueryIF RetrieverIF SignalFetcherIF RankerIF DocFetcherIF FilterIF
SimpleQuery
PersonalizedQuery
AdvancedPostQuery
NotecountFetcher
FollowercountFetcher
SolrPostRetriever
MysqlPostRetriever
SMPostRetriever
TumblelogRetriever
TagTypeaheadReteriever
TopPostRanker
TumblelogRanker
RelatedPostRanker
PostFetcher
TumblelogFetcher
PostFilter
TumblelogFilter
Search Flow Execution
Multi-level Caching
Search Logging
Async Execution
Search Services
TimeSliceQuery
TrendTagQuery
TumblelogGlobalRankFetcher
RecommendationSignalFetcher
BlogTopTagFetcher
TumblelogMixingRanker
TagFetcher TagFilter
Search Editorial
Search Task Base
Search Batch Processing
Hive Jobs Streaming Jobs
Scalding Jobs
Pig Jobs
Scribe Logs, Sqoop Tables (HDFS)
Search Data (Redis)
Search Workflow Engine
Workflow Composition
Dependency Resolution
AutomaticVersioning
DataVerification
FailureDetection/Alert
ExecutionLogging
TermGenerators
Top-K Indexer
Lucille2 Classes
DeltaPropagator
Indexing
● 3-Tier indices○ Index all posts
■ 600+ machines○ Recent (6W) + Popular (4Y) + Existing tag table
■ Down to 40 machines■ Minor loss in coverage■ Serve up to 4K qps (non-cached)
● Lean index ○ Separate signals from index
■ Eliminate high volume re-indexing ■ Independent signal engineering from indexing
○ Separate document text from index■ Dropping the memory footprint
Ranking
● Quickly evolving!● Major ranking signals in production
○ Global popularity■ likes, reblogs, follows
○ Local popularity■ popularity projected on <user, query>
● blog search: aggregated likes on query term● blog recommendation: follow counts among friends
○ Textual relevancy■ how: exact match, query proximity■ where: name, title, tag, mention, body, etc
○ Recency
Duplicate Elimination (DE)
● Index-time DE○ post signature
■ number of tags > N1■ md5 hash of normalized tag list
● Search-time DE○ Media DE
■ posts with same media hashes.○ Near DE
■ posts with tags > N2■ mark as near duplicate if diff <= N3 tags■ older posts selected as original
Search Platform
● A curvy road○ Started with ElasticSearch○ Switched to SolrCloud due to reliability○ Ended up with Solr + Customized Clustering
● Our takes○ ElasticSearch and SolrCloud have great functionality
■ distributed indexing and search■ easy cluster management
○ Solr seems still much more reliable with high indexing load and search traffic.
Offline Precomputation
● Benefits○ Minimize the search online latency○ More sophisticated/expensive computation
● Limitation○ Loss of freshness○ Expensive for longtail query and results
● Precomputed○ Typeaheads○ Related search○ Blog recommendation○ Top posts of Blog / User
What’s Next
● Inblog search○ full text search on all posts in a blog○ original posts, reblogs, likes
● Ranking○ more effective and spam-resilient signals○ learning to rank
● Topical interest modeling○ supervised and unsupervised○ blog content and user activities○ interest based blog recommendation
● Content discovery○ trending content in various categories
Question: Are you hiring?
Answer: Yeah! Check it out at http://www.tumblr.com/jobs
More questions please, :-)
Q & A