Download - Search at Tumblr (nyc search meetup)

Search at Tumblr

Yufei PanDirector of Search, Tumblr

16 January 2013

Tumblr - Follow the World’s Creators

Founded● David Karp● February 2007

Publishing Platform● 163 million blogs● 72 billion posts

Social Network● Follow, Mention● Like, Reblog

About search@tumblr

● Most important way to discover great content ○ 50M searches a day

● Limited search for a long time (2007-2012)○ Tagged page

■ mysql lookup of a single tag id■ sorted by reverse chronological order

○ Finding blog■ navigate through curated directories

About search@tumblr

● Search Team○ 2012 July, Jak joined as first search engineer!

● Features launched in 2013○ Post search, Blog search, Theme search○ Typeaheads, Recommendation, Trends

Jak Yufei Bennett Beitao Patrick Adam

Whole New Search

Post search● full text search● top and recent● post type filtering

Blog search● name & title ● top tags in posts● blog highlights

Related search● term co-occurrence

Typeahead AutocompletesSearch Autocomplete Mention Autocomplete Tag Suggest

● Interactive guide of tumblr content● High volume of traffic● Low latency

RecommendationsPersonalized Recommendation Weekly Dashboard Digest

Trends

Trending Tags Trending Blogs

Theme Search

Search Architecture

Recent Post Index

Global Tag Index

In-Blog Tag Index

Blog Full Index

Blog Top-K Index

Personalized Blog Index

Trending Blogs

Trending Posts

Trending Tags

Related Tag Index

Typeahead Indices

Blog Top Posts

Blog Top Tags Like Root

Search Offline Framework

Post Notecount

PostModel

Blog Model

UserModel

Activity Streams (Fire Geyser) Scribe logs, Sqoop tables (HDFS)

Follower Counts

Blog Global Rank

Blog FeedbackTwo Degree

Data

Offline

Post Search Typeahead Blog

RecommendRelated Tags

Blog Highlights

Blog Top Tags

Search Online Framework

Blog Search

Trending Tags

Trending Blogs

Trending Posts

Online

Rediscover

Solr

MySQL

TopPost Index

Theme Index

Nginx

Linux

Software Stack

● Search Online○ HAProxy, Nginx, PHP○ Memcache○ Icinga, Scribe, OpenTSDB

● Search Data○ Solr, Redis, MySQL

● Search Offline○ Sqoop, Hadoop○ Java, Hive, Pig, Scalding, Python

Search Online Framework

SearchBase

QueryIF RetrieverIF SignalFetcherIF RankerIF DocFetcherIF FilterIF

SimpleQuery

PersonalizedQuery

AdvancedPostQuery

NotecountFetcher

FollowercountFetcher

SolrPostRetriever

MysqlPostRetriever

SMPostRetriever

TumblelogRetriever

TagTypeaheadReteriever

TopPostRanker

TumblelogRanker

RelatedPostRanker

PostFetcher

TumblelogFetcher

PostFilter

TumblelogFilter

Search Flow Execution

Multi-level Caching

Search Logging

Async Execution

Search Services

TimeSliceQuery

TrendTagQuery

TumblelogGlobalRankFetcher

RecommendationSignalFetcher

BlogTopTagFetcher

TumblelogMixingRanker

TagFetcher TagFilter

Search Editorial

Search Task Base

Search Batch Processing

Hive Jobs Streaming Jobs

Scalding Jobs

Pig Jobs

Scribe Logs, Sqoop Tables (HDFS)

Search Data (Redis)

Search Workflow Engine

Workflow Composition

Dependency Resolution

AutomaticVersioning

DataVerification

FailureDetection/Alert

ExecutionLogging

TermGenerators

Top-K Indexer

Lucille2 Classes

DeltaPropagator

Indexing

● 3-Tier indices○ Index all posts

■ 600+ machines○ Recent (6W) + Popular (4Y) + Existing tag table

■ Down to 40 machines■ Minor loss in coverage■ Serve up to 4K qps (non-cached)

● Lean index ○ Separate signals from index

■ Eliminate high volume re-indexing ■ Independent signal engineering from indexing

○ Separate document text from index■ Dropping the memory footprint

Ranking

● Quickly evolving!● Major ranking signals in production

○ Global popularity■ likes, reblogs, follows

○ Local popularity■ popularity projected on <user, query>

● blog search: aggregated likes on query term● blog recommendation: follow counts among friends

○ Textual relevancy■ how: exact match, query proximity■ where: name, title, tag, mention, body, etc

○ Recency

Duplicate Elimination (DE)

● Index-time DE○ post signature

■ number of tags > N1■ md5 hash of normalized tag list

● Search-time DE○ Media DE

■ posts with same media hashes.○ Near DE

■ posts with tags > N2■ mark as near duplicate if diff <= N3 tags■ older posts selected as original

Search Platform

● A curvy road○ Started with ElasticSearch○ Switched to SolrCloud due to reliability○ Ended up with Solr + Customized Clustering

● Our takes○ ElasticSearch and SolrCloud have great functionality

■ distributed indexing and search■ easy cluster management

○ Solr seems still much more reliable with high indexing load and search traffic.

Offline Precomputation

● Benefits○ Minimize the search online latency○ More sophisticated/expensive computation

● Limitation○ Loss of freshness○ Expensive for longtail query and results

● Precomputed○ Typeaheads○ Related search○ Blog recommendation○ Top posts of Blog / User

What’s Next

● Inblog search○ full text search on all posts in a blog○ original posts, reblogs, likes

● Ranking○ more effective and spam-resilient signals○ learning to rank

● Topical interest modeling○ supervised and unsupervised○ blog content and user activities○ interest based blog recommendation

● Content discovery○ trending content in various categories

Question: Are you hiring?

Answer: Yeah! Check it out at http://www.tumblr.com/jobs

More questions please, :-)

Q & A

http://www.tumblr.com/jobs

Download - Search at Tumblr (nyc search meetup)

Top Related