practical search with solr: beyond just looking it up
TRANSCRIPT
Lucid Imagination, Inc. – http://www.lucidimagination.com 1
Big, Bigger BiggestLarge scale issues: Phrase queries and common words OCR
Tom Burton WestHathi Trust Project
Lucid Imagination, Inc. – http://www.lucidimagination.com
Hathi Trust Large Scale Search ChallengesGoal: Design a system for full-text search that
will scale to 5 million to 20 million volumes (at a reasonable cost.)
Challenges:
Must scale to 20 million full-text volumes
Very long documents compared to most large-scale search applications
Multilingual collection
OCR quality varies
2
Lucid Imagination, Inc. – http://www.lucidimagination.com
Index Size, Caching, and Memory
Our documents average about 300 pages which is about 700KB of OCR.
Our 5 million document index is between 2 and 3 terabytes. About 300 GB per million documents
Large index means disk I/O is bottleneck
Tradeoff JVM vs OS memory
Solr uses OS memory (disk I/O caching) for caching of postings
Memory available for disk I/O caching has most impact on response time (assuming adequate cache warming)
Fitting entire index in memory not feasible with terabyte size index
3
Lucid Imagination, Inc. – http://www.lucidimagination.com
Response time varies with query
4
Average: 673
Median: 91
90th: 328
99th: 7,504
Lucid Imagination, Inc. – http://www.lucidimagination.com 5
Slowest 5 % of queries
Response Time 95th percentile (seconds)
0
1
10
100
1,000
940 950 960 970 980 990 1,000
Query number
Res
pons
e Tim
e (s
econ
ds)
The slowest 5% of queries took about 1 second or longer.
The slowest 1% of queries took between 10 seconds and 2 minutes.
Slowest 0.5% of queries took between 30 seconds and 2 minutes
These queries affect response time of other queries
Cache pollution
Contention for resources
Slowest queries are phrase queries containing common words
Lucid Imagination, Inc. – http://www.lucidimagination.com
Query processingPhrase queries use position index (Boolean queries do not).
Position index accounts for 85% of index size
Position list for common words such as “the” can be many GB in size
This causes lots of disk I/O .
Solr depends on the operating systems disk cache to reduce disk I/O requirements for words that occur in more than one query
I/O from Phrase queries containing common words pollutes the cache
6
Lucid Imagination, Inc. – http://www.lucidimagination.com
Slow QueriesSlowest test query: “the lives and literature of the beat generation” took 2 minutes.4MB data read for Boolean query. 9,000+ MB read for Phrase query.
WORD NUMBER OF DOCUMENTS
POSTINGS LIST (SIZE MB)
TOTAL TERM OCCURRENCES (MILLIONS)
POSITION LIST (SIZE MB)
the 800,000 0.8 4,351 4,351
of 892,000 0.89 2,795 2,795
and 769,000 0.77 1,870 1,870
literature 435,000 0.44 9 9
generation 414,000 0.41 5 5
lives 432,000 0.43 5 5
beat 278,000 0.28 1 1
TOTALTOTAL 4.024.02 9,0369,036
7
Lucid Imagination, Inc. – http://www.lucidimagination.com
Why not use Stop Words?The word “the” occurs more than 4 billion times in our 1 million document index.
Removing “stop” words (“the”, “of” etc.) not desirable for our use cases.
Couldn’t search for many phrases
“to be or not to be”
“the who”
“man in the moon” vs. “man on the moon”
Stop words in one language are content words in another language
German stop words “war” and “die” are content words in English
English stop words “is” and “by” are content words (“ice” and “village”) in Swedish
8
Lucid Imagination, Inc. – http://www.lucidimagination.com
“CommonGrams”
Ported Nutch “CommonGrams” algorithm to Solr
Create Bi-Grams selectively for any two word sequence containing common terms
Slowest query: “The lives and literature of the beat generation”“the-lives” “lives-and”
“and-literature” “literature-of”
“of-the” “the-beat” “generation”
9
Lucid Imagination, Inc. – http://www.lucidimagination.com 10
Standard index vs. CommonGrams
Standard Index Common Grams
WORD
TOTAL OCCURRENCES
IN CORPUS (MILLIONS)
NUMBER OF DOCS
(THOUSANDS)
the 2,013 386
of 1,299 440
and 855 376
literature 4 210
lives 2 194
generation 2 199
beat 0.6 130
TOTALTOTAL 4,1764,176
TERM
TOTAL OCCURRENCES
IN CORPUS (MILLIONS)
NUMBER OF DOCS
(THOUSANDS)
of-the 446 396
generation 2.42 262
the-lives 0.36 128
literature-of 0.35 103
lives-and 0.25 115
and-literature 0.24 77
the-beat 0.06 26
TOTALTOTAL 450450
Lucid Imagination, Inc. – http://www.lucidimagination.com
Comparison of Response time (ms)
AVERAGE MEDIAN 90th 99thSLOWEST
QUERY
Standard Index 459 32 146 6,784 120,595
Common Grams 68 3 71 2,226 7,800
11
Lucid Imagination, Inc. – http://www.lucidimagination.com
Other issues
Analyze your slowest queries
We analyzed the slowest queries from our query logs and discovered additional “common words” to be added to our list.
We used Solr Admin panel to run our slowest queries from our logs with the “debug” flag checked.
We discovered that words such as “l’art” were being split into two token phrase queries.
We used the Solr Admin Analysis tool and determined that the analyzer we were using was the culprit.
12
Lucid Imagination, Inc. – http://www.lucidimagination.com
Other issues
We broke Solr … temporarily
Dirty OCR in combination with over 200 languages creates indexes with over 2.4 billion unique terms
Solr/Lucene index size was limited to 2.1 Billion unique terms
Patched: Now it’s 274 Billion
Dirty OCR is difficult to remove without removing “good” words.
Because Solr/Lucene tii/tis index uses pointers into the frequency and position files we suspect that the performance impact is minimal compared to disk I/O demands, but we will be testing soon.
13