![Page 1: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/1.jpg)
Data-Intensive Distributed Computing
Part 3: Analyzing Text (2/2)
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
CS 631/651 451/651 (Winter 2019)
Adam RoegiestKira Systems
January 31, 2019
These slides are available at http://roegiest.com/bigdata-2019w/
![Page 2: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/2.jpg)
Source: http://www.flickr.com/photos/guvnah/7861418602/
Search!
![Page 3: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/3.jpg)
DocumentsQuery
Hits
RepresentationFunction
RepresentationFunction
Query Representation Document Representation
ComparisonFunction Index
offlineonline
Abstract IR Architecture
![Page 4: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/4.jpg)
one fish, two fishDoc 1
red fish, blue fishDoc 2
cat in the hatDoc 3
1
1
1
1
1
1
1 2 3
1
1
1
4
blue
cat
egg
fish
green
ham
hat
one
green eggs and hamDoc 4
1red
1two
What goes in each cell?
booleancountpositions
![Page 5: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/5.jpg)
one fish, two fishDoc 1
red fish, blue fishDoc 2
cat in the hatDoc 3
1
1
1
1
1
1
1 2 3
1
1
1
4
blue
cat
egg
fish
green
ham
hat
one
green eggs and hamDoc 4
1red
1two
Indexing: building this structure
Retrieval: manipulating this structure
![Page 6: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/6.jpg)
one fish, two fishDoc 1
red fish, blue fishDoc 2
cat in the hatDoc 3
1
1
1
1
1
1
1 2 3
1
1
1
4
blue
cat
egg
fish
green
ham
hat
one
3
4
1
4
4
3
2
1
blue
cat
egg
fish
green
ham
hat
one
2
green eggs and hamDoc 4
1red
1two
2red
1two
![Page 7: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/7.jpg)
2
1
1
2
1
1
1
1
1
1
1
2
1
2
1
1
1
1 2 3
1
1
1
4
1
1
1
1
1
1
2
1
df
blue
cat
egg
fish
green
ham
hat
one
1
1
1
1
1
1
2
1
blue
cat
egg
fish
green
ham
hat
one
1 1red
1 1two
1red
1two
3
4
1
4
4
3
2
1
2
2
1
one fish, two fishDoc 1
red fish, blue fishDoc 2
cat in the hatDoc 3
green eggs and hamDoc 4
tf
![Page 8: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/8.jpg)
[2,4]
[3]
[2,4]
[2]
[1]
[1]
[3]
[2]
[1]
[1]
[3]
2
1
1
2
1
1
1
1
1
1
1
2
1
2
1
1
1
1 2 3
1
1
1
4
1
1
1
1
1
1
2
1
tf
df
blue
cat
egg
fish
green
ham
hat
one
1
1
1
1
1
1
2
1
blue
cat
egg
fish
green
ham
hat
one
1 1red
1 1two
1red
1two
3
4
1
4
4
3
2
1
2
2
1
one fish, two fishDoc 1
red fish, blue fishDoc 2
cat in the hatDoc 3
green eggs and hamDoc 4
![Page 9: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/9.jpg)
1
1
2
1
1
2 2
11
1
1
1
1
1
1
2
1one
1two
1fish
one fish, two fishDoc 1
2red
2blue
2fish
red fish, blue fishDoc 2
3cat
3hat
cat in the hatDoc 3
1fish 2
1one1two
2red
3cat
2blue
3hat
Shuffle and Sort: aggregate values by keys
Map
Reduce
Inverted Indexing with MapReduce
![Page 10: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/10.jpg)
Inverted Indexing: Pseudo-Codeclass Mapper {def map(docid: Long, doc: String) = {
val counts = new Map()for (term <- tokenize(doc)) {
counts(term) += 1}for ((term, tf) <- counts) {
emit(term, (docid, tf))}
}}
class Reducer {def reduce(term: String, postings: Iterable[(docid, tf)]) = {
val p = new List()for ((docid, tf) <- postings) {
p.append((docid, tf))}p.sort()emit(term, p)
}}
![Page 11: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/11.jpg)
[2,4]
[1]
[3]
[1]
[2]
[1]
[1]
[3]
[2]
[3]
[2,4]
[1]
[2,4]
[2,4]
[1]
[3]
1
1
2
1
1
2
1
1
2 2
11
1
1
1
1
1one
1two
1fish
2red
2blue
2fish
3cat
3hat
1fish 2
1one1two
2red
3cat
2blue
3hat
Shuffle and Sort: aggregate values by keys
Map
Reduce
one fish, two fishDoc 1
red fish, blue fishDoc 2
cat in the hatDoc 3
Positional Indexes
![Page 12: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/12.jpg)
Inverted Indexing: Pseudo-Codeclass Mapper {def map(docid: Long, doc: String) = {
val counts = new Map()for (term <- tokenize(doc)) {
counts(term) += 1}for ((term, tf) <- counts) {
emit(term, (docid, tf))}
}}
class Reducer {def reduce(term: String, postings: Iterable[(docid, tf)]) = {
val p = new List()for ((docid, tf) <- postings) {
p.append((docid, tf))}p.sort()emit(term, p)
}}
![Page 13: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/13.jpg)
2
1
3
1
2
3
1fish
9
21
(values)(key)
34
35
80
1fish
9
21
(values)(keys)
34
35
80
fish
fish
fish
fish
fish
How is this different?Let the framework do the sorting!
Where have we seen this before?
Another Try…
2
1
3
1
2
3
![Page 14: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/14.jpg)
Inverted Indexing: Pseudo-Codeclass Mapper {
def map(docid: Long, doc: String) = {val counts = new Map()for (term <- tokenize(doc)) {
counts(term) += 1}for ((term, tf) <- counts) {
emit((term, docid), tf)}
}}
class Reducer {var prev = nullval postings = new PostingsList()
def reduce(key: Pair, tf: Iterable[Int]) = {if key.term != prev and prev != null {
emit(prev, postings)postings.reset()
}postings.append(key.docid, tf.first)prev = key.term
}
def cleanup() = {emit(prev, postings)
}} What else do we need to do?
![Page 15: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/15.jpg)
2 1 3 1 2 3
2 1 3 1 2 3
1fish 9 21 34 35 80 …
1fish 8 12 13 1 45 …
Conceptually:
In Practice:
Don’t encode docids, encode gaps (or d-gaps) But it’s not obvious that this save space…
= delta encoding, delta compression, gap compression
Postings Encoding
![Page 16: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/16.jpg)
Overview of Integer Compression
Byte-aligned techniqueVByte
Bit-alignedUnary codes/ codes
Golomb codes (local Bernoulli model)
Word-alignedSimple family
Bit packing family (PForDelta, etc.)
![Page 17: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/17.jpg)
0
1 0
1 1 0
7 bits
14 bits
21 bits
Beware of branch mispredicts!
VByte
Works okay, easy to implement…
Simple idea: use only as many bytes as neededNeed to reserve one bit per byte as the “continuation bit”
Use remaining bits for encoding value
![Page 18: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/18.jpg)
28 1-bit numbers
14 2-bit numbers
9 3-bit numbers
7 4-bit numbers
(9 total ways)
“selectors”
Beware of branch mispredicts?
Simple-9How many different ways can we divide up 28 bits?
Efficient decompression with hard-coded decodersSimple Family – general idea applies to 64-bit words, etc.
![Page 19: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/19.jpg)
3 …
4 …
5 …
Beware of branch mispredicts?
Bit Packing
Efficient decompression with hard-coded decodersPForDelta – bit packing + separate storage of “overflow” bits
What’s the smallest number of bits we need to code a block (=128) of integers?
![Page 20: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/20.jpg)
x 1, parameter b:
q + 1 in unary, where q = ( x - 1 ) / b
r in binary, where r = x - qb - 1, in log b or log b bits
Example:
b = 3, r = 0, 1, 2 (0, 10, 11)
b = 6, r = 0, 1, 2, 3, 4, 5 (00, 01, 100, 101, 110, 111)
x = 9, b = 3: q = 2, r = 2, code = 110:11
x = 9, b = 6: q = 1, r = 2, code = 10:100
Golomb Codes
Punch line: optimal b ~ 0.69 (N/df)Different b for every term!
![Page 21: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/21.jpg)
Inverted Indexing: Pseudo-Codeclass Mapper {
def map(docid: Long, doc: String) = {val counts = new Map()for (term <- tokenize(doc)) {
counts(term) += 1}for ((term, tf) <- counts) {
emit((term, docid), tf)}
}}
class Reducer {var prev = nullval postings = new PostingsList()
def reduce(key: Pair, tf: Iterable[Int]) = {if key.term != prev and prev != null {
emit(prev, postings)postings.reset()
}postings.append(key.docid, tf.first)prev = key.term
}
def cleanup() = {emit(prev, postings)
}}
![Page 22: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/22.jpg)
1fish
9
21
(value)(key)
34
35
80
fish
fish
fish
fish
fish
Write postings compressed
…
Sound familiar?
But wait! How do we set the Golomb parameter b?
We need the df to set b…
But we don’t know the df until we’ve seen all postings!
Recall: optimal b ~ 0.69 (N/df)
Chicken and Egg?
2
1
3
1
2
3
![Page 23: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/23.jpg)
Getting the df
In the mapper:Emit “special” key-value pairs to keep track of df
In the reducer:Make sure “special” key-value pairs come first: process them to determine df
Remember: proper partitioning!
![Page 24: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/24.jpg)
one fish, two fishDoc 1
1fish
(value)(key)
1one
1two
fish
one
two
Input document…
Emit normal key-value pairs…
Emit “special” key-value pairs to keep track of df…
Getting the df: Modified Mapper
2
1
1
1
1
1
![Page 25: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/25.jpg)
1fish
9
21
(value)(key)
34
35
80
fish
fish
fish
fish
fishWrite postings compressed
fish …
…
First, compute the df by summing contributions from all “special” key-value pair…
Compute b from df
Important: properly define sort order to make sure “special” key-value pairs come first!
Where have we seen this before?
Getting the df: Modified Reducer
2
1
3
1
2
3
1 1 1
![Page 26: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/26.jpg)
2
1
1
2
1
1
1
1
1
1
1
2
1
2
1
1
1
1 2 3
1
1
1
4
1
1
1
1
1
1
2
1
df
blue
cat
egg
fish
green
ham
hat
one
1
1
1
1
1
1
2
1
blue
cat
egg
fish
green
ham
hat
one
1 1red
1 1two
1red
1two
3
4
1
4
4
3
2
1
2
2
1
tf
But I don’t care about Golomb Codes!
![Page 27: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/27.jpg)
1fish
9
21
(value)(key)
34
35
80
fish
fish
fish
fish
fishWrite postings compressed
fish …
…
Compute the df by summing contributions from all “special” key-value pair…
Write the df
Basic Inverted Indexer: Reducer
2
1
3
1
2
3
1 1 1
![Page 28: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/28.jpg)
Inverted Indexing: IP (~Pairs)class Mapper {
def map(docid: Long, doc: String) = {val counts = new Map()for (term <- tokenize(doc)) {
counts(term) += 1}for ((term, tf) <- counts) {
emit((term, docid), tf)}
}}
class Reducer {var prev = nullval postings = new PostingsList()
def reduce(key: Pair, tf: Iterable[Int]) = {if key.term != prev and prev != null {
emit(key.term, postings)postings.reset()
}postings.append(key.docid, tf.first)prev = key.term
}
def cleanup() = {emit(prev, postings)
}}
![Page 29: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/29.jpg)
Postings(1, 15, 22, 39, 54) ⊕ Postings(2, 46)= Postings(1, 2, 15, 22, 39, 46, 54)
Merging Postings
Let’s define an operation ⊕ on postings lists P:
Then we can rewrite our indexing algorithm!flatMap: emit singleton postings
reduceByKey: ⊕
![Page 30: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/30.jpg)
Postings1⊕ Postings2 = PostingsM
Solution: apply compression as needed!
What’s the issue?
![Page 31: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/31.jpg)
class Mapper {val m = new Map()
def map(docid: Long, doc: String) = {val counts = new Map()for (term <- tokenize(doc)) {
counts(term) += 1}for ((term, tf) <- counts) {
m(term).append((docid, tf))}if memoryFull()flush()
}
def cleanup() = {flush()
}
def flush() = {for (term <- m.keys) {
emit(term, new PostingsList(m(term)))}m.clear()
}}
Slightly less elegant implementation… but uses same idea
Inverted Indexing: LP (~Stripes)
![Page 32: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/32.jpg)
class Reducer {def reduce(term: String, lists: Iterable[PostingsList]) = {
var f = new PostingsList()
for (list <- lists) {f = f + list
}
emit(term, f)}
}
Inverted Indexing: LP (~Stripes)
![Page 33: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/33.jpg)
From: Elsayed et al., Brute-Force Approaches to Batch Retrieval: Scalable Indexing with MapReduce, or Why Bother? 2010
Experiments on ClueWeb09 collection: segments 1 + 2101.8m documents (472 GB compressed, 2.97 TB uncompressed)
LP vs. IP?
![Page 34: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/34.jpg)
class Mapper {val m = new Map()
def map(docid: Long, doc: String) = {val counts = new Map()for (term <- tokenize(doc)) {counts(term) += 1
}for ((term, tf) <- counts) {m(term).append((docid, tf))
}if memoryFull()flush()
}
def cleanup() = {flush()
}
def flush() = {for (term <- m.keys) {emit(term, new PostingsList(m(term)))
}m.clear()
}}
class Reducer {def reduce(term: String, lists: Iterable[PostingsList]) = {
val f = new PostingsList()for (list <- lists) {f = f + list
}emit(term, f)
}}
RDD[(K, V)]
aggregateByKeyseqOp: (U, V) ⇒ U, combOp: (U,
U) ⇒ U
RDD[(K, U)]
Another Look at LPflatMap: emit singleton postings
reduceByKey: ⊕
![Page 35: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/35.jpg)
Exploit associativity and commutativityvia commutative monoids (if you can)
Source: Wikipedia (Walnut)
Exploit framework-based sorting to sequence computations (if you can’t)
Algorithm design in a nutshell…
![Page 36: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/36.jpg)
DocumentsQuery
Hits
RepresentationFunction
RepresentationFunction
Query Representation Document Representation
ComparisonFunction Index
offlineonline
Abstract IR Architecture
![Page 37: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/37.jpg)
MapReduce it?
The indexing problemScalability is critical
Must be relatively fast, but need not be real timeFundamentally a batch operation
Incremental updates may or may not be importantFor the web, crawling is a challenge in itself
The retrieval problemMust have sub-second response time
For the web, only need relatively few results
![Page 38: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/38.jpg)
Assume everything fits in memory on a single machine…(For now)
![Page 39: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/39.jpg)
Boolean Retrieval
Users express queries as a Boolean expressionAND, OR, NOT
Can be arbitrarily nested
Retrieval is based on the notion of setsAny query divides the collection into two sets: retrieved, not-retrieved
Pure Boolean systems do not define an ordering of the results
![Page 40: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/40.jpg)
( blue AND fish ) OR ham
blue fish
ANDham
OR
1
2blue
fish 2
1ham 3
3 5 6 7 8 9
4 5
5 9
Boolean Retrieval
To execute a Boolean query:
Build query syntax tree
For each clause, look up postings
Traverse postings and apply Boolean operator
![Page 41: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/41.jpg)
blue fish
ANDham
OR
1
2blue
fish 2
1ham 3
3 5 6 7 8 9
4 5
5 9
2 5 9
blue fish
AND
blue fish
ANDham
OR 1 2 3 4 5 9
What’s RPN?
Efficiency analysis?
Term-at-a-Time
![Page 42: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/42.jpg)
1
2blue
fish 2
1ham 3
3 5 6 7 8 9
4 5
5 9
Tradeoffs?Efficiency analysis?
Document-at-a-Time
blue fish
ANDham
OR
1
2blue
fish 2
1ham 3
3 5 6 7 8 9
4 5
5 9
![Page 43: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/43.jpg)
Boolean Retrieval
Users express queries as a Boolean expressionAND, OR, NOT
Can be arbitrarily nested
Retrieval is based on the notion of setsAny query divides the collection into two sets: retrieved, not-retrieved
Pure Boolean systems do not define an ordering of the results
![Page 44: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/44.jpg)
Ranked Retrieval
Order documents by how likely they are to be relevantEstimate relevance(q, di)
Sort documents by relevance
How do we estimate relevance?Take “similarity” as a proxy for relevance
![Page 45: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/45.jpg)
Assumption: Documents that are “close together” in vector space “talk about” the same things
t1
d2
d1
d3
d4
d5
t3
t2
θ
φ
Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)
Vector Space Model
![Page 46: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/46.jpg)
Similarity Metric
Use “angle” between the vectors:
Or, more generally, inner products:
![Page 47: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/47.jpg)
Term Weighting
Term weights consist of two componentsLocal: how important is the term in this document?Global: how important is the term in the collection?
Here’s the intuition:Terms that appear often in a document should get high weightsTerms that appear in many documents should get low weights
How do we capture this mathematically?Term frequency (local)
Inverse document frequency (global)
![Page 48: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/48.jpg)
i
jijin
Nw logtf ,, =
jiw ,
ji,tf
N
in
weight assigned to term i in document j
number of occurrence of term i in document j
number of documents in entire collection
number of documents with term i
TF.IDF Term Weighting
![Page 49: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/49.jpg)
Look up postings lists corresponding to query terms
Traverse postings for each query term
Store partial query-document scores in accumulators
Select top k results to return
Retrieval in a Nutshell
![Page 50: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/50.jpg)
fish 2 1 3 1 2 31 9 21 34 35 80 …
blue 2 1 19 21 35 …
Accumulators(e.g. min heap)
Document score in top k?
Yes: Insert document score, extract-min if heap too large
No: Do nothing
Retrieval: Document-at-a-Time
Tradeoffs:Small memory footprint (good)
Skipping possible to avoid reading all postings (good)More seeks and irregular data accesses (bad)
Evaluate documents one at a time (score all query terms)
![Page 51: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/51.jpg)
fish 2 1 3 1 2 31 9 21 34 35 80 …
blue 2 1 19 21 35 …
Accumulators(e.g., hash)
Score{q=x}(doc n) = s
Retrieval: Term-At-A-Time
Tradeoffs:Early termination heuristics (good)
Large memory footprint (bad), but filtering heuristics possible
Evaluate documents one query term at a timeUsually, starting from most rare term (often with tf-sorted postings)
![Page 52: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/52.jpg)
2
1
1
2
1
1
1
1
1
1
1
2
1
2
1
1
1
1 2 3
1
1
1
4
1
1
1
1
1
1
2
1
df
blue
cat
egg
fish
green
ham
hat
one
1
1
1
1
1
1
2
1
blue
cat
egg
fish
green
ham
hat
one
1 1red
1 1two
1red
1two
3
4
1
4
4
3
2
1
2
2
1
tf
Why store df as part of postings?
![Page 53: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/53.jpg)
Assume everything fits in memory on a single machine…
Okay, let’s relax this assumption now
![Page 54: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/54.jpg)
The rest is just details!
Partitioning (for scalability)
Replication (for redundancy)
Caching (for speed)
Routing (for load balancing)
Important Ideas
![Page 55: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/55.jpg)
…
T
D
T1
T2
T3
D
T…
D1 D2 D3
Term Partitioning
DocumentPartitioning
Term vs. Document Partitioning
![Page 56: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/56.jpg)
partitions
…
…
…
… … … … …
replicas
brokers
FE
cache
![Page 57: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/57.jpg)
brokers
Datacenter
Tier
partitions
…
…
…
… … … … …
replicas
cache
Tier
partitions
…
…
…
… … … … …
replicas
cache
Tier
partitions
…
…
…
… … … … …rep
licas
cache
brokers
Datacenter
Tier
partitions
…
…
…
… … … … …
replicas
cache
Tier
partitions
…
…
…
… … … … …
replicas
cache
Tier
partitions
…
…
…
… … … … …
replicas
cache
Datacenter
Tier
partitions
… … …
Tier
partitions
… … …
Tier
partitions
… … …
![Page 58: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/58.jpg)
Partitioning (for scalability)
Replication (for redundancy)
Caching (for speed)
Routing (for load balancing)
Important Ideas
![Page 59: Data-Intensive Distributed Computing › bigdata-2019w › slides › didp-part03b.pdf · Comparison Function Index online offline Abstract IR Architecture. one fish, two fish Doc](https://reader034.vdocuments.us/reader034/viewer/2022042407/5f21b37855a2340bed471f0e/html5/thumbnails/59.jpg)
Source: Wikipedia (Japanese rock garden)