![Page 1: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/1.jpg)
Index ConstructionIntroduction to Information RetrievalCS 150Donald J. Patterson
Content adapted from Hinrich Schützehttp://www.informationretrieval.org
![Page 2: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/2.jpg)
Crawl System
(1998, www.cnn.com)(every, www.cnn.com)
(I, www.cnn.com)(Jensen's, www.cnn.com)
(kite, www.hobby.com)
(1998, www.hobby.com)(her, news.bbc.co.uk)
(I, news.bbc.co.uk)(lion, news.bbc.co.uk)
(zebra, news.bbc.co.uk)
Disk
Block that fits in memory Block that fits in memory
BSBI - Block sort-based indexingDifferent way to sort index
![Page 3: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/3.jpg)
(1998, www.cnn.com, www.hobby.com)(every, www.cnn.com)(her, news.bbc.co.uk)
(I, www.cnn.com, news.bbc.co.uk(Jensen's, www.cnn.com)
(kite, www.hobby.com)(lion, news.bbc.co.uk)
(zebra, news.bbc.co.uk)
Merged Postings
(1998, www.cnn.com)(every, www.cnn.com)
(I, www.cnn.com)(Jensen's, www.cnn.com)
(kite, www.hobby.com)
(1998, www.hobby.com)(her, news.bbc.co.uk)
(I, news.bbc.co.uk)(lion, news.bbc.co.uk)
(zebra, news.bbc.co.uk)
.......
Different way to sort indexBSBI - Block sort-based indexing
![Page 4: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/4.jpg)
BlockSortBasedIndexConstruction()1 n� 02 while (all documents not processed)3 do block � ParseNextBlock()4 BSBI-Invert(block)5 WriteBlockToDisk(block, fn)6 MergeBlocks(f1, f2..., fn, fmerged)
BSBI - Block sort-based indexing
![Page 5: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/5.jpg)
Block merge indexing• Parse documents into (TermID, DocID) pairs until “block” is full
• Invert the block
• Sort the (TermID,DocID) pairs
• Compile into TermID posting lists
• Write the block to disk
• Then merge all blocks into one large postings file
• Need 2 copies of the data on disk (input then output)
BSBI - Block sort-based indexing
![Page 6: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/6.jpg)
Analysis of BSBI• The dominant term is O(TlogT)
• T is the number of (TermID,DocID) pairs
• But in practice ParseNextBlock takes the most time
• Then MergingBlocks
• Again, disk seeks times versus memory access times
BSBI - Block sort-based indexing
![Page 7: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/7.jpg)
Analysis of BSBI• 12-byte records (term, doc, meta-data)
• Need to sort T= 100,000,000 such 12-byte records by term
• Define a block to have 1,600,000 such records
• can easily fit a couple blocks in memory
• we will be working with 64 such blocks
• 64 blocks * 1,600,000 records * 12 bytes = 1,228,800,000
bytes
• Nlog2N comparisons is 5,584,577,250.93
• 2 touches per comparison at memory speeds (10e-6 sec) =
• 55,845.77 seconds = 930.76 min = 15.5 hours
BSBI - Block sort-based indexing
![Page 8: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/8.jpg)
Overview• Introduction
• Hardware
• BSBI - Block sort-based indexing
• SPIMI - Single Pass in-memory indexing
• Distributed indexing
• Dynamic indexing
• Miscellaneous topics
Index Construction
![Page 9: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/9.jpg)
SPIMI• BSBI is good but,
• it needs a data structure for mapping terms to termIDs
• this won’t fit in memory for big corpora
• Straightforward solution
• dynamically create dictionaries
• store the dictionaries with the blocks
• integrate sorting and merging
Single-Pass In-Memory Indexing
![Page 10: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/10.jpg)
SPIMI-Invert(tokenStream)1 outputF ile� NewFile()2 dictionary � NewHash()3 while (free memory available)4 do token� next(tokenStream)5 if term(token) /⇥ dictionary6 then postingsList� AddToDictionary(dictionary, term(token))7 else postingsList� GetPostingsList(dictionary, term(token))8 if full(postingsList)9 then postingsList� DoublePostingsList(dictionary, term(token))
10 AddToPostingsList(postingsList, docID(token))11 sortedTerms� SortTerms(dictionary)12 WriteBlockToDisk(sortedTerms, dictionary, outputF ile)13 return outputF ile
Single-Pass In-Memory Indexing
![Page 11: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/11.jpg)
SPIMI• So what is different here?
• SPIMI adds postings directly to a posting list.
• BSBI first collected (TermID,DocID pairs)
• then sorted them
• then aggregated the postings
• Each posting list is dynamic so there is no term sorting
• Saves memory because a term is only stored once
• Complexity is O(T)
• Compression enables bigger effective blocks
Single-Pass In-Memory Indexing
![Page 12: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/12.jpg)
Large Scale Indexing• Key decision in block merge indexing is block size
• In practice, spidering often interlaced with indexing
• Spidering bottlenecked by WAN speed and other factors
Single-Pass In-Memory Indexing
![Page 13: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/13.jpg)
• Deal with storage size / speed tradeoff
Why we use these algorithms
Cloud Disk Drive SSD CPU Cache
Storage Size ~ Infinite ~ PB ~ TB ~ MB
Access Speed ~ 1s .005 s .00001 s 40 clock cycle
RAM
~ GB
0.00000002 s
![Page 14: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/14.jpg)
Overview• Introduction
• Hardware
• BSBI - Block sort-based indexing
• SPIMI - Single Pass in-memory indexing
• Distributed indexing
• Dynamic indexing
• Miscellaneous topics
Index Construction
![Page 15: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/15.jpg)
Review• termID is an index given to a vocabulary word
• e.g., “house” = 57820
• docID is an index given to a document
• e.g., “news.bbc.co.uk” = 74291
• posting list is a data structure for the term-document matrix
• posting list is an inverted data structure
Index Construction
Term DocID DocID DocID DocID
![Page 16: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/16.jpg)
Review• BSBI and SPIMI
• are single pass indexing algorithms
• leverage fast memory vs slow disk speeds
• for data sets that won’t fit in entirely in memory
• for data sets that will fit on a single disk
Index Construction
![Page 17: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/17.jpg)
Review• BSBI
• builds (termID, docID) pairs until a block is filled
• builds a posting list in the final merge
• requires a vocabulary mapping word to termID
• SPMI
• builds posting lists until a block is filled
• combines posting lists in the final merge
• uses terms directly (not termIDs)
Index Construction
![Page 18: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/18.jpg)
• What if your documents don’t fit on a single disk?
• Web-scale indexing
• Use a distributed computing cluster
• supported by “Cloud computing” companies
Index Construction
![Page 19: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/19.jpg)
• Other benefits of distributed processing
• Individual machines are fault-prone
• They slow down unpredictably or fail
• Automatic maintenance
• Software bugs
• Transient network conditions
• A truck crashing into the pole outside
• Hardware fatigue and then failure
Distributed Indexing
![Page 20: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/20.jpg)
• The design of Google’s indexing as of 2004
• trajectory of big data since then
Distributed Indexing - Architecture
![Page 21: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/21.jpg)
• Think of our task as two types of parallel tasks
• Parsing
• A Parser will read a document and output (t,d) pairs
• Inverting
• An Inverter will sort and write posting lists
Distributed Indexing - Architecture
![Page 22: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/22.jpg)
• Use an instance of MapReduce
• A general architecture for distributed computing jobs
• Manages interactions among clusters of
• cheap commodity compute servers
• aka nodes
• Uses Key-Value pairs as primary object of computation
• An open-source implementation is “Hadoop” by
apache.org
Distributed Indexing - Architecture
![Page 23: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/23.jpg)
• Generally speaking in MapReduce
• There is a map phase
• This takes input and makes key-value pairs
• this corresponds to the “parse” phase of BSBI and
SPIMI
• The map phase writes intermediate files
• Results are bucketed into R buckets
• There is a reduce phase
• This is the “invert” phase of BSBI and SPIMI
• There are R inverters
Distributed Indexing - Architecture
![Page 24: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/24.jpg)
Postings
A-F
Corpus
...
Master
A-F G-P Q-Z
A-F G-P Q-Z
A-F G-P Q-Z
Parsers
...
...
A-F G-P Q-Z
A-F G-P Q-Z
A-F G-P Q-Z
Inverters
G-P
Q-Z
Distributed Indexing - Architecture
![Page 25: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/25.jpg)
Distributed Indexing - Architecture
![Page 26: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/26.jpg)
• Parsers and Inverters are not separate machines
• They are both assigned from a pool
• It is different code that gets executed
• Intermediate files are stored on a local disk
• For efficiency
• Part of the “invert” task is to talk to the parser machine
and get the data.
Distributed Indexing - Architecture
![Page 27: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/27.jpg)
• Hadoop/MapReduce does
• Hadoop manages fault tolerance
• Hadoop manages job assignment
• Hadoop manages a distributed file system
• Hadoop provides a pipeline for data
• Hadoop/MapReduce does not
• define data types
• manipulate data
Distributed Indexing - Hadoop
![Page 28: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/28.jpg)
• InputFormat
• Creates splits
• One split is assigned to one mapper
• A split is a collection of <K1,V1> pairs
(K2,list<V2>)
(K2,V2)(K2,V2)
(K2,V2)(K2,V2)
(K1,V1)(K1,V1)
(K1,V1)(K1,V1)
(K1,V1)splits
Corpus
InputFormat Mapper <K1,V1,K2,V2>
Partitioner
ReduceReduce
ReduceReducer<K2,V2,K3,V3> OutputFormat
Output
Distributed Indexing - Hadoop
![Page 29: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/29.jpg)
• InputFormat
• Hadoop comes with NLineInputFormat which breaks text input
into splits with N lines each
• K1 = line number
• V1 = text of line
(K2,list<V2>)
(K2,V2)(K2,V2)
(K2,V2)(K2,V2)
(K1,V1)(K1,V1)
(K1,V1)(K1,V1)
(K1,V1)splits
Corpus
InputFormat Mapper <K1,V1,K2,V2>
Partitioner
ReduceReduce
ReduceReducer<K2,V2,K3,V3> OutputFormat
Output
Distributed Indexing - Hadoop
![Page 30: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/30.jpg)
• Mapper<K1,V1,K2,V2>
• Takes a <K1,V1> pair as input
• Produces 0, 1 or more <K2,V2> pairs as output
• Optionally it can report progress with a Reporter
(K2,list<V2>)
(K2,V2)(K2,V2)
(K2,V2)(K2,V2)
(K1,V1)(K1,V1)
(K1,V1)(K1,V1)
(K1,V1)splits
Corpus
InputFormat Mapper <K1,V1,K2,V2>
Partitioner
ReduceReduce
ReduceReducer<K2,V2,K3,V3> OutputFormat
Output
Distributed Indexing - Hadoop
![Page 31: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/31.jpg)
• Partitioner<K2,V2>
• Takes a <K2,V2> pair as input
• Produces a bucket number as output
• Default is HashPartitioner
(K2,list<V2>)
(K2,V2)(K2,V2)
(K2,V2)(K2,V2)
(K1,V1)(K1,V1)
(K1,V1)(K1,V1)
(K1,V1)splits
Corpus
InputFormat Mapper <K1,V1,K2,V2>
Partitioner
ReduceReduce
ReduceReducer<K2,V2,K3,V3> OutputFormat
Output
Distributed Indexing - Hadoop
![Page 32: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/32.jpg)
• Reducer<K2,V2,K3,V3>
• Takes a <K2,list<V2>> pair as input
• Produces <K3,V3> as output
• Output is not resorted
(K2,list<V2>)
(K2,V2)(K2,V2)
(K2,V2)(K2,V2)
(K1,V1)(K1,V1)
(K1,V1)(K1,V1)
(K1,V1)splits
Corpus
InputFormat Mapper <K1,V1,K2,V2>
Partitioner
ReduceReduce
ReduceReducer<K2,V2,K3,V3> OutputFormat
Output
Distributed Indexing - Hadoop
![Page 33: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/33.jpg)
• OutputFormat
• Does something with the output (like write it to disk)
• TextOutputFormat<K3,V3> comes with Hadoop
(K2,list<V2>)
(K2,V2)(K2,V2)
(K2,V2)(K2,V2)
(K1,V1)(K1,V1)
(K1,V1)(K1,V1)
(K1,V1)splits
Corpus
InputFormat Mapper <K1,V1,K2,V2>
Partitioner
ReduceReduce
ReduceReducer<K2,V2,K3,V3> OutputFormat
Output
Distributed Indexing - Hadoop
![Page 34: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/34.jpg)
Hadoop example: WordCount
• Example: count the words in an input corpus
• InputFormat = TextInputFormat
• Mapper: separates words, outputs <Word, 1>
• Partitioner = HashPartitioner
• Reducer: counts the length of list<V2>, outputs <Word,count>
• OutputFormat = TextOutputFormat
(K2,list<V2>)
(K2,V2)(K2,V2)
(K2,V2)(K2,V2)
(K1,V1)(K1,V1)
(K1,V1)(K1,V1)
(K1,V1)splits
Corpus
InputFormat Mapper <K1,V1,K2,V2>
Partitioner
ReduceReduce
ReduceReducer<K2,V2,K3,V3> OutputFormat
Output
![Page 35: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/35.jpg)
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
Hadoop example: WordCount
![Page 36: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/36.jpg)
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setInputFormat(TextInputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setOutputFormat(TextOutputFormat.class);
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
Hadoop example: WordCount
![Page 37: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/37.jpg)
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
Hadoop example: WordCount
![Page 38: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/38.jpg)
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Hadoop example: WordCount
![Page 39: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/39.jpg)
• The JobTracker
• manages assignment of code to nodes
• makes sure nodes are responding
• makes sure data is flowing
• the “master”
• The TaskTracker
• runs on nodes
• accepts jobs from the JobTracker
• one of many “slaves”
Hadoop execution
![Page 40: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/40.jpg)
Hadoop example: WordCount
> mkdir wordcount_classes
> javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d wordcount_classes
WordCount.java
>jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ .
> bin/hadoop dfs -ls /usr/joe/wordcount/input/
/usr/joe/wordcount/input/file01
/usr/joe/wordcount/input/file02
>bin/hadoop dfs -cat /usr/joe/wordcount/input/file01
Hello World Bye World
>bin/hadoop dfs -cat /usr/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop
![Page 41: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/41.jpg)
>bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/
joe/wordcount/output
>bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
Hadoop example: WordCount
![Page 42: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/42.jpg)
Hadoop example: WordCount
![Page 43: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/43.jpg)
Overview• Introduction
• Hardware
• BSBI - Block sort-based indexing
• SPIMI - Single Pass in-memory indexing
• Distributed indexing
• Dynamic indexing
• Miscellaneous topics
Index Construction
![Page 44: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/44.jpg)
• Documents come in over time
• Postings need to be updated for terms already in
dictionary
• New terms need to get added to dictionary
• Documents go away
• Get deleted, etc.
Dynamic Indexing
![Page 45: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/45.jpg)
• Overview of solution
• Maintain your “big” main index on disk
• (or distributed disk)
• Continuous crawling creates “small” indices in memory
• Search queries are applied to both
• Results merged
Dynamic Indexing
![Page 46: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/46.jpg)
• Overview of solution
• Document deletions
• Invalidation bit for deleted documents
• Just like contextual filtering,
• results are filtered to remove invalidated docs
• according to bit vector.
• Periodically merge “small” index into “big” index.
Dynamic Indexing
![Page 47: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/47.jpg)
Small Indices
A-F
Inverters
G-P
Q-Z
Big Indices
A-F
G-P
Q-ZInvalidation
Bits
Query
Filter
Result
Dynamic Indexing
![Page 48: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/48.jpg)
• Issues with big *and* small indexes
• Corpus wide statistics are hard to maintain
• Typical solution is to ignore small indices when
computing stats
• Frequent merges required
• Poor performance during merge
• unless well engineered
• Logarithmic merging
Dynamic Indexing
![Page 49: Introduction to Information Retrieval Donald J. Pattersondjp3.westmont.edu/classes/2015_09_CS150/Lectures/Lecture_09.pdfBlock merge indexing • Parse documents into (TermID, DocID)](https://reader030.vdocuments.us/reader030/viewer/2022040513/5e69402cfbae0b7c6a79aa44/html5/thumbnails/49.jpg)