information retrieval - indian institute of technology bombay · web crawler a web crawler is an...
Post on 27-May-2020
3 Views
Preview:
TRANSCRIPT
Information RetrievalShehzaad Dhuliawala
Maulik Vachhani
Presentation Outline• Introduction
• Boolean Retrieval
• Indexing
Term Vocabulary
Postings List
Index Creation
• Retrieval Models and Scoring
Vector Space Model
Probabilistic Model
• Web Crawling
• Cross Lingual Information Retrieval
Content we will refer to1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,
Introduction to Information Retrieval, Cambridge University Press. 2008. http://nlp.stanford.edu/IR-book/
2. Coursera Natural Language Processing Course by Dan Jurafsky and Christopher Manning https://class.coursera.org/nlp/
3. NPTEL course on Natural Language Processing by Pushpak Bhattacharyya http://nptel.ac.in/courses/106101007/
What is Information RetrievalInformation retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information need
from within large collections (usually stored on computers). [1]
Unstructured Text• What differentiates an IR system from a database
IR Models• An IR model is a quadruple
[D, Q, F, R(di, qi)]
D: Collection of Documents
Q: Collection of Queries
F: Framework for modelling the document, query and their relationship
R: A Ranking/ scoring function which returns a real number expressing relevance of di with qi
Boolean Retrieval• It’s a simple model based on Set theory
• It checks whether terms are present in a document or not
Example• We have a collection of scientific papers in the field of computer science
• The information need: A collection of papers which are about information retrieval using machine learning
• Query: 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 ∧ 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑎𝑙 ∧ 𝑚𝑎𝑐ℎ𝑖𝑛𝑒 ∧ 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔
• Set(information) U Set(retrieval) U Set(machine) U Set(learning)
information
retrieval
machine
learning
Grepping• The Unix grep command lets you search for the presence of a term in a
document
• Why does this approach pose a problem?
The term-document matrixcompiler machine learning deep informat
ion
retrieval translati
on
Doc1 1 0 0 0 0 0 0
Doc 2 0 1 1 0 1 1 0
Doc 3 0 0 1 1 1 0 0
Doc 4 0 0 0 0 0 0 1
Doc 5 1 1 1 1 1 1 0
Doc 6 1 1 0 0 1 0 0
Query: 𝑚𝑎𝑐ℎ𝑖𝑛𝑒 ∧ 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 ∧ ¬(𝑐𝑜𝑚𝑝𝑙𝑖𝑒𝑟)
(010011) ∧ (011010) ∧ (011100) = (010000)
So relevant set -> Doc 2
TDM: Sparseness• Space complexity: |V| . |D|
• |V| -> Vocabulary size
• |D| -> No. of Documents
• |V| = 500,000
• |D| = 1 Million
• Space required: ~ 500 GB
Inverted Index
compilers
machine
learning
information
retrieval
1 5 6
2 5 76
2 5 8
2 6 9
2 6 9
10 20 77
77 78 90
76 77 78
12 19 30
19 30 45
45
90
NLP and IR
How NLP helps IR• Tokenization
• Stemming/Lemmatization
• Stopword removal
• Normalization
• Named Entities
• Multi-word expressions
Tokenization• Text is a sequence of characters
• For term based indexing we need to take a decision on how to tokenize the text
• Where does this become a problem?
• O’Neal, Knock-out –How do we tokenize these?
Stemming• Q: The best cars
• D: The best car in 2016 is the Honda…
• More prevalent in Morphologically richer languages (Eg: Marathi)
• म ुंबईहून प ण्याला जाणार् या बसची वेळ
• Is stemming always beneficial?
Stopword Removal• Which words actually convey the meaning of the text
• Taj Mahal is situated in Agra which is close to Delhi
• Taj Mahal is situated in Agra which is close to Delhi
• It has been shown that removal of stopwords often boosts performances of IR system and lowers index size
• Is it always beneficial to remove stopwords?
Normalization• Text often contains stylistic features and usages may not be consistant
• For example, one document may contain the term : USA, while another : U.S.A
• Should both be indexed separately?
Named Entities and Multiword Expressions• Often a group of words may be more relevant together than individually
• Q: machine learning
• D: …the machine was used by several students and this was a good learning experience for them…
• Such terms are called Multiword expressions
• Should they be indexed together?
Retrieval Models
Problems with Boolean search• Boolean queries often result in either too few (=0) or too many (1000s)
results.
• Query 1: “standard user dlink 650” → many results
• Query 2: “standard user dlink 650 no card found”: 0 hits
• It takes a lot of skill to come up with a query that produces a manageable number of hits.
AND gives too few; OR gives too many
• Retrieved documents are not in order.
Ranked retrieval models• Rather than a set of documents satisfying a query expression, in ranked
retrieval, the system returns an ordering over the (top) documents in the collection for a query
• Ranked retrieval Models are:
1. Vector Space Model
2. Probabilistic model
Scoring as the basis of ranked retrieval• We wish to return in order the documents most likely to be useful to the
searcher
• Assign a score – say in [0, 1] – to each document
• We need a way of assigning a score to a query/document pair
• The more frequent the query term in the document, the higher the score (should be)
• Rare terms are more informative than frequent terms
Term frequency tf• The term frequency tft,d of term t in document d is defined as the number of
times that t occurs in d.
• Raw term frequency is not important.
• Relevance does not increase proportionally with term frequency.
• So we use log frequency.
• The log frequency weight of term t in d is
otherwise 0,
0 tfif, tflog 1
10 t,dt,d
t,dw
idf weight• Frequent terms are less informative than rare terms
• We wants to give higher weights to rare documents.
• We will use document frequency (df) to capture this.
• dft is the document frequency of t: the number of documents that contain t
• We define the idf (inverse document frequency) of t by
• N is total number of documents.
)/df( log idf 10 tt N
tf-idf weighting• The tf-idf weight of a term is the product of its tf weight and its idf weight.
• Alternative names: tf.idf, tf x idf
• Increases with the number of occurrences within a document
• Increases with the rarity of the term in the collection
)df/(log)tf1log(w 10,, tdt Ndt
dqt dtdq ,tf.idf),(Score
Documents and query as vectors• So we have a |V|-dimensional vector space
• Terms are axes of the space
• Documents and query are points or vectors in this space
• Find the cosine similarity between documents and query.
• We can remove denominator as we are interested in relative values only.
V
i i
V
i i
V
i ii
dq
dq
d
d
q
q
dq
dqdq
1
2
1
2
1),cos(
Summary – vector space model• Represent the query as a weighted tf-idf vector
• Represent each document as a weighted tf-idf vector
• Compute the cosine similarity score for the query vector and each document vector
• Rank documents with respect to the query by score
• Return the top K (e.g., K = 10) to the user
Probabilistic Model• Probability Ranking Principle
Let d is document collection.
R represents relevant documents
NR represents non relevant documents
• In a probabilistic model, the obvious way to give the output is to rank documents by the estimated probability of their relevance with respect to the information.
• That is, we order documents d by P(R|d, q).
Where q is query terms
• Examples are BM25, Binary Independence Model etc.
BM25• Ranks documents based on query terms appearing in a document
• Given a query , containing keywords , the BM25 score of a document is
avgDL
DbbkDqTF
kDqTFqIDFQDscore
i
in
i
i ||*1.(),(
)1(*),(*)(),(
1
1
1
5.0)(
5.0)(log)(
i
ii
qn
qnNqIDF
Link based Model
Link Structure of the Web
• Intuitively, a webpage is important if it has a lot of backlinks.
In-links and Out-links links:A and B are C’s in-links
C is A and B’s out-link
PageRank𝑃𝑅 𝑝𝑖 =
1 − 𝑑
𝑁+ 𝑑
𝑝𝑗∈𝑀(𝑝𝑖)
𝑃𝑅(𝑝𝑗)
𝐿(𝑝𝑗)
• p1,p2…pN are pages under consideration.
• M(pi) is the set of pages that link to pi.
• L(pj) is the number of outbound links on page pj.
• N is the total number of pages.
An example of Simplified PageRank
PageRank Calculation: first iteration
Evaluation
Set based effectiveness measures
RetrievedRelevantRelevant
and
retrieved
Precision and recall
Precision (P) is the fraction of retrieved documents that are relevant
Recall (R) is the fraction of relevant documents that are retrieved
Precision/recall tradeoff• You can increase recall by returning more docs.
• Recall is a non-decreasing function of the number of docs retrieved.
• A system that returns all docs has 100% recall!
• The converse is also true (usually): It’s easy to get high precision for very low recall.
• So we can use harmonic mean of both.
• 𝐹 =2𝑃𝑅
𝑃+𝑅
Measures
Average Precision is average of all P@K where the document at rank K is Relevant.
Advantage of average precision : No need to select any particular k.
Mean Average Precision (MAP) is average precision averaged across a set of queries.
Advantage of MAP : Result shows relevance of whole system.
NDCG
Normalized Discounted Cumulative Gain (NDCG) :
It is used when relevant judgement is not a binary.
Suppose there are five level of relevance judgement
Perfect, Excellent, Good, Fair, Bad.
We assign relevance score to each level. Suppose Perfect =
4, Excellent = 3, Good = 2, Fair = 1 and Bad = 0 .
𝑁𝐷𝐶𝐺 𝑄, 𝑘 =1
|𝑄|
𝑗=1
|𝑄|
𝑍𝑘𝑗
𝑚=1
𝑘2𝑅(𝑗,𝑚) − 1
log2(1 + 𝑚)
NDCG can be measured at rank k. Here Q = set of queries.
R(j,m) = Relevance score for query j and document m. Zkj
is normalizing factor.
Evaluation Fora
Cranfield
experiments
Cranfield
collection
1960
Initial experiments on text retrieval were started by Cyril Cleverdon in the 60s at Cranfield University. The Cleverdon’s retrieval test collection formed the blueprint for TREC
Cranfield
experiments
TREC
Collections
(1-4)
TRECThe Text Retrieval Conference was started in 1992 by the NIST. TREC focuses on several tracks ranging from question answering to cross lingual information retrieval.
1992
NTCIR
Asian
language
collections
Cranfield
experimentsTREC
The NTCIR was the Japanese counterpart of TREC which was launched in 1999. NTCIR focuses largely on datasets for Asian languages (Japanese, Korean, Chinese)
1999
NTCIRCranfield
experimentsTREC CLEF
European
language
collections
CLEF or Cross lingual evaluation forums started out as an evaluation forum focused on cross lingual IR. Today it has become a fully peer reviewed conference. CLEF focuses largely on European languages
2000
2007
NTCIRCranfield
experimentsTREC CLEF
Indian
language
collections
FIRE
FIRE (Forum for IR evaluation) started as a spin-off to a CLEF 2007 task for retrieval for Indian languages. FIRE has released collections for 10 Indian languages.
Web Crawling
Web Crawler A Web crawler is an Internet bot which systematically browses the World
Wide Web, typically for the purpose of Web indexing.
A Web crawler may also be called a Web spider, an ant, an automaticindexer, or (in the FOAF software context) a Web scutter.
Web search engines and some other sites use Web crawling or spideringsoftware to update their web content or indexes of others sites' web content.
List of web crawlers• Apache Nutch
• WebCrawler
• DataparkSearch
• HTTrack
• MnoGoSearch
Web Crawler Architecture
Crawl cycle• Create a URL seed list (One time process)
• Generate : In this phase, list of URLs will be generated which need to be fetched in this cycle.
• Fetcher : In this phase, list of generated URLs will be fetched from the internet.
• Parser : In this phase, fetched document will get parsed and out-link will be extracted.
• UpdateDb : In this phase, out-link will be updated in the database.
Cross Lingual Information Retrieval
The Problem• You have a collection of documents in language L1
• The user gives a query in language L2
Possible pipelines: Document translation
Document
collection
Translation
system
IR system
Index
Query
Ranked list of
documents
Possible pipelines: Query translation
Document
collection
IR system
Index
Query
Ranked list of
documents
Translation
system
Sandhan: A Case Studyhttp://www.sandhansearch.in
Crawled and
Indexed
Web Pages
Target Information
in English
तिरूपति यात्राHindi Query
CLIR Engine
Target Language Index
in English
Ranked List of Results
Language
Resources
तिरूपति आने के लिए रेिसाधन
तिरूपति प ण्य नगर पह ुँचने के ललएबह ि रेल उपलब्ध हैं | अगर म ुंबई सेयात्रा कर रहे है िो म ुंबई-चेन्नईएक्सपे्रस गाडी से प्रवास कर सकिे है|
तिरूपतियात्रा
Result Snippets
in Hindi
57
Sandhan – Consortium Project• IIT Bombay (co-ordinator)
• CDAC Noida (co-cordinator)
• CDAC Pune
• IIT Kharaghpur
• Jadhavpur University
• ISI Kolkata
• IIIT Hyderabad
• AU KBC
• AU CEG
• Gauhati University
• DAIICT Gujarat
• IIIT Bhubaneswar
• TDIL 58
Problem definition• Cross Lingual Information Retrieval (CLIR) engine for Indian languages
Input: Query in one of the six Indian languages (Hindi, Marathi, Tamil, Telugu, Bengali, Punjabi, Assamese. Gujarati, Oriya)
Output: In Hindi, English and Query Language
• Currently in the second phase of the project
• Three new languages are added in second phase
Assamese, Gujarati, Oriya
• Built on top of Nutch Framework
59
Software Used• Nutch v0.9 – Framework
• Hadoop – Distributed Crawling
• Lucene – Indexing
• Moses/GIZA++ - Training models
• Tomcat – Deployment
60
61
Fetcher
Web
Analyzer
MWE
Lookup
NE
Lookup
Domain
Identifier
Language
Identifier
Font
Transcoder
Indexer
CMLifier
UNL Index
Snippet
Translation
Summary
Generation
Snippet
GenerationTranslation
/Transliteration
MWE
Lookup
NE
Lookup
Analyzer
Query
Formulation
Index
Information
Extraction
Resources Developed• Language specific analyzers
• Stop word List
• Bilingual Dictionary ( X-English, X-Hindi)
• NE List
• MWE List
• Transliteration Models
62
Nutch and LuceneFramework: Demo-Arjun Atreya V
RS-IITB
Outline• Introduction
• Behavior of Nutch (Offline and Online)
• Lucene Features
64
Nu
tch
an
d L
uce
ne F
ram
ew
ork
Resources Used
Nu
tch
an
d L
uce
ne F
ram
ew
ork
65
• Gospodnetic, Otis; Erik Hatcher (December 1, 2004). Lucene in Action (1st ed.). Manning Publications. pp. 456. ISBN 978-1-932394-28-3.
• Nutch Wiki http://wiki.apache.org/nutch/
Introduction
Nu
tch
an
d L
uce
ne F
ram
ew
ork
66
• Nutch is an opensource search engine
• Implemented in Java
• Nutch is comprised of Lucene, Solr, Hadoop etc..
• Lucene is an implementation of indexing and searching crawled data
• Both Nutch and Lucene are developed using plugin framework
• Easy to customize
Where do they fit in IR?
Nu
tch
an
d L
uce
ne F
ram
ew
ork
67
Nutch – complete search engine
Nu
tch
an
d L
uce
ne F
ram
ew
ork
68
Nutch – offline processing
Nu
tch
an
d L
uce
ne F
ram
ew
ork
69
• Crawling
Starts with set of seed URLs
Goes deeper in the web and starts fetching the content
Content need to be analyzed before storing
Storing the content
Makes suitable for searching
• Issues
Time consuming process
Freshness of the crawl (How often should I crawl?)
Coverage of content
Nutch – online processing
Nu
tch
an
d L
uce
ne F
ram
ew
ork
70
• Searching
Analysis of the query
Processing of few words(tokens) in the query
Query tokens matched against stored tokens(index)
• Fast and Accurate
• Involves ordering the matching results
• Ranking affects User’s satisfaction directly
• Supports distributed searching
Nutch – Data structures
Nu
tch
an
d L
uce
ne F
ram
ew
ork
71
• Web Database or WebDB Mirrors the properties/structure of web graph being crawled
• Segment Intermediate index
Contains pages fetched in a single run
• Index Final inverted index obtained by “merging” segments (Lucene)
Nutch –Crawling
Nu
tch
an
d L
uce
ne F
ram
ew
ork
72
• Inject: initial creation of CrawlDB
Insert seed URLs
Initial LinkDB is empty
• Generate new shard's fetchlist
• Fetch raw content
• Parse content (discovers outlinks)
• Update CrawlDB from shards
• Update LinkDB from shards
• Index shards
Wide Crawling vs. Focused Crawling
Nu
tch
an
d L
uce
ne F
ram
ew
ork
73
• Differences:
Little technical difference in configuration
Big difference in operations, maintenance and quality
• Wide crawling:
(Almost) Unlimited crawling frontier
High risk of spamming and junk content
“Politeness” a very important limiting factor
Bandwidth & DNS considerations
• Focused (vertical or enterprise) crawling:
Limited crawling frontier
Bandwidth or politeness is often not an issue
Low risk of spamming and junk content
Crawling Architecture
Nu
tch
an
d L
uce
ne F
ram
ew
ork
74
Step1 : Injector injects the list of seed URLs into the CrawlDB
Nu
tch
an
d L
uce
ne F
ram
ew
ork
75
Step2 : Generator takes the list of seed URLs from CrawlDB, forms fetch list, adds crawl_generate folder into the segments
Nu
tch
an
d L
uce
ne F
ram
ew
ork
76
Step3 : These fetch lists are used by fetchers to fetch the rawcontent of the document. It is then stored in segments.
Nu
tch
an
d L
uce
ne F
ram
ew
ork
77
Step4 : Parser is called to parse the content of the documentand parsed content is stored back in segments.
Nu
tch
an
d L
uce
ne F
ram
ew
ork
78
Step5 : The links are inverted in the link graph and stored inLinkDB
Nu
tch
an
d L
uce
ne F
ram
ew
ork
79
Step6 : Indexing the terms present in segments is done andindices are updated in the segments
Nu
tch
an
d L
uce
ne F
ram
ew
ork
80
Step7 : Information on the newly fetched documents areupdated in the CrwalDB
Nu
tch
an
d L
uce
ne F
ram
ew
ork
81
Crawling: 10 stage process
Nu
tch
an
d L
uce
ne F
ram
ew
ork
82
bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log
1. admin db –create: Create a new WebDB.
2. inject: Inject root URLs into the WebDB.
3. generate: Generate a fetchlist from the WebDB in a new segment.
4. fetch: Fetch content from URLs in the fetchlist.
5. updatedb: Update the WebDB with links from fetched pages.
6. Repeat steps 3-5 until the required depth is reached.
7. updatesegs: Update segments with scores and links from the WebDB.
8. index: Index the fetched pages.
9. dedup: Eliminate duplicate content (and duplicate URLs) from the indexes.
10. merge: Merge the indexes into a single index for searching
De-duplication Algorithm
Nu
tch
an
d L
uce
ne F
ram
ew
ork
83
(MD5 hash, float score, int indexID, int docID, int urlLen)
for each page
to eliminate URL duplicates from a segmentsDir:
open a temporary file
for each segment:
for each document in its index:
append a tuple for the document to the temporary file with hash=MD5(URL)
close the temporary file
sort the temporary file by hash
for each group of tuples with the same hash:
for each tuple but the first:
delete the specified document from the index
URL Filtering
Nu
tch
an
d L
uce
ne F
ram
ew
ork
84
URL Filters (Text file) (conf/crawl-urlfilter.txt)
Regular expression to filter URLs during crawling
E.g.
To ignore files with certain suffix:
-\.(gif|exe|zip|ico)$
To accept host in a certain domain
+^http://([a-z0-9]*\.)*apache.org/
Few API’s
Nu
tch
an
d L
uce
ne F
ram
ew
ork
85
• Site we would crawl: http://www.iitb.ac.in bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log
• Analyze the database: bin/nutch readdb <db dir> –stats
bin/nutch readdb <db dir> –dumppageurl
bin/nutch readdb <db dir> –dumplinks
s=`ls -d <segment dir> /* | head -1` ; bin/nutch segread -dump $s
Map-Reduce Function
Nu
tch
an
d L
uce
ne F
ram
ew
ork
86
• Works in distributed environment
• map() and reduce() functions are implemented in most of the modules
• Both map() and reduce() functions uses <key, value> pairs
• Useful in case of processing large data (eg: Indexing)
• Some applications need sequence of map-reduce
Map-1 -> Reduce-1 -> ... -> Map-n -> Reduce-n
Map-Reduce Architecture
Nu
tch
an
d L
uce
ne F
ram
ew
ork
87
Nutch – Map-Reduce Indexing
Nu
tch
an
d L
uce
ne F
ram
ew
ork
88
• Map() just assembles all parts of documents
• Reduce() performs text analysis + indexing:
Adds to a local Lucene index
Other possible MR indexing models:
• Hadoop contrib/indexing model:
analysis and indexing on map() side
Index merging on reduce() side
• Modified Nutch model:
Analysis on map() side
Indexing on reduce() side
Nutch - Ranking
Nu
tch
an
d L
uce
ne F
ram
ew
ork
89
• Nutch Ranking
queryNorm() : indicates the normalization factor for the query
coord() : indicates how many query terms are present in the given document
norm() : score indicating field based normalization factor
tf : term frequency and idf : inverse document frequency
t.boost() : score indicating the importance of terms occurrence in a particular field
Lucene - Features
Nu
tch
an
d L
uce
ne F
ram
ew
ork
90
• Field based indexing and searching
• Different fields of a webpage are
Title
URL
Anchor text
Content, etc..
• Different boost factors to give importance to fields
• Uses inverted index to store content of crawled documents
• Open source Apache project
Lucene - Index
Nu
tch
an
d L
uce
ne F
ram
ew
ork
91
• Concepts
Index: sequence of documents (a.k.a. Directory)
Document: sequence of fields
Field: named sequence of terms
Term: a text string (e.g., a word)
• Statistics
Term frequencies and positions
Writing to Index
Nu
tch
an
d L
uce
ne F
ram
ew
ork
92
IndexWriter writer =
new IndexWriter(directory, analyzer, true);
Document doc = new Document();
// add fields to document (next slide)
writer.addDocument(doc);
writer.close();
Adding Fields
Nu
tch
an
d L
uce
ne F
ram
ew
ork
93
doc.add(Field.Keyword("isbn", isbn));
doc.add(Field.Keyword("category", category));
doc.add(Field.Text("title", title));
doc.add(Field.Text("author", author));
doc.add(Field.UnIndexed("url", url));
doc.add(Field.UnStored("subjects", subjects, true));
doc.add(Field.Keyword("pubmonth", pubmonth));
doc.add(Field.UnStored("contents",author + " " + subjects));
doc.add(Field.Keyword("modified", DateField.timeToString(file.lastModified())));
Fields Description
Nu
tch
an
d L
uce
ne F
ram
ew
ork
94
• Attributes
Stored: original content retrievable
Indexed: inverted, searchable
Tokenized: analyzed, split into tokens
• Factory methods
Keyword: stored and indexed as single term
Text: indexed, tokenized, and stored if String
UnIndexed: stored
UnStored: indexed, tokenized
• Terms are what matters for searching
Searching an Index
Nu
tch
an
d L
uce
ne F
ram
ew
ork
95
IndexSearcher searcher =
new IndexSearcher(directory);
Query query = QueryParser.parse(queryExpression,
"contents“,analyzer);
Hits hits = searcher.search(query);
for (int i = 0; i < hits.length(); i++) {
Document doc = hits.doc(i);
System.out.println(doc.get("title"));
}
Analyzer
Nu
tch
an
d L
uce
ne F
ram
ew
ork
96
• Analysis occurs
For each tokenized field during indexing
For each term or phrase in QueryParser
• Several analyzers built-in
Many more in the sandbox
Straightforward to create your own
• Choosing the right analyzer is important!
WhiteSpace Analyzer
Nu
tch
an
d L
uce
ne F
ram
ew
ork
97
The quick brown fox jumps over the lazy dog.
[The] [quick] [brown] [fox] [jumps] [over] [the]
[lazy] [dog.]
Simple Analyzer
Nu
tch
an
d L
uce
ne F
ram
ew
ork
98
The quick brown fox jumps over the lazy dog.
[the] [quick] [brown] [fox] [jumps] [over] [the]
[lazy] [dog]
Stop Analyzer
Nu
tch
an
d L
uce
ne F
ram
ew
ork
99
The quick brown fox jumps over the lazy dog.
[quick] [brown] [fox] [jumps] [over] [lazy] [dog]
Snowball Analyzer
Nu
tch
an
d L
uce
ne F
ram
ew
ork
100
The quick brown fox jumps over the lazy dog.
[the] [quick] [brown] [fox] [jump] [over] [the]
[lazy] [dog]
Query Creation
Nu
tch
an
d L
uce
ne F
ram
ew
ork
101
• Searching by a term – TermQuery
• Searching within a range – RangeQuery
• Searching on a string – PrefixQuery
• Combining queries – BooleanQuery
• Searching by phrase – PhraseQuery
• Searching by wildcard – WildcardQuery
• Searching for similar terms - FuzzyQuery
Lucene Queries
Nu
tch
an
d L
uce
ne F
ram
ew
ork
102
Conclusions
Nu
tch
an
d L
uce
ne F
ram
ew
ork
103
• Nutch as a starting point
• Crawling in Nutch
• Detailed map-reduce architecture
• Different query formats in Lucene
• Built-in analyzers in Lucene
• Same analyzer need to be used both while indexing and searching
Thanks
Nu
tch
an
d L
uce
ne F
ram
ew
ork
104
• Questions ??
top related