design a full-text search engine for a website based on lucene presented by: lijia li, yingyu wu,...
TRANSCRIPT
![Page 1: Design a full-text search engine for a website based on Lucene Presented by: Lijia Li, Yingyu Wu, Xiao Zhu](https://reader035.vdocuments.us/reader035/viewer/2022062423/5697bf851a28abf838c879f1/html5/thumbnails/1.jpg)
Design a full-text search engine for a website based on Lucene
Presented by: Lijia Li, Yingyu Wu, Xiao Zhu
![Page 2: Design a full-text search engine for a website based on Lucene Presented by: Lijia Li, Yingyu Wu, Xiao Zhu](https://reader035.vdocuments.us/reader035/viewer/2022062423/5697bf851a28abf838c879f1/html5/thumbnails/2.jpg)
Outline
• Introduction• Our goal• System architecture• Conclusion and future work• Show demo
![Page 3: Design a full-text search engine for a website based on Lucene Presented by: Lijia Li, Yingyu Wu, Xiao Zhu](https://reader035.vdocuments.us/reader035/viewer/2022062423/5697bf851a28abf838c879f1/html5/thumbnails/3.jpg)
Introduction• With the development of the network, the amount of information on the Internet showed explosive growth, increased the difficulty of finding the target information, the search engine has brought great convenience to people looking for information, internet has become an indispensable tool.
![Page 4: Design a full-text search engine for a website based on Lucene Presented by: Lijia Li, Yingyu Wu, Xiao Zhu](https://reader035.vdocuments.us/reader035/viewer/2022062423/5697bf851a28abf838c879f1/html5/thumbnails/4.jpg)
Our goal
• In this project, our goal is to implement a full-text retrieval engine based on Lucene.
![Page 5: Design a full-text search engine for a website based on Lucene Presented by: Lijia Li, Yingyu Wu, Xiao Zhu](https://reader035.vdocuments.us/reader035/viewer/2022062423/5697bf851a28abf838c879f1/html5/thumbnails/5.jpg)
Full-text retrieval engine
• The full-text search engine based on the entire text retrieval technology for indexing and searching.
• Features: (1) The unstructured index file database (2) Flexible retrieval methods (3) Support nature language retrieval (4) Retrieval efficiency
![Page 6: Design a full-text search engine for a website based on Lucene Presented by: Lijia Li, Yingyu Wu, Xiao Zhu](https://reader035.vdocuments.us/reader035/viewer/2022062423/5697bf851a28abf838c879f1/html5/thumbnails/6.jpg)
System Architecture
• Search Engine is used to provide searching service to users. Our search engine has two main parts: online and offline.
![Page 7: Design a full-text search engine for a website based on Lucene Presented by: Lijia Li, Yingyu Wu, Xiao Zhu](https://reader035.vdocuments.us/reader035/viewer/2022062423/5697bf851a28abf838c879f1/html5/thumbnails/7.jpg)
Users
User Interface
analyzer Result sorting
Search module
Index File
Index module
Website database
crawler
website
Enter keyword
webpage
Request
Search
Online
offline
![Page 8: Design a full-text search engine for a website based on Lucene Presented by: Lijia Li, Yingyu Wu, Xiao Zhu](https://reader035.vdocuments.us/reader035/viewer/2022062423/5697bf851a28abf838c879f1/html5/thumbnails/8.jpg)
LuceneWhy
•The index file format independent of the application platform
•Inverted index
•Object-oriented system architecture
•Chinese parser (SmartchineseAnalyzer, IKAnalyzer)
•Implement a set of powerful Query engine(RangeQuery,
FuzzyQuery……)
•Open Source
![Page 9: Design a full-text search engine for a website based on Lucene Presented by: Lijia Li, Yingyu Wu, Xiao Zhu](https://reader035.vdocuments.us/reader035/viewer/2022062423/5697bf851a28abf838c879f1/html5/thumbnails/9.jpg)
Web Crawler
Collection of start URL
URL Analysis
Analysis robots.txt
Get robots.txt
Unprocessed URL queue
URL Page fetch module
Page database
Internet
Page analysis module
Extract Links
Architecture of web crawler
![Page 10: Design a full-text search engine for a website based on Lucene Presented by: Lijia Li, Yingyu Wu, Xiao Zhu](https://reader035.vdocuments.us/reader035/viewer/2022062423/5697bf851a28abf838c879f1/html5/thumbnails/10.jpg)
Work flow of web crawler1. Extract the initial URL into unprocessed URL queue2. Get a URL address from the head of the queue3. Download pages according to their URL4. Extract hyperlink from the download page5. Extracted hyperlinks added to unprocessed URL queue6. Check whether the unprocessed URL queue is null if yes the program will be terminated otherwise step 2 will be executed.7. Loop
![Page 11: Design a full-text search engine for a website based on Lucene Presented by: Lijia Li, Yingyu Wu, Xiao Zhu](https://reader035.vdocuments.us/reader035/viewer/2022062423/5697bf851a28abf838c879f1/html5/thumbnails/11.jpg)
Index
Call the corresponding document parser to parse
document
Aset of documents to be
index
Read and Analysis document
Whether Indexed?
Determine the type of document
noDate of index ealier than the creation
data
no
yes
yes
Whether exist same type
Parse document
yesno
Build index file
Work flow
![Page 12: Design a full-text search engine for a website based on Lucene Presented by: Lijia Li, Yingyu Wu, Xiao Zhu](https://reader035.vdocuments.us/reader035/viewer/2022062423/5697bf851a28abf838c879f1/html5/thumbnails/12.jpg)
Document indexing steps1. Creating a IndexWriter instance
IndexWriter writer = new IndexWriter(indexPath, analyzer, boolean,
maxFieldLength)
2. Creating a recode of Document
Document doc = new Document()
3. Add Field Object in recode of Document
doc.add(new Filed(string, tokenstream))
4. Write recode of Document in Index
writer.addDocument(doc);
5. Close Index Writer Object, end indexing
writer.close()
![Page 13: Design a full-text search engine for a website based on Lucene Presented by: Lijia Li, Yingyu Wu, Xiao Zhu](https://reader035.vdocuments.us/reader035/viewer/2022062423/5697bf851a28abf838c879f1/html5/thumbnails/13.jpg)
Flow chart of searching
Example : User input: “ 大连理工 计算机” ,“america ohio” After QueryParser:“大连理工” AND“ 计算
机” ,“america” AND “ohio”
start
end
Accept search string from user
QueryParser analyze search string, output Query object
Set up Searcher
IndexSearcher object search related document in Index File
Output related document
![Page 14: Design a full-text search engine for a website based on Lucene Presented by: Lijia Li, Yingyu Wu, Xiao Zhu](https://reader035.vdocuments.us/reader035/viewer/2022062423/5697bf851a28abf838c879f1/html5/thumbnails/14.jpg)
Highlight search key word
1. Get position value of search key word
2. Get fragment of search key word, according position value of search key word
3. Use HTML and CSS attributes to highlight search key word
![Page 15: Design a full-text search engine for a website based on Lucene Presented by: Lijia Li, Yingyu Wu, Xiao Zhu](https://reader035.vdocuments.us/reader035/viewer/2022062423/5697bf851a28abf838c879f1/html5/thumbnails/15.jpg)
Conclusion and future work
• What we learn through this project is how to use web crawler and Lucene to implement a full-text search engine.
• Working on hadoop
•Thank you!