fast phrase querying with combined indexes
DESCRIPTION
Fast Phrase Querying With Combined Indexes. HUGH E. WILLIAMS , JUSTIN ZOBEL , and DIRK BAHLE School of Computer Science & Information Technology RMIT University, Australia TOIS, Volume 22 , Issue 4 (October 2004). 2003 SCI Journal. ACM Transactions on Information Systems (TOIS). - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/1.jpg)
Fast Phrase Querying With Combined Indexes
HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE
School of Computer Science & Information TechnologyRMIT University, Australia
TOIS, Volume 22 , Issue 4 (October 2004)
![Page 2: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/2.jpg)
2003 SCI Journal
Abbreviated Journal Title
2003 Total Cites
Impact Factor
Immediacy Index
2003 Articles
Cited Half-Life
ACM COMPUT SURV 1347 7.5 0.154 13 7.4
IEEE INTELL SYST 784 3.725 0.386 44 3.3
ACM T INFORM SYST 843 3.533 0.667 15 7.1
COMPUT LINGUIST 513 1.515 0.071 14 8.6
J AM SOC INF SCI TEC 2060 1.473 0.447 103 6.7
![Page 3: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/3.jpg)
ACM Transactions on Information Systems (TOIS)
Volume 22 , Issue 4 (October 2004) • Qualitative decision making in adaptive presentation of structured inf
ormation
Ronen I. Brafman, Carmel Domshlak, Solomon E. ShimonyPages: 503 – 539
• Analysis of lexical signatures for improving information persistence on the World Wide Web
Seung-Taek Park, David M. Pennock, C. Lee Giles, Robert KrovetzPages: 540 – 572
• Fast phrase querying with combined indexes Hugh E. Williams, Justin Zobel, Dirk BahlePages: 573 – 594
• Information systems interoperability: What lies beneath? Jinsoo Park, Sudha RamPages: 595 - 632
![Page 4: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/4.jpg)
Abstract
• Search engines need to evaluate queries extremely fast (low disk overheads)
• A significant proportion of the queries are phrases, indicated that some of the query terms must be ordered and adjacent– nextword indexes (indexes are twice as large)– special-purpose phrase indexes– combined version with inverted files, additiona
l space overhead is only 26%
![Page 5: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/5.jpg)
• inverted list - no practical alternatives
• phrase queries
• inverted list 如果包含 common word, 然後作 phrase serach 時就會變很慢
• 計算機結構Make common case fastfundamental law, called Amdahl's Law
![Page 6: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/6.jpg)
PROPERTIES OF QUERIES• gather large numbers of queries and see how users are choosing to
express their information needs– 8.3% were phrae queries “xx oo”– 41% remainings matched a phrase– 8.4% included one of {the, to, of}– 14.4% included one of the top-20 common terms
• Structural common terms• common but played an important role
• back to stopping issues– 本來 122,438 個 queries 對到 309*106 個 documents– 3 stoppings 會對到 390*106, 20 stoppings 會對到 490*106,
254 stoppings 對到 1693*106
• median number of words – 2; average 2.• 34% have 3 words or more; 1.3% have 6 words or more• 0.4% of phrase queries have {the, to, of} at the end
– no 4+ queries terminate with a common term– short query, ending in a common term, the others are usually common
![Page 7: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/7.jpg)
Phrase query evaluation Test Data
![Page 8: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/8.jpg)
inverted index
• no practical alternatives
• term indexing, 2-level, a list of postings– document identifier– In-document frequency– and a list of offsets
{d, fd,t, [o1, . . . , o fd,t ] }
• stopping
• complete phrase indexes
![Page 9: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/9.jpg)
![Page 10: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/10.jpg)
Sorted Phrase Algorithm
• from a superset, becoming pruned,need n fetching and n-1 merging steps
![Page 11: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/11.jpg)
考量• 增加多一點 additional information 到 invert
ed list 裡面 , 讓 cpu 去 decode 沒關係• 以前的時代 CPU cycles 比較寶貴• 現在的時代需要 disk access 較有效率一點 ,
所以若有用的資訊一次讀進來後 , 剩下的讓 cpu 作很快
• 新的 tradeoff 在哪邊 ?
![Page 12: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/12.jpg)
Phrase Indexes• Partial Phrase Indexes
– 可將過去常搜尋的拿來當 indexes
![Page 13: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/13.jpg)
Nextword Indexes
![Page 14: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/14.jpg)
Combined Inverted and Nextword
![Page 15: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/15.jpg)
• Combined Inverted and Nextword Indexes
• Combined Inverted and Phase Indexes
• Three-Way Index Combination
![Page 16: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/16.jpg)
![Page 17: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/17.jpg)
![Page 18: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/18.jpg)
ACM Computing Surveys (CSUR)
Volume 36 , Issue 1 (March 2004) • Advances in dataflow programming languages
Wesley M. Johnston, J. R. Paul Hanna, Richard J. MillarPages: 1 – 34
• Image Retrieval from the World Wide Web: Issues, Techniques, and Systems
M. L. Kherfi, D. Ziou, A. BernardiPages: 35 – 67
• Line drawing, leap years, and Euclid Mitchell A. Harris, Edward M. ReingoldPages: 68 - 80
![Page 19: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/19.jpg)
ACM Computing Surveys (CSUR)
Volume 35 , Issue 4 (December 2003) • An analysis of XML database solutions for the managem
ent of MPEG-7 media descriptions
Utz Westermann, Wolfgang KlasPages: 331 – 373
• A survey of Web cache replacement strategies Stefan Podlipnig, Laszlo BöszörmenyiPages: 374 – 398
• Face recognition: A literature survey W. Zhao, R. Chellappa, P. J. Phillips, A. RosenfeldPages: 399 - 458
![Page 20: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/20.jpg)
Intelligent Systems, IEEE Volume: 19, Issue: 4, Year: July-Aug. 2004• Ontology versioning in an ontology management framework
Noy, N.F.; Musen, M.A.Page(s): 6- 13
• Guest Editors' Introduction: Semantic Web ServicesPayne, T.; Lassila, O.Page(s): 14- 15
• Automatically composed workflows for grid environments • ODE SWS: a framework for designing and composing semantic Web s
ervices• KAoS policy management for semantic Web services• Filtering and selecting semantic Web services with interactive compos
ition techniques • Authorization and privacy for semantic Web services• Value Webs: using ontologies to bundle real-world services…
![Page 21: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/21.jpg)
ACM Transactions on Information Systems (TOIS)
Volume 22 , Issue 3 (July 2004) • Relevance models to help estimate document and query parameters
David BodoffPages: 357 – 380
• Efficient mining of both positive and negative association rules Xindong Wu, Chengqi Zhang, Shichao ZhangPages: 381 – 405
• Trustworthy 100-year digital objects: Evidence after every witness is dead Henry M. GladneyPages: 406 – 436
• PocketLens: Toward a personal recommender system Bradley N. Miller, Joseph A. Konstan, John RiedlPages: 437 – 476
• Distributed content-based visual information retrieval system on peer-to-peer networks
Irwin King, Cheuk Hang Ng, Ka Cheung SiaPages: 477 - 501
![Page 22: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/22.jpg)
• 一個人掃約 15~20 分鐘
• 預計每週兩個人用電腦 random 排
![Page 23: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/23.jpg)
利用搜尋引擎協助錯字偵測之應用• 姍姍來遲 google 3130 openfind 7100• 珊珊來遲 google 2410 openfind 737
• Features of the classifier– Naïve 直接用 page count 就好– Complex: return 回來的前 URL, summary 是不是就有其他錯字了
• 成功運用了 local context information
• Application– 改錯字– 建立錯別字資料庫– 藉以知道哪些網站都拼錯字 , 進而判別該網站不 reliable
• Issues: – How to detect the 錯字 candidates?– How to get the initial gold standard for evaluation?
• 學術網站的 , 用字應該比較精確 ... 先相信他– 有些是通用的 , 不是錯的– 能不能擴充到別國的語言
![Page 24: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/24.jpg)
(LDC) Chinese Gigaword – Authors: David Graff, Ke Chen– Data Source(s): newswire– Project(s): EARS, TIDES– Distribution: 1 DVD(s).– Membership Year(s):
• 2003Non-member Price: US$2500– Central News Agency of Taiwan(cna)– Xinhua News Agency of Beijing
• Mandarin Chinese News Text • TREC Mandarin• TDT Multilanguage Text corpora
• UTF-8 character encoding
Source #Files Gzip-MB Totl-MB K-wrds #DOCs
CNA 144 1018 2606 735499 1649492
XIE 142 548 1331 382881 817348
TOTAL 286 1566 3937 1118380 2466840
![Page 25: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/25.jpg)
• The Stanford NLP group includes:– Professors
• Chris Manning, Computer Science and Linguistics • Dan Jurafsky, Linguistics
– 2 語言學 postdocs, (1 Chinese)– 8 phd students (3 visiting from other schools)– 4 碩士生 , 一助理 , 10 個已畢業
![Page 26: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/26.jpg)
Topics• Computational Semantics
– Named Entity Recognition (NER) and Information Extraction (IE) • The Stanford Edinburgh Entity Recognition (SEER) Project
– Shallow Semantic Parsing – Question Answering (QA) – Knowledge Representation from Text – The NLKR Project: solving natural-language logic puzzles – Thesaurus Induction – Word Sense Disambiguation (WSD)
• Parsing & Tagging– Probabilistic Parsing, Part-of-speech (POS) tagging
• Multilingual NLP– Chinese NLP, Arabic NLP, German NLP
• Unsupervised Induction of Linguistic Structure– Grammar Induction – Morphology & Phonology Induction – Thesaurus Induction
• Other– Personalized PageRank algorithms – Clustering Models – Computational Lexicography – Text Categorization – Discriminative Models
– NER
• Topic Detecting, Summary– QA ( 林川傑 )
– X– 林其青– 林其青
• 楊宸彥 , NTU tagger
• Multilingual IR
• X• Transliteration
• IR with Image, Media, Web• Bio info• Opinion
![Page 27: Fast Phrase Querying With Combined Indexes](https://reader035.vdocuments.us/reader035/viewer/2022062314/56814524550346895db1eb13/html5/thumbnails/27.jpg)
• The Stanford Parser – Java implementations of probabilistic natural languag
e parsers, both highly optimized PCFG and dependency parsers, and a lexicalized PCFG parser
• The Stanford Classifier – A Java implementation of conditional loglinear model
classification (a.k.a., maximum entropy models)
• The Stanford POS Tagger – A Java implementation of a maximum-entropy part-of-
speech (POS) tagger
• QuASI Software