fast phrase querying with combined indexes

Fast Phrase Querying With Combined Indexes

HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE

School of Computer Science & Information TechnologyRMIT University, Australia

TOIS, Volume 22 , Issue 4 (October 2004)

http://www.cs.rmit.edu.au/

2003 SCI Journal

Abbreviated Journal Title

2003 Total Cites

Impact Factor

Immediacy Index

2003 Articles

Cited Half-Life

ACM COMPUT SURV 1347 7.5 0.154 13 7.4

IEEE INTELL SYST 784 3.725 0.386 44 3.3

ACM T INFORM SYST 843 3.533 0.667 15 7.1

COMPUT LINGUIST 513 1.515 0.071 14 8.6

J AM SOC INF SCI TEC 2060 1.473 0.447 103 6.7

ACM Transactions on Information Systems (TOIS)

Volume 22 , Issue 4 (October 2004) • Qualitative decision making in adaptive presentation of structured inf

ormation

Ronen I. Brafman, Carmel Domshlak, Solomon E. ShimonyPages: 503 – 539

• Analysis of lexical signatures for improving information persistence on the World Wide Web

Seung-Taek Park, David M. Pennock, C. Lee Giles, Robert KrovetzPages: 540 – 572

• Fast phrase querying with combined indexes Hugh E. Williams, Justin Zobel, Dirk BahlePages: 573 – 594

• Information systems interoperability: What lies beneath? Jinsoo Park, Sudha RamPages: 595 - 632

http://portal.acm.org/citation.cfm?id=1028099.1028100&coll=ACM&dl=ACM&idx=1028099&part=periodical&WantType=periodical&title=ACM%20Transactions%20on%20Information%20Systems%20%28TOIS%29&CFID=30755455&CFTOKEN=38482782






Abstract

• Search engines need to evaluate queries extremely fast (low disk overheads)

• A significant proportion of the queries are phrases, indicated that some of the query terms must be ordered and adjacent– nextword indexes (indexes are twice as large)– special-purpose phrase indexes– combined version with inverted files, additiona

l space overhead is only 26%

• inverted list - no practical alternatives

• phrase queries

• inverted list 如果包含 common word, 然後作 phrase serach 時就會變很慢

• 計算機結構Make common case fastfundamental law, called Amdahl's Law

PROPERTIES OF QUERIES• gather large numbers of queries and see how users are choosing to

express their information needs– 8.3% were phrae queries “xx oo”– 41% remainings matched a phrase– 8.4% included one of {the, to, of}– 14.4% included one of the top-20 common terms

• Structural common terms• common but played an important role

• back to stopping issues– 本來 122,438 個 queries 對到 309*106 個 documents– 3 stoppings 會對到 390*106, 20 stoppings 會對到 490*106,

254 stoppings 對到 1693*106

• median number of words – 2; average 2.• 34% have 3 words or more; 1.3% have 6 words or more• 0.4% of phrase queries have {the, to, of} at the end

– no 4+ queries terminate with a common term– short query, ending in a common term, the others are usually common

Phrase query evaluation Test Data

inverted index

• no practical alternatives

• term indexing, 2-level, a list of postings– document identifier– In-document frequency– and a list of offsets

{d, fd,t, [o1, . . . , o fd,t ] }

• stopping

• complete phrase indexes

Sorted Phrase Algorithm

• from a superset, becoming pruned,need n fetching and n-1 merging steps

考量• 增加多一點 additional information 到 invert

ed list 裡面 , 讓 cpu 去 decode 沒關係• 以前的時代 CPU cycles 比較寶貴• 現在的時代需要 disk access 較有效率一點 ,

所以若有用的資訊一次讀進來後 , 剩下的讓 cpu 作很快

• 新的 tradeoff 在哪邊 ?

Phrase Indexes• Partial Phrase Indexes

– 可將過去常搜尋的拿來當 indexes

Nextword Indexes

Combined Inverted and Nextword

• Combined Inverted and Nextword Indexes

• Combined Inverted and Phase Indexes

• Three-Way Index Combination

ACM Computing Surveys (CSUR)

Volume 36 , Issue 1 (March 2004) • Advances in dataflow programming languages

Wesley M. Johnston, J. R. Paul Hanna, Richard J. MillarPages: 1 – 34

• Image Retrieval from the World Wide Web: Issues, Techniques, and Systems

M. L. Kherfi, D. Ziou, A. BernardiPages: 35 – 67

• Line drawing, leap years, and Euclid Mitchell A. Harris, Edward M. ReingoldPages: 68 - 80

http://portal.acm.org/citation.cfm?id=1013208.1013209&coll=ACM&dl=ACM&idx=1013208&part=periodical&WantType=periodical&title=ACM%20Computing%20Surveys%20%28CSUR%29&CFID=30753954&CFTOKEN=51002656




ACM Computing Surveys (CSUR)

Volume 35 , Issue 4 (December 2003) • An analysis of XML database solutions for the managem

ent of MPEG-7 media descriptions

Utz Westermann, Wolfgang KlasPages: 331 – 373

• A survey of Web cache replacement strategies Stefan Podlipnig, Laszlo BöszörmenyiPages: 374 – 398

• Face recognition: A literature survey W. Zhao, R. Chellappa, P. J. Phillips, A. RosenfeldPages: 399 - 458

http://portal.acm.org/citation.cfm?id=954339.954340&dl=ACM&dl=ACM&idx=954339&part=periodical&WantType=periodical&title=ACM%20Computing%20Surveys%20%28CSUR%29&CFID=30753954&CFTOKEN=51002656




Intelligent Systems, IEEE Volume: 19, Issue: 4, Year: July-Aug. 2004• Ontology versioning in an ontology management framework

Noy, N.F.; Musen, M.A.Page(s): 6- 13

• Guest Editors' Introduction: Semantic Web ServicesPayne, T.; Lassila, O.Page(s): 14- 15

• Automatically composed workflows for grid environments • ODE SWS: a framework for designing and composing semantic Web s

ervices• KAoS policy management for semantic Web services• Filtering and selecting semantic Web services with interactive compos

ition techniques • Authorization and privacy for semantic Web services• Value Webs: using ontologies to bundle real-world services…

ACM Transactions on Information Systems (TOIS)

Volume 22 , Issue 3 (July 2004) • Relevance models to help estimate document and query parameters

David BodoffPages: 357 – 380

• Efficient mining of both positive and negative association rules Xindong Wu, Chengqi Zhang, Shichao ZhangPages: 381 – 405

• Trustworthy 100-year digital objects: Evidence after every witness is dead Henry M. GladneyPages: 406 – 436

• PocketLens: Toward a personal recommender system Bradley N. Miller, Joseph A. Konstan, John RiedlPages: 437 – 476

• Distributed content-based visual information retrieval system on peer-to-peer networks

Irwin King, Cheuk Hang Ng, Ka Cheung SiaPages: 477 - 501








• 一個人掃約 15~20 分鐘

• 預計每週兩個人用電腦 random 排

利用搜尋引擎協助錯字偵測之應用• 姍姍來遲 google 3130 openfind 7100• 珊珊來遲 google 2410 openfind 737

• Features of the classifier– Naïve 直接用 page count 就好– Complex: return 回來的前 URL, summary 是不是就有其他錯字了

• 成功運用了 local context information

• Application– 改錯字– 建立錯別字資料庫– 藉以知道哪些網站都拼錯字 , 進而判別該網站不 reliable

• Issues: – How to detect the 錯字 candidates?– How to get the initial gold standard for evaluation?

• 學術網站的 , 用字應該比較精確 ... 先相信他– 有些是通用的 , 不是錯的– 能不能擴充到別國的語言

(LDC) Chinese Gigaword – Authors: David Graff, Ke Chen– Data Source(s): newswire– Project(s): EARS, TIDES– Distribution: 1 DVD(s).– Membership Year(s):

• 2003Non-member Price: US$2500– Central News Agency of Taiwan(cna)– Xinhua News Agency of Beijing

• Mandarin Chinese News Text • TREC Mandarin• TDT Multilanguage Text corpora

• UTF-8 character encoding

Source #Files Gzip-MB Totl-MB K-wrds #DOCs

CNA 144 1018 2606 735499 1649492

XIE 142 548 1331 382881 817348

TOTAL 286 1566 3937 1118380 2466840

• The Stanford NLP group includes:– Professors

• Chris Manning, Computer Science and Linguistics • Dan Jurafsky, Linguistics

– 2 語言學 postdocs, (1 Chinese)– 8 phd students (3 visiting from other schools)– 4 碩士生 , 一助理 , 10 個已畢業

http://www-nlp.stanford.edu/~manning/

http://www.stanford.edu/~jurafsky/


http://nlp.stanford.edu/~manning/


Topics• Computational Semantics

– Named Entity Recognition (NER) and Information Extraction (IE) • The Stanford Edinburgh Entity Recognition (SEER) Project

– Shallow Semantic Parsing – Question Answering (QA) – Knowledge Representation from Text – The NLKR Project: solving natural-language logic puzzles – Thesaurus Induction – Word Sense Disambiguation (WSD)

• Parsing & Tagging– Probabilistic Parsing, Part-of-speech (POS) tagging

• Multilingual NLP– Chinese NLP, Arabic NLP, German NLP

• Unsupervised Induction of Linguistic Structure– Grammar Induction – Morphology & Phonology Induction – Thesaurus Induction

• Other– Personalized PageRank algorithms – Clustering Models – Computational Lexicography – Text Categorization – Discriminative Models

– NER

• Topic Detecting, Summary– QA ( 林川傑 )

– X– 林其青– 林其青

• 楊宸彥 , NTU tagger

• Multilingual IR

• X• Transliteration

• IR with Image, Media, Web• Bio info• Opinion

http://www.ltg.ed.ac.uk/seer/

http://nlp.stanford.edu/nlkr/

• The Stanford Parser – Java implementations of probabilistic natural languag

e parsers, both highly optimized PCFG and dependency parsers, and a lexicalized PCFG parser

• The Stanford Classifier – A Java implementation of conditional loglinear model

classification (a.k.a., maximum entropy models)

• The Stanford POS Tagger – A Java implementation of a maximum-entropy part-of-

speech (POS) tagger

• QuASI Software

fast phrase querying with combined indexes

Documents

phrase serach

fast phrase querying

millar pages

shimony pages

bernardi pages

reingold pages

dirk bahle pages

information persistence