fast phrase querying with combined indexes

27
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE School of Computer Science & Information Technology RMIT University, Australia TOIS, Volume 22 , Issue 4 (October 2004)

Upload: sibley

Post on 12-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Fast Phrase Querying With Combined Indexes. HUGH E. WILLIAMS , JUSTIN ZOBEL , and DIRK BAHLE School of Computer Science & Information Technology RMIT University, Australia TOIS, Volume 22 ,  Issue 4  (October 2004). 2003 SCI Journal. ACM Transactions on Information Systems (TOIS). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fast Phrase Querying With Combined Indexes

Fast Phrase Querying With Combined Indexes

HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE

School of Computer Science & Information TechnologyRMIT University, Australia

TOIS, Volume 22 ,  Issue 4  (October 2004)

Page 2: Fast Phrase Querying With Combined Indexes

2003 SCI Journal

Abbreviated Journal Title

2003 Total Cites

Impact Factor

Immediacy Index

2003 Articles

Cited Half-Life

ACM COMPUT SURV 1347 7.5 0.154 13 7.4

IEEE INTELL SYST 784 3.725 0.386 44 3.3

ACM T INFORM SYST 843 3.533 0.667 15 7.1

COMPUT LINGUIST 513 1.515 0.071 14 8.6

J AM SOC INF SCI TEC 2060 1.473 0.447 103 6.7

Page 3: Fast Phrase Querying With Combined Indexes

ACM Transactions on Information Systems (TOIS)

Volume 22 ,  Issue 4  (October 2004) • Qualitative decision making in adaptive presentation of structured inf

ormation

Ronen I. Brafman, Carmel Domshlak, Solomon E. ShimonyPages: 503 – 539

• Analysis of lexical signatures for improving information persistence on the World Wide Web

Seung-Taek Park, David M. Pennock, C. Lee Giles, Robert KrovetzPages: 540 – 572

• Fast phrase querying with combined indexes Hugh E. Williams, Justin Zobel, Dirk BahlePages: 573 – 594

• Information systems interoperability: What lies beneath? Jinsoo Park, Sudha RamPages: 595 - 632

Page 4: Fast Phrase Querying With Combined Indexes

Abstract

• Search engines need to evaluate queries extremely fast (low disk overheads)

• A significant proportion of the queries are phrases, indicated that some of the query terms must be ordered and adjacent– nextword indexes (indexes are twice as large)– special-purpose phrase indexes– combined version with inverted files, additiona

l space overhead is only 26%

Page 5: Fast Phrase Querying With Combined Indexes

• inverted list - no practical alternatives

• phrase queries

• inverted list 如果包含 common word, 然後作 phrase serach 時就會變很慢

• 計算機結構Make common case fastfundamental law, called Amdahl's Law

Page 6: Fast Phrase Querying With Combined Indexes

PROPERTIES OF QUERIES• gather large numbers of queries and see how users are choosing to

express their information needs– 8.3% were phrae queries “xx oo”– 41% remainings matched a phrase– 8.4% included one of {the, to, of}– 14.4% included one of the top-20 common terms

• Structural common terms• common but played an important role

• back to stopping issues– 本來 122,438 個 queries 對到 309*106 個 documents– 3 stoppings 會對到 390*106, 20 stoppings 會對到 490*106,

254 stoppings 對到 1693*106

• median number of words – 2; average 2.• 34% have 3 words or more; 1.3% have 6 words or more• 0.4% of phrase queries have {the, to, of} at the end

– no 4+ queries terminate with a common term– short query, ending in a common term, the others are usually common

Page 7: Fast Phrase Querying With Combined Indexes

Phrase query evaluation Test Data

Page 8: Fast Phrase Querying With Combined Indexes

inverted index

• no practical alternatives

• term indexing, 2-level, a list of postings– document identifier– In-document frequency– and a list of offsets

{d, fd,t, [o1, . . . , o fd,t ] }

• stopping

• complete phrase indexes

Page 9: Fast Phrase Querying With Combined Indexes
Page 10: Fast Phrase Querying With Combined Indexes

Sorted Phrase Algorithm

• from a superset, becoming pruned,need n fetching and n-1 merging steps

Page 11: Fast Phrase Querying With Combined Indexes

考量• 增加多一點 additional information 到 invert

ed list 裡面 , 讓 cpu 去 decode 沒關係• 以前的時代 CPU cycles 比較寶貴• 現在的時代需要 disk access 較有效率一點 ,

所以若有用的資訊一次讀進來後 , 剩下的讓 cpu 作很快

• 新的 tradeoff 在哪邊 ?

Page 12: Fast Phrase Querying With Combined Indexes

Phrase Indexes• Partial Phrase Indexes

– 可將過去常搜尋的拿來當 indexes

Page 13: Fast Phrase Querying With Combined Indexes

Nextword Indexes

Page 14: Fast Phrase Querying With Combined Indexes

Combined Inverted and Nextword

Page 15: Fast Phrase Querying With Combined Indexes

• Combined Inverted and Nextword Indexes

• Combined Inverted and Phase Indexes

• Three-Way Index Combination

Page 16: Fast Phrase Querying With Combined Indexes
Page 17: Fast Phrase Querying With Combined Indexes
Page 20: Fast Phrase Querying With Combined Indexes

Intelligent Systems, IEEE Volume: 19,   Issue: 4,   Year: July-Aug. 2004• Ontology versioning in an ontology management framework

Noy, N.F.; Musen, M.A.Page(s): 6- 13

• Guest Editors' Introduction: Semantic Web ServicesPayne, T.; Lassila, O.Page(s): 14- 15

• Automatically composed workflows for grid environments • ODE SWS: a framework for designing and composing semantic Web s

ervices• KAoS policy management for semantic Web services• Filtering and selecting semantic Web services with interactive compos

ition techniques • Authorization and privacy for semantic Web services• Value Webs: using ontologies to bundle real-world services…

Page 21: Fast Phrase Querying With Combined Indexes

ACM Transactions on Information Systems (TOIS)

Volume 22 ,  Issue 3  (July 2004) • Relevance models to help estimate document and query parameters

David BodoffPages: 357 – 380

• Efficient mining of both positive and negative association rules Xindong Wu, Chengqi Zhang, Shichao ZhangPages: 381 – 405

• Trustworthy 100-year digital objects: Evidence after every witness is dead Henry M. GladneyPages: 406 – 436

• PocketLens: Toward a personal recommender system Bradley N. Miller, Joseph A. Konstan, John RiedlPages: 437 – 476

• Distributed content-based visual information retrieval system on peer-to-peer networks

Irwin King, Cheuk Hang Ng, Ka Cheung SiaPages: 477 - 501

Page 22: Fast Phrase Querying With Combined Indexes

• 一個人掃約 15~20 分鐘

• 預計每週兩個人用電腦 random 排

Page 23: Fast Phrase Querying With Combined Indexes

利用搜尋引擎協助錯字偵測之應用• 姍姍來遲 google 3130 openfind 7100• 珊珊來遲 google 2410 openfind 737

• Features of the classifier– Naïve 直接用 page count 就好– Complex: return 回來的前 URL, summary 是不是就有其他錯字了

• 成功運用了 local context information

• Application– 改錯字– 建立錯別字資料庫– 藉以知道哪些網站都拼錯字 , 進而判別該網站不 reliable

• Issues: – How to detect the 錯字 candidates?– How to get the initial gold standard for evaluation?

• 學術網站的 , 用字應該比較精確 ... 先相信他– 有些是通用的 , 不是錯的– 能不能擴充到別國的語言

Page 24: Fast Phrase Querying With Combined Indexes

(LDC) Chinese Gigaword – Authors: David Graff, Ke Chen– Data Source(s): newswire– Project(s): EARS, TIDES– Distribution: 1 DVD(s).– Membership Year(s):

• 2003Non-member Price: US$2500– Central News Agency of Taiwan(cna)– Xinhua News Agency of Beijing

• Mandarin Chinese News Text • TREC Mandarin• TDT Multilanguage Text corpora

• UTF-8 character encoding

Source #Files Gzip-MB Totl-MB K-wrds #DOCs

CNA 144 1018 2606 735499 1649492

XIE 142 548 1331 382881 817348

TOTAL 286 1566 3937 1118380 2466840

Page 25: Fast Phrase Querying With Combined Indexes

• The Stanford NLP group includes:– Professors

• Chris Manning, Computer Science and Linguistics • Dan Jurafsky, Linguistics

– 2 語言學 postdocs, (1 Chinese)– 8 phd students (3 visiting from other schools)– 4 碩士生 , 一助理 , 10 個已畢業

Page 26: Fast Phrase Querying With Combined Indexes

Topics• Computational Semantics

– Named Entity Recognition (NER) and Information Extraction (IE) • The Stanford Edinburgh Entity Recognition (SEER) Project

– Shallow Semantic Parsing – Question Answering (QA) – Knowledge Representation from Text – The NLKR Project: solving natural-language logic puzzles – Thesaurus Induction – Word Sense Disambiguation (WSD)

• Parsing & Tagging– Probabilistic Parsing, Part-of-speech (POS) tagging

• Multilingual NLP– Chinese NLP, Arabic NLP, German NLP

• Unsupervised Induction of Linguistic Structure– Grammar Induction – Morphology & Phonology Induction – Thesaurus Induction

• Other– Personalized PageRank algorithms – Clustering Models – Computational Lexicography – Text Categorization – Discriminative Models

– NER

• Topic Detecting, Summary– QA ( 林川傑 )

– X– 林其青– 林其青

• 楊宸彥 , NTU tagger

• Multilingual IR

• X• Transliteration

• IR with Image, Media, Web• Bio info• Opinion

Page 27: Fast Phrase Querying With Combined Indexes

• The Stanford Parser – Java implementations of probabilistic natural languag

e parsers, both highly optimized PCFG and dependency parsers, and a lexicalized PCFG parser

• The Stanford Classifier – A Java implementation of conditional loglinear model

classification (a.k.a., maximum entropy models)

• The Stanford POS Tagger – A Java implementation of a maximum-entropy part-of-

speech (POS) tagger

• QuASI Software