overview of information retrieval (cs598-cxz advanced topics in ir presentation) jan. 18, 2005...
Post on 04-Jan-2016
222 Views
Preview:
TRANSCRIPT
Overview of Information
Retrieval
(CS598-CXZ Advanced Topics in IR Presentation)
Jan. 18, 2005
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
What is Information Retrieval (IR)?
•Narrow-sense: – IR= Search Engine Technologies (IR=Google, library
info system)
– IR= Text matching/classification
•Broad-sense: IR = Text Information Management:– Gneral problem: how to manage text information?
– How to find useful information? (info. retrieval) (e.g., google)
– How to organize information? (text classification) (e.g., automatically assign email to different folders)
– How to discover knowledge from text? (text mining) (e.g., discover correlation of events)
Why is IR Important?
•More and more online information in general (Information Overload)
•Many tasks rely on effective management and exploitation of information
•Textual information plays an important role in our lives
•Effective text management directly improves productivity
Elements of Text Info Management Technologies
Search
Text
Filtering
Categorization
Summarization
Clustering
Natural Language Content Analysis
Extraction
Mining
VisualizationRetrievalApplications
MiningApplications
InformationAccess
KnowledgeAcquisition
InformationOrganization
A Quick Tour of the State of the Art….
Component Technology 1:Natural Language
Processing
What is NLP? �ه� … $ه"ل أ و$م$ع$ ه� $ف"س� ن م$ع$ , و$ص$اد�ق$ا , $ا "ن أم�ي 5ون$ $ك ي أن ان� "س$ اإلن ع$ل$ى $ج�ب5 ي
ع$ل$ى $ع"م$ل$ ي $ن" و$أ الو$ط$ن� ن�" أ ش$ �ع"الء� إ ف�ي Kج5ه"د O5ل ك "ذ5ل$ $ب ي $ن" و$أ �ه� ان "ر$ ي و$ج�
… م$ا
How can a computer make sense out of this string?
Arabic text
- What are the basic units of meaning (words)?- What is the meaning of each word? - How are words related with each other? - What is the “combined meaning” of words? - What is the “meta-meaning”? (speech act)- Handling a large chunk of text- Making sense of everything
Syntax
Semantics
Pragmatics
Morphology
DiscourseInference
An Example of NLP
A dog is chasing a boy on the playgroundDet Noun Aux Verb Det Noun Prep Det Noun
Noun Phrase Complex Verb Noun PhraseNoun Phrase
Prep PhraseVerb Phrase
Verb Phrase
Sentence
Dog(d1).Boy(b1).Playground(p1).Chasing(d1,b1,p1).
Semantic analysis
Lexicalanalysis
(part-of-speechtagging)
Syntactic analysis(Parsing)
A person saying this maybe reminding another person to
get the dog back…
Pragmatic analysis(speech act)
Scared(x) if Chasing(_,x,_).+
Scared(b1)
Inference
What we can do in NLP
A dog is chasing a boy on the playgroundDet Noun Aux Verb Det Noun Prep Det Noun
Noun Phrase Complex Verb Noun PhraseNoun Phrase
Prep PhraseVerb Phrase
Verb Phrase
Sentence
Semantics: some aspects
- Entity/relation extraction- Word sense disambiguation- Anaphora resolution
POSTagging:
97%
Parsing: partial >90%(?)
Speech act analysis: ???
Inference: ???
What We Can’t Do in NLP
•100% POS tagging– “He turned off the highway.” vs “He turned off the fan.”
•General complete parsing– “A man saw a boy with a telescope.”
•Deep semantic analysis– Will we ever be able to precisely define the meaning of
“own” in “John owns a restaurant.”?
Robust & general NLP tends to be “shallow” …
“Deep” understanding doesn’t scale up …
Component Technology 2:Search (ad hoc retrieval)
What is Search (Ad hoc IR)?
RetrievalSystem User
“robotics applications”
query
Robotics
others
relevant docs
non-relevant docs
database/collection
text docs
What we can do in Search
•Search in a pure text collection is well studied
– Many different methods
– Equally effective when optimized
•Basic search techniques (e.g., vector space, prob. models) are good enough for commercialization
– All implementing TF-IDF style heuristics
– Some new models have more potential for further optimization
What we can’t do in Search
•Basic retrieval models
– No single model is the best on all test collections
– Automatic parameter optimization
•Lack of interactive search support
•Lack of personalization
•Search context modeling
•Retrieval with more than pure text
– With structures
– Multi-media
Component Technology 3:Information Filtering
What is Information Filtering?
•Stable & long term interest, dynamic info source
•System must make a delivery decision immediately as a document “arrives”
FilteringSystem
…
my interest:
State of the Art: Filtering
•Content-based adaptive filtering
– Basic techniques, though not perfect, are there
– We haven’t seen many (any?) filtering applications
•Collaborative filtering (recommender systems)
– Simple methods can be (are being) commercialized
– Real applications exist
– More applications are possible
Component Technology 4:Text Categorization
What is Text Categorization?
•Pre-given categories and labeled document examples (Categories may form hierarchy)
•Classify new documents
•A standard supervised learning problem
CategorizationSystem
…
Sports
Business
Education
Science…
SportsBusiness
Education
State of the Art: Categorization
•Many supervised learning methods have been developed– SVM is often the best in performance
– Other methods are also competitive
– Commercial applications exist, but not at a large-scale
– More applications can be developed
•Feature selection/extraction is often more important than the choice of the learning algorithm
•Applications have been developed
•Relatively well explored
Component Technology 5:Clustering
The Clustering Problem
•Discover “natural structure”
•Group similar objects together
•Object can be document, term, passages
•Example
State of the Art: Clustering
•Many methods have been developed, applicable in different situations
•Difficult to predict which method is the best
•When patterns are clear, most methods work well
•In difficult situations– Special clustering bias must be incorporated
– Properties of clustering methods need to be considered
End of State of the Art Tour…
Where is IR Going?
•IR and related areas
•Current trends
•How would this course fit to the picture?
Related Areas
InformationRetrieval Databases
Library & InfoScience
Machine LearningPattern Recognition
Data Mining
NaturalLanguageProcessing
ApplicationsWeb, Bioinformatics…
StatisticsOptimization
Software engineeringComputer systems
Models
Algorithms
Applications
Systems
Current Trends
InformationRetrieval Databases
Library & InfoScience
Machine LearningPattern Recognition
Data Mining
NaturalLanguageProcessing
ApplicationsWeb, Bioinformatics…
StatisticsOptimization
Software engineeringComputer systems
Models
Algorithms
Applications
Systems
Web/ Bioinformatics/…
Literature/Digital Library
Structured + Unstructured Data
Human-Computer InteractionsHigh-Performance Computing
More PowerfulContent Analysis
More PrincipledModels/Algorithms
Publications/Societies
ACM SIGIR
VLDB, PODS, ICDE
ASIS
Learning/Mining
NLP
Applications
Statistics??
Software/systems??
COLING, EMNLP, ANLP
HLT
ICML, NIPS, UAIRECOMB, PSB
JCDL
Info. Science
Info Retrieval
ACM CIKM, TREC
DatabasesACM SIGMOD
ACL
ICML
AAAI
ACM SIGKDD
ISMB WWW
Let Users Lead the Way…
•The underlying driving force has always been real world applications
•The ultimate impact of research in IR is to benefit people in accessing and using information in the real world
•Research on many component technologies is reaching a stage of “diminishing return”; the challenge is how to make use of such imperfect techniques
•Think more about complete solutions (as opposed to component technologies) as well as new applications
How would this Course Fit to the Picture?
•Identify novel application problems
•Identify new research topics
•Examine existing research work in these directions
•Design and carry out new projects in some of the directions
•We will broadly look at 3 application domains: Web, Email, and Literature
top related