ir in a nutshell: applications, research, and challenges session 1 feb 21 st 2013 tamer elsayed

55
IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Upload: donna-fletcher

Post on 20-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

IR in a Nutshell: Applications, Research, and Challenges

Session 1

Feb 21st 2013Tamer Elsayed

Page 2: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

IR in a Nutshell: Applications, Research, and Challenges 2

Roadmap What is Information Retrieval (IR)?● Overview and applications

Overview of my research interests● Large-scale problems●MapReduce Extensions● Twitter Analysis

The future of IR research● SWIRL 2012

Page 3: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

3

WHAT IS IR?OVERVIEW & APPLICATIONS/RESEARCH TOPICS

IR in a Nutshell: Applications, Research, and Challenges

Page 4: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

4

Information Retrieval (IR) …

UnstructuredQuery

Hits

IR in a Nutshell: Applications, Research, and Challenges

informationneed

Page 5: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Who and Where?

*Source: Matt Lease (IR Course at UTexes)

Page 6: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

6

IR is not just “Web Page” Ranking

or Document or Retrieval

Page 7: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Web Search: GoogleSearch suggestions

Vertical search

Query-biased summarization

Sponsored search

Search shortcuts

Vertical search (news, blog,

image)

Page 8: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Web Search: Google II

Spelling correction

Personalized search / social ranking

Vertical search (local)

Page 9: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Cross-Lingual IR 1/3 of the Web is in non-English About 50% of Web users do not use English as their

primary language

Many (maybe most) search applications have to deal with multiple languages● monolingual search: search in one language, but with many

possible languages● cross-language search: search in multiple languages at the

same time

Page 10: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Routing / Filtering Given standing query, analyze new information as it

arrives● Input: all email, RSS feed or listserv, …● Typically classification rather than ranking● Simple example: Ham vs. spam

*Source: Matt Lease (IR Course at UTexes)

Page 11: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Content-based Music Search

*Source: Matt Lease (IR Course at UTexes)

Page 12: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Speech Retrieval

*Source: Matt Lease (IR Course at UTexes)

Page 13: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Entity Search

*Source: Matt Lease (IR Course at UTexes)

Page 14: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Question Answering & Focused Retrieval

*Source: Matt Lease (IR Course at UTexes)

Page 15: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Expert Search

*Source: Matt Lease (IR Course at U Texes)

Page 16: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Blog Search

*Source: Matt Lease (IR Course at UTexes)

Page 17: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

μ-Blog Search (e.g. Twitter)

*Source: Matt Lease (IR Course at UTexes)

Page 18: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

e-Discovery

*Source: Matt Lease (IR Course at Utexes)

Page 19: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Book Search

Find books or more focused results Detect / generate / link table of contents Classification: detect genre (e.g. for browsing) Detect related books, revised editions Challenges: Variable scan quality, OCR accuracy, Copyright,

etc.

Page 20: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Other Visual Interfaces

*Source: Matt Lease (IR Course at Utexes)

Page 21: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

21

MY RESEARCH

IR in a Nutshell: Applications, Research, and Challenges

Page 22: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

22

My Research …

Text

Large-ScaleProcessing

emails

+ web pages

Enron

CLuEWebIdentity

Resolution

WebSearch

~500,000

~1,000,000,000

User Application

Page 23: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

23

Back in 2009 … Before 2009, small text collections are available● Largest: ~ 1M documents

ClueWeb09● Crawled by CMU in 2009● ~ 1B documents !● need to move to cluster environments

MapReduce/Hadoop seems like promising framework

Page 24: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

24

MapReduce Framework

map

map

map

map

reduce

reduce

reduce

input

input

input

input

output

output

output

Shuffling

group values by: [keys]

(a) Map (b) Shuffle (c) Reduce

(k2, [v2])(k1, v1)

[(k3, v3)][k2, v2]

Framework handles “everything else” !

Page 25: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

25

E2E Search Toolkit using MapReduce Completely designed for the Hadoop environment Experimental Platform for research Supports common text collections

● + ClueWeb09 Open source release Implements state-of-the-art retrieval models

http://ivory.ccIvory

Page 26: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

26

(1) Pairwise Similarity in Large Collections

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

Applications: Clustering “more-like-that” queries

Page 27: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

27

Decomposition

reduce

Each term contributes only if appears in

map

Page 28: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

28

(2) Cross-Lingual Pairwise Similarity Find similar document pairs in different languages

Multilingual text mining, Machine Translation

Application: automatic generation of potential “interwiki” language links

Locality-sensitive Hashing

More difficult than monolingual!

Vectors close to each other are likely to have similar signatures

Page 29: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Solution Overview

CLIRprojection

Nf German articles

Ne

Englisharticles

Preprocess

Ne+Nf

English document

vectors

Ne+Nf

SignaturesSignature

generation

Sliding window

algorithm

Similar article pairs

<nobel=0.324, prize=0.227, book=0.01, …>

0111000010111100001010

Random Projection/Minhash/Simhash

Page 30: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

30

(3) Approximate Positional Indexes

Learn

“Learning to Rank” models

Termpositions

effective ranking functions

Proximity features

Approximate

Largeindex

Slow query evaluation

X XSmaller index

Faster query evaluation√ √

Close Enough is Good Enough?

Page 31: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

31

Fixed-Width Buckets Buckets of length W

………...........….………...........….………...........….………...........….………...........….

d2

123

d1………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….………...........….

12345

Page 32: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

(4) Pseudo Training Data for Web Rankers Documents, queries, and relevance judgments Important driving force behind IR innovation

In industry, easy to get In academia, hard and really expensive

Page 33: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Web Graphweb search

SIGIR 2012

web search

web search

web search

Google

web search

P1

P4

P2

P5

P7

P3

P6

Page 34: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Queries and Judgments?

SIGIR 2012P1

P4

P2

P7

P3

P6

web search

BingP5

Google

anchor text lines ≈ pseudo queries

target pages ≈ relevant candidates

noise reduction ?

Page 35: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

IR in a Nutshell: Applications, Research, and Challenges 35

(5) Extending MapReduce Framework Iterative Computations (iHadoop)

Concurrent Jobs with shared data m maps - r reduces instead of 1 map-1 reduce

Page 36: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

IR in a Nutshell: Applications, Research, and Challenges 36

(6) Twitter Analysis Real-time search in Twitter● TREC 2011 (6th out of 59 teams) ● TREC 2013?

Answering Real-time Questions from Arabic Social Media● NPRP-submitted

Page 37: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

37

FUTURE RESEARCH DIRECTIONS

IR in a Nutshell: Applications, Research, and Challenges

Page 38: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

SWIRL 2012

Page 39: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Goal of Report Inspire researchers and graduate students to address

the questions raised Provide funding agencies data to focus and coordinate

support for information retrieval research.

Participants were asked to focus on efforts that could be handled in an academic setting, without the requirement of large-scale commercial data.

Page 40: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

Key Themes (across Topics) Not just a ranked list

● move beyond the classic “single adhoc query and ranked list” approach Help for users

● support users more broadly, including ways to bring IR to inexperienced, illiterate, and disabled users.

Capturing context● Treats people using search systems, their context, and their information needs as

critical aspects needing exploration. Information, not documents

● beyond document retrieval and into more complex types of data and more complicated results

New Domains● data with restricted access, collections of “apps,” and richly connected workplace

data Evaluation

● suggest new techniques for evaluation

Page 41: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

IR in a Nutshell: Applications, Research, and Challenges 41

“Most Interesting” Topics

Page 42: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

[1] Conversational Answer Retrieval IR: provides ranked lists of documents in response to a

wide range of keyword queries QA: provides more specific answers to a very limited

range of natural language questions.

Goal: combine the advantages of both to provide effective retrieval of appropriate answers to a wide range of questions expressed in natural language, with rich user-system dialogue

Page 43: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

IR in a Nutshell: Applications, Research, and Challenges 43

Proposed Research Questions: open-domain, natural language text questions Answers: Develop more general approaches to identifying as

many constraints as possible on the answers for questions Dialogue would be initiated by the searcher and proactively

by the system, for:● refining the understanding of questions● improving the quality of answers

Answers: short answers, text passages, clustered groups of passages, documents, or even groups of documents may be appropriate answers. Even tables, figures, images, or videos

Page 44: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

IR in a Nutshell: Applications, Research, and Challenges 44

Challenges Definitions of question and answer for open domain

searching Techniques for representing questions and answers Techniques for reasoning about and ranking answers Techniques for representing a mixed-initiative CAR

dialogue Effective dialogue actions for improving question

understanding Effective dialogue actions for refining answers

Page 45: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

[2] Finding What You Need with Zero Query Terms (or Less)

Function without an explicit query, depending on context and personalization in order to understand user needs

Anticipate user needs and respond with information appropriate to the current context without the user having to enter a query (zero query terms) or even initiate an interaction with the system (or less).

In a mobile context: take the form of an app that recommends interesting places and activities based on the user’s location, personal preferences, past history, and environmental factors such as weather and time.

In a traditional desktop environment: might monitor ongoing activities and suggest related information, or track news, blogs, and social media for interesting updates.

Imagine a system that automatically gathers information related to an upcoming task.

Page 46: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

IR in a Nutshell: Applications, Research, and Challenges 46

Proposed Research New representations of information and user needs,

along with methods for matching the two Modeling person, task, and context; Methods for finding “objects of interest”, including

content, people, objects and actions Methods for determining what, how and when to show

material of interest.

Page 47: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

IR in a Nutshell: Applications, Research, and Challenges 47

Challenges Time- and geo-sensitivity; trust, transparency, privacy;

determining interruptibility; summarization Power management in mobile contexts Evaluation

Page 48: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

[3] Mobile Information Retrieval Analytics (MIRA)

No company or researcher has an understanding of mobile information access across a variety of tasks, modes of interaction, or software applications.

For example, a search service provider might know that a query was issued, but not know whether the results it provided resulted in consequent action.

The identification of common types of web search queries led to query classification and algorithms tuned for different purposes, which improved web search accuracy. A similar understanding for mobile information seeking would focus research on the problems of highest value to mobile users.

study what information, what kind of information, and what granularity of information to deliver for different tasks and contexts

Page 49: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

IR in a Nutshell: Applications, Research, and Challenges 49

Proposed Research Methodology and tools for doing large-scale collection of

data about mobile information access. Research on incentive mechanisms is required to understand

situations in which people are willing to allow their behavior to be monitored.

Research on privacy is required to understand what can be protected by dataset licenses alone, what must be anonymized, and tradeoffs between anonymization and data utility.

Development of well-defined information seeking tasks Support quantitative evaluation in well-defined evaluation

frameworks that lead to repeatable scientific research

Page 50: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

IR in a Nutshell: Applications, Research, and Challenges 50

Challenges Developing incentive mechanisms Developing data collections that are sufficiently detailed

to be useful while still protecting people’s privacy. Collection of data in a manner that university internal

review boards will consider acceptable ethically. Collection of data in a manner that does not violate the

Terms of Use restrictions of commercial service providers.

Page 51: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

[4] Empowering Users to Search and Learn Search engines are currently optimized for look-up

tasks and not tasks that require more sustained interactions with information

People have been conditioned by current search engines to interact in particular ways that prevent them from achieving higher levels of learning.

We seek to empower users to be more proactive and critical thinkers during the information search process.

Page 52: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

[5] The Structure Dimension Better integration of structured and unstructured

information to seamlessly meet a user’s information needs is a promising, but underdeveloped area of exploration.

Named entities, user profiles, contextual annotations, as well as (typed) links between information objects ranging from web pages to social media messages.

Page 53: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

[6] Understanding People in Order to Improve Information (Retrieval) Systems

Development of a research resource for the IR community:1. from which hypotheses about how to support people in

information interactions can be developed2. in which IR system designs can be appropriately evaluated.

Conducting studies of people ● before, during, and after engagement with information systems, ● at a variety of levels, ● using a variety of methods.

• ethnography• in situ observation• controlled observation• large-scale logging

Page 54: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

IR in a Nutshell: Applications, Research, and Challenges 54

Page 55: IR in a Nutshell: Applications, Research, and Challenges Session 1 Feb 21 st 2013 Tamer Elsayed

IR in a Nutshell: Applications, Research, and Challenges 55

Thank You!