what to do when one size does not fit all?!

47
What to do when one size does not fit all?! Arjen P. de Vries [email protected] Centrum Wiskunde & Informatica Delft University of Technology Spinque B.V.

Upload: arjen-de-vries

Post on 18-Dec-2014

562 views

Category:

Technology


5 download

DESCRIPTION

Keynote talk about "Search by Strategy" at the ESAIR 2011 workshop, held at CIKM 2011.

TRANSCRIPT

Page 1: What to do when one size does not fit all?!

What to do when one size does not fit all?!

Arjen P. de [email protected]

Centrum Wiskunde & InformaticaDelft University of Technology

Spinque B.V.

Page 2: What to do when one size does not fit all?!

Core Questions

How to represent information? The information need and search requests The objects to be shown in response to an

information request

How to match information representations (Deductive) data retrieval, (inductive)

information retrieval, or a mix?!

Page 3: What to do when one size does not fit all?!

Complications

Heterogeneous data sources WWW, wikipedia, news, e-mail, patents,

twitter, personal information, …

Varying result types “Documents”, tweets, courses, people,

experts, gene expressions, temperatures, …

Multiple dimensions of relevance Topicality, recency, reading level, …

Page 4: What to do when one size does not fit all?!

Complications

Many search tasks require a mix within these dimensions: News and patents Companies and their CEOs Recent and on topic

Many search tasks also require a mix across these dimensions: Patents assigned to our top 3 competitors in

market segments mentioned in the recent press releases issued by our top 10 clients

Page 5: What to do when one size does not fit all?!

Complications

System’s internal information representation Linguistic annotations

Named entities, sentiment, dependencies, …

Knowledge resources Wikipedia, Freebase, IDC9, IPTC, …

Links to related documents Citations, urls

Page 6: What to do when one size does not fit all?!

Complications

Anchors that describe the URI Anchor text

Queries that lead to clicks on the URI Session, user, dwell-time, …

Tweets that mention the URI Time, location, user, …

Other social media that describe the URI User, rating Tag, organisation of `folksonomy’

Page 7: What to do when one size does not fit all?!

Tweets about blip.tv

E.g.: http://blip.tv/file/2168377 Amazing Watching “World’s most realistic 3D city

models?” Google Earth/Maps killer Ludvig Emgard shows how maps/satellite pics

on web is done (learn Google and MS!) and ~120 more Tweets

Page 8: What to do when one size does not fit all?!

Even More Complications

Uncertainty in matching process Vocabulary mismatch Incomplete relevance information

Imperfect and noisy representations of both documents and information need OCR, multimedia analysis, NE taggers, HTML

table extraction, …

Page 9: What to do when one size does not fit all?!

The one size fits all "semantically enhanced retrieval

model“?

BM25BM25F

LMRM

VSMDFR

QIR?Learning to rank?

Document Collection:AnchorsEntity typesSentimentTweetsCited documents

Context

User

Ranked list of

answers

Page 10: What to do when one size does not fit all?!

http://www.hellokids.com/c_19938/coloring-page/holiday-coloring-pages/easter-coloring-pages/jesus-coloring-pages/the-holy-grail-coloring-page

Page 11: What to do when one size does not fit all?!

Parameterised Search System

Cannot we ‘remove’ this IR engineer

from the loop, like DBMS software

removes the data engineer from the

loop?

Cornacchia, De Vries, ECIR 2007A Parametrised Search System

Page 12: What to do when one size does not fit all?!

Search by Strategy

Visually construct search strategies by connecting building blocks

Page 13: What to do when one size does not fit all?!
Page 14: What to do when one size does not fit all?!
Page 15: What to do when one size does not fit all?!

Generate Search Engine!

Page 16: What to do when one size does not fit all?!

Search by Strategy

Visually construct search strategies by connecting building blocks

Each block describes either data or actions upon that data

Page 17: What to do when one size does not fit all?!

Strategy Builder

Page 18: What to do when one size does not fit all?!

From Patent to Inventor

Page 19: What to do when one size does not fit all?!

Reports

Visits

Page 20: What to do when one size does not fit all?!
Page 21: What to do when one size does not fit all?!

BBs and typed pins

N input pins, 1 output pin Pins represent data / result sets

M user-parameters (u) instantiated at query-time

A BB can be viewed as a function

out = BB(in1,..,inN, u1,..,uM)

Pins are typed doc / sec / term / ne (named entity) / tuple only pins of the same type can be connected

No assumption on the underlying data store / data API

The content of the BB respects the type contract.

BB1(in1,in2,in3, u1,u2)

in1 in2 in3

out

BB2(in1)

in1

out

Page 22: What to do when one size does not fit all?!

From Strategies to DB Queries

DatabaseSpinque: RDBMS (MonetDB)

BB1(in1,in2,in3, u1,u2)

in1 in2 in3

out

BB2(in1)

in1

out

• Data flowSpinque: strategy

• Query: strategy made operationalSpinque: PRA

CREATE VIEW a AS SELECT ..

CREATE VIEW b AS SELECT ..

CREATE VIEW c AS SELECT ..

Strategy

Relational DB

Page 23: What to do when one size does not fit all?!

Probabilistic Relational AlgebraStrategy

Relational DB

• SQLexplicit probabilities

CREATE VIEW x AS SELECT a1, a3, 1-prod(1-prob) AS prob FROM yGROUP BY a1, a3;

• PRA: probabilistic relational algebra (Fuhr and Roelleke, TOIS 2001)

x = Project DISTINCT [$1,$3](y);

Page 24: What to do when one size does not fit all?!

SpinQL, the sneak preview

PRA still too low-level; who writes algebraic plans?!

SpinQL: “See objects, generate SQL under the hood”

Understands Spinque data types (doc, sec, term, named-entity, tuple)

Allows to build levels of abstractions, to describe:

1. access to probabilistic relations

2. domain-unaware typed data streams (e.g. person.name() )

3. domain aware data streams (e.g. person.inventor_of())

4. building blocks (building blocks are functions)

5. strategies (if building blocks are functions, strategies are as well)

SpinQL not contained in a strategy, SpinQL is a strategy specification

SpinQL describes all, the editor shows desired granularity / expertise level

E.g. show patent-strategy, zoom in on “inventors”, then “persons”, “ne”, raw data access

Page 25: What to do when one size does not fit all?!

What’s in the DB?

Text-based ranking term-doc-freq relations (inverted file)

One per language, stemming, section Domain-independent, click and index

Entity ranking Probabilistic triples Domain-aware

Needs supervised indexing

Content-based (MM) retrieval Feature vectors, click and index

T D f

t0 d3 3

t0 d5 10

t1 d2 4

subj pred/attr obj/value p

Arjen speaks_to you 0.95

you follow Arjen 0.5

speech minutes 45 0.8

Img_id f1 … fN

0 0.12 … 0.84

1 0.54 … 0.31

2 0.23 … 0.1

Page 26: What to do when one size does not fit all?!

VIEWS and TABLES

BB content: sequence of VIEW definitions

A VIEW is pre-computable when All the relations addressed are pre-computable / stored No dependency on user parameters

Pre-computable VIEWs can become TABLEs (or MATERIALIZED VIEWs) Query-independent computations are performed only once, then

read from TABLEs at each query Recognition of these patterns is fully automatic Extends MonetDB’s per-session caching to across-sessions caching

CREATE VIEW a AS SELECT … FROM term-doc … ;CREATE VIEW b AS SELECT … FROM a WHERE a.x = u1 ;CREATE VIEW c AS SELECT … FROM a WHERE a.x = 42 ;CREATE VIEW d AS SELECT … FROM b … ;

CREATE TABLE a AS SELECT … FROM term-doc … ;CREATE VIEW b AS SELECT … FROM a WHERE a.x = u1 ;CREATE TABLE c AS SELECT … FROM a WHERE a.x = 42 ;CREATE VIEW d AS SELECT … FROM b … ;

User parameterStored relation

No user parameter

Pre-computable relation

Page 27: What to do when one size does not fit all?!

Exploratory Search

Search & (Faceted) Browsing Help discover schema, ontology, etc. Help discover the relevant sources

Within-collection (by year/location, by type, …) Across multiple collections (by source)

Page 28: What to do when one size does not fit all?!

Probabilistic faceted browsing

Traditional (boolean filters) Probabilistic

• 100K - 200K• 200K - 300K• 300K - 400K

Price

• 3• 4• 5

Rooms

• 100 - 150 m2

• 150 - 200 m2

• 200 - 250 m2

Size

• 100K - 200K• 200K - 300K• 300K - 400K

Price

• 3• 4• 5

Rooms

• 100 - 150 m2

• 150 - 200 m2

• 200 - 250 m2

Size

• Good when user knows exactly which filters to apply

• Will see perfect-match results• Won’t see “interesting” results

• Good for exploratory search• Will see perfect-match results• Will also see “interesting” results

Page 29: What to do when one size does not fit all?!

Dynamic facets

Pre-indexed Dynamic

• 100K - 200K• 200K - 300K• 300K - 400K

Price

• 3• 4• 5

Rooms

• 100 - 150 m2

• 150 - 200 m2

• 200 - 250 m2

Size

• 100K - 200K• 200K - 300K• 300K - 400K

Price

• 3• 4• 5

Rooms

• 100 - 150 m2

• 150 - 200 m2

• 200 - 250 m2

Size

• Pre-defined ad-hoc indices intersected with result set

• Challenge: many indices to maintain

• Facets decided from result set• Challenge: dynamically adapt granularity

• Different price ranges for villa/garage!• Challenge: heavy concurrent queries to DB

Page 30: What to do when one size does not fit all?!

Probabilistic facets and strategies(current)

Originalstrategy

• 100K - 200K• 200K - 300K• 300K - 400K

Price

• 3• 4• 5

Rooms

• 100 - 150 m2

• 150 - 200 m2

• 200 - 250 m2

Size

• 100 - 150 m2

• 150 - 200 m2

• 200 - 250 m2

Size

• 100K - 200K• 200K - 300K• 300K - 400K

Price

• 3• 4• 5

Rooms

• Filter on Size (in/out)• no ranking!

• Re-rank on • Rooms (up/down)• Price (up/down)• no filter!

• BAD: Order of Rooms/Price matters!

• BAD: Ranking function of Rooms/Price internally smoothed with previous ranking (done for efficiency)

• BAD: possible to weight each facet, but not consistently with others

Page 31: What to do when one size does not fit all?!

Probabilistic facets and strategies(better)

Originalstrategy

20% 50% 30%Mix

• 100K - 200K• 200K - 300K• 300K - 400K

Price

• 3• 4• 5

Rooms

• 100 - 150 m2

• 150 - 200 m2

• 200 - 250 m2

Size

• 100 - 150 m2

• 150 - 200 m2

• 200 - 250 m2

Size

• 100K - 200K• 200K - 300K• 300K - 400K

Price

• 3• 4• 5

Rooms

• Filter on Size (in/out)• no ranking!

• Re-rank on Rooms (up/down)• no filter!

• Re-rank on Price (up/down)• no filter!

• Mix 3 different rankings:1. Rooms2. original (always present)3. Price

• Change coefficients to explore• Challenge: use algebraic

SpinQL representation for rewritings• e.g. push filters up

Page 32: What to do when one size does not fit all?!

Mixing probabilistic data streams

N inputs streams S1,…,SN, 1 output All streams: (id, p)

Same id type (e.g. docs) Linear combination:

p1…pN must be comparable; on same scale Take care in scripting the blocks

Expensive operation in relational algebra

20% 50% 30%Mix

N

jjj

NNNN

pidpid

ididSpidSpid

11

1111

),(),(

|),(,,),(

Page 33: What to do when one size does not fit all?!

Mixing probabilistic data streams in RA

Sum(p, GroupBy(id, (Union(α1S1,…,αnSN))) GroupBy optim. for few large groups – opposite here Example shows 3 ids, 2 streams:

3 groups (could be millions), each group large max 2 (usually < 5).

N

jjj

NNNN

pidpid

ididSpidSpid

11

1111

),(),(

|),(,,),(

id p

id0 0.2*0.1 = 0.02

id1 0.2*0.7 = 0.14

id2 0.2*0.9 = 0.18

id0 0.8*0.2 = 0.16

id2 0.8*1.0 = 0.8

id p

id0 0.02 + 0.16 = 0.18

id1 0.14

id2 0.18 + 0.8 = 0.26

α1S1

α2S220% 80%Mix

id p

id0 0.1

id1 0.7

id2 0.9S1 S2

id p

id0 0.2

id2 1.0

Inputs Union(α1S1,…,αnSN))) Sum(p, GroupBy(id))

Page 34: What to do when one size does not fit all?!

Mixing probabilistic data streams in RA

Project(α1p1+…+αNpN, OuterJoin(id1=…idn, (S1,…,SN)))

Explicit summation of few values more efficient than aggregation Example omits handling NULLs from OuterJoin – not free Super-fast if streams ordered on id – impossible with Union/Group

20% 80%Mix

N

jjj

NNNN

pidpid

ididSpidSpid

11

1111

),(),(

|),(,,),(

id p

id0 0.1

id1 0.7

id2 0.9

id p

id0 0.2*0.1 + 0.8*0.2 = 0.18

id1 0.2*0.7 = 0.14

id2 0.2*0.9 + 0.8*1.0 = 0.26S1 S2

id p

id0 0.2

id2 1.0

id p1 p2

id0 0.1 0.2

id1 0.7

id2 0.9 1.0

Inputs OuterJoin(id1=…idn, (S1,…,SN)) Project(α1p1+…+αNpN)

Page 35: What to do when one size does not fit all?!

Limitations Search & Browse

Faceted exploration does not include joins Cannot construct new data sources from

existing ones! Only the pre-defined paths through the

information space can actually be traversed

Page 36: What to do when one size does not fit all?!

Who needs a Join?

You!!!… whenever ‘relevance cues’ are typed: People (e.g., inventors) Companies (e.g., assignees) Categories (e.g., IPTC) Time (e.g., expiry date) Location (e.g., country)

… or whenever multiple sources are to be combined E.g., patents & news, patents & Wikipedia, …

Page 37: What to do when one size does not fit all?!

Patents on X by Y(y)

by Y(y)

Page 38: What to do when one size does not fit all?!

1. Which universities/colleges

hold patents?

2. Who are the inventors named in those patents?

3. Which inventors are active in the area of our company?

Real-life patent search example:

Which researchers associated to universities and colleges should our Human Resources manager know to hire the right people on time?

Page 39: What to do when one size does not fit all?!

How Strategies Help

Strategies improve communication between search intermediary and user Encapsulate domain expert knowledge Abstract representation of search expert knowledge Analyze information seeking process at any stage

Strategies facilitate knowledge management Store / share / publish / refine

Strategies mix exact (DB) and ranked (IR) searches Avoid the need for “human (probabilistic) joins”

Page 40: What to do when one size does not fit all?!
Page 41: What to do when one size does not fit all?!

Conclusion

“No idealized one-shot search engine” Empower the user!

Page 42: What to do when one size does not fit all?!

Search Intermediaries

Travel agency Real estate agents Recruiters Librarians Archivists Digital forensics detectives Patent information specialists

Task

complexity

Page 43: What to do when one size does not fit all?!
Page 44: What to do when one size does not fit all?!

Research Opportunities

Assist the user make the best out of their increased level of control Integrate usage data from live system to help

improve or adapt strategies Handle “even larger” scale data

Patent demo fine on ~17GB semi-structured data (i.e., Fairview Research’s Green Energy collection), without specific optimizations, even with fairly large strategies

Formalism Score normalization

Close the loop!

Page 45: What to do when one size does not fit all?!

Current Situation

index ; repeat { specify ; retrieve } until

Search & explore

Schema definition

Page 46: What to do when one size does not fit all?!

Desirable Situation

repeat { index ; specify ; retrieve } until

Mixed InitiativeSchema definitionSearch & explore

Page 47: What to do when one size does not fit all?!

Interactive Information Access

Feedback: Interaction improves information

representation

Faceted Browsing: Interaction can let user take over where

machine would fail

Search by Strategy: Interaction can let user take over where

system designer would fail