what to do when one size does not fit all?!

What to do when one size does not fit all?!

Arjen P. de [email protected]

Centrum Wiskunde & InformaticaDelft University of Technology

Spinque B.V.

mailto:[email protected]

Core Questions

How to represent information? The information need and search requests The objects to be shown in response to an

information request

How to match information representations (Deductive) data retrieval, (inductive)

information retrieval, or a mix?!

Complications

Heterogeneous data sources WWW, wikipedia, news, e-mail, patents,

twitter, personal information, …

Varying result types “Documents”, tweets, courses, people,

experts, gene expressions, temperatures, …

Multiple dimensions of relevance Topicality, recency, reading level, …

Complications

Many search tasks require a mix within these dimensions: News and patents Companies and their CEOs Recent and on topic

Many search tasks also require a mix across these dimensions: Patents assigned to our top 3 competitors in

market segments mentioned in the recent press releases issued by our top 10 clients

Complications

System’s internal information representation Linguistic annotations

Named entities, sentiment, dependencies, …

Knowledge resources Wikipedia, Freebase, IDC9, IPTC, …

Links to related documents Citations, urls

Complications

Anchors that describe the URI Anchor text

Queries that lead to clicks on the URI Session, user, dwell-time, …

Tweets that mention the URI Time, location, user, …

Other social media that describe the URI User, rating Tag, organisation of `folksonomy’

Tweets about blip.tv

E.g.: http://blip.tv/file/2168377 Amazing Watching “World’s most realistic 3D city

models?” Google Earth/Maps killer Ludvig Emgard shows how maps/satellite pics

on web is done (learn Google and MS!) and ~120 more Tweets

http://blip.tv/file/2168377

Even More Complications

Uncertainty in matching process Vocabulary mismatch Incomplete relevance information

Imperfect and noisy representations of both documents and information need OCR, multimedia analysis, NE taggers, HTML

table extraction, …

The one size fits all "semantically enhanced retrieval

model“?

BM25BM25F

LMRM

VSMDFR

QIR?Learning to rank?

Document Collection:AnchorsEntity typesSentimentTweetsCited documents

…

Context

User

Ranked list of

answers

http://www.hellokids.com/c_19938/coloring-page/holiday-coloring-pages/easter-coloring-pages/jesus-coloring-pages/the-holy-grail-coloring-page

Parameterised Search System

Cannot we ‘remove’ this IR engineer

from the loop, like DBMS software

removes the data engineer from the

loop?

Cornacchia, De Vries, ECIR 2007A Parametrised Search System

Search by Strategy

Visually construct search strategies by connecting building blocks

Generate Search Engine!

Search by Strategy

Visually construct search strategies by connecting building blocks

Each block describes either data or actions upon that data

Strategy Builder

From Patent to Inventor

Reports

Visits

BBs and typed pins

N input pins, 1 output pin Pins represent data / result sets

M user-parameters (u) instantiated at query-time

A BB can be viewed as a function

out = BB(in1,..,inN, u1,..,uM)

Pins are typed doc / sec / term / ne (named entity) / tuple only pins of the same type can be connected

No assumption on the underlying data store / data API

The content of the BB respects the type contract.

BB1(in1,in2,in3, u1,u2)

in1 in2 in3

out

BB2(in1)

in1

out

From Strategies to DB Queries

DatabaseSpinque: RDBMS (MonetDB)

BB1(in1,in2,in3, u1,u2)

in1 in2 in3

out

BB2(in1)

in1

out

• Data flowSpinque: strategy

• Query: strategy made operationalSpinque: PRA

CREATE VIEW a AS SELECT ..

CREATE VIEW b AS SELECT ..

CREATE VIEW c AS SELECT ..

Strategy

Relational DB

Probabilistic Relational AlgebraStrategy

Relational DB

• SQLexplicit probabilities

CREATE VIEW x AS SELECT a1, a3, 1-prod(1-prob) AS prob FROM yGROUP BY a1, a3;

• PRA: probabilistic relational algebra (Fuhr and Roelleke, TOIS 2001)

x = Project DISTINCT [$1,$3](y);

SpinQL, the sneak preview

PRA still too low-level; who writes algebraic plans?!

SpinQL: “See objects, generate SQL under the hood”

Understands Spinque data types (doc, sec, term, named-entity, tuple)

Allows to build levels of abstractions, to describe:

1. access to probabilistic relations

2. domain-unaware typed data streams (e.g. person.name() )

3. domain aware data streams (e.g. person.inventor_of())

4. building blocks (building blocks are functions)

5. strategies (if building blocks are functions, strategies are as well)

SpinQL not contained in a strategy, SpinQL is a strategy specification

SpinQL describes all, the editor shows desired granularity / expertise level

E.g. show patent-strategy, zoom in on “inventors”, then “persons”, “ne”, raw data access

What’s in the DB?

Text-based ranking term-doc-freq relations (inverted file)

One per language, stemming, section Domain-independent, click and index

Entity ranking Probabilistic triples Domain-aware

Needs supervised indexing

Content-based (MM) retrieval Feature vectors, click and index

T D f

t0 d3 3

t0 d5 10

t1 d2 4

subj pred/attr obj/value p

Arjen speaks_to you 0.95

you follow Arjen 0.5

speech minutes 45 0.8

Img_id f1 … fN

0 0.12 … 0.84

1 0.54 … 0.31

2 0.23 … 0.1

VIEWS and TABLES

BB content: sequence of VIEW definitions

A VIEW is pre-computable when All the relations addressed are pre-computable / stored No dependency on user parameters

Pre-computable VIEWs can become TABLEs (or MATERIALIZED VIEWs) Query-independent computations are performed only once, then

read from TABLEs at each query Recognition of these patterns is fully automatic Extends MonetDB’s per-session caching to across-sessions caching

CREATE VIEW a AS SELECT … FROM term-doc … ;CREATE VIEW b AS SELECT … FROM a WHERE a.x = u1 ;CREATE VIEW c AS SELECT … FROM a WHERE a.x = 42 ;CREATE VIEW d AS SELECT … FROM b … ;

CREATE TABLE a AS SELECT … FROM term-doc … ;CREATE VIEW b AS SELECT … FROM a WHERE a.x = u1 ;CREATE TABLE c AS SELECT … FROM a WHERE a.x = 42 ;CREATE VIEW d AS SELECT … FROM b … ;

User parameterStored relation

No user parameter

Pre-computable relation

Exploratory Search

Search & (Faceted) Browsing Help discover schema, ontology, etc. Help discover the relevant sources

Within-collection (by year/location, by type, …) Across multiple collections (by source)

Probabilistic faceted browsing

Traditional (boolean filters) Probabilistic

• 100K - 200K• 200K - 300K• 300K - 400K

Price

• 3• 4• 5

Rooms

• 100 - 150 m2

• 150 - 200 m2

• 200 - 250 m2

Size

• 100K - 200K• 200K - 300K• 300K - 400K

Price

• 3• 4• 5

Rooms

• 100 - 150 m2

• 150 - 200 m2

• 200 - 250 m2

Size

• Good when user knows exactly which filters to apply

• Will see perfect-match results• Won’t see “interesting” results

• Good for exploratory search• Will see perfect-match results• Will also see “interesting” results

Dynamic facets

Pre-indexed Dynamic

• 100K - 200K• 200K - 300K• 300K - 400K

Price

• 3• 4• 5

Rooms

• 100 - 150 m2

• 150 - 200 m2

• 200 - 250 m2

Size

• 100K - 200K• 200K - 300K• 300K - 400K

Price

• 3• 4• 5

Rooms

• 100 - 150 m2

• 150 - 200 m2

• 200 - 250 m2

Size

• Pre-defined ad-hoc indices intersected with result set

• Challenge: many indices to maintain

• Facets decided from result set• Challenge: dynamically adapt granularity

• Different price ranges for villa/garage!• Challenge: heavy concurrent queries to DB

Probabilistic facets and strategies(current)

Originalstrategy

• 100K - 200K• 200K - 300K• 300K - 400K

Price

• 3• 4• 5

Rooms

• 100 - 150 m2

• 150 - 200 m2

• 200 - 250 m2

Size

• 100 - 150 m2

• 150 - 200 m2

• 200 - 250 m2

Size

• 100K - 200K• 200K - 300K• 300K - 400K

Price

• 3• 4• 5

Rooms

• Filter on Size (in/out)• no ranking!

• Re-rank on • Rooms (up/down)• Price (up/down)• no filter!

• BAD: Order of Rooms/Price matters!

• BAD: Ranking function of Rooms/Price internally smoothed with previous ranking (done for efficiency)

• BAD: possible to weight each facet, but not consistently with others

Probabilistic facets and strategies(better)

Originalstrategy

20% 50% 30%Mix

• 100K - 200K• 200K - 300K• 300K - 400K

Price

• 3• 4• 5

Rooms

• 100 - 150 m2

• 150 - 200 m2

• 200 - 250 m2

Size

• 100 - 150 m2

• 150 - 200 m2

• 200 - 250 m2

Size

• 100K - 200K• 200K - 300K• 300K - 400K

Price

• 3• 4• 5

Rooms

• Filter on Size (in/out)• no ranking!

• Re-rank on Rooms (up/down)• no filter!

• Re-rank on Price (up/down)• no filter!

• Mix 3 different rankings:1. Rooms2. original (always present)3. Price

• Change coefficients to explore• Challenge: use algebraic

SpinQL representation for rewritings• e.g. push filters up

Mixing probabilistic data streams

N inputs streams S1,…,SN, 1 output All streams: (id, p)

Same id type (e.g. docs) Linear combination:

p1…pN must be comparable; on same scale Take care in scripting the blocks

Expensive operation in relational algebra

20% 50% 30%Mix

N

jjj

NNNN

pidpid

ididSpidSpid

11

1111

),(),(

|),(,,),(

Mixing probabilistic data streams in RA

Sum(p, GroupBy(id, (Union(α1S1,…,αnSN))) GroupBy optim. for few large groups – opposite here Example shows 3 ids, 2 streams:

3 groups (could be millions), each group large max 2 (usually < 5).

N

jjj

NNNN

pidpid

ididSpidSpid

11

1111

),(),(

|),(,,),(

id p

id0 0.2*0.1 = 0.02

id1 0.2*0.7 = 0.14

id2 0.2*0.9 = 0.18

id0 0.8*0.2 = 0.16

id2 0.8*1.0 = 0.8

id p

id0 0.02 + 0.16 = 0.18

id1 0.14

id2 0.18 + 0.8 = 0.26

α1S1

α2S220% 80%Mix

id p

id0 0.1

id1 0.7

id2 0.9S1 S2

id p

id0 0.2

id2 1.0

Inputs Union(α1S1,…,αnSN))) Sum(p, GroupBy(id))

Mixing probabilistic data streams in RA

Project(α1p1+…+αNpN, OuterJoin(id1=…idn, (S1,…,SN)))

Explicit summation of few values more efficient than aggregation Example omits handling NULLs from OuterJoin – not free Super-fast if streams ordered on id – impossible with Union/Group

20% 80%Mix

N

jjj

NNNN

pidpid

ididSpidSpid

11

1111

),(),(

|),(,,),(

id p

id0 0.1

id1 0.7

id2 0.9

id p

id0 0.2*0.1 + 0.8*0.2 = 0.18

id1 0.2*0.7 = 0.14

id2 0.2*0.9 + 0.8*1.0 = 0.26S1 S2

id p

id0 0.2

id2 1.0

id p1 p2

id0 0.1 0.2

id1 0.7

id2 0.9 1.0

Inputs OuterJoin(id1=…idn, (S1,…,SN)) Project(α1p1+…+αNpN)

Limitations Search & Browse

Faceted exploration does not include joins Cannot construct new data sources from

existing ones! Only the pre-defined paths through the

information space can actually be traversed

Who needs a Join?

You!!!… whenever ‘relevance cues’ are typed: People (e.g., inventors) Companies (e.g., assignees) Categories (e.g., IPTC) Time (e.g., expiry date) Location (e.g., country)

… or whenever multiple sources are to be combined E.g., patents & news, patents & Wikipedia, …

Patents on X by Y(y)

by Y(y)

1. Which universities/colleges

hold patents?

2. Who are the inventors named in those patents?

3. Which inventors are active in the area of our company?

Real-life patent search example:

Which researchers associated to universities and colleges should our Human Resources manager know to hire the right people on time?

How Strategies Help

Strategies improve communication between search intermediary and user Encapsulate domain expert knowledge Abstract representation of search expert knowledge Analyze information seeking process at any stage

Strategies facilitate knowledge management Store / share / publish / refine

Strategies mix exact (DB) and ranked (IR) searches Avoid the need for “human (probabilistic) joins”

Conclusion

“No idealized one-shot search engine” Empower the user!

Search Intermediaries

Travel agency Real estate agents Recruiters Librarians Archivists Digital forensics detectives Patent information specialists

Task

complexity

Research Opportunities

Assist the user make the best out of their increased level of control Integrate usage data from live system to help

improve or adapt strategies Handle “even larger” scale data

Patent demo fine on ~17GB semi-structured data (i.e., Fairview Research’s Green Energy collection), without specific optimizations, even with fairly large strategies

Formalism Score normalization

Close the loop!

Current Situation

index ; repeat { specify ; retrieve } until

Search & explore

Schema definition

Desirable Situation

repeat { index ; specify ; retrieve } until

Mixed InitiativeSchema definitionSearch & explore

Interactive Information Access

Feedback: Interaction improves information

representation

Faceted Browsing: Interaction can let user take over where

machine would fail

Search by Strategy: Interaction can let user take over where

system designer would fail

what to do when one size does not fit all?!

Technology

data engineer

spinque data typesdoc

strategy query

strategy relational

underlying data store

strategy builder

data result sets

raw data access