what to do when one size does not fit all?!
DESCRIPTION
Keynote talk about "Search by Strategy" at the ESAIR 2011 workshop, held at CIKM 2011.TRANSCRIPT
What to do when one size does not fit all?!
Arjen P. de [email protected]
Centrum Wiskunde & InformaticaDelft University of Technology
Spinque B.V.
Core Questions
How to represent information? The information need and search requests The objects to be shown in response to an
information request
How to match information representations (Deductive) data retrieval, (inductive)
information retrieval, or a mix?!
Complications
Heterogeneous data sources WWW, wikipedia, news, e-mail, patents,
twitter, personal information, …
Varying result types “Documents”, tweets, courses, people,
experts, gene expressions, temperatures, …
Multiple dimensions of relevance Topicality, recency, reading level, …
Complications
Many search tasks require a mix within these dimensions: News and patents Companies and their CEOs Recent and on topic
Many search tasks also require a mix across these dimensions: Patents assigned to our top 3 competitors in
market segments mentioned in the recent press releases issued by our top 10 clients
Complications
System’s internal information representation Linguistic annotations
Named entities, sentiment, dependencies, …
Knowledge resources Wikipedia, Freebase, IDC9, IPTC, …
Links to related documents Citations, urls
Complications
Anchors that describe the URI Anchor text
Queries that lead to clicks on the URI Session, user, dwell-time, …
Tweets that mention the URI Time, location, user, …
Other social media that describe the URI User, rating Tag, organisation of `folksonomy’
Tweets about blip.tv
E.g.: http://blip.tv/file/2168377 Amazing Watching “World’s most realistic 3D city
models?” Google Earth/Maps killer Ludvig Emgard shows how maps/satellite pics
on web is done (learn Google and MS!) and ~120 more Tweets
Even More Complications
Uncertainty in matching process Vocabulary mismatch Incomplete relevance information
Imperfect and noisy representations of both documents and information need OCR, multimedia analysis, NE taggers, HTML
table extraction, …
The one size fits all "semantically enhanced retrieval
model“?
BM25BM25F
LMRM
VSMDFR
QIR?Learning to rank?
Document Collection:AnchorsEntity typesSentimentTweetsCited documents
…
Context
User
Ranked list of
answers
http://www.hellokids.com/c_19938/coloring-page/holiday-coloring-pages/easter-coloring-pages/jesus-coloring-pages/the-holy-grail-coloring-page
Parameterised Search System
Cannot we ‘remove’ this IR engineer
from the loop, like DBMS software
removes the data engineer from the
loop?
Cornacchia, De Vries, ECIR 2007A Parametrised Search System
Search by Strategy
Visually construct search strategies by connecting building blocks
Generate Search Engine!
Search by Strategy
Visually construct search strategies by connecting building blocks
Each block describes either data or actions upon that data
Strategy Builder
From Patent to Inventor
Reports
Visits
BBs and typed pins
N input pins, 1 output pin Pins represent data / result sets
M user-parameters (u) instantiated at query-time
A BB can be viewed as a function
out = BB(in1,..,inN, u1,..,uM)
Pins are typed doc / sec / term / ne (named entity) / tuple only pins of the same type can be connected
No assumption on the underlying data store / data API
The content of the BB respects the type contract.
BB1(in1,in2,in3, u1,u2)
in1 in2 in3
out
BB2(in1)
in1
out
From Strategies to DB Queries
DatabaseSpinque: RDBMS (MonetDB)
BB1(in1,in2,in3, u1,u2)
in1 in2 in3
out
BB2(in1)
in1
out
• Data flowSpinque: strategy
• Query: strategy made operationalSpinque: PRA
CREATE VIEW a AS SELECT ..
CREATE VIEW b AS SELECT ..
CREATE VIEW c AS SELECT ..
Strategy
Relational DB
Probabilistic Relational AlgebraStrategy
Relational DB
• SQLexplicit probabilities
CREATE VIEW x AS SELECT a1, a3, 1-prod(1-prob) AS prob FROM yGROUP BY a1, a3;
• PRA: probabilistic relational algebra (Fuhr and Roelleke, TOIS 2001)
x = Project DISTINCT [$1,$3](y);
SpinQL, the sneak preview
PRA still too low-level; who writes algebraic plans?!
SpinQL: “See objects, generate SQL under the hood”
Understands Spinque data types (doc, sec, term, named-entity, tuple)
Allows to build levels of abstractions, to describe:
1. access to probabilistic relations
2. domain-unaware typed data streams (e.g. person.name() )
3. domain aware data streams (e.g. person.inventor_of())
4. building blocks (building blocks are functions)
5. strategies (if building blocks are functions, strategies are as well)
SpinQL not contained in a strategy, SpinQL is a strategy specification
SpinQL describes all, the editor shows desired granularity / expertise level
E.g. show patent-strategy, zoom in on “inventors”, then “persons”, “ne”, raw data access
What’s in the DB?
Text-based ranking term-doc-freq relations (inverted file)
One per language, stemming, section Domain-independent, click and index
Entity ranking Probabilistic triples Domain-aware
Needs supervised indexing
Content-based (MM) retrieval Feature vectors, click and index
T D f
t0 d3 3
t0 d5 10
t1 d2 4
subj pred/attr obj/value p
Arjen speaks_to you 0.95
you follow Arjen 0.5
speech minutes 45 0.8
Img_id f1 … fN
0 0.12 … 0.84
1 0.54 … 0.31
2 0.23 … 0.1
VIEWS and TABLES
BB content: sequence of VIEW definitions
A VIEW is pre-computable when All the relations addressed are pre-computable / stored No dependency on user parameters
Pre-computable VIEWs can become TABLEs (or MATERIALIZED VIEWs) Query-independent computations are performed only once, then
read from TABLEs at each query Recognition of these patterns is fully automatic Extends MonetDB’s per-session caching to across-sessions caching
CREATE VIEW a AS SELECT … FROM term-doc … ;CREATE VIEW b AS SELECT … FROM a WHERE a.x = u1 ;CREATE VIEW c AS SELECT … FROM a WHERE a.x = 42 ;CREATE VIEW d AS SELECT … FROM b … ;
CREATE TABLE a AS SELECT … FROM term-doc … ;CREATE VIEW b AS SELECT … FROM a WHERE a.x = u1 ;CREATE TABLE c AS SELECT … FROM a WHERE a.x = 42 ;CREATE VIEW d AS SELECT … FROM b … ;
User parameterStored relation
No user parameter
Pre-computable relation
Exploratory Search
Search & (Faceted) Browsing Help discover schema, ontology, etc. Help discover the relevant sources
Within-collection (by year/location, by type, …) Across multiple collections (by source)
Probabilistic faceted browsing
Traditional (boolean filters) Probabilistic
• 100K - 200K• 200K - 300K• 300K - 400K
Price
• 3• 4• 5
Rooms
• 100 - 150 m2
• 150 - 200 m2
• 200 - 250 m2
Size
• 100K - 200K• 200K - 300K• 300K - 400K
Price
• 3• 4• 5
Rooms
• 100 - 150 m2
• 150 - 200 m2
• 200 - 250 m2
Size
• Good when user knows exactly which filters to apply
• Will see perfect-match results• Won’t see “interesting” results
• Good for exploratory search• Will see perfect-match results• Will also see “interesting” results
Dynamic facets
Pre-indexed Dynamic
• 100K - 200K• 200K - 300K• 300K - 400K
Price
• 3• 4• 5
Rooms
• 100 - 150 m2
• 150 - 200 m2
• 200 - 250 m2
Size
• 100K - 200K• 200K - 300K• 300K - 400K
Price
• 3• 4• 5
Rooms
• 100 - 150 m2
• 150 - 200 m2
• 200 - 250 m2
Size
• Pre-defined ad-hoc indices intersected with result set
• Challenge: many indices to maintain
• Facets decided from result set• Challenge: dynamically adapt granularity
• Different price ranges for villa/garage!• Challenge: heavy concurrent queries to DB
Probabilistic facets and strategies(current)
Originalstrategy
• 100K - 200K• 200K - 300K• 300K - 400K
Price
• 3• 4• 5
Rooms
• 100 - 150 m2
• 150 - 200 m2
• 200 - 250 m2
Size
• 100 - 150 m2
• 150 - 200 m2
• 200 - 250 m2
Size
• 100K - 200K• 200K - 300K• 300K - 400K
Price
• 3• 4• 5
Rooms
• Filter on Size (in/out)• no ranking!
• Re-rank on • Rooms (up/down)• Price (up/down)• no filter!
• BAD: Order of Rooms/Price matters!
• BAD: Ranking function of Rooms/Price internally smoothed with previous ranking (done for efficiency)
• BAD: possible to weight each facet, but not consistently with others
Probabilistic facets and strategies(better)
Originalstrategy
20% 50% 30%Mix
• 100K - 200K• 200K - 300K• 300K - 400K
Price
• 3• 4• 5
Rooms
• 100 - 150 m2
• 150 - 200 m2
• 200 - 250 m2
Size
• 100 - 150 m2
• 150 - 200 m2
• 200 - 250 m2
Size
• 100K - 200K• 200K - 300K• 300K - 400K
Price
• 3• 4• 5
Rooms
• Filter on Size (in/out)• no ranking!
• Re-rank on Rooms (up/down)• no filter!
• Re-rank on Price (up/down)• no filter!
• Mix 3 different rankings:1. Rooms2. original (always present)3. Price
• Change coefficients to explore• Challenge: use algebraic
SpinQL representation for rewritings• e.g. push filters up
Mixing probabilistic data streams
N inputs streams S1,…,SN, 1 output All streams: (id, p)
Same id type (e.g. docs) Linear combination:
p1…pN must be comparable; on same scale Take care in scripting the blocks
Expensive operation in relational algebra
20% 50% 30%Mix
N
jjj
NNNN
pidpid
ididSpidSpid
11
1111
),(),(
|),(,,),(
Mixing probabilistic data streams in RA
Sum(p, GroupBy(id, (Union(α1S1,…,αnSN))) GroupBy optim. for few large groups – opposite here Example shows 3 ids, 2 streams:
3 groups (could be millions), each group large max 2 (usually < 5).
N
jjj
NNNN
pidpid
ididSpidSpid
11
1111
),(),(
|),(,,),(
id p
id0 0.2*0.1 = 0.02
id1 0.2*0.7 = 0.14
id2 0.2*0.9 = 0.18
id0 0.8*0.2 = 0.16
id2 0.8*1.0 = 0.8
id p
id0 0.02 + 0.16 = 0.18
id1 0.14
id2 0.18 + 0.8 = 0.26
α1S1
α2S220% 80%Mix
id p
id0 0.1
id1 0.7
id2 0.9S1 S2
id p
id0 0.2
id2 1.0
Inputs Union(α1S1,…,αnSN))) Sum(p, GroupBy(id))
Mixing probabilistic data streams in RA
Project(α1p1+…+αNpN, OuterJoin(id1=…idn, (S1,…,SN)))
Explicit summation of few values more efficient than aggregation Example omits handling NULLs from OuterJoin – not free Super-fast if streams ordered on id – impossible with Union/Group
20% 80%Mix
N
jjj
NNNN
pidpid
ididSpidSpid
11
1111
),(),(
|),(,,),(
id p
id0 0.1
id1 0.7
id2 0.9
id p
id0 0.2*0.1 + 0.8*0.2 = 0.18
id1 0.2*0.7 = 0.14
id2 0.2*0.9 + 0.8*1.0 = 0.26S1 S2
id p
id0 0.2
id2 1.0
id p1 p2
id0 0.1 0.2
id1 0.7
id2 0.9 1.0
Inputs OuterJoin(id1=…idn, (S1,…,SN)) Project(α1p1+…+αNpN)
Limitations Search & Browse
Faceted exploration does not include joins Cannot construct new data sources from
existing ones! Only the pre-defined paths through the
information space can actually be traversed
Who needs a Join?
You!!!… whenever ‘relevance cues’ are typed: People (e.g., inventors) Companies (e.g., assignees) Categories (e.g., IPTC) Time (e.g., expiry date) Location (e.g., country)
… or whenever multiple sources are to be combined E.g., patents & news, patents & Wikipedia, …
Patents on X by Y(y)
by Y(y)
1. Which universities/colleges
hold patents?
2. Who are the inventors named in those patents?
3. Which inventors are active in the area of our company?
Real-life patent search example:
Which researchers associated to universities and colleges should our Human Resources manager know to hire the right people on time?
How Strategies Help
Strategies improve communication between search intermediary and user Encapsulate domain expert knowledge Abstract representation of search expert knowledge Analyze information seeking process at any stage
Strategies facilitate knowledge management Store / share / publish / refine
Strategies mix exact (DB) and ranked (IR) searches Avoid the need for “human (probabilistic) joins”
Conclusion
“No idealized one-shot search engine” Empower the user!
Search Intermediaries
Travel agency Real estate agents Recruiters Librarians Archivists Digital forensics detectives Patent information specialists
Task
complexity
Research Opportunities
Assist the user make the best out of their increased level of control Integrate usage data from live system to help
improve or adapt strategies Handle “even larger” scale data
Patent demo fine on ~17GB semi-structured data (i.e., Fairview Research’s Green Energy collection), without specific optimizations, even with fairly large strategies
Formalism Score normalization
Close the loop!
Current Situation
index ; repeat { specify ; retrieve } until
Search & explore
Schema definition
Desirable Situation
repeat { index ; specify ; retrieve } until
Mixed InitiativeSchema definitionSearch & explore
Interactive Information Access
Feedback: Interaction improves information
representation
Faceted Browsing: Interaction can let user take over where
machine would fail
Search by Strategy: Interaction can let user take over where
system designer would fail