search by strategy

28
Search by strategy Arjen P. de Vries [email protected] CWI, Spinque, Delft University of Technology

Upload: chet

Post on 14-Jan-2016

37 views

Category:

Documents


2 download

DESCRIPTION

Search by strategy. Arjen P. de Vries [email protected] CWI, Spinque, Delft University of Technology. Interactive Information Access. Feedback: Interaction improves information representation Faceted Browsing: Interaction can let user take over where machine would fail Search by Strategy: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Search by strategy

Search by strategy

Arjen P. de [email protected]

CWI, Spinque, Delft University of Technology

Page 2: Search by strategy

Interactive Information Access

Feedback: Interaction improves information

representation

Faceted Browsing: Interaction can let user take over where

machine would fail

Search by Strategy: Interaction can let user take over where

system designer would fail

Page 3: Search by strategy

Two extreme views on ‘search’

What is DB? From business applications Deductive reasoning Precise and efficient query

processing Users with technical skills

(SQL, XQuery, etc) and precise information needs

SelectionBooks where category=‘CS’

What is IR? From digital libraries,

patent collections, etc Inductive reasoning Best-effort processing Users with low technical

skills and imprecise information needs

RankingBooks about CS

Note: SemWeb more DB than IR!!!

Page 4: Search by strategy

Complex search tasks(or, “not-so-simple” search tasks, if you like)

I want to buy a house in Amsterdam and I want it with ‘sfeer’ but still in good shape

I can afford about €350K. I need 3 bedrooms, the size should be about 80m2. It should have a balcony or a backyard

The closer to the station and an AH, the better. BUT… I do not want to live in Amsterdam-Noord, unless there is a quick bus connection to the ferry

I may be willing to drop some of these constraints, but I’m not sure which

Page 5: Search by strategy

Search = IR + DB

Complex search tasks mix exact (DB) and ranked (IR) searches, on structured (DB) and unstructured (IR) data

Current technical solutions support either/or SQL or Excel for structured data (DB) Google or Lucene for unstructured data (IR)

Combining results requires significant effort copy & paste result sets between interfaces,

“human (probabilistic) joins”

Page 6: Search by strategy

Search = IR on-top-of DB ?

IR on-top-of DB: let exact and ranked operations both be processed by the same engine, so they can be mixed freely

IR responsible for ranking models, using DB as a data-access layer; no physical details necessary

DB responsible for reliable, dynamically optimised, data access; no logical details necessary

Page 7: Search by strategy

IR on-top-of DB???

Traditional, general-purpose DB technology cannot compete with custom IR search tools Working assumption: using column stores

should solve the efficiency problem

Page 8: Search by strategy

IR on-top-of DB – part I

let $c := doc("papers.xml")//DOC[author = "John Smith"]

let $q := "//text[about(.//Abstract, IR DB integration)]"

for $res in tijah-query($c, $q)

return $res/title/text()

+ +

I am an IR and DB expert

Hiemstra, Rode, Van Os, Flokstra, OSIR 2006PF/Tijah: text search in an XML database system

Page 9: Search by strategy

IR on-top-of DB

DATA AND QUERIES

DATA MANAGEMENT

Bla bla

Bla bla bl

a, bla bl

a. Bla

.

Bla bla

Bla bla bla, bla bla. Bla, bla bla

Bla

bla

Bla

bla

ENGINEERINGData Model Mismatch!!!

To implement IR taskson top of DB

is a tough job!

FORMALISM

MAPPING

Rölleke, Tsikrika, KazaiA general matrix frameworkfor modelling Information Retrieval

“IR concepts to matrix spaces”

Cornacchia, De Vries, Van Ballegooij, KerstenSRAM (Sparse Relational Array Mapping)

“matrix spaces to relational tables”

Page 10: Search by strategy

SRAM – query lifecycle

Page 11: Search by strategy

IR on-top-of DB – part II

let $c := doc("papers.xml")//DOC[author = "John Smith"]

let $q := "//text[about(.//Abstract, IR DB integration)]"

for $res in tijah-query($c, $q)

return $res/title/text()

+++

I am an IR expert(needn’t be a DB expert as well)

Cornacchia, De Vries, ECIR 2007A Parametrised Search System

Page 12: Search by strategy

SRAM for IR# Language Modellinglangmod(Q,d,λ) = sum( [ lm_t(d,t,λ) * Q(t) | t ] )lm_t(d,t,λ) = log( λ * DTf(d,t) + (1-λ) * Tf(t) )

# Okapi BM25bm25(Q,d,k1,b) = sum( [ w_t(d,t,k1,b) * Q(t) | t ] )w_t(d,t,k1,b) = Tidf(t) * (k1+1) * DTf(d,t) / ( DTf(d,t) + k1 * ((1-b) + b * Dnl(d)/avgDl) )

DTnl = mxMult(mxTrnsp(LT), LD) # doc-term freq.Dnl = mvMult(mxTrnsp(LD), L) # doc lenghtTnl = mvMult(mxTrnsp(LT), L) # term freq.DTf = [ DTnl(d,t)/Dnl(d) | d,t ] # doc-term norm. freq.Tf = [ Tnl(t)/nLocs | t ] # term norm. freq.

mxTrnsp(A) = [ A(j,i) | i,j ]mxMult(A,B) = [ sum([ A(i,k) * B(k,j) | k ]) | i,j ]mvMult(A,V) = [ sum([ A(i,j) * V(j) | j ]) | i ]

Indexing.ram (excerpt)

Linear_Algebra.ram (excerpt)

Retrieval_Models.ram (excerpt)

Page 13: Search by strategy

Parameterised Search System (PSS)

Cannot we ‘remove’ this IR engineer

from the loop, like DBMS software

removes the data engineer from the

loop?

Cornacchia, De Vries, ECIR 2007A Parametrised Search System

Page 14: Search by strategy

Search by Strategy

Visually construct search strategies by connecting building blocks

Page 15: Search by strategy

Strategy Builder

Page 16: Search by strategy

From Patent to Inventor

Page 17: Search by strategy

Search by Strategy

Visually construct search strategies by connecting building blocks

Each block describes either data or actions upon that data A search strategy may include multiple data

sources Data sources are internally represented as

quadruples, triples extended with an additional probability value

Actions are actually scripts expressed in Fuhr and Roelleke’s PRA (TOIS 1997)

Page 18: Search by strategy

1. Which universities/colleges

hold patents?

2. Who are the inventors named in those patents?

3. Which inventors are active in the area of our company?

Real-life patent search example:

Which researchers associated to universities and colleges should our Human Resources manager know to hire the right people on time?

Page 19: Search by strategy
Page 20: Search by strategy
Page 21: Search by strategy

Generate Search Engine

Page 22: Search by strategy

Work In Progress…

Page 23: Search by strategy

How Strategies Help

Strategies improve communication between search intermediary and user Encapsulate domain expert knowledge Abstract representation of search expert knowledge Analyze information seeking process at any stage

Strategies facilitate knowledge management Store / share / publish / refine

Strategies mix exact (DB) and ranked (IR) searches Avoid the need for “human (probabilistic) joins”

Page 24: Search by strategy

Implementation

PRA translates into SQL (!) Current system setup using CWI’s

MonetDB column-store Strategies are dynamically transformed

into a REST API

Page 25: Search by strategy

Conclusion

Hand over control to the user (or, most likely, the search intermediary) Patent information specialists Digital forensics detectives Librarians / archivists Real estate agents Travel agency

Page 26: Search by strategy

Open Issues

Assist the user make the best out of their increased level of control Integrate usage data live system to help

improve or adapt strategies

Handle “even larger” scale data Patent demo fine on ~17GB semi-structured

data (i.e., Fairview Research’s Green Energy collection), without specific optimizations, even with fairly large strategies

Close the loop!

Page 27: Search by strategy

Current Situation

index ; repeat { specify ; retrieve } until

IR + DB

IE

Page 28: Search by strategy

Desirable Situation

repeat { index ; specify ; retrieve } until

IR + DB + IE