2007.1.25 - slide 1is 240 – spring 2007 prof. ray larson university of california, berkeley school...
Post on 22-Dec-2015
213 views
TRANSCRIPT
2007.1.25 - SLIDE 1IS 240 – Spring 2007
Prof. Ray Larson University of California, Berkeley
School of InformationTuesday and Thursday 10:30 am - 12:00 pm
Spring 2007http://courses.ischool.berkeley.edu/i240/s07
Principles of Information Retrieval
Lecture 4: Boolean Logic
2007.1.25 - SLIDE 3IS 240 – Spring 2007
Paradox
• The “Fundamental paradox of Information Retrieval” as stated by Roland Hjerrpe– The need to describe that which you do not
know in order to find it
2007.1.25 - SLIDE 4IS 240 – Spring 2007
What is Needed?
• What software components are needed to construct an IR system?
• One way to approach this question is to look at the information and data, and see what needs to be done to allow us to do IR
2007.1.25 - SLIDE 5IS 240 – Spring 2007
What, again, is the goal?
• Goal of IR is to retrieve all and only the “relevant” documents in a collection for a particular user with a particular need for information– Relevance is a central concept in IR theory
• OR• The goal is to search large document collections
(millions of documents) to retrieve small subsets relevant to the user’s information need
2007.1.25 - SLIDE 6IS 240 – Spring 2007
Collections of Documents…
• Documents– A document is a representation of some
aggregation of information, treated as a unit.
• Collection– A collection is some physical or logical
aggregation of documents
• Let’s take the simplest case, and say we are dealing with a computer file of plain ASCII text, where each line represents the “UNIT” or document.
2007.1.25 - SLIDE 7IS 240 – Spring 2007
How to search that collection?
• Manually?– Cat, more
• Scan for strings?– Grep
• Extract individual words to search???– “tokenize” (a unix pipeline)
• Put it in a DBMS and use pattern matching there…
2007.1.25 - SLIDE 8IS 240 – Spring 2007
What about VERY big files?
• Scanning becomes a problem
• The nature of the problem starts to change as the scale of the collection increases
• A variant of Parkinson’s Law that applies to databases is:– Data expands to fill the space available to
store it
2007.1.25 - SLIDE 9IS 240 – Spring 2007
The IR Approach
• Extract the words (or tokens) along with references to the record they come from– I.e. build an inverted file of words or tokens
• Is this enough?
2007.1.25 - SLIDE 11IS 240 – Spring 2007
What about …
• The structure information, POS info, etc.?• Where and how to store this information?
– DBMS?– Special file structures
• DBMS File types (ISAM, VSAM, B-Tree, etc.)• PAT trees• Hashed files (Minimal, Perfect and Both)• Inverted files
• How to get it back out of the storage– And how to map to the original document location?
2007.1.25 - SLIDE 12IS 240 – Spring 2007
Structure of an IR SystemSearchLine Interest profiles
& QueriesDocuments
& data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
StorageLine
Potentially Relevant
Documents
Comparison/Matching
Store1: Profiles/Search requests
Store2: Documentrepresentations
Indexing (Descriptive and
Subject)
Formulating query in terms of
descriptors
Storage of profiles
Storage of Documents
Information Storage and Retrieval System
Adapted from Soergel, p. 19
2007.1.25 - SLIDE 13IS 240 – Spring 2007
What next?
• User queries– How do we handle them?– What sort of interface do we need?– What processing steps once a query is
submitted?
• Matching– How (and what) do we match?
2007.1.25 - SLIDE 14IS 240 – Spring 2007
From the Text…
User Interface
Text operations
indexing DB Man.
Text Db
index
Queryoperations
Searching
Ranking
2007.1.25 - SLIDE 15IS 240 – Spring 2007
Today
• IR Models• The Boolean Model• Boolean implementation issues
• Note: we will get back to the Text Processing parts later in looking more at implementation. For now we focus on the theoretical foundations of the search process.
2007.1.25 - SLIDE 16IS 240 – Spring 2007
IR Models
• Set Theoretic Models– Boolean– Fuzzy– Extended Boolean
• Vector Models (Algebraic)
• Probabilistic Models (“pure” probabilistic models, Logistic regression, language modeling)
• Others (e.g., neural networks, etc.)
2007.1.25 - SLIDE 17IS 240 – Spring 2007
Boolean Model for IR
• Based on Boolean Logic (Algebra of Sets).
• Fundamental principles established by George Boole in the 1850’s
• Deals with set membership and operations on sets
• Set membership in IR systems is usually based on whether (or not) a document contains a keyword (term)
2007.1.25 - SLIDE 18IS 240 – Spring 2007
• Intersection – Boolean ‘AND’ -- --
• Union – Boolean ‘OR’ -- --
• Negation – Boolean ‘NOT’ -- --– Usually means “AND NOT” in IR
• Exclusive OR – ‘XOR’ – seldom used,– Instead
Boolean Operations on Sets
X
)()( BABAAxorB
2007.1.25 - SLIDE 20IS 240 – Spring 2007
Query Languages
• A way to express the query (formal expression of the information need)
• Types: – Boolean– Natural Language– Stylized Natural Language– Form-Based (GUI)
2007.1.25 - SLIDE 21IS 240 – Spring 2007
Simple query language: Boolean
• Terms + Connectors– terms
• words• normalized (stemmed) words• phrases• thesaurus terms
– connectors• AND• OR• NOT
– parentheses (for grouping operations)
2007.1.25 - SLIDE 22IS 240 – Spring 2007
Boolean Queries
• Cat
• Cat OR Dog
• Cat AND Dog
• (Cat AND Dog)
• (Cat AND Dog) OR Collar
• (Cat AND Dog) OR (Collar AND Leash)
• (Cat OR Dog) AND (Collar OR Leash)
2007.1.25 - SLIDE 23IS 240 – Spring 2007
Boolean Queries
• (Cat OR Dog) AND (Collar OR Leash)– Each of the following combinations works:
Doc # 1 2 3 4 5 6 7CAT X X X X XDOG X X X X XCOLLAR X X X X XLEASH X X X X
2007.1.25 - SLIDE 24IS 240 – Spring 2007
Boolean Queries
• (Cat OR Dog) AND (Collar OR Leash)– None of the following combinations works:
Doc # 1 2 3 4 5 6 7CAT X XDOG X XCOLLAR X XLEASH X X
2007.1.25 - SLIDE 25IS 240 – Spring 2007
Boolean Queries
• Usually expressed as INFIX operators in IR– ((a AND b) OR (c AND b))
• NOT is UNARY PREFIX operator– ((a AND b) OR (c AND (NOT b)))
• AND and OR can be n-ary operators– (a AND b AND c AND d)
• Some rules - (De Morgan revisited)– NOT(a) AND NOT(b) = NOT(a OR b)– NOT(a) OR NOT(b)= NOT(a AND b)– NOT(NOT(a)) = a
2007.1.25 - SLIDE 26IS 240 – Spring 2007
Boolean Searching
Formal Query:cracks AND beamsAND Width_measurementAND Prestressed_concrete
Cracks
Beams Widthmeasurement
Prestressedconcrete
Relaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)
Relaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)
2007.1.25 - SLIDE 27IS 240 – Spring 2007
Boolean Logic
t33
t11 t22
D11D22
D33
D44D55
D66
D88D77
D99
D1010
D1111
m1
m2
m3m5
m4
m7m8
m6
m2 = t1 t2 t3
m1 = t1 t2 t3
m4 = t1 t2 t3
m3 = t1 t2 t3
m6 = t1 t2 t3
m5 = t1 t2 t3
m8 = t1 t2 t3
m7 = t1 t2 t3
2007.1.25 - SLIDE 28IS 240 – Spring 2007
Precedence Ordering
• In what order do we evaluate the components of the Boolean expression?– Parenthesis get done first
• (a or b) and (c or d)• (a or (b and c) or d)
– Usually start from the left and work right (in case of ties)
– Usually (if there are no parentheses)• NOT before AND• AND before OR
– Different systems have different rules
2007.1.25 - SLIDE 29IS 240 – Spring 2007
Pseudo-Boolean Queries
• A new notation, from web search– +cat dog +collar leash– These are prefix operators
• Does not mean the same thing as AND/OR!– + means “mandatory, must be in document”– - means “cannot be in the document”
• Phrases:– “stray cat” AND “frayed collar” – is equivalent to– +“stray cat” +“frayed collar”
2007.1.25 - SLIDE 30IS 240 – Spring 2007
Result Sets
• Run a query, get a result set
• Two choices– Reformulate query, run on entire collection– Reformulate query, run on result set
• Example: Dialog query• (Redford AND Newman)• -> S1 1450 documents• (S1 AND Sundance)• ->S2 898 documents
2007.1.25 - SLIDE 31IS 240 – Spring 2007
Faceted Boolean Query
• Strategy: break query into facets (polysemous with earlier meaning of facets)– conjunction of disjunctions
• (a1 OR a2 OR a3)
• (b1 OR b2)
• (c1 OR c2 OR c3 OR c4)
– each facet expresses a topic• (“rain forest” OR jungle OR amazon)
• (medicine OR remedy OR cure)
• (Smith OR Zhou)
AND
AND
2007.1.25 - SLIDE 32IS 240 – Spring 2007
Faceted Boolean Query
• Query still fails if one facet missing• Alternative:
– Coordination level ranking– Order results in terms of how many facets
(disjuncts) are satisfied– Also called Quorum ranking, Overlap ranking,
and Best Match
• Problem: Facets still undifferentiated• Alternative:
– Assign weights to facets
2007.1.25 - SLIDE 33IS 240 – Spring 2007
Proximity Searches
• Proximity: terms occur within K positions of one another– pen w/5 paper
• A “Near” function can be more vague– near(pen, paper)
• Sometimes order can be specified• Also, Phrases and Collocations
– “United Nations” “Bill Clinton”
• Phrase Variants– “retrieval of information” “information retrieval”
2007.1.25 - SLIDE 34IS 240 – Spring 2007
Ordering of Retrieved Documents
• Pure Boolean has no ordering• In practice:
– order chronologically– order by total number of “hits” on query terms
• What if one term has more hits than others?• Is it better to one of each term or many of one
term?
• Fancier methods have been investigated – p-norm is most famous
• usually impractical to implement• usually hard for user to understand
2007.1.25 - SLIDE 35IS 240 – Spring 2007
Boolean Implementation: Inverted Files
• We will look at “Vector files” in detail later. But conceptually, an Inverted File is a vector file “inverted” so that rows become columns and columns become rows
docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1
Terms D1 D2 D3 D4 D5 D6 D7 …
t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0
2007.1.25 - SLIDE 36IS 240 – Spring 2007
How Are Inverted Files Created
• Documents are parsed to extract words (or stems) and these are saved with the Document ID.
Now is the timefor all good men
to come to the aidof their country
Doc 1
It was a dark andstormy night in
the country manor. The time was past midnight
Doc 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
TextProcSteps
2007.1.25 - SLIDE 37IS 240 – Spring 2007
How Inverted Files are Created
• After all document have been parsed the inverted file is sorted
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
2007.1.25 - SLIDE 38IS 240 – Spring 2007
How Inverted Files are Created
• Multiple term entries for a single document are merged and frequency information added
Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
2007.1.25 - SLIDE 39IS 240 – Spring 2007
Inverted Files• The file is commonly split into a Dictionary
and a Postings fileTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
2007.1.25 - SLIDE 40IS 240 – Spring 2007
Inverted files
• Permit fast search for individual terms
• Search results for each term is a list of document IDs (and optionally, frequency and/or positional information)
• These lists can be used to solve Boolean queries:– country: d1, d2– manor: d2– country and manor: d2
2007.1.25 - SLIDE 41IS 240 – Spring 2007
Inverted Files
• Lots of alternative implementations – E.g.: Cheshire builds within-document
frequency using a hash table during document parsing. Then Document IDs and frequency info are stored in a BerkeleyDB B-tree index keyed by the term.
2007.1.25 - SLIDE 42IS 240 – Spring 2007
Btree (conceptual)
B | | D | | F |
AcesBoilers
Cars
F | | P | | Z |
R | | S | | Z |H | | L | | P |
DevilsMinors
PanthersSeminoles
Flyers
HawkeyesHoosiers
2007.1.25 - SLIDE 43IS 240 – Spring 2007
Btree with Postings
B | | D | | F |
AcesBoilers
Cars
F | | P | | Z |
R | | S | | Z |H | | L | | P |
DevilsMinors
PanthersSeminoles
FlyersHawkeyesHoosiers
2,4,8,122,4,8,122,4,8,12
2,4,8,12
2,4,8,12
2,4,8,125, 7, 200
2,4,8,122,4,8,128,120
2007.1.25 - SLIDE 44IS 240 – Spring 2007
Inverted files
• Permit fast search for individual terms
• Search results for each term is a list of document IDs (and optionally, frequency and/or positional information)
• These lists can be used to solve Boolean queries:– country: d1, d2– manor: d2– country and manor: d2
2007.1.25 - SLIDE 45IS 240 – Spring 2007
Boolean Processing
• Boolean Processing (classic Boolean)– Data structures for Query representation and
Boolean Operations
• Boolean processing logic and algorithms
• Extended Boolean Models– Fuzzy Logic– Others
2007.1.25 - SLIDE 46IS 240 – Spring 2007
Boolean Processing
• All processing takes place on postings lists
• Different methods can be used for sorted or unsorted postings lists
2007.1.25 - SLIDE 47IS 240 – Spring 2007
Boolean Query Processing
• The query must be parsed to determine what the:– Search Words– Optional field or index qualifications– Boolean Operators
• Are and how they relate to one-another• Typical parsing uses lexical analysers (like lex or
flex) along with parser generators like YACC, BISON or Llgen– These produce code to be compiled into programs.– Example…
2007.1.25 - SLIDE 48IS 240 – Spring 2007
Z39.50 Query Structure (ASN-1 Notation)
-- Query Definitions
Query ::= CHOICE{
type-0 [0] ANY,
type-1 [1] IMPLICIT RPNQuery,
type-2 [2] OCTET STRING,
type-100 [100] OCTET STRING,
type-101 [101] IMPLICIT RPNQuery,
type-102 [102] OCTET STRING}
2007.1.25 - SLIDE 49IS 240 – Spring 2007
Z39.50 RPN Query (ASN-1 Notation)
-- Definitions for RPN query
RPNQuery ::= SEQUENCE{
attributeSet AttributeSetId,
rpn RPNStructure}
2007.1.25 - SLIDE 50IS 240 – Spring 2007
RPN Structure
RPNStructure ::= CHOICE{
op [0] Operand,
rpnRpnOp [1] IMPLICIT SEQUENCE{
rpn1 RPNStructure,
rpn2 RPNStructure,
op Operator }
}
2007.1.25 - SLIDE 51IS 240 – Spring 2007
Operand
Operand ::= CHOICE{
attrTerm AttributesPlusTerm,
resultSet ResultSetId,
-- If version 2 is in force:
-- - If query type is 1, one of the above two must be
chosen;
-- - resultAttr (below) may be used only if query type is
101.
resultAttr ResultSetPlusAttributes}
2007.1.25 - SLIDE 52IS 240 – Spring 2007
Operator
Operator ::= [46] CHOICE{
and [0] IMPLICIT NULL,
or [1] IMPLICIT NULL,
and-not [2] IMPLICIT NULL,
-- If version 2 is in force:
-- - For query type 1, one of the above three must be
chosen;
-- - prox (below) may be used only if query type is 101.
prox [3] IMPLICIT ProximityOperator}
2007.1.25 - SLIDE 53IS 240 – Spring 2007
Parse Result (Query Tree)
• Z39.50 queries…
Operator: AND
Title XXX and Subject YYY
Operand:Index = TitleValue = XXX
Operand:Index = SubjectValue = YYY
left right
2007.1.25 - SLIDE 54IS 240 – Spring 2007
Parse Results
• Subject XXX and (title yyy and author zzz)
Op: AND
Op: ANDOper:
Index: SubjectValue: XXX
Oper:Index: TitleValue: YYY
Oper:Index: AuthorValue: ZZZ
2007.1.25 - SLIDE 55IS 240 – Spring 2007
Boolean AND (Sorted) Algorithm
• Choose the shortest list
• Create new list the same length as the short list– For each item in the short list
• Compare next item in longer list – If greater than – go to next item in longer list– If equal - add to new list and go to next item in both lists– If less than - go to next item in short list
2007.1.25 - SLIDE 56IS 240 – Spring 2007
Boolean AND Algorithm
2578
152935
100135140155189190195198
28
15100135155189195
289
1215222850687784
100120128135138141150155188189195
AND =
2007.1.25 - SLIDE 57IS 240 – Spring 2007
Boolean OR (Sorted) Algorithm
• Choose the longer list• Create new list the same length both lists
combined– For each item in the longer list
• If less than or equal to the first item in the short list– Add to new list
• Otherwise– Add item from short list– Compare next items in short and long lists
» If long item less then short item add long item and go to next long item
» Otherwise – add from short list and go to next short item
– Once the short list runs out, add the remaining items in the long list
2007.1.25 - SLIDE 58IS 240 – Spring 2007
Boolean OR Algorithm
2578
152935
100135140155189190195198
25789
12152228293550687784
100120128135138141150155188189190195198
289
1215222850687784
100120128135138141150155188189195
OR =
2007.1.25 - SLIDE 59IS 240 – Spring 2007
Boolean AND NOT(Sorted) Algorithm
Create new list the same length as the left-hand list– For each item in the left-hand list
• Compare next item in not list – If greater than – add to new list and go to next item in not
list– If equal - go to next item in both lists– If less than - go to next item in not list
2007.1.25 - SLIDE 60IS 240 – Spring 2007
Boolean AND NOTAlgorithm
2578
152935
100135140155189190195198
57
152935
140190198
289
1215222850687784
100120128135138141150155188189195
AND NOT =
2007.1.25 - SLIDE 61IS 240 – Spring 2007
Hashed Boolean AND (unsorted)
• Put each item in shortest list into hash table– For each item in other lists
• If hash entry exists, set flag in hash table entry (or increment counter)
• Scan hash table contents– If flag set (or counter == number of lists) add
to new list
2007.1.25 - SLIDE 62IS 240 – Spring 2007
Hashed Boolean OR (unsorted)
• Put each item in EACH list into hash table– If match increment counter (optional)
• Scan hash table contents and add to new list
2007.1.25 - SLIDE 63IS 240 – Spring 2007
Hashed Boolean AND NOT (unsorted)
• Put each item in left-hand list into hash table– For each item in NOT list
• If hash entry exists, remove it
• Scan hash table contents and add to new list
2007.1.25 - SLIDE 64IS 240 – Spring 2007
Boolean Summary
• Advantages– simple queries are easy to understand– relatively easy to implement
• Disadvantages– difficult to specify what is wanted, particularly
in complex situations– too much returned, or too little– ordering not well determined
• Dominant IR model in commercial systems until the WWW