sims 296a-3: ui background marti hearst fall ‘98

SIMS 296a-3:SIMS 296a-3:UI BackgroundUI Background

Marti Hearst Marti Hearst

Fall ‘98Fall ‘98

Marti HearstUCB SIMS, Fall 98

Interface Topics TodayInterface Topics Today

(Other topics will be covered later)(Other topics will be covered later)

Supporting the Dynamic Continuing Supporting the Dynamic Continuing

Process of SearchProcess of Search

Search Starting PointsSearch Starting Points


Human Information Seeking Human Information Seeking BehaviorBehavior


Standard ModelStandard Model

Assumptions:Assumptions: Maximizing precision and recall Maximizing precision and recall

simultaneouslysimultaneously The information need remains staticThe information need remains static The value is in the resulting document setThe value is in the resulting document set

User’s InformationNeed

Index

Pre-process

Parse

Collections

Rank or Match

Query

text input

Query Reformulation


““Berry-Picking” as an Berry-Picking” as an Information Seeking Strategy Information Seeking Strategy (Bates 90)(Bates 90) Standard IR modelStandard IR model

The information need remains the same throughout the The information need remains the same throughout the search session.search session.

Goal is to produce a perfect set of relevant docs.Goal is to produce a perfect set of relevant docs. Berry-picking modelBerry-picking model

The query is continually shifting.The query is continually shifting. Users may move through a variety of sources.Users may move through a variety of sources. New information may yield new ideas and new New information may yield new ideas and new

directions.directions. The value of search is on the bits and pieces picked up The value of search is on the bits and pieces picked up

along the way.along the way.


A sketch of a searcher… “moving through many A sketch of a searcher… “moving through many actions towards a general goal of satisfactory actions towards a general goal of satisfactory completion of research related to an information completion of research related to an information need.” (after Bates 90)need.” (after Bates 90)

Q0

Q1

Q2

Q3

Q4

Q5


ImplicationsImplications

Interfaces should make it easy to store Interfaces should make it easy to store intermediate resultsintermediate results

Interfaces should make it easy to follow Interfaces should make it easy to follow trails with unanticipated resultstrails with unanticipated results

Difficulties with evaluationDifficulties with evaluation


Supporting the Information Supporting the Information Seeking ProcessSeeking Process

Two recent similar approaches that focus Two recent similar approaches that focus on supporting the processon supporting the process SketchTrieve (Hendry & Harper 97)SketchTrieve (Hendry & Harper 97) DLITE (Cousins 97)DLITE (Cousins 97)


Informal InterfaceInformal Interface InformalInformal does does notnot mean less useful mean less useful Show how the search isShow how the search is

unfolding or evolvingunfolding or evolving expanding or contractingexpanding or contracting

Prompt the user toPrompt the user to reformulate and abandon plansreformulate and abandon plans backtrack to points of task deferralbacktrack to points of task deferral make side-by-side comparisonsmake side-by-side comparisons define and discuss problemsdefine and discuss problems


SketchTrieve: An Informal SketchTrieve: An Informal InterfaceInterface (Hendry & Harper 96, 97)(Hendry & Harper 96, 97) A “spreadsheet” for information access A “spreadsheet” for information access Make use of layout, space, and localityMake use of layout, space, and locality

comprehension and explanationcomprehension and explanation search planningsearch planning

A data-flow notation for information seekingA data-flow notation for information seeking link sources to querieslink sources to queries link both to retrieved documentslink both to retrieved documents align results in space for comparisonalign results in space for comparison


SketchTrieve: Connecting SketchTrieve: Connecting Results with Next QueryResults with Next Query


DLITE DLITE (Cousins 97)(Cousins 97)

Drag and Drop interfaceDrag and Drop interface Reify queries, sources, retrieval resultsReify queries, sources, retrieval results Animation to keep track of activityAnimation to keep track of activity


Starting Points for SearchStarting Points for Search

Faced with a prompt or an empty entry form Faced with a prompt or an empty entry form … how to start?… how to start? Lists of sourcesLists of sources OverviewsOverviews

ClustersClusters Category Hierarchies/Subject CodesCategory Hierarchies/Subject Codes Co-citation LinksCo-citation Links

ExamplesExamples Automatic source selectionAutomatic source selection


List of SourcesList of Sources

Have to guess based on the nameHave to guess based on the name Requires prior exposure/experienceRequires prior exposure/experience


Overviews in the User Overviews in the User InterfaceInterface Unsupervised Groupings Unsupervised Groupings

ClusteringClustering Kohonen Feature MapsKohonen Feature Maps

Supervised CategoriesSupervised Categories Yahoo!Yahoo! SuperbookSuperbook HiBrowseHiBrowse Cat-a-ConeCat-a-Cone

CombinationsCombinations DynaCatDynaCat SONIASONIA


Text ClusteringText Clustering

Finds overall similarities among groups of Finds overall similarities among groups of documentsdocuments

Finds overall similarities among groups of Finds overall similarities among groups of tokenstokens

Picks out some themes, ignores othersPicks out some themes, ignores others


Text ClusteringText ClusteringClustering isClustering is

““The The art art of finding groups in data.” of finding groups in data.” -- Kaufmann and Rousseeu-- Kaufmann and Rousseeu

Term 1

Term 2


Text ClusteringText Clustering

Term 1

Term 2

Clustering isClustering is““The The art art of finding groups in data.” of finding groups in data.” -- Kaufmann and Rousseeu-- Kaufmann and Rousseeu


Document/Document MatrixDocument/Document Matrix

....

.....

.....

....

....

...

21

2212

1121

21

nnn

t

t

t

ddD

ddD

ddD

DDD

jiij DDd to of similarity


Agglomerative ClusteringAgglomerative Clustering

A B C D E F G HI


AgglomerativeAgglomerativeClusteringClustering

A B C D E F G HI


K-Means ClusteringK-Means Clustering

1 Create a pair-wise similarity measure1 Create a pair-wise similarity measure 2 Find K centers using agglomerative clustering2 Find K centers using agglomerative clustering

take a small sample take a small sample group bottom up until K groups foundgroup bottom up until K groups found

3 Assign each document to nearest center, 3 Assign each document to nearest center, forming new clustersforming new clusters

4 Repeat 3 as necessary4 Repeat 3 as necessary


The Cluster The Cluster HypothesisHypothesis

“Closely associated documents tend to be relevant to the same requests.”

van Rijsbergen 1979

“… I would claim that document clustering can lead to more effective retrieval than linearsearch [which] ignores the relationships thatexist between documents.”

van Rijsbergen 1979


Clustering as Clustering as CategorizationCategorization

“In a traditional library environment … the itemsare classified first into subject areas, and a search is restricted to times within a few chosen subjectclasses. The same device can also be used … [to construct] groups of related documents and confining the search to certain groups only.”

Salton 71


Clustering as Clustering as CategorizationCategorization

“… In experiments we often want to vary the cluster representatives at search time. …Of course, were we to design an operationalclassification, the cluster representatives wouldbe constructed once and for all at cluster time.

van Rijsbergen 79


Scatter/GatherScatter/Gather

Cutting, Pedersen, Tukey & Karger 92, 93Cutting, Pedersen, Tukey & Karger 92, 93

Hearst & Pedersen 95Hearst & Pedersen 95

Cluster sets of documents into general “themes”, like a table of contents Cluster sets of documents into general “themes”, like a table of contents

Display the contents of the clusters by showing Display the contents of the clusters by showing topical terms topical terms andand typical typical titlestitles

User chooses subsets of the clusters and re-clusters the documents within User chooses subsets of the clusters and re-clusters the documents within

Resulting new groups have different “themes”Resulting new groups have different “themes”

query Collection

Cluster

Rank


S/G Example: query on “star”S/G Example: query on “star”

Encyclopedia textEncyclopedia text

14 sports14 sports

8 symbols8 symbols 47 film, tv47 film, tv

68 film, tv (p)68 film, tv (p) 7 music 7 music

97 astrophysics97 astrophysics

67 astronomy(p)67 astronomy(p) 12 steller phenomena12 steller phenomena

10 flora/fauna10 flora/fauna 49 galaxies, stars 49 galaxies, stars

29 constellations29 constellations

7 miscelleneous7 miscelleneous

Clustering and Clustering and re-clusteringre-clustering is entirely automated is entirely automated


Two Queries: Two Two Queries: Two ClusteringsClusteringsAUTO, CAR, ELECTRIC AUTO, CAR, SAFETY

The main differences are the clusters that are central to the query

8 control drive accident …

25 battery california technology …

48 import j. rate honda toyota …

16 export international unit japan

3 service employee automatic …

6 control inventory integrate …

10 investigation washington …

12 study fuel death bag air …

61 sale domestic truck import …

11 japan export defect unite …


Publication History of Publication History of Scatter/GatherScatter/Gather

1991 1991 Patents FiledPatents Filed SIGIR 92 SIGIR 92 Initial Algorithm IntroducedInitial Algorithm Introduced SIGIR 93SIGIR 93 Optimizations PresentedOptimizations Presented AAAIFS 95 AAAIFS 95 Examples of Use on Retrieval ResultsExamples of Use on Retrieval Results TREC 95TREC 95 Use in Interactive Track ExperimentsUse in Interactive Track Experiments CHI 96CHI 96 Experiments providing evidence that Experiments providing evidence that

users learn collection structureusers learn collection structure SIGIR 96SIGIR 96 Evidence that clustering can improve Evidence that clustering can improve

ranking for TREC-like scenarioranking for TREC-like scenario

(Publication timing may lag significantly behind when the work was done)


Another use of clusteringAnother use of clustering

Use clustering to map the entire huge Use clustering to map the entire huge multidimensional document space into a multidimensional document space into a huge number of small clusters.huge number of small clusters.

““Project” these onto a 2D graphical Project” these onto a 2D graphical representation:representation:


Clustering Multi-Dimensional Clustering Multi-Dimensional Document SpaceDocument Space(image from Wise et al 95)(image from Wise et al 95)


Concept “Landscapes”Concept “Landscapes”

Pharmocology

Anatomy

Legal

Disease

Hospitals

Built using Kohonen Feature MapsXia Lin, H.C. Chen


Visualization of ClustersVisualization of Clusters

Huge 2D maps may be inappropriate focus Huge 2D maps may be inappropriate focus for information retrieval for information retrieval

Can’t see what documents are aboutCan’t see what documents are about Documents forced into one position in semantic Documents forced into one position in semantic

spacespace Space is difficult to use for IR purposesSpace is difficult to use for IR purposes Hard to view titlesHard to view titles

Perhaps more suited for pattern discoveryPerhaps more suited for pattern discovery problem: often only one view on the spaceproblem: often only one view on the space


Using Clustering in Using Clustering in Document RankingDocument Ranking

Cluster entire collectionCluster entire collection Find cluster centroid that best matches Find cluster centroid that best matches

the querythe query This has been explored extensivelyThis has been explored extensively

it is expensiveit is expensive it doesn’t work wellit doesn’t work well


Using Clustering in Using Clustering in InterfacesInterfaces Alternative (scatter/gather): Alternative (scatter/gather):

cluster top-ranked documentscluster top-ranked documents show cluster summaries to usershow cluster summaries to user

Seems usefulSeems useful experiments show relevant docs tend to end experiments show relevant docs tend to end

up in the same clusterup in the same cluster users seem able to interpret and use the users seem able to interpret and use the

cluster summaries some of the timecluster summaries some of the time More computationally feasibleMore computationally feasible

ClusteringClustering Advantages:Advantages:

Sometimes discover meaningful themesSometimes discover meaningful themes Data-driven, so reflect emphases present in the collection of Data-driven, so reflect emphases present in the collection of

documentsdocuments Can differentiate heterogeneous collectionsCan differentiate heterogeneous collections Domain independentDomain independent

DisadvantagesDisadvantages Variability in quality of resultsVariability in quality of results Only one view on documents’ themesOnly one view on documents’ themes Not good at differentiating homogenous collectionsNot good at differentiating homogenous collections Require interpretationRequire interpretation May mis-match users’ interestsMay mis-match users’ interests


Incorporating Categories Incorporating Categories into the Interfaceinto the Interface

Yahoo is the standard methodYahoo is the standard method Problems:Problems:

Hard to search, meant to be navigated.Hard to search, meant to be navigated. Only one category per document (usually)Only one category per document (usually)


Integrated Browsing & SearchIntegrated Browsing & Search

Search for category labelsSearch for category labels Browse category labelsBrowse category labels Search within document collectionSearch within document collection Browse resulting documents in bookBrowse resulting documents in book


Example: MeSH and MedLineExample: MeSH and MedLine

MeSH Category HierarchyMeSH Category Hierarchy ~18,000 labels~18,000 labels manually assigned manually assigned ~8 labels/article on average~8 labels/article on average avg depth: 4.5, max depth 9avg depth: 4.5, max depth 9

Top Level Categories:Top Level Categories:anatomyanatomy diagnosisdiagnosis related discrelated disc

animalsanimals psychpsych technologytechnology

diseasedisease biologybiology humanitieshumanities

drugsdrugs physicsphysics


Large Category SetsLarge Category Sets

Problems for User InterfacesProblems for User Interfaces

Too many categories to browseToo many categories to browse

Too many docs per categoryToo many docs per category Docs belong to multiple categoriesDocs belong to multiple categories Need to integrate searchNeed to integrate search Need to show the documentsNeed to show the documents

We’ll discuss this more next week.We’ll discuss this more next week.


Category LabelsCategory Labels Advantages:Advantages:

InterpretableInterpretable Capture summary informationCapture summary information Describe multiple facets of contentDescribe multiple facets of content Domain dependent, and so descriptiveDomain dependent, and so descriptive

DisadvantagesDisadvantages Do not scale well (for organizing documents)Do not scale well (for organizing documents) Domain dependent, so costly to acquireDomain dependent, so costly to acquire May mis-match users’ interestsMay mis-match users’ interests


Other Starting Points Other Starting Points ApproachesApproaches

Co-citation LinksCo-citation Links Examples, Guided ToursExamples, Guided Tours


Next WeekNext Week

Interfaces for Subject Codes/Category Interfaces for Subject Codes/Category HierarchiesHierarchies

Leader: Alison BrandtLeader: Alison Brandt

sims 296a-3: ui background marti hearst fall ‘98

Documents

marti hearst ucb sims

new information

value of search

evaluation slide

behavior slide

human information

search session

ui background marti