dynamic search using semantics & statistics

39
Text Mining - Bayesian Topic Modeling for Interactive Retrieval at SAP and Cisco Ram Akella University of California and Stanford With Karla Caballero, Maria Daltayanni, Chunye Wang - UCSC and Paul Hofmann SAP Labs October 6, 2011 SAP

Upload: paul-hofmann

Post on 20-Jun-2015

532 views

Category:

Technology


1 download

DESCRIPTION

This presentation shows 3 applications of successfully combining semantics and statistics for text mining and interactive search. 1) We predict the Lehman bankruptcy using statistical topic modeling, SAP Business Objects entity extraction and associative memories (powered by Saffron Technologies). 2) We semi-automatically handle service requests at Cisco using knowledge extraction and knowledge reuse. 3) We discover user intent for interactive retrieval. User intent is defined as a latent state. The observations of this latent state are the reformulated query sequence, and the retrieved documents, together with the positive or negative feedback provided by the user. Demo shows recognizing user’s intent for health care search.

TRANSCRIPT

Page 1: Dynamic Search Using Semantics & Statistics

Text Mining - Bayesian Topic Modeling for Interactive Retrieval

at SAP and Cisco

Ram AkellaUniversity of California and Stanford

With Karla Caballero, Maria Daltayanni, Chunye Wang - UCSC andPaul Hofmann SAP Labs

October 6, 2011 SAP

Page 2: Dynamic Search Using Semantics & Statistics

Outline

• Motivation• Statistical Topic Modeling - SAP & Saffron• Knowledge Extraction and Reuse at Cisco• Interactive Retrieval• Interactive Retrieval Demo

Page 3: Dynamic Search Using Semantics & Statistics

Outline

• Motivation• Statistical Topic Modeling - SAP & Saffron • Knowledge Extraction and Reuse in Cisco• Interactive Retrieval• Interactive Retrieval Demo

Page 4: Dynamic Search Using Semantics & Statistics

Motivation

10/6/2011

SEARCH

Depression treatment of patients…

Depression influence on

family relationships…

DOCTOR

SOCIAL SCIENTIST

q1: elderly depression

q2: depression symptoms

q3: symptoms and treatment

User expects to find more relevant results each time she interacts with the system

Relevance of the presented documents depends on user context

Page 5: Dynamic Search Using Semantics & Statistics

Interactive Retrieval Model Query

User Feeback

Feedback and propagation to

similar documents

Information needUpdate

DocumentCollection

Metadata Generation System

Interactive Retrieval System

Page 6: Dynamic Search Using Semantics & Statistics

Interactive Retrieval Model Query

User Feeback

Feedback and propagation to

similar documents

Information needUpdate

DocumentCollection

Interactive Retrieval System

Metadata Generation SystemAdd to the document metadata that facilitates the retrieval processThis metadata consist of:

1. Statistical Topic Mixture2. Knowledge Extraction basedon Business process (problem, cause, solution)

Page 7: Dynamic Search Using Semantics & Statistics

Outline• Motivation

• Statistical Topic Modeling - SAP & Saffron– Motivation– Related Work– Proposed Approach– Topic Modeling and Entity Association

• Knowledge Extraction and Reuse at Cisco• Interactive Retrieval• Interactive Retrieval Demo

Page 8: Dynamic Search Using Semantics & Statistics

Topic Modeling: Motivation• Given a set of documents, we want to identify the main areas or topics

discussed in a unsupervised manner. We take advantage of the semantic associations between words across the documents.

If two words appear in the same document, they should be related.

• For each topic we have different distributions of words and each document might contain material about a variety of topics.

Play

Music

Sports

10/6/2011

Topic 1 (80%)Sports

Topic 2 (5%)

Topic 3 (20%)Common Words

Topic 1Sports

net

game

ball

ball net racquet

notes

instrument

Page 9: Dynamic Search Using Semantics & Statistics

Related WorkLDA[2003] Correlated

Topics [2005]Pachinko Allocation Model [2006]

Our Model GD-LDA

Complexity based on # oftopics K

K 2K

Speed

Scalable

Handles Topic Correlations

Effective topic selection and truncation

Page 10: Dynamic Search Using Semantics & Statistics

Our Approach – The higher probability mass is accommodated in the upper part of the

tree (this facilitates the truncation and reduction in the number of topics)

– We can define a method to determine the number of topics suitable for a particular dataset without training the model several times (each time for a given number of specified topics)

10/6/2011

bushcampaign

mccainbradley

republicancandidate

filmshowmusicmoviestoryplay

companypercentstockmarketpricerate

patientDiseasePeopleStudyMedicHealth

peacetalksyrianclintonsyriagolan

0.00960.0146

0.0310 0.0660

0.0851

Page 11: Dynamic Search Using Semantics & Statistics

Experimental SetupThe datasets are from two types:• Scientific Articles (NIPS)

– Longer documents

• News Data (NYT, APW, XIE)– Shorter Documents– More diverse vocabulary

• We compare the performance of the algorithm against three approaches in the literature : LDA, CTM and Pachinko

• We test our model using Empirical Likelihood– This method estimate how likely it is that a test document will be generated

from the estimated model. – We want this value to be high (better generalization and applicability to

unseen documents).10/6/2011

Dataset NIPS NYT APW XIE

#documents 1840 5553 4954 5275

# unique terms 13649 11229 6955 3890

Doc Length 1322 274 170 81

Page 12: Dynamic Search Using Semantics & Statistics

Results: NYT DatasetWe obtain the topic mixture for the NYT Dataset using K=20 topics .

10/6/2011

yearlivecenturymuseumpeoplemusictimestarbook

storypjournalconstitutiontimeeditorbudgetyork

militarywarnuclearpresidentpoliticchechnyapowersoviet

internetinformationtechnologyserviceipeopleebusyworkmail

computermakehandsystemtvpeopleprogramnetworkdontdrivecall

studypatientpeopleuniversdiseasemedicincreasecarestate

bankproblemeconomysysteminvestorpercentpriceinvestmenteconomistfinancial

+

-++

+ drugstateunitedtalknatoclintonamerican

+

+

-

Page 13: Dynamic Search Using Semantics & Statistics

13

Results: Empirical Likelihood

10/6/2011

APW Dataset NIPS Dataset

NYT Dataset XIE Dataset

Our Model

Page 14: Dynamic Search Using Semantics & Statistics

Results: Running Time

10/6/2011

APW Dataset NIPS Dataset

Minutes

Minutes

NYT Dataset

Minutes

XIE Dataset

Minutes Our

Model

Page 15: Dynamic Search Using Semantics & Statistics

Illustrative Example: NYT Dataset

10/6/2011

NORTHRIDGE TAUGHT A LESSONLOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.

Page 16: Dynamic Search Using Semantics & Statistics

Illustrative Example: NYT Dataset

10/6/2011

NORTHRIDGE TAUGHT A LESSONLOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.

Page 17: Dynamic Search Using Semantics & Statistics

Illustrative Example: NYT Dataset

10/6/2011

NORTHRIDGE TAUGHT A LESSONLOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.

Page 18: Dynamic Search Using Semantics & Statistics

Illustrative Example: NYT Dataset

10/6/2011

NORTHRIDGE TAUGHT A LESSONLOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.

Page 19: Dynamic Search Using Semantics & Statistics

Topic Modeling & Entity Association

This work was presented at SAPPHIRE NOW 2010

Base knowledge Source

UCSC Topic

Mining System

Saffron Associative Memory Base

Query

Valukas Report about why Lehman

Brothers Failed

(6 volumes)

SAP Business Objects Entity

Extractor

Entities

TopicsSaffron Associative

Memory creates associations among entities and topics

We would like to know who are

the actors involved in a

particular action that led to the

failure of Lehman brothers

Text Data to be monitored

Page 20: Dynamic Search Using Semantics & Statistics

Outline

• Motivation• Statistical Topic modeling - SAP & Saffron

• Knowledge Extraction and Reuse in Cisco– Knowledge Extraction System– System Architecture– Domain Knowledge– Improving Productivity– Performance of Service Request Recommender

• Interactive Retrieval• Interactive Retrieval Demo

Page 21: Dynamic Search Using Semantics & Statistics

Service Request Database

Service Request

Text Mining System

What was the problem?

Why did it occur?

How was it solved?

Problem

Cause

Solution

Irrelevant Content

KnowledgeUnstructured Text

Knowledge Database

Applicationssuch as retrieval

Problem

Cause

Solution

Document 1

Problem

Cause

Solution

Document 2

high

Similarity

high

low

Finding different solutions to the same problem

Knowledge Extraction System at Cisco

Page 22: Dynamic Search Using Semantics & Statistics

Service Request

HierarchicalClassifier

Labeled Paragraphs

Preprocessor

Service Request

Recommender

User

Bag-of-words

Domain Knowledge

ExpertiseFeature Generator

Data flow of Analyzer

Data flow of Recommender

Data output for User

Legend

System ArchitectureType Feature Class and

Motivation

Statistical

features

Length of paragraph Short paragraphs are usually irrelevant.

Relative position of a paragraph in a service request

Service requests have the hidden process “problem → cause→ solution”.

Number of “%” Error codes (relevant) begin with “%”.

Contextual

features

Contain “Hi”, “Hello”, “my name”, or “I’m”

Introduction, irrelevant

Contain “feel free”, “to contact”, or “have a ... day”; begin with “Best” or “Thank”

Salutation, irrelevant

Telephone number, zip code, or affiliation

Contact information, irrelevant

Hint words

Contain “problem”, “error message” or “symptom”

Problem

Contain “suspect”, “seem”, “looks like”, “indicate”, “try”, “test”, or “check”

Troubleshooting

Contain “recommend”, “suggest”, “replace”, “reseat”, “RMA”, or “workaround”

Solution

Lexical features

Number of words from domain dictionary

Usually relevant

Product name Usually relevant

Features from Expertise

Page 23: Dynamic Search Using Semantics & Statistics

- Internetworking Terms and Acronyms Dictionary (ITAD)- Benefits: (1) the expansion of acronyms and terminology;

(2) the enhancement of concept dependencies.- Example:

The phone boots up and it does a DHCP [Dynamic Host Configuration Protocol. Provides a mechanism for allocating IP addresses dynamically so that addresses can be reused when hosts no longer need them] request in the native VLAN [virtual LAN]. There it gets an IP address [32-bit address assigned to hosts using TCP/IP] and an option that it needs to boot up in the VLAN 40 and that it need to go in trunking [physical and logical connection between two switches across which network traffic travels] mode.

Host Server with 2 interfaces [connection between two systems or devices] and one default gateway. When ping Vlan-B [virtual LAN] interface an ARP [Address Resolution Protocol. Internet protocol used to map an IP address to a MAC address] request with a source IP of Vlan-B is sent to Default Router [network layer device that uses one or more metrics to determine the optimal path along which network traffic should be forwarded. Routers forward packets from one network to another based on network layer information] on Vlan-A, but Router does not respond to ARP request.

Snippet from Doc1

Snippet from Doc2

[…]: explanation from ITAD. Blue: overlapping words between unexpanded excerpts.Red: overlapping words introduced by ITAD.

Measuring similarity

Domain Knowledge

Page 24: Dynamic Search Using Semantics & Statistics

Browse a service request

Relevant?N

Read and understand thoroughly

Create knowledge article

Y

N

Y

Time to access relevance

Time to extract knowledge

Read enough?

Improving ProductivityCompare the time spent by engineers in reading service requests before and after using our system.

Time to access relevance

Time to extract knowledge

Before using system 27 minutes 97 minutes

After using system 11 minutes 67 minutes

Productivity improved by

145% 45%

Page 25: Dynamic Search Using Semantics & Statistics

Performance of Service Request Recommender

Result 1: Both deterministic and probabilistic model achieved much better results when labeled paragraphs were used; validates our hypothesis of the inherent diagnostic business process.

Result 2: Using domain knowledge further improves retrieval results. Result 3: Probabilistic recommender outperformed deterministic recommender.

Baseline Our Method

Retrieval models

Deterministic model

Probabilistic model

Information The whole document

The semantically labeled paragraphs

Domain Knowledge

None Dictionary

Retrieval SchemesOur

Method

Page 26: Dynamic Search Using Semantics & Statistics

Outline• Motivation• Statistical Topic modeling – SAP & Saffron • Knowledge Extraction and Reuse at Cisco

• Interactive Retrieval– Problem– Reinforcement Learning Formulation– How many interaction steps needed– How much feedback is needed– Interactive Retrieval Using Topic Modeling

• Interactive Retrieval Demo

Page 27: Dynamic Search Using Semantics & Statistics

Interactive Retrieval• Model the user intent to retrieve relevant documents• Identify the trade-off between

– Retrieval accuracy (how accurate are the results required to be by the user?)

– Interaction time (how much time is the user willing to spend on interaction?)

• Applied to– Medical documents retrieval

• e.g., search for past patient cases with similar symptoms

– Resume retrieval in a labor marketplace• e.g., search for Python developers who work in machine learning

MORE IMPORTANT

LESS IMPORTANT

Page 28: Dynamic Search Using Semantics & Statistics

28

Problem

10/6/2011 What is the best path to choose ?

User Intent

Set of Relevant Documents

Static Myopic Dynamic

Dynamic

Dynamic Programming

Reinforcement Learning

t1 t2 t3 … tn

User Intent

Set of Relevant Documents

User Intent

Set of Relevant Documents

Page 29: Dynamic Search Using Semantics & Statistics

Reinforcement Learning formulation of IIR

Agent IIR system

Environment User

IntentBest guess for user intent or need

(expressed in query terms)

Action Ranking Rk

Reward Improvement

v(Rk)-v(Rk-1)(as observed from user

feedback)

ObjectiveMax. sum of

rewards

Page 30: Dynamic Search Using Semantics & Statistics

Experiments Set-Up

• Dataset: TREC-9 OHSUMED, 348.566 medical documents– with a list of relevance judgments

• 65 user queries– query title: 2 − 5 words– query description: 5 − 10 words

• Interactive Sessions of 3 − 5 steps• Relevance function is binary• Value of results (with appropriate weights wi)

– Precision @10: percentage of relevant documents in the top-10 results– We compare our results with Pseudo-relevance Feedback

Page 31: Dynamic Search Using Semantics & Statistics

How many interaction steps needed?

9/19/2011

Page 32: Dynamic Search Using Semantics & Statistics

How much feedback is needed?

1 2 3 4 5 6 70.600000000000001

0.650000000000001

0.700000000000001

0.750000000000001

0.800000000000001

0.850000000000001

# of documents on which feedback is provided per step

prec

isio

n @

10

Experiments tested on348,566 OHSU-MED medical dataset, TREC 2002

Page 33: Dynamic Search Using Semantics & Statistics

Interactive Retrieval w Topic Modeling• Topics help us to reduce the search

– They add context to the query– Some important terms to describe the users’ intent may not be

included in the query– Topics are calculated a-priori and added to each document as metadata

Topic Mixture ofNon Relevant Docs

Topic Mixture ofRelevant Docs

Combination of terms and topic relevance

scores

Meta-query(combination of

user inputs)

Updated each time the user provides feedback (clicks) or additional information to the system (query redefinition)

Page 34: Dynamic Search Using Semantics & Statistics

Proposed Dataset

• We test our approach using the HARD TREC queries which consist of :– 851,018 news documents from NYT APW and XIE

agencies– Each document has an average length of 305 terms– There are 496,779 unique terms– We infer the topic information of the corpus using 75 topics

– For testing purposes we use m=3 interactions– We use test 30 queries– We compare our algorithm with mixture relevance feedback

10/6/2011

Page 35: Dynamic Search Using Semantics & Statistics

Preliminary Results

10/6/2011

Number of Interactions

Precision

1 2 30.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MixtureState Based

Page 36: Dynamic Search Using Semantics & Statistics

Outline

• Motivation• Statistical Topic modeling – SAP & Saffron• Knowledge Extraction and Reuse at Cisco• Interactive Retrieval

• Interactive Retrieval Demo

Page 37: Dynamic Search Using Semantics & Statistics

Example User intent• young female with fevers and increased CPK (Creatine PhosphoKinase)

– CPK: enzyme, may cause heart attack or severe muscle breakdown if increased

• neuroleptic malignant syndrome (life-threatening neurological disorder)– Associated with CPK– Symptoms: muscular cramps, fever, unstable blood pressure, changes in

cognition, including agitation, delirium and coma

• differential diagnosis– List symptoms– List causes of the symptoms– Prioritize by the most dangerous – Treat

• treatment

Page 38: Dynamic Search Using Semantics & Statistics

Relevant Documents

• Non-relevant documents:Doc 1: Significance of elevated levels of CPK in febrile diseases: a prospective study. The incidence and significance of elevated serum levels of (CPK) in febrile diseases were studied prospectively in all patients admitted with fever to a department of medicine during 1 year.

Doc 2: Metoclopramide-induced neuroleptic malignant syndrome….Symptoms of NMS include rigidity, hyperpyrexia, altered consciousness, and autonomic instability. This syndrome is generally associated with neuroleptic medications used to treat psychotic and major depressive illnesses…

• Relevant document:Doc 3: Neuroleptic malignant syndrome: guidelines for treatment and reinstitution of neuroleptics… Cardinal symptoms include fever, muscular rigidity, an elevated serum level of creatine phosphokinase, changes in mental status, and autonomic dysfunction…

Page 39: Dynamic Search Using Semantics & Statistics

Interactive Demo

• InteractiveDemo_MedicalData

• Sub-queries– young female with fevers and increased CPK– neuroleptic malignant syndrome– differential diagnosis– treatment