deep learning for information retrieval - hang li...star wars the force awakens reviews star wars:...

Machine Learning for Information Retrieval

Hang Li

Noah’s Ark Lab

Huawei Technologies

The Third Asian Summer School in Information Access

Kyoto JapanAug 5, 2016

Outline of Tutorial

• Introduction

• Part 1: Learning to Rank

• Part 2: Learning to Match

• Part 3: Deep Learning for Information Retrieval

• Summary

Overview of Information Retrieval

Information and Knowledge Base

Information Retrieval System

Relevant Result

Intent:Key Words,Question

Content:Documents,Images,Relational Tables

Machine Learning can play an important role

• Key questions: how to represent intent and content, how to match intent and content• Ranking, indexing, etc are less essential• Interactive IR is not particularly considered here

Approach in Traditional IR

Query:star wars the force awakens reviews

Star Wars: Episode VIIThree decades after the defeat of the Galactic

Empire, a new threat arises.

||||||||

dqdqfVSM

• Representing query and document as tf-idf vectors• Calculating cosine similarity between them• BM25, LM4IR, etc can be considered as non-linear variants

),( dqf

Document:

Approach in Modern IR

• Conducting query and document understanding• Representing query and document as feature vectors• Calculating multiple matching scores between query and document using learning to match• Training ranker with matching scores as features using learning to rank

Document:Query:star wars the force awakens reviews

Star Wars: Episode VIIThree decades after the defeat of the Galactic

Empire, a new threat arises.

1),( dqf

“Easy” Problems in IR

• Search

– Matching between query and document

• Question Answering from Documents

– Matching between question and answer

• Learning to match & learning to rank: well studied so far

• Deep Learning may not help so much

“Hard” Problems in IR

• Image Retrieval

– Matching between text and image

– Not the same as traditional setting

• Question Answering from Knowledge Base

– Complicated matching between question and fact in knowledge base

• Generation-based Question Answering

– Generating answer to question based on facts in knowledge base

• Not well studied so far

• Deep Learning for Information Retrieval can make a big deal

Part 1. Learning to Rank

1.1. Overview of Learning to Rank

Ranking Plays Key Role in Many Applications

Ranking Problem: Example = Document Retrieval

documents

ranking of documents

NdddD ,,, 21

),( dqf

ranking based on relevance, importance,

preference

Ranking ProblemExample = Recommenders System

Item1 Item2 Item3 ... ItemN

User1 5 4

User2 1 2 2

... ? ? ?

UserM 4 3

Ranking ProblemExample = Machine Translation

Re-Ranking Model

ranked sentencecandidates in target language

GenerativeModel

sentence source language

re-ranked topsentence in targetlanguage

1.2. Problem and Approaches of Learning to Rank

Ranking Problem: Example = Document Search

documents

NdddD ,,, 21

),( dqf

ranking based on relevance, importance,

preference

Traditional Approach = Probabilistic Model

nn dqrPd

documents

BM25 [Robertson & Walker 94]

qquery

documents

ranking function

qdw wtfavgdl

nn dqrPd

PageRank[Page et al, 1999]

ij dMd j

New Approach = Learning to Rank

Learning System

Ranking System

11 ,11,1

2,112,1

1,111,1

mm nmmnm

),( dqf

NdddD ,,, 21

1. Data Labeling(rank) 3. Learning

2. Feature Extraction

11 ,1,1

2,12,1

1,11,1

mm nmnm

11 ,1,1

2,12,1

1,11,1

mm nmnm

Training Process

1. Data Labeling(rank)

2. Feature Extraction

Testing Process

3. Ranking with )(xf

111 ,1,1,1

2,12,12,1

1,11,11,1

mmm nmnmnm

4. Evaluation

EvaluationResult

11 ,1,1

2,12,1

1,11,1

mm nmnm

11 ,1,1

2,12,1

1,11,1

mm nmnm

• Features are functions of query and document

• Query and associated documents form a group

• Groups are i.i.d. data

• Feature vectors within group are not i.i.d. data

• Ranking model is function of features

• Several data labeling methods (here labeling of grade)

Issues in Learning to Rank

• Data Labeling

• Feature Extraction

• Evaluation Measure

• Learning Method (Model, Loss Function, Algorithm)

Data Labeling Problem

• E.g., relevance of documents w.r.t. query

Data Labeling Methods

• Labeling of Grades

– Multiple levels (e.g., relevant, partially relevant, irrelevant)

– Widely used in IR

• Labeling of Ordered Pairs

– Ordered pairs between documents (e.g. A>B, B>C)

– Implicit relevance judgment: derived from click-through data

• Creation of List

– List (or permutation) of documents is given

– Ideal but difficult to implement

Implicit Relevance Judgment

Doc CB > A

ranking of documents at search system

users often clicked on Doc B

ordered pair

Feature Extraction

PageRank

.............

Query-document feature

Document feature

Feature Vectors

Example Features

Evaluation Measures

• Important to rank top results correctly

• Measures– NDCG (Normalized Discounted Cumulative Gain)

– MAP (Mean Average Precision)

– MRR (Mean Reciprocal Rank)

– WTA (Winners Take All)

– Kendall’s Tau

• Evaluating ranking using labeled grades

• NDCG at position j

)1log(/)12(1

NDCG (cont’)

• Example: perfect ranking

– (3, 3, 2, 2, 1, 1, 1) grade r=3,2,1

– (7, 7, 3, 3, 1, 1, 1) gain

– (1, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33) position discount

– (7, 11.41, 12.91, …) DCG

– (1/7, 1/11.41, 1/12.91, …) normalizing factor

– (1, 1,1,1,1,1,1) NDCG for perfect ranking

12 )( jr

)1log(/1 j

)1log(/)12(1

NDCG (cont’)

• Example: imperfect ranking

– (2, 3, 2, 3, 1, 1, 1)

– (3, 7, 3, 7, 1, 1, 1) Gain

– (1, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33) Position discount

– (3, 7.41, 8.91, … ) DCG

– (1/7, 1/11.41, 1/12.91, …) normalizing factor

– (0.43, 0.65, 0.69, ….) NDCG

• Imperfect ranking decreases NDCG

Relations with Other Learning Tasks

• No need to predict category

vs Classification

• No need to predict value of

vs Regression

• Relative ranking order is more important

vs Ordinal regression

• Learning to rank can be approximated by classification, regression, ordinal regression

),( dqf

Ordinal Regression (Ordinal Classification)

• Categories are ordered

– 5, 4, 3, 2, 1

– e.g., rating restaurants

• Prediction

– Map to ordered categories

Three Major Approaches

• Pointwise approach

• Pairwise approach

• Listwise approach

• SVM based

• Boosting based

• Neural Network based

• Others

Categorization of Learning to rank Methods

Pointwise Approach

• Transforming ranking to regression, classification, or ordinal classification

• Query-document group structure is ignored

Pointwise Approach

Pairwise Approach

• Transforming ranking to pairwise classification

• Query-document group structure is ignored

Pairwise Approach

Listwise Approach

• List as instance

• Query-document group structure is used

• Straightforwardly represents learning to rank problem

Listwise Approach

Evaluation Results

• Pairwise approach and listwise approach perform better than pointwise approach

• LabmdaMART performs best in Yahoo Learning to rank Challenge

• No significant difference among pairwise and listwise methods

1.3. Methods of Learning to Rank

Ranking SVM

Hebrich et al., 1999

Pairwise Classification

• Converting document list to document pairs

Doc B Doc C

Doc ADoc B

grade 1

grade 2

grade 3

Transforming Ranking to Pairwise Classification

• Input space: X

• Ranking function

• Ranking:

• Linear ranking function:

• Transforming to pairwise classification:

);();( wxfwxfxx jiji

xwwxf ,);(

);();( 0, wxfwxfxxw jiji

xxyzxx

1 ),,(

Ranking Problem

grade 1

grade 2

grade 3

Transformed Pairwise Classification Problem

);( wxf31 xx

Positive Examples

Negative Examples

Redundant

Ranking SVM

• Pairwise classification on differences of feature vectors

• Corresponding positive and negative examples

• Negative examples are redundant and can be discarded

• Hyper plane passes the origin

• Soft margin and kernel can be used

• Ranking SVM = pairwise classification SVM

Learning of Ranking SVM

)2()1( ||||,1min wxxwyl

),0max(][ ss C2

,,1 1,

)2()1(

Nixxwy

IR SVM

Cao et al., 2006

Cost-sensitive Pairwise Classification

• Converting to document pairs

Doc B Doc C

Doc ADoc B

grade 1

grade 2

grade 3

CriticalNot Critical

Problems with Ranking SVM

• Not sufficient emphasis on correct ranking on topgrades: 3, 2, 1ranking 1: 2 3 2 1 1 1 1ranking 2: 3 2 1 2 1 1 1ranking 2 should be better than ranking 1Ranking SVM views them as the same

• Numbers of pairs vary according to queriesq1: 3 2 2 1 1 1 1q2: 3 3 2 2 2 1 1 1 1 1number of pairs for q1 : 2*(2-2) + 4*(3-1) + 8*(2-1) = 14number of pairs for q2: 6*(3-2) + 10*(3-1) + 15*(2-1) = 31Ranking SVM is biased toward q2

IR SVM

• Solving the two problems of Ranking SVM

• Higher weight on important grade pairs

• Normalization weight on pairs in query

• IR SVM = Ranking SVM using modified hinge loss

Modified Hinge Loss function

2)2()1(

)( ||||,1min wxxwy iiiiq

1 )( )2()1( xxfy

Learning of IR SVM

2)2()1(

)( ||||,1min wxxwy iiiiq

,,1 1,

)2()1(

lixxwy

AdaRank

Xu and Li, 2007

Listwise Loss

111 ,1,1,1

2,11,22,1

1,11,11,1

nnn yx

mmm nmnmnm

2,2,2,

1,1,1,

AdaRank

• Optimizing exponential loss function

• Algorithm: AdaBoost-like algorithm for ranking

Loss Function of AdaRank

Any evaluation measuretaking value between [-1,+1]

AdaRank Algorithm

Part 2. Learning to Match

2-1. Semantic Matching in Search

Query Document Mismatch isBiggest Challenge in Search

Query Document Mismatch

• Same intent can be represented by different queries (representations)

• Search is still mainly based on term level matching

• Query document mismatch occurs, when searcher and author use different representations

Examples of Query Document Mismatch

Query Document Term Matching

Semantic Matching

seattle best hotel seattle best hotels

no yes

pool schedule swimmingpoolschedule

no yes

natural logarithm transformation

logarithm transformation

partial yes

china kong china hong kong partial no

why are windows so expensive

why are macs so expensive

partial no

Matching at Different Levels

Semantic Matching

Form Phrase Sense Topic Structure

Term Matching

Structure Identification

Topic Identification

Similar Query Finding

Phrase Identification

Spelling Error Correction

michael jordan berkele

query form: michael jordan berkeley

phrase: michael jordan

similar query: michael i. jordan

main phrase: michael jordan

Phrase

Structure

topic: machine learning, berkeley

phrase: berkeley

Query Understanding

Title Structure Identification

Topic Identification

Key Phrase Identification

Phrase IdentificationTerm

Homepage of Michael Jordan

Michael Jordan is Professor in the

Department of Electrical Engineering

……

key phrase: michael jordan, professor,

electrical engineeringKey Phrase

topic: machine learning, berkeley

main phrase in title: michael jordan

phrase: michael jordan, professor,

department, electrical engineering]

Phrase

Structure

Document Understanding

Representation

Document

RepresentationQuery Document

Matching

Relevance Ranking

Query form: michael jordan berkeley

Similar query: michael i jordan

Main phrase: michael jordan

Phrase: michael jordan, berkeley

Topic : machine learning

Document: michael jordan homepage

Main phrase in title: michael jordan

Key phrase: michael jordan, berkeley

Phrase: michael jordan, professor,

department of electrical engineering

Topic : machine learning, berkeley

Semantic Matching

Matching in Different Ways

Query Reformulation

Document transformation

Query and document transformation

No transformation

Web Search System

Ranking

Document

UnderstandingIndexing

User Interface

Crawling

Document

Matching

UnderstandingRetrieving

Machine Learning for Query Document Matching in Web Search

Learning for Matching between Query and Document

• Learning matching function

• Using training data

• and can be id’s or feature vectors

• can be binary or numerical values

• Using relations in data and/or prior knowledge

),|(or ),( dqrpdqf MM

),,(,),,,( 111 NNN rdqrdq

Nqqq ,,, 21 Nddd ,,, 21

Nrrr ,,, 21

Long Tail Challenge

• Head pages have rich anchor texts and click data

• Tail queries and pages suffer more from mismatch

• Problem of propagating information and knowledge from head to tail

Relation between Matching and Ranking

• In traditional IR: – Ranking = matching

• Web search: – Ranking and matching become separated– Learning to rank becomes state-of-the-art

– Matching = feature learning for ranking

or ),(),( 25 dqfdqf BM

)(),(),( 25 dgdqfdqf PageRankBM

)|(),( qdPdqf LMIR

Matching vs Ranking

Matching Ranking

Prediction Matchingdegree between query and document

Ranking list of documents

Model f(q, d) f(q,d1), f(q,d2), … f(q,dn)

Challenge Mismatch Correct ranking on top

In search, first matching and then ranking

Matching Functions as Features in Learning to Rank

• Term level matching:

• Phrase level matching:

• Sense level matching:

• Topic level matching:

• Structure level matching:

• Term level matching (spelling, stemming):

),(25 dqfBM

),( dqfP

),( dqfT

),( dqfC

),( dqfS

),(25 dqf BMn

2-2. Overview of Learning to Match

Matching between Heterogeneous Data is Everywhere

• Matching between user and product (collaborative filtering)

• Matching between text and image (image annotation)

• Matching between people (dating)

• Matching between languages (machine translation)

• Matching between receptor and ligand (drug design)

Formulation of Learning Problem

• Learning matching function

• Training data

• Generated according to

),( yxf

),,(,),,,( 111 NNN ryxryx

),|(~ ),|(~ ),(~ YXRPrXYPyXPx

Formulation of Learning Problem

• Loss Function

• Risk Function

• Objective Function in Learning

) ),(,( yxfrL

RYXryxdPyxfrLryxPyxfrR ),,() ),(,(),,() ),(,(

FffyxfrL

)() ),(,(min

Matching Problem: Instance Matching

y1 yny2 y3

Instances

Can be represented as matching between nodes in bipartite graph

Matching Problem: Feature Matching

y1 yny2 y3

Features

Can be represented as matching between objects in two spaces

Matching Problem: Structure Matching

y1 yny2 y3

Structures

2-3. Methods of Learning to Match

Regularized Latent Semantic Indexing

Wang et al. 2011

• Motivation– Matching between query and document at topic level– Scale up to large datasets (vs. existing methods)

• Approach– Matrix Factorization– Regularization on topics and documents (vs. Sparse Coding)– Learning problem can be easily decomposed

• Results– l1 on topics leads to sparse topics and l2 on documents leads to

accurate matching– Comparable with existing methods in topic discovery and search

relevance– But can easily scale up to large document sets

Query and Document Matching in Topic Space

Document Space

Topic Space

term representation

of doc n topics

topic representation

of doc n

topics are sparse

documents are

smooth

documentdocument

topicterm

≈ V×

Optimization Strategy

Coordinate Decent

Analytic Solution

RLSI Algorithm

terms processed in parallel

docs processed in parallel

• Single machine multi core version• Multiple machine version (MapReduce and MPI)

Regularized Matching in Latent Space (RMLS)

Wu et al., 2013

Matching in Latent Space

• Motivation

– Matching between query and document in latent space

• Assumption

– Queries have similarity

– Document have similarity

– Click-through data represent “similarity” relations between queries and documents

• Approach

– Projection to latent space

– Regularization or constraints

• Results

– Significantly enhance accuracy of query document matching

Matching in Latent Space

Query Space Document Space

q2 Latent Space

• Matching between Heterogeneous Data

• Example: Image Annotation

fishing

singer

solider

worrier

microphone

Projecting Keywords and Images into Latent Space

Partial Least Square (PLS)• Setting

– Two spaces:

• Input

– Training data:

• Output

– Matching function

• Assumption

– Two linear (and orthonormal ) transformations

– Dot product as similarity function

• Optimization

DdQq ,

Nccdq iiii )},,{(

),( dqf

dq LL ,

dLqL dq ,

ILLILLdLqLc d

idiqiLL

, st. , maxarg),(

Solution of Partial Least Square

• Non-convex optimization

• Global optimal solution exists

• Global optimum can be found by solving SVD (Singular Value Decomposition)

Regularized Mapping to Latent Space (RMLS)

• Setting

– Two spaces:

• Input

– Training data:

• Output

– Matching function

• Assumption

– Two linear (and sparse) transformations

– Dot product as similarity function

• Optimization

DdQq ,

Nccdq iiii )},,{(

),( dqf

dq LL ,

dLqL dq ,

ddqqddqq

idiqiLL

lllldLqLcii

||||,||||,||,|| st. , maxarg),(

Solution of Regularized Mapping to Latent Space (RMLS)

• Coordinate Descent

• Repeat

– Fix , update

• No guarantee to find global optimum

• Updates can be parallelized by rows

Comparison between PLS and RMLS

PLS RMLS

Assumption Orthogonal L1 and L2 Regularization

OptimizationMethod

Singular Value Decomposition

CoordinateDescent

Optimality Globaloptimum

Local optimum

Efficiency Low High

Scalability Low High

Part 3. Deep Learning for Information Retrieval

3-1: Overview of Deep Learning for Information Retrieval

Hard Problems in IR

Q: How tall is Yao Ming?

Q: A dog catching a ball

Name Height Weight

Yao Ming 2.29m 134kg

Liu Xiang 1.89m 85kg

Key Questions: How to Represent Intent and Content, How to Match Intent and Content

Image Retrieval

Question Answering from Knowledge Base

Q:How far is sun from earth?

The average distance between the Sun andthe Earth is about 92,935,700 miles.

Generation-based Question Answering

A: It is about 93 million miles

(No tag on images)

Deep Learning and IR

Intent ContentMatching

Deep Learning

Recent Progress: Deep Learning Is Particularly Effective for Hard IR Problems

3-2: Basics of Deep Learning

Word Embedding

• Motivation: representing words with low-dimensional real-valued vectors, utilizing them as input to deep learning methods, vs one-hot vectors

• Method: SGNS (Skip-Gram with Negative Sampling)

• Tool: Word2Vec

• Input: words and their contexts in documents

• Output: embeddings of words

• Assumption: similar words occur in similar contexts

• Interpretation: factorization of mutual information matrix

• Advantage: compact representations (usually 100~ dimensions)

Skip-Gram with Negative Sampling(Mikolov et al., 2013)

• Input: occurrences between words and contexts

• Probability model:

1c 2c 3c4c 5cM

cwecwcwDP

1)(),|1(

cwecwcwDP

1)(),|0(

Skip-Gram with Negative Sampling

• Word vector and context vector: lower dimensional (parameter ) vectors

• Goal: learning of the probability model from data

• Take co-occurrence data as positive examples

• Negative sampling: randomly sample k unobserved pairs

as negative examples

• Objective function in learning

• Algorithm: stochastic gradient descent

),( Ncw

)(log)(log),(# ~ NPC

cwkcwcwLN

Interpretation as Matrix Factorization(Levy & Goldberg 2014)

• Pointwise Mutual Information Matrix

),(log

3 -.5 2

1 -0.5

1c 2c 3c4c 5c

Interpretation as Matrix Factorization

7 0.5 1

1 1.5 1

1t 2t 3t

TWCM Matrix factorization, equivalent to SGNS

3 -.5 2

1 -0.5

1c 2c 3c4c 5c

Word embedding

Recurrent Neural Network

• Motivation: representing sequence of words and utilizing the representation in deep learning methods

• Input: sequence of word embeddings, denoting sequence of words (e.g., sentence)

• Output: sequence of internal representations (hidden states)

• Variants: LSTM and GRU, to deal with long distance dependency

• Learning of model: stochastic gradient descent

• Advantage: handling arbitrarily long sequence; can be used as part of deep model for sequence processing (e.g., language modeling)

Recurrent Neural Network (RNN)(Mikolov et al. 2010)

the cat sat on the mat

the cat sat …. mat

),( 1 ttt xhfh

Recurrent Neural Network

)tanh(),( 11 hxtxthttt bxWhWxhfh

1th tx

……

Long Term Short Memory (LSTM)(Hochreiter & Schmidhuber, 1997)

)tanh(

gtgxtght

otoxtoht

ftfxtfht

itixtiht

bxWhWg

bxWhWo

bxWhWf

bxWhWi

• A memory (vector) to store Svalues of previous state• Input gate, output gate, and kforget gate to control • Gate: element-wise product kwith vector of values in [0,1]

),( 1 tt xh

),( 1 tt xh ),( 1 tt xh

),( 1 tt xh

Input Gate Output Gate

Forget Gate

Gated Recurrent Unit (GRU)(Cho et al., 2014)

gtgxttght

ztzxtzht

rtrxtrht

bxWhrWg

bxWhWz

bxWhWr

))(tanh(

• A memory (vector) to store Svalues of previous state• Reset gate and update gate to kcontrol

),( 1tt hx

Reset Gate

Update Gate

Recurrent Neural Network Language Model

)max(soft)|(

)tanh(

bWhxxxPp

bxWhWh

hxtxtht

Model Objective of Learning

ˆlog1

• Input one sequence and output another• In training, input sequence is same as output sequence

Convolutional Neural Network

• Motivation: representing sequence of words and utilizing the representation in deep learning methods

• Input: sequence of word embeddings, denoting sequence of words (e.g., sentence)

• Output: representation of input sequence

• Learning of model: stochastic gradient descent

• Advantage: robust extraction of n-gram features; can be used as part of deep model for sequence processing (e.g., sentence classification)

Convolutional Neural Network (CNN) (Kim 2014, Blunsom et al. 2014, Hu et al., 2014)

the cat sat on the mat Concatenation

……

the cat sat on the mat

cat sat

the cat sat

the cat

sat on

cat sat on

cat sat

on the

sat on the

sat on

the mat

on the mat

on the

sat on

the cat sat

the cat

the mat

on the mat

sat on

Convolution

Max pooling

• Shared parameters on same level• Fixed length, zero padding

Example: Image Convolution

Dark Pixel Value = 1, Light Pixel Value = 0Dot in Filter = 1, Others = 0

Filter

Leow Wee Kheng

Example: Image Convolution

0 0 0 0 0

0 0 1 1 0

0 1 3 2 0

0 1 3 1 0

0 1 1 0 0

Feature Map

• Scanning image with filter having 3*3 cells, among them 3 are dot cells• Counting number of dark pixels overlapping with dot cells at each position• Creating feature map (matrix), each element represents similarity between filer pattern and pixel pattern at one position• Equivalent to extracting feature using the filter• Translation-invariant

Convolution Operation

Convolution

Filter Feature Map Neuron

Ffbzwz

location for vectors wordedcancatenat frominput is

1layer from location for neuron ofinput is

function sigmoid is

layer in typeofneuron of parameters are ,

layer in location for typeofneuron ofoutput is

,,2,1 )(

),(),(

),()1(),(),(

)1( liz

),( flb

),( flw

),( fliz

Convolution

Equivalent to n-gram feature extraction at each position

Max Pooling

layer in location for typeof pooling ofinput are ,

layer in location for typeof pooling ofoutput is

),max(

Max Pooling

Equivalent to n-gram feature selection

Sentence Classification Using Convolutional Neural Network

)max(soft)(

bWzxfy

Concatenation

……

Convolution

Max Pooling

3-3. Methods of Deep Learning for Information Retrieval

Retrieval based Question Answering

Hu et al. 2014

Ji et al. 2014

Retrieval-based Question Answering

Q: What is the population of Hong Kong?A: It is 7.18 million as in 2013.

Q: How many people are there in Hong Kong?A: There are about 7 million.

Question Answering System

Q:Do you know Hong Kong’smmpopulation?

A:There are about 7 mmmillion.

Learning System

Retrieval based Question Answering System

Index of Questions and

Answers

Matching Ranking

Question

Retrieval

RetrievedQuestions and Answers

RankedAnswers

MatchingModels

RankingModel

Online

Offline

Best Answer

MatchedAnswers

Deep Match CNN- Architecture I

…… ……

• First represent two sentences as vectors, and then match the vectors

Sentence X Sentence Y

Deep Match CNN - Architecture II

• Represent and match two sentences simultaneously

• Two dimensional model

Matching Degree

2D Convolution

More 2D Convolution & Pooling

Max-Pooling

1D Convolution

Sentence X

Generation based Question Answering

Shang et al. 2015

Generation-based Question Answering

Q: What is the population of Hong Kong?

A: It is 7.18 million as in 2013.

Q: How many people are there in HonglmmKong?

A: There are about 7 million.

Q: Do you know HongmmlKong’s population?

A: It is 7 million

Learning System

Neural Responding Machine

• Encoding questions to internal representations

• Decoding internal representations to answers

• Using GRU

Question

Answer

Encoder

Txxx 21x

Decoder

tyyy 21y

ContextGenerator

Decoder

tc 'Tc

1s 1ts ts'Ts

1y 1ty ty 'Ty

… …

GRU is () function,softmax is ()

ctorcontext ve is

decoder of statehidden is

hot vector-one is

),,(),|(

csygyyyP

Similar to attention mechanism in RNN Encoder-Decoder

Encoder

Global EncoderLocal Encoder

1x1tx tx Tx

1h 1th th Th

… …

stateshidden global and local ofion concatenat is :

weightis ctor,context ve is

),(,: 1

shqhhc

Combination of global and local encoders

GRU is ()

encoder of statehidden is

embedding wordis

Question Answering from Relational Database

Yin et al. 2016

Question Answering from Relational Database

Relational Database

Q: How many people participated in the mmgame in Beijing?

A: 4,200SQL: select #_participants, where city=beijing

Q: When was the latest game hosted?A: 2012SQL: argmax(city, year)

Q: Which city hosted the mmlongest Olympic game mmbefore the game in Beijing? A: Athens

Learning System

year city #_days #_medals

2000 Sydney 20 2,000

2004 Athens 35 1,500

2008 Beijing 30 2,500

2012 London 40 2,300

Neural Enquirer

• Query Encoder: encoding query• Table Encoder: encoding entries in table• Five Executors: executing query against table

Conducting matching between question and database entries multiple times

Query Encoder and Table Encoder

• Creating query embedding using RNN• Creating table embedding for each entry using DNN

Query Representation

Entry Representation

Field Value

Table Representation

Query Encoder Table Encoder

Executors

• Five layers, except last layer, each layer has reader, mannotator, and memory • Reader fetches important representation for each row,me.g., city=beijing• Annotator encodes result representation for each row, me.g., row where city=beijing

Select #_participants where city = beijing

Question Answering from Knowledge Graph

Yin et al. 2016

Question Answering from Knowledge Graph

(Yao-Ming, spouse, Ye-Li)(Yao-Ming, born, Shanghai)(Yao-Ming, height, 2.29m)… …(Ludwig van Beethoven, place ofbirth, Germany)… …

Knowledge Graph

Q: How tall is Yao Ming?A: He is 2.29m tall and is visible from space.(Yao Ming, height, 2.29m)

Q: Which country was Beethoven from?A: He was born in what is now Germany.(Ludwig van Beethoven, place of birth, Germany)

Q: How tall is Liu Xiang? A: He is 1.89m tall

Learning System

Answer is generated

• Interpreter: creates representation of question using RNN

• Enquirer: retrieves top k triples with highest matching scores using CNN model

• Generator: generates answer based on question and retrieved triples using attention-based RNN

• Attention model: controls generation of answer

Short Term Memory

Long Term Memory

(Knowledge Base)

How tall is Yao Ming?

Interpreter

Enquirer

Generator

He is 2.29m tall

Attention Model

Key idea: • Generation of answer based on question and retrieved kresult• Combination of neural processing and symbolic rprocessing

Enquirer: Retrieval and Matching

• Retaining both symbolic representations and vector representations• Using question words to retrieve top k triples• Calculating matching scores between question and triples using CCNN model• Finding best matched triples

(how, tall, is, liu, xiang)

< liu xiang, height, 1.90m>

< yao ming, height, 2.26m>… …

Retrieved Top k Triplesand Embeddings

Question and Embedding Matching

Generator: Answer Generation

• Generating answer using attention mechanism• At each position, a variable decides whether to ggenerate a word or use the object of top triple

03 z 13 z

2.29mtall

< yao ming, height, 2.29m>

How tall is Yao Ming ?

Image Retrieval

Ma et al. 2015

Image Retrieval

a lady in a car

a man holds a cell phone

two ladies are chatting

Learning System

Retrieval System

Having dinner with friends in restaurant

Image Index

Multimodal CNN• Represent text and image as vectors and then match

the two vectors

• Word-level matching, phrase-level matching, sentence-level matching

• CNN model works better than RNN models (state of the art) for text

A dog is catching a ball

CNN CNN

Sentence-level Matching

• Combing image vector and sentence vector

……

a dog is catching a ball

Word-level Matching Model

• Adding image vector to word vectors

……

a dog is catching a ball

Summary

• Learning to Rank– Approaches: pointwise, pairwise, listwise approaches– Methods: Ranking SVM, IR SVM, AdaRank

• Learning to Match– Semantic Matching– Methods: RLSI, RMLS

• Deep Learning for Information Retrieval– Deep learning is effective for hard IR problems– Basic tools: word embedding, CNN, RNN– Methods: Deep Match CNN, NRM, GenQA, Neural

Enquirer, Multimodal CNN

References• Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, Xiaoming Li. Neural Generative

Question Answering. Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16), 2972-2978, 2016.

• Pengcheng Yin, Zhengdong Lu, Hang Li, Ben Kao. Neural Enquirer: Learning to Query Tables with Natural Language. Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16), 2308-2314, 2016.

• Lin Ma, Zhengdong Lu, Lifeng Shang, Hang Li, Multimodal Convolutional Neural Networks for Matching Image and Sentence. Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV), 2623-2631, 2015.

• Lifeng Shang, Zhengdong Lu, Hang Li. Neural Responding Machine for Short Text Conversation. Proceedings of the 53th Annual Meeting of Association for Computational Linguistics and the 7th International Conference on Natural Language Processing (ACL-IJCNLP'15), 2015.

• Baotian Hu, Zhengdong Lu, Hang Li, Qingcai Chen. Convolutional Neural Network Architectures for Matching Natural Language Sentences. Proceedings of Advances in Neural Information Processing Systems 27 (NIPS'14), 2042-2050, 2014.

• Hang Li, Jun Xu. Semantic Matching in Search. Foundations and Trends in Information Retrieval, Now Publishers, 2014.

• Wei Wu, Zhengdong Lu, Hang Li. Learning Bilinear Model for Matching Queries and Documents. Journal of Machine Learning Research (JMLR), 14: 2519-2548, 2013.

References• Quan Wang, Jun Xu, Hang Li, Nick Craswell. Regularized Latent Semantic Indexing:

A New Approach to Large Scale Topic Modeling. ACM Transactions on Information Systems (TOIS), 31(1): 5, 2013.

• Hang Li. A Short Introduction to Learning to Rank. IEICE Transactions on Information and Systems, E94-(10), 2011.

• Quan Wang, Jun Xu, Hang Li, Nick Craswell. Regularized Latent Semantic Indexing. Proceedings of the 34th Annual International ACM SIGIR Conference (SIGIR’11), 685-694, 2011.

• Hang Li. Learning to Rank for Information Retrieval and Natural Language Processing. Synthesis Lectures on Human Language Technology, Lecture 12, Morgan & Claypool Publishers, 2011.

• Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, Hang Li. Learning to Rank: From Pairwise Approach to Listwise Approach. Proceedings of the 24th International Conference on Machine Learning (ICML’07), 129-136, 2007.

• Jun Xu and Hang Li. AdaRank: A Boosting Algorithm for Information Retrieval. Proceedings of the 30th Annual International ACM SIGIR Conference (SIGIR’07), 391-398, 2007.

• Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, Hsiao-Wuen Hon. Adapting Ranking SVM to Document Retrieval. Proceedings of the 29th Annual International ACM SIGIR Conference (SIGIR’06), 186-193, 2006.

Thank you!

deep learning for information retrieval - hang li...star wars the force awakens reviews star wars:...

Documents

star wars: the force awakens written by lawrence kasdan

the star wars economy: the licensing machine awakens a $4...

7 days to die lego star wars: the force awakens ·...

publication is licensed for performance by school groups...

amazon s3 · 2018. 5. 23. · star wars: the force awakens...

survey wars - the force awakens

star wars : the force awakens trailer

star wars: the force awakens table guide by...

star wars: the force awakens written by … wars-the force...

star wars the force awakens jakku diorama 1/144 scale

star wars the force awakens calendar 2016

star wars: the force awakens written by lawrence...

star wars: the force awakens - la · pdf filestar wars: the...

star wars: the force awakens': all about the new...

star wars the force awakens storyboard

star wars the force awakens

star wars episode vii the force awakens

star night - verkehrshaus der schweiz...star wars: episode...

crash wars - the handling awakens

how star wars changed film marketing forever...star wars:...