contextual advertising by combining relevance with click feedback

35
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski

Upload: emerson-vinson

Post on 31-Dec-2015

32 views

Category:

Documents


1 download

DESCRIPTION

Contextual Advertising by Combining Relevance with Click Feedback. D. Chakrabarti D. Agarwal V. Josifovski. Motivation. Match ads to queries Sponsored Search: The query is a short piece of text input by the user Content Match: The query is a webpage on which ads can be displayed. - PowerPoint PPT Presentation

TRANSCRIPT

Contextual Advertising by Combining Relevance with Click Feedback

D. ChakrabartiD. AgarwalV. Josifovski

Motivation

Match ads to queries Sponsored Search: The query is a short piece of

text input by the user Content Match: The query is a webpage on which

ads can be displayed

Motivation

Relevance-based

1. Uses IR measures of match

cosine similarity BM25

2. Uses domain knowledge

3. Gives a score

Click-based

1. Uses ML methods to learn a good matching function

Maximum Entropy

2. Uses existing data improvement over time

3. Typically gives a probability of click

Motivation

Relevance-based4. Very low training cost

At most one or two params, which can be set by cross-validation

5. Simple computations at testing time Using the Weighted AND

(WAND) algorithm

Click-based4. Training is complicated

Scalability concerns Extremely imbalanced

class sizes Problems interpreting non-

clicks Sampling methods heavily

affect accuracy

5. All features must be computed at test time Good feature engineering

critical

Motivation

Relevance-based

Uses domain knowledge

Very low training cost Simple computations

at testing time

Click-based

Uses existing data improvement over time

Training is complicated Efficiency concerns

during testing

Combine the two

Benefits of both

Must control these

Motivation

We want a system for computing matches over all ads (~millions) NOT a re-ranking of filtered results of some other

matching algo Training:

Can be done offline Should be parallelizable (for scalability)

Testing: Must be as fast and scalable as WAND Accurate results

Outline

Motivation

WAND Background Proposed Method Experiments Conclusions

WAND Background

Red

Ball

Ad 1 Ad 5 Ad 8

Ad 7 Ad 8 Ad 9

Word posting lists

Cursors

Query = Red Ball

skip

Candidate Results = Ad 8 …

More generally, queries are weighted compute upper bounds on score for skips

WAND Background

Efficiency through cursor skipping Must be able to compute upper bounds

quickly Match scoring formula should not use features of

the form (“word X in query AND word Y in ad”) Such pairwise (“cross-product”) checks can

become very costly

Outline

Motivation WAND Background

Proposed Method Experiments Conclusions

Proposed Method

Only use features of the form (“word X in both query AND ad”)

Learn to predict click data using such features

Add in some function of IR scores as extra features What function?

Proposed Method

A logistic regression method model for CTR

CTR Main effect for page

(how good is the page)

Main effect for ad

(how good is the ad)

Interaction effect

(words shared by page and ad)

Model parameters

Proposed Method

Mp,w = tfp,w

Ma,w = tfa,w

Ip,a,w = tfp,w * tfa,w

So, IR-based term frequency measures are taken into account

Proposed Method

Four sources of complexity Adding in IR scores Word selection for efficient learning Finer resolutions than page-level or ad-level Fast implementation for training and testing

Proposed Method

How can IR scores fit into the model? What is the relationship between logit(pij) and

cosine score?

Quadratic relationshipCosine score

logi

t(p i

j)

Proposed Method

How can IR scores fit into the model? This quadratic relationship can be used in

two ways Put in cosine and cosine2 as features

Use it as a prior

Proposed Method

How can IR scores fit into the model? This quadratic relationship can be used in

two ways We tried both, and they give very similar results

Proposed Method

Four sources of complexity Adding in IR scores Word selection for efficient learning Finer resolutions than page-level or ad-level Fast implementation for training and testing

Proposed Method

Word selection Overall, nearly 110k words in corpus Learning parameters for each word would be:

Very expensive Require a huge amount of data Suffer from diminishing returns

So we want to select ~1k top words which will have the most impact

Proposed Method

Word selection Two methods

Data based: Define an interaction measure for each word

Higher values for words which have higher-than-expected CTR when they occur on both page and ad

Proposed Method

Word selection Two methods

Data based Relevance based

Compute average tfidf score of each word overall pages and ads

Higher values imply higher relevance

Proposed Method

Word selection Two methods

Data based Relevance based

We picked the top 1000 words by each measure

Data-based methods give better results

Recall

Pre

cisi

on

Proposed Method

Four sources of complexity Adding in IR scores Word selection for efficient learning Finer resolutions than page-level or ad-level Fast implementation for training and testing

Proposed Method

Finer resolutions than page-level or ad-level

The data has finer granularity Words are in “regions”, such as title, headers,

boldfaces, metadata, etc. Word matches in title can be more important that

in the body Simple extension of the model to region-specific

features

Proposed Method

Four sources of complexity Adding in IR scores Word selection for efficient learning Finer resolutions than page-level or ad-level Fast implementation for training and testing

Proposed Method

Fast Implementation Training: Hadoop implementation of Logistic

Regression

Data

Iterative Newton-Raphson

Random data splits Mean and

Variance estimates

Combine estimates Learned

model params

Proposed Method

Fast Implementation Testing

Main effect for ads is used in ordering of ads in postings list (static)

Interaction effect is used to modify the idf-table of words (static)

Main effect for pages does not play a role in ad serving (page is given)

Building postings lists

Proposed Method

Fast Implementation Testing

Model can be integrated into existing code No loss of performance or scalability of the existing

system

Proposed Method

Four sources of complexity Adding in IR scores Word selection for efficient learning Finer resolutions than page-level or ad-level Fast implementation for training and testing

Outline

Motivation WAND Background Proposed Method

Experiments Conclusions

Experiments

Recall

Pre

cisi

on

25% lift in precision at 10% recall

Experiments

Recall

Pre

cisi

on

25% lift in precision at 10% recall

Magnification for low recall region

Experiments

Increasing the number of words from 1000 to 3400 led to only marginal improvement Diminishing returns System already performs close to its limit, without

needing more training

Outline

Motivation WAND Background Proposed Method Experiments

Conclusions

Conclusions

Relevance-based

Uses domain knowledge

Very low training cost Simple computations

at testing time

Combine the two

Parallel code for parameter fitting

Use existing system: no code changes or

efficiency bottlenecks

Click-based

Uses existing data improvement over time

Training is complicated Efficiency concerns

during testing