Online Advertising Overview
Adv
ertis
ers
Ad Network
Ads
Content
Pick ads
User
Content Provider
Examples:Yahoo, Google, MSN,
RightMedia, …
2
Advertising Setting
Graphical display ads Mostly for brand awareness Revenue based on number of impressions
(not clicks)
Display Content Match
Sponsored Search
5
Advertising Setting
Pick ads
Text ads
Match ads to the content
Display Content Match
Sponsored Search
7
Advertising Setting
The user intent is unclear Revenue depends on number of clicks Query (webpage) is long and noisy
Display Content Match
Sponsored Search
8
This presentation
1) Content Match [KDD 2007]: How can we estimate the click-through rate
(CTR) of an ad on a page?
~106 ads
~10
9 p
ages CTR for ad j
on page i
10
This presentation
1) Estimating CTR for Content Match [KDD ‘07]
2) Traffic Shaping for Display Advertising [EC ‘12]
Article summary
Alternates
click
Display ads
11
This presentation
1) Estimating CTR for Content Match [KDD ‘07]
2) Traffic Shaping for Display Advertising [EC ‘12]
Recommend articles (not ads) need high CTR on article summaries + prefer articles on which under-delivering ads
can be shown
12
This presentation
1) Estimating CTR for Content Match [KDD ‘07]
2) Traffic Shaping for Display Advertising [EC ‘12]
3) Theoretical underpinnings[COLT ‘10 best student paper]
Represent relationships as a graph Recommendation = Link Prediction Many useful heuristics exist Why do these heuristics work?
Goal: Suggest friends
13
14
Estimating CTR for Content Match Contextual Advertising
Show an ad on a webpage (“impression”) Revenue is generated if a user clicks Problem: Estimate the click-through rate (CTR) of
an ad on a page
~106 ads
~1
09 p
ag
es
CTR for ad j on page i
Estimating CTR for Content Match Why not use the MLE?
1. Few (page, ad) pairs have N>0
2. Very few have c>0 as well
3. MLE does not differentiate between 0/10 and 0/100
We have additional information: hierarchies
15
16
Estimating CTR for Content Match Use an existing, well-understood hierarchy
Categorize ads and webpages to leaves of the hierarchy
CTR estimates of siblings are correlatedThe hierarchy allows us to aggregate data
Coarser resolutions provide reliable estimates for rare events which then influences estimation at finer
resolutions
17
Estimating CTR for Content Match Level 0
Level i
Page hierarchy Ad hierarchy
Region= (page node, ad node)
Region Hierarchy A cross-product of the page
hierarchy and the ad hierarchy
Page classes Ad classes
Region
Data Transformation
Problem:
Solution: Freeman-Tukey transform
Differentiates regions with 0 clicks Variance stabilization:
19
Model
Goal: Smoothing across siblings in hierarchy[Huang+Cressie/2000]
2020
Level i
Level i+1
S1S2
S3S4
Sparent1. Each region has a latent state Sr
2. yr is independent of the hierarchy given Sr
3. Sr is drawn from its parent Spa(r)
y1 y2 y4
observable
late
nt
However, learning Wr , Vr and βr for each region is
clearly infeasible Assumptions:
All regions at the same level ℓ sharethe same W(ℓ) and β(ℓ)
Vr = V/Nr for some constant V, since
Model
22
Sr
yr
Vr βr
wr
ur
Spa(r)
Model
Implications: determines degree of smoothing :
Sr varies greatly from Spa(r)
Each region learns its own Sr
No smoothing :
All Sr are identical
A regression model on features ur is learnt
Maximum Smoothing
23
Sr
yr
Vr βr
wr
ur
Spa(r)
Implications: determines degree of smoothing Var(Sr) increases from root to leaf
Better estimates at coarser resolutions
Model
24
Sr
yr
Vr βr
wr
ur
Spa(r)
Implications: determines degree of smoothing Var(Sr) increases from root to leaf Correlations among siblings at
level ℓ: Depends only on level of least common
ancestor
Model
25
Sr
yr
Vr βr
wr
ur
Spa(r)
Corr( , ) > Corr( , )
Estimating CTR for Content Match Our Approach
Data Transformation (Freeman-Tukey) Model (Tree-structured Markov Chain) Model Fitting
26
27
Model Fitting
Fitting using a Kalman filtering algorithm Filtering: Recursively aggregate
data from leaves to root Smoothing: Propagate
information from root to leaves
Complexity: linear in the number of regions, for both time and space
filtering
smoo
thin
g
28
Model Fitting
Fitting using a Kalman filtering algorithm Filtering: Recursively aggregate
data from leaves to root Smoothing: Propagates
information from root to leaves
Kalman filter requires knowledge of β, V, and W EM wrapped around the
Kalman filter
filtering
smoo
thin
g
29
Experiments
503M impressions 7-level hierarchy of which the top 3 levels
were used Zero clicks in
76% regions in level 2 95% regions in level 3
Full dataset DFULL, and a 2/3 sample DSAMPLE
30
Experiments
Estimate CTRs for all regions R in level 3 with zero clicks in DSAMPLE
Some of these regions R>0 get clicks in DFULL
A good model should predict higher CTRs for R>0 as against the other regions in R
31
Experiments
We compared 4 models TS: our tree-structured model LM (level-mean): each level smoothed
independently NS (no smoothing): CTR proportional to 1/Nr
Random: Assuming |R>0| is given, randomly predict the membership of R>0 out of R
Experiments MLE=0 everywhere, since 0 clicks were observed What about estimated CTR?
33
Impressions
Est
imat
ed C
TR
Impressions
Est
imat
ed C
TR
No Smoothing (NS) Our Model (TS)
Variability from coarser resolutions
Close to MLE for large N
34
Estimating CTR for Content Match We presented a method to estimate
rates of extremely rare events at multiple resolutions under severe sparsity constraints
Key points: Tree-structured generative model Extremely fast parameter fitting
Traffic Shaping
1) Estimating CTR for Content Match [KDD ‘07]
2) Traffic Shaping for Display Advertising [EC ‘12]
3) Theoretical underpinnings [COLT ‘10 best student paper]
35
Traffic Shaping
36
Which article summary should
be picked?
Ans: The one with highest expected CTR
Which ad should be displayed?
Ans: The ad that minimizes underdelivery
Article pool
Underdelivery
Advertisers are guaranteed some impressions (say, 1M) over some time (say, 2 months) only to users matching their specs only when they visit certain types of pages only on certain positions on the page
An underdelivering ad is one that is likely to miss its guarantee
37
Underdelivery
How can underdelivery be computed? Need user traffic forecasts Depends on other ads in the system
An ad-serving systemwill try to minimizeunder-delivery on thisgraph
38
Forecasted impressions
(user, article, position)
Ad inventory
Supply sℓDemand dj
ℓ j
Traffic Shaping
39
Which article summary should
be picked?
Ans: The one with highest expected CTR
Which ad should be displayed?
Ans: The ad that minimizes underdelivery
Goal: Combine the two
Traffic Shaping
Goal: Bias the article summary selection to reduce under-delivery but insignificant drop in CTR AND do this in real-time
Formulation
j:(ads)
ℓ:(user, article, position)“Fully Qualified Impression”
i:(user, article)
k:(user)
ℓj
i
k
Goal: Infer traffic shaping fractions wki
Supply sk
CTR c ki
Traffi
c
shaping
fracti
on w ki
Demand dj
Ad delivery fraction φℓj
Formulation
Full traffic shaping graph: All forecasted user traffic X
all available articles arriving at the homepage, or directly on article page
Goal: Infer wki But forced to infer φℓj as
well
Full Traffic Shaping Graph
A
B
C
Traffic
shaping
fracti
on w ki
Ad delivery fraction φℓj
CTR c ki
Formulation
44
ℓj
ik
underdelivery
Total user traffic flowing to j (accounting for CTR loss)
demand
(Satisfy demand constraints)
sk wki
cki
Formulation
45
ℓj
ik
(Bounds on traffic shaping fractions)
(Shape only available traffic)
(Satisfy demand constraints)
(Ad delivery fractions)
Key Transformation
This allows a reformulation solely in terms of new variables zℓj zℓj = fraction of supply that is shown ad j,
assuming user always clicks article
46
Formulation
But we have another problem At runtime, we must shape every incoming user
without looking at the entire graph
Solution: Periodically solve the convex problem offline Store a cache derived from this solution Reconstruct the optimal solution for each user at
runtime, using only the cache
48
Real-time solution
50
Cache these
Reconstruct using these
All constraints can be expressed as constraints on σℓ
Real-time solution
51
1
2 σℓ = 0 unless Σzℓj = maxℓ Σzℓj
3 Σℓ σℓ = constant for all i connected to k
Σzℓj
Ui
Li
σℓ
3 K
KT
con
diti
ons Shape depends
on the cached duals αj
ℓj
k i
Real-time solution
52
1
2 σℓ = 0 unless Σzℓj = maxℓ Σzℓj
3 Σℓ σℓ = constant for all i connected to k
ℓj
k iΣzℓj
Ui
Li
σℓ
Algo Initialize σℓ = 0
Compute Σzℓj from (1)
If constraints unsatisfied, increase σℓ while satisfying (2) and (3)
Repeat
Extract wki from zℓj
Results
Data: Historical traffic logs from April, 2011 25K user nodes
Total supply weight > 50B impressions 100K ads
We compare our model to a scheme that picks articles to maximize expected CTR, and picks ads to display via a separate greedy method
53
Lift in impressions
Lift
in im
pres
sion
s de
liver
ed t
o un
derp
erfo
rmin
g ad
s
Fraction of traffic that is not shaped
Nearly threefold improvement via
traffic shaping
54
Average CTR
Ave
rage
CT
R (
as p
erce
ntag
e of
max
imum
CT
R)
Fraction of traffic that is not shaped
CTR drop < 10%
55
Summary
3x underdelivery reduction with <10% CTR drop 2.6x reduction with 4% CTR drop Runtime application needs only a small cache
57
Traffic Shaping
1) Estimating CTR for Content Match [KDD ‘07]
2) Traffic Shaping for Display Advertising [EC ‘12]
3) Theoretical underpinnings [COLT ‘10 best student paper]
58
Link Prediction
Which pair of nodes {i,j} should be connected?
Alice
Bob
Charlie
Goal: Recommend a movie
59
Previous Empirical Studies*
Random Shortest Path
Common Neighbors
Adamic/Adar Ensemble of short paths
Link
pre
dict
ion
accu
racy
*
*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007
How do we justify these observations?
Especially if the graph is sparse
61
Link Prediction – Generative Model
Unit volume universe
Model:1. Nodes are uniformly distributed points in a latent space
2. This space has a distance metric
3. Points close to each other are likely to be connected in the graph
Logistic distance function (Raftery+/2002)
62
63
1
½
Higher probability of linking
radius r
α determines the steepness
Link prediction ≈ find nearest neighbor who is not currently linked to the node.
Equivalent to inferring distances in the latent space
Link Prediction – Generative Model
Model:1. Nodes are uniformly distributed points in a latent space
2. This space has a distance metric
3. Points close to each other are likely to be connected in the graph
Common Neighbors
Pr2(i,j) = Pr(common neighbor|dij)
jkikijjkikjkik2 dd)d|d,d()d|~Pr()d|~Pr(j)(i,Pr Pkjki
Product of two logistic probabilities, integrated over a volume determined by dij
i j
64
Common Neighbors
OPT = node closest to i MAX = node with max common neighbors with i
Theorem:
w.h.p
Link prediction by common neighbors is asymptotically optimal
dOPT ≤ dMAX ≤ dOPT + 2[ε/V(1)]1/D
65
Common Neighbors: Distinct Radii Node k has radius rk .
ik if dik ≤ rk (Directed graph) rk captures popularity of node k
“Weighted” common neighbors: Predict (i,j) pairs with highest Σ w(r)η(r)
i
rk
Weight for nodes of radius r
# common neighbors of radius r
k
j
m
66
Type 2 common neighbors
r is close to max radius
D1
deg
const
r
constw(r)
Real world graphs generally fall in this range
i
rk
k
j
Presence of common neighbor is very informative
Absence is very informative
Adamic/Adar
1/r
67
ℓ-hop Paths
Common neighbors = 2 hop paths
For longer paths:
Bounds are weaker For ℓ’ ≥ ℓ we need ηℓ’ >> ηℓ to obtain similar bounds
justifies the exponentially decaying weight given to longer paths by the Katz measure
δN,,ηg-11)r(rdij
68
Summary
Three key ingredients
1. Closer points are likelier to be linked. Small World Model- Watts, Strogatz, 1998, Kleinberg 2001
2. Triangle inequality holds necessary to extend to ℓ-hop paths
3. Points are spread uniformly at random Otherwise properties will depend on location as well as distance
69
Summary
Random Shortest Path
Common Neighbors
Adamic/Adar Ensemble of short paths
Link
pre
dict
ion
accu
racy
*
*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007
The number of paths matters, not the
length
For large dense graphs, common neighbors are
enough
Differentiating between different degrees is
important
In sparse graphs, length 3 or more
paths help in prediction.
70
Conclusions
Discussed three problems1. Estimating CTR for Content Match
Combat sparsity by hierarchical smoothing
2. Traffic Shaping for Display Advertising Joint optimization of CTR and underdelivery-reduction Optimal traffic shaping at runtime using cached duals
3. Theoretical underpinnings Latent space model Link prediction ≈ finding nearest neighbors in this
space
71
Other Work
72
Web Search
Finding Quicklinks
Titles for Quicklinks
Incorporating tweets into search results
Website clustering
Webpage segmentation
Template detection
Finding hidden query aspects
Computational Advertising
Combining IR with click feedback
Multi-armed bandits using hierarchies
Online learning under finite ad lifetimes
Graph Mining
Epidemic thresholds
Non-parametric prediction in dynamic graphs
Graph sampling
Graph generation models
Community detection
Model
Goal: Smoothing across siblings in hierarchy Our approach:
Each region has a latent state Sr yr is independent of hierarchy given Sr Sr is drawn from the parent region Spa(r)
7373
Level i
Level i+1