graphs, algorithms and big data: the google adwords case study
Post on 23-Feb-2016
48 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Graphs, Algorithms and Big Data: the Google AdWords Case study
GDG DevFest Central Italy 2013
Alessandro Epasto
2
Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H.
Lynch (Google) and the AdWords team.
The AdWords Problem
The AdWords Problem
?
The AdWords Problem
?
The AdWords Problem
Soccer Shoes
The AdWords Problem
Soccer Shoes
Google Advertisement in Numbers
Over a billion of query a day. A lot of advertisers.
www.google.com/competition/howgooglesearchworks.html
Challenges
Several scientific and technological challenges.How to find in real-time the best ads?How to price each ads?How to suggest new queries to advertisers?
The solution to these problems involves some fundamental scientific results (e.g. a Nobel Prize-winning auction mechanism)
Google Advertisement in Numbers
2012 Revenues: 46 billions USD95% Advertisement: 43 billions USD.
http://investor.google.com/financial/tables.html
Goals of the Project
Tackling AdWords data to identify automatically, for each advertiser, its main competitors and suggest relevant queries to each advertiser.
Goals:Useful business information.Improve advertisement.More relevant performance benchmarks.
Information Deluge
Large advertisers (e.g. Amazon, Ask.com, etc) compete in several market segments with very different advertisers.
Query Information
Nike store New York
Market Segment: Retailer,Geo: NY (USA), Stats: 10 clicks
Soccer shoes Market Segment: Apparel,Geo: London, UK, Stats: 4 clicks
Soccer ball Market Segment: Equipment,Geo: San Franciso, CA, Stats: 5 clicks
…. millions of other queries ….
Representing the data
How to represent the salient features of the data?Relationships between advertisers and queriesStatistics: clicks, costs, etc.Take into account the categories.Efficient algorithms.
Graphs: the lingua franca of Big Data
Mathematical objects studied well before the history of computers.
Königsberg’s bridges problem. Euler, 1735.
Graphs: the lingua franca of Big Data
Graphs are everywhere!
Social Networks Technological Networks
Natural Networks
Graphs: the lingua franca of Big Data
Formal definition
A
B
C
D
A set of Nodes
Graphs: the lingua franca of Big Data
Formal definition
A
B
C
D
A set of Edges
Graphs: the lingua franca of Big Data
Formal definition
A
B
C
D
The edges might have a weight
1
4
2
3
Adwords data as a (Bipartite) Graph
A lot of Advertisers Billions of Queries
Hundreds of Labels
Semi-Formal Problem Definition
Advertisers
Queries
Semi-Formal Problem Definition
A
Advertisers
Queries
Semi-Formal Problem Definition
A
Advertisers
Queries
Labels:
Semi-Formal Problem Definition
A
Advertisers
Queries
Labels:
Semi-Formal Problem Definition
A
Advertisers
Queries
Labels:Goal:
Find the nodes most “similar” to A.
How to Define Similarity?
Several node similarity measures in the literature based on the graph structure, random walk, etc. What is the accuracy?Can it scale to graphs with billions of nodes?Can be computed in real-time?
The three ingredients of Big Data
A lot of data…
A sophisticated infrastructure: MapReduce
Efficient algorithms: Graph mining
MapReduce
MapReduce
The work is spread across several machines in parallel connected with fast links.
Algorithms
Personalized PageRank:Random walks on the graphClosely related to the celebrated Google
PageRank™.
Personalized PageRank
Personalized PageRank
Personalized PageRank
Personalized PageRank
Personalized PageRank
Personalized PageRank
Personalized PageRank
Personalized PageRank
Personalized PageRank
Personalized PageRank
Personalized PageRank
Personalized PageRank
Personalized PageRank
Personalized PageRank
Idea: perform a very long random walk (starting from v).
Rank nodes by probability of visit assigns a similarity score to each node w.r.t. node v.
Strong community bias (this can be formalized).
Personalized PageRank
Exact computation is unfeasible O(n^3), but it can be approximated very well.
Very efficient Map Reduce algorithm scaling to large graphs (hundred of millions of nodes)
However…
Algorithmic Bottleneck
Our graphs are simply too big (billions of nodes) even for large-scale systems.
MapReduce is not real-time.We cannot precompute the results for
all subsets of categories (exponential time!).
1st idea: Tackling Real Graph Structure
Data size is the main bottleneck. Compressing the graph would speed up the
computation.
1st idea: Tackling Real Graph Structure
a b c d e f g
A B A
B
Only advertisers.Advertisers and queries
1
1st idea: Tackling Real Graph Structure
a b c d e f g
A B
1
A
B
Advertisers and queries
a b cd e fgA B
Ranking of the entire graph
2
Only advertisers.
1st idea: Tackling Real Graph Structure
Theorem: the ranking computed is the corrected Personalized PageRank on the entire graph.
Based on results from the mathematical theory Markov Chain state aggregation (Simon and Ado, ’61; Meyer ’89, etc.).
Algorithmic Bottleneck
Our graphs are too big (billions of nodes) even for large-scale systems.
MapReduce is not real-time.We cannot precompute the results for
all subsets of categories (exponential time!).
Two-stage Approach
First stage: Large-scale (but feasible) MapReduce pre-computation.Second Stage: Fast iterative algorithm.
First Stage: Individual Category Rankings
Advertisers
Queries
First Stage: Individual Category Rankings
Advertisers
Queries
PrecomputedRankings
First Stage: Individual Category Rankings
Advertisers
Queries
PrecomputedRankings
PrecomputedRankings
First Stage: Individual Category Rankings
Advertisers
Queries
PrecomputedRankings
PrecomputedRankings
PrecomputedRankings
Second Stage: Rank aggregation
PrecomputedRankings
PrecomputedRankings
Ranking ofRed + Yellow
A real-time iterative algorithm aggregates the rankings of a given node for a subset of the categories.
Algorithmic Bottleneck
Our graphs are too big (billions of nodes) even for large-scale systems.
MapReduce is not real-time.We cannot precompute the results for
all subsets of categories (exponential time!).
Experimental evaluation shows the accuracy of the results.
Fully implemented and currently under evaluation for integration in production systems.
Ongoing research project for future scientific publications.
Conclusions
Thank you for your attention
top related