graphs, algorithms and big data: the google adwords case study

Graphs, Algorithms and Big Data: the Google AdWords Case study

GDG DevFest Central Italy 2013

Alessandro Epasto

Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H.

Lynch (Google) and the AdWords team.

The AdWords Problem

Soccer Shoes

The AdWords Problem

Soccer Shoes

Google Advertisement in Numbers

Over a billion of query a day. A lot of advertisers.

www.google.com/competition/howgooglesearchworks.html

Challenges

Several scientific and technological challenges.How to find in real-time the best ads?How to price each ads?How to suggest new queries to advertisers?

The solution to these problems involves some fundamental scientific results (e.g. a Nobel Prize-winning auction mechanism)

Google Advertisement in Numbers

2012 Revenues: 46 billions USD95% Advertisement: 43 billions USD.

http://investor.google.com/financial/tables.html

Goals of the Project

Tackling AdWords data to identify automatically, for each advertiser, its main competitors and suggest relevant queries to each advertiser.

Goals:Useful business information.Improve advertisement.More relevant performance benchmarks.

Information Deluge

Large advertisers (e.g. Amazon, Ask.com, etc) compete in several market segments with very different advertisers.

Query Information

Nike store New York

Market Segment: Retailer,Geo: NY (USA), Stats: 10 clicks

Soccer shoes Market Segment: Apparel,Geo: London, UK, Stats: 4 clicks

Soccer ball Market Segment: Equipment,Geo: San Franciso, CA, Stats: 5 clicks

…. millions of other queries ….

Representing the data

How to represent the salient features of the data?Relationships between advertisers and queriesStatistics: clicks, costs, etc.Take into account the categories.Efficient algorithms.

Graphs: the lingua franca of Big Data

Mathematical objects studied well before the history of computers.

Königsberg’s bridges problem. Euler, 1735.

Graphs are everywhere!

Social Networks Technological Networks

Natural Networks

Formal definition

A set of Nodes

Formal definition

A set of Edges

Formal definition

The edges might have a weight

Adwords data as a (Bipartite) Graph

A lot of Advertisers Billions of Queries

Hundreds of Labels

Semi-Formal Problem Definition

Advertisers

Queries

Advertisers

Queries

Advertisers

Queries

Labels:

Advertisers

Queries

Labels:

Advertisers

Queries

Labels:Goal:

Find the nodes most “similar” to A.

How to Define Similarity?

Several node similarity measures in the literature based on the graph structure, random walk, etc. What is the accuracy?Can it scale to graphs with billions of nodes?Can be computed in real-time?

The three ingredients of Big Data

A lot of data…

A sophisticated infrastructure: MapReduce

Efficient algorithms: Graph mining

MapReduce

The work is spread across several machines in parallel connected with fast links.

Algorithms

Personalized PageRank:Random walks on the graphClosely related to the celebrated Google

PageRank™.

Personalized PageRank

Idea: perform a very long random walk (starting from v).

Rank nodes by probability of visit assigns a similarity score to each node w.r.t. node v.

Strong community bias (this can be formalized).

Personalized PageRank

Exact computation is unfeasible O(n^3), but it can be approximated very well.

Very efficient Map Reduce algorithm scaling to large graphs (hundred of millions of nodes)

However…

Algorithmic Bottleneck

Our graphs are simply too big (billions of nodes) even for large-scale systems.

MapReduce is not real-time.We cannot precompute the results for

all subsets of categories (exponential time!).

1st idea: Tackling Real Graph Structure

Data size is the main bottleneck. Compressing the graph would speed up the

computation.

a b c d e f g

Only advertisers.Advertisers and queries

a b c d e f g

Advertisers and queries

a b cd e fgA B

Ranking of the entire graph

Only advertisers.

Theorem: the ranking computed is the corrected Personalized PageRank on the entire graph.

Based on results from the mathematical theory Markov Chain state aggregation (Simon and Ado, ’61; Meyer ’89, etc.).

Our graphs are too big (billions of nodes) even for large-scale systems.

Two-stage Approach

First stage: Large-scale (but feasible) MapReduce pre-computation.Second Stage: Fast iterative algorithm.

First Stage: Individual Category Rankings

Advertisers

Queries

Advertisers

Queries

PrecomputedRankings

Advertisers

Queries

PrecomputedRankings

Advertisers

Queries

PrecomputedRankings

Second Stage: Rank aggregation

PrecomputedRankings

Ranking ofRed + Yellow

A real-time iterative algorithm aggregates the rankings of a given node for a subset of the categories.

Our graphs are too big (billions of nodes) even for large-scale systems.

Experimental evaluation shows the accuracy of the results.

Fully implemented and currently under evaluation for integration in production systems.

Ongoing research project for future scientific publications.

Conclusions

Thank you for your attention

graphs, algorithms and big data: the google adwords case study

projecttackling adwords

adwords team

knigsbergs bridges problem

weight142318adwords

different advertisers

mapreduce efficient

lynch google

relevant queries

Documents

and parallel optimization algorithms for recursive graphs

algorithms for comparing pedigree graphs

influence propagation in large graphs - theorems and...

planar graphs as vpg-graphs - journal of graph algorithms...

problem solving with algorithms and data structure - graphs

known algorithms on graphs of bounded treewidth...

cycle bases in graphs characterization, algorithms...

graphs algorithms

algorithms for drawing graphs: an annotated - brown...

review of ring perception algorithms for chemical graphs

algorithms for extracting timeliness graphs

diversified recommendation on graphs: pitfalls, measures,...

cycle bases in graphs characterization, algorithms ...cycle...

algorithms social graphs

1 directed graphs csc401 – analysis of algorithms lecture...

linear time optimization algorithms for p4-sparse graphs ·...

parralle algorithms for dynamically partitioning...

undirected graphs - algorithms, 4th edition

approximation algorithms for geometric intersection graphs

more algorithms for trees and graphs