graphs, algorithms and big data: the google adwords case study

Post on 23-Feb-2016

48 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Graphs, Algorithms and Big Data: the Google AdWords Case study. GDG DevFest Central Italy 2013. Alessandro Epasto. Joint work with J . Feldman, S. Lattanzi , V . Mirrokni (Google Research), S. Leonardi ( Sapienza U. Rome), H. Lynch (Google) and the AdWords team. - PowerPoint PPT Presentation

TRANSCRIPT

1

Graphs, Algorithms and Big Data: the Google AdWords Case study

GDG DevFest Central Italy 2013

Alessandro Epasto

2

Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H.

Lynch (Google) and the AdWords team.

The AdWords Problem

The AdWords Problem

?

The AdWords Problem

?

The AdWords Problem

Soccer Shoes

The AdWords Problem

Soccer Shoes

Google Advertisement in Numbers

Over a billion of query a day. A lot of advertisers.

www.google.com/competition/howgooglesearchworks.html

Challenges

Several scientific and technological challenges.How to find in real-time the best ads?How to price each ads?How to suggest new queries to advertisers?

The solution to these problems involves some fundamental scientific results (e.g. a Nobel Prize-winning auction mechanism)

Google Advertisement in Numbers

2012 Revenues: 46 billions USD95% Advertisement: 43 billions USD.

http://investor.google.com/financial/tables.html

Goals of the Project

Tackling AdWords data to identify automatically, for each advertiser, its main competitors and suggest relevant queries to each advertiser.

Goals:Useful business information.Improve advertisement.More relevant performance benchmarks.

Information Deluge

Large advertisers (e.g. Amazon, Ask.com, etc) compete in several market segments with very different advertisers.

Query Information

Nike store New York

Market Segment: Retailer,Geo: NY (USA), Stats: 10 clicks

Soccer shoes Market Segment: Apparel,Geo: London, UK, Stats: 4 clicks

Soccer ball Market Segment: Equipment,Geo: San Franciso, CA, Stats: 5 clicks

…. millions of other queries ….

Representing the data

How to represent the salient features of the data?Relationships between advertisers and queriesStatistics: clicks, costs, etc.Take into account the categories.Efficient algorithms.

Graphs: the lingua franca of Big Data

Mathematical objects studied well before the history of computers.

Königsberg’s bridges problem. Euler, 1735.

Graphs: the lingua franca of Big Data

Graphs are everywhere!

Social Networks Technological Networks

Natural Networks

Graphs: the lingua franca of Big Data

Formal definition

A

B

C

D

A set of Nodes

Graphs: the lingua franca of Big Data

Formal definition

A

B

C

D

A set of Edges

Graphs: the lingua franca of Big Data

Formal definition

A

B

C

D

The edges might have a weight

1

4

2

3

Adwords data as a (Bipartite) Graph

A lot of Advertisers Billions of Queries

Hundreds of Labels

Semi-Formal Problem Definition

Advertisers

Queries

Semi-Formal Problem Definition

A

Advertisers

Queries

Semi-Formal Problem Definition

A

Advertisers

Queries

Labels:

Semi-Formal Problem Definition

A

Advertisers

Queries

Labels:

Semi-Formal Problem Definition

A

Advertisers

Queries

Labels:Goal:

Find the nodes most “similar” to A.

How to Define Similarity?

Several node similarity measures in the literature based on the graph structure, random walk, etc. What is the accuracy?Can it scale to graphs with billions of nodes?Can be computed in real-time?

The three ingredients of Big Data

A lot of data…

A sophisticated infrastructure: MapReduce

Efficient algorithms: Graph mining

MapReduce

MapReduce

The work is spread across several machines in parallel connected with fast links.

Algorithms

Personalized PageRank:Random walks on the graphClosely related to the celebrated Google

PageRank™.

Personalized PageRank

Personalized PageRank

Personalized PageRank

Personalized PageRank

Personalized PageRank

Personalized PageRank

Personalized PageRank

Personalized PageRank

Personalized PageRank

Personalized PageRank

Personalized PageRank

Personalized PageRank

Personalized PageRank

Personalized PageRank

Idea: perform a very long random walk (starting from v).

Rank nodes by probability of visit assigns a similarity score to each node w.r.t. node v.

Strong community bias (this can be formalized).

Personalized PageRank

Exact computation is unfeasible O(n^3), but it can be approximated very well.

Very efficient Map Reduce algorithm scaling to large graphs (hundred of millions of nodes)

However…

Algorithmic Bottleneck

Our graphs are simply too big (billions of nodes) even for large-scale systems.

MapReduce is not real-time.We cannot precompute the results for

all subsets of categories (exponential time!).

1st idea: Tackling Real Graph Structure

Data size is the main bottleneck. Compressing the graph would speed up the

computation.

1st idea: Tackling Real Graph Structure

a b c d e f g

A B A

B

Only advertisers.Advertisers and queries

1

1st idea: Tackling Real Graph Structure

a b c d e f g

A B

1

A

B

Advertisers and queries

a b cd e fgA B

Ranking of the entire graph

2

Only advertisers.

1st idea: Tackling Real Graph Structure

Theorem: the ranking computed is the corrected Personalized PageRank on the entire graph.

Based on results from the mathematical theory Markov Chain state aggregation (Simon and Ado, ’61; Meyer ’89, etc.).

Algorithmic Bottleneck

Our graphs are too big (billions of nodes) even for large-scale systems.

MapReduce is not real-time.We cannot precompute the results for

all subsets of categories (exponential time!).

Two-stage Approach

First stage: Large-scale (but feasible) MapReduce pre-computation.Second Stage: Fast iterative algorithm.

First Stage: Individual Category Rankings

Advertisers

Queries

First Stage: Individual Category Rankings

Advertisers

Queries

PrecomputedRankings

First Stage: Individual Category Rankings

Advertisers

Queries

PrecomputedRankings

PrecomputedRankings

First Stage: Individual Category Rankings

Advertisers

Queries

PrecomputedRankings

PrecomputedRankings

PrecomputedRankings

Second Stage: Rank aggregation

PrecomputedRankings

PrecomputedRankings

Ranking ofRed + Yellow

A real-time iterative algorithm aggregates the rankings of a given node for a subset of the categories.

Algorithmic Bottleneck

Our graphs are too big (billions of nodes) even for large-scale systems.

MapReduce is not real-time.We cannot precompute the results for

all subsets of categories (exponential time!).

Experimental evaluation shows the accuracy of the results.

Fully implemented and currently under evaluation for integration in production systems.

Ongoing research project for future scientific publications.

Conclusions

Thank you for your attention

top related