hadoop graph processing with apache giraph

31
June, 2013 Jay Tang GRAPH MINING WITH APACHE GIRAPH

Post on 22-Sep-2014

21 views

Category:

Technology


6 download

DESCRIPTION

PayPal prvoides an online transfer money network. Each payment flow connects senders and receivers into a giant network where each sender/receiver is a node and each transaction is an edge. Traditionally, the risk score of a transaction is computed based on the characteristics of the involved sender/receiver/transaction. In this talk, we will describe a novel network inference approach to calculate transaction risk score that also includes the risk profile of neighboring senders and receivers using Apache Giraph. The approach reveals additional risk insights not possible with the traditional method. We leverage Hadoop to support a graph computation involving hundreds of millions of nodes and edges.

TRANSCRIPT

Page 1: Hadoop Graph Processing with Apache Giraph

June, 2013

Jay Tang

GRAPH MINING WITH APACHE GIRAPH

Page 2: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary2

• Introduction

• Big Data problem

• Graph mining platform

• Use case

• Lessons

• Future work

AGENDA

Page 3: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary3

• Director of Big Data Platform & Analytics, PayPal

− Hadoop, Graph mining, Real-time analytics, ML, text mining

• 20 years of software experience in the valley focused on data

• Member of original Hadoop team @Yahoo

• Built data warehouse, relational database, OLAP product @Yahoo, Oracle/Hyperion, IBM Informix, DB2

ABOUT ME

Page 4: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary4

BIG DATA PROBLEM

Page 5: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary5

• Enable Online, Offline, and Mobile payment

• 128M customers worldwide

• $160B payment volume processed annually

• Major retail locations accepting PayPal

20K today 2M end of 2013

• PayPal Here launching in US and international markets

Petabye Data Problem & Growing

BIG DATA PROBLEM @ PAYPAL

Page 6: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary6

• Detect and prevent fraud

• Assess credit risk

• Relevant offer to our customers

• Improve user experience

• Provide better insights to our merchants

BIG DATA POWERS PAYPAL ANALYTICS

Page 7: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary7

GRAPH MINING PLATFORM

Page 8: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary8

BIG DATA STACK

DataCloud

Page 9: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary9

Traditional data processing abstraction -- TABLE

• Rows

• Columns

• Data Types

DATA ABSTRACTION

Page 10: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary10

• Internet & WWW

• Social network

• PayPal payment network – accounts & transactions

GRAPH IS EVERYWHERE

Page 11: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary11

• Think like a vertex

• Two basic operations

− Fusion: aggregate information from neighbors to a set of entities

− Diffusion: propagate information from a vertex to neighbors

GRAPH COMPUTING

Page 12: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary12

THING LIKE A VERTEX - FUSION

Page 13: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary13

THINK LIKE A VERTEX - DIFFUSION

Page 14: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary14

• Which graph mining engine to use?

− GraphLab

− Apache Giraph

− Apache Hamas

• Hadoop compatible

− Data is on Hadoop

− Leverage existing cluster infrastructure

− Integration with Hadoop

• Easy of deployment and update

• Community

GRAPH MINING ENGINE

Page 15: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary15

• Apache open src implementation of Google Pregel on Hadoop

• Send msg from a vertex to any other vertex

• In-memory scalable system

− Map-only jobs, Zookeeper, Netty

BSP & GIRAPH

Page 16: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary16

GRAPH MINING USE CASE

Page 17: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary17

• Stop fraudsters from stealing money from PayPal payment network

• Sophisticate risk models running in real-time based on

− Online data

− Offline data

• Risk profile traditionally based on a variety of data

− Account

− Transaction -- frequency, amount, history

− IP

− Email domain

RISK DETECTION & MITIGATION

Page 18: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary18

RISK COMPUTATION

Current TX Details

Risk Models

Approve

DeclineHistory Data

Page 19: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary19

• PayPal data are connected

• Form multiple communities that have hidden inferences

• Discover the inferences via a graph approach

• Build a system to extract the inferences

GRAPH MINING CONNECTED DATA

Page 20: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary20

GRAPH VIEW OF DATA

User1

User2

Merchant

BUY

BUY

P2P Money Transfer

Page 21: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary21

GRAPH VIEW OF DATA

Account 1

IP1 IP2

Account 2

IP3

Page 22: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary22

GRAPH MINING DATA PIPELINE

Pre Processing

Graph Processing

Post Processing

Giraph

MapReduce

MapReduce

Page 23: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary23

• Input data is raw transaction data

• Custom MapReduce jobs to pre-process data into graph model

• Output is JSON format of adjacent node list

− Easy to consume in Java and by humans

− Use gson library

• Post processing – output format conversion

GRAPH DATA PIPELINE

Page 24: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary24

• Customers/Accounts linked via transactions

• Compute risk = intrinsic risk + risk propagated from peers

• Send risk message to peers

• Iterate till converge

GRAPH PROCESSING

Cus1

Cus2

Transaction T1

Transaction T0

Transaction T2

Transaction T3

Page 25: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary25

IP3

IP2

GRAPH PROCESSING

Account 1

IP1 IP2

Account 2

IP3IP1

Page 26: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary26

LESSONS LEARNED

Page 27: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary27

• Giraph is an emerging technology

− Incubation in 2012

− Rapidly evolving

− 0.1 and 0.2 are not compatible

− Lack of knowledge & doc

• Build internal git repo

• Read code and join mailing list

• Port code from 0.1 to 0.2

• Use Giraph 1.0 released on May 6 2013

GIRAPH

Page 28: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary28

• Must guarantee minimum number of Mappers

• Capacity scheduler

− set MIN mapper of queue > Giraph job needs

• Fair scheduler

− set MIN mapper of queue > Giraph job needs

− Turn on pre-emption

− Set pre-emption wait time to a small interval – 20 sec

HADOOP ENVIRONMENT INTEGRATION

Page 29: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary29

• Memory constraint in a shared Hadoop environment

− 1.2B edges and 300M nodes

− Single purpose POC cluster mapper memory = 10 GB

− Shared R&D cluster mapper memory = 3 GB

• Reduce memory consumption is key

− Convert String to long for graph processing

− Convert back to String in post-processing for downstream application

− Cap the number of messages passed

− distance from current vertex

− message payload data values

MEMORY SCALABILITY

Page 30: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary30

• Giraph-based data engine to produce enriched data set

• Leverage Giraph on YARN

• Number of worker scalability

FUTURE WORK

Page 31: Hadoop Graph Processing with Apache Giraph

Q&A

WE ARE HIRING