Transcript
Page 1: Hadoop Graph Processing with Apache Giraph

June, 2013

Jay Tang

GRAPH MINING WITH APACHE GIRAPH

Page 2: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary2

• Introduction

• Big Data problem

• Graph mining platform

• Use case

• Lessons

• Future work

AGENDA

Page 3: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary3

• Director of Big Data Platform & Analytics, PayPal

− Hadoop, Graph mining, Real-time analytics, ML, text mining

• 20 years of software experience in the valley focused on data

• Member of original Hadoop team @Yahoo

• Built data warehouse, relational database, OLAP product @Yahoo, Oracle/Hyperion, IBM Informix, DB2

ABOUT ME

Page 4: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary4

BIG DATA PROBLEM

Page 5: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary5

• Enable Online, Offline, and Mobile payment

• 128M customers worldwide

• $160B payment volume processed annually

• Major retail locations accepting PayPal

20K today 2M end of 2013

• PayPal Here launching in US and international markets

Petabye Data Problem & Growing

BIG DATA PROBLEM @ PAYPAL

Page 6: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary6

• Detect and prevent fraud

• Assess credit risk

• Relevant offer to our customers

• Improve user experience

• Provide better insights to our merchants

BIG DATA POWERS PAYPAL ANALYTICS

Page 7: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary7

GRAPH MINING PLATFORM

Page 8: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary8

BIG DATA STACK

DataCloud

Page 9: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary9

Traditional data processing abstraction -- TABLE

• Rows

• Columns

• Data Types

DATA ABSTRACTION

Page 10: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary10

• Internet & WWW

• Social network

• PayPal payment network – accounts & transactions

GRAPH IS EVERYWHERE

Page 11: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary11

• Think like a vertex

• Two basic operations

− Fusion: aggregate information from neighbors to a set of entities

− Diffusion: propagate information from a vertex to neighbors

GRAPH COMPUTING

Page 12: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary12

THING LIKE A VERTEX - FUSION

Page 13: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary13

THINK LIKE A VERTEX - DIFFUSION

Page 14: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary14

• Which graph mining engine to use?

− GraphLab

− Apache Giraph

− Apache Hamas

• Hadoop compatible

− Data is on Hadoop

− Leverage existing cluster infrastructure

− Integration with Hadoop

• Easy of deployment and update

• Community

GRAPH MINING ENGINE

Page 15: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary15

• Apache open src implementation of Google Pregel on Hadoop

• Send msg from a vertex to any other vertex

• In-memory scalable system

− Map-only jobs, Zookeeper, Netty

BSP & GIRAPH

Page 16: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary16

GRAPH MINING USE CASE

Page 17: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary17

• Stop fraudsters from stealing money from PayPal payment network

• Sophisticate risk models running in real-time based on

− Online data

− Offline data

• Risk profile traditionally based on a variety of data

− Account

− Transaction -- frequency, amount, history

− IP

− Email domain

RISK DETECTION & MITIGATION

Page 18: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary18

RISK COMPUTATION

Current TX Details

Risk Models

Approve

DeclineHistory Data

Page 19: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary19

• PayPal data are connected

• Form multiple communities that have hidden inferences

• Discover the inferences via a graph approach

• Build a system to extract the inferences

GRAPH MINING CONNECTED DATA

Page 20: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary20

GRAPH VIEW OF DATA

User1

User2

Merchant

BUY

BUY

P2P Money Transfer

Page 21: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary21

GRAPH VIEW OF DATA

Account 1

IP1 IP2

Account 2

IP3

Page 22: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary22

GRAPH MINING DATA PIPELINE

Pre Processing

Graph Processing

Post Processing

Giraph

MapReduce

MapReduce

Page 23: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary23

• Input data is raw transaction data

• Custom MapReduce jobs to pre-process data into graph model

• Output is JSON format of adjacent node list

− Easy to consume in Java and by humans

− Use gson library

• Post processing – output format conversion

GRAPH DATA PIPELINE

Page 24: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary24

• Customers/Accounts linked via transactions

• Compute risk = intrinsic risk + risk propagated from peers

• Send risk message to peers

• Iterate till converge

GRAPH PROCESSING

Cus1

Cus2

Transaction T1

Transaction T0

Transaction T2

Transaction T3

Page 25: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary25

IP3

IP2

GRAPH PROCESSING

Account 1

IP1 IP2

Account 2

IP3IP1

Page 26: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary26

LESSONS LEARNED

Page 27: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary27

• Giraph is an emerging technology

− Incubation in 2012

− Rapidly evolving

− 0.1 and 0.2 are not compatible

− Lack of knowledge & doc

• Build internal git repo

• Read code and join mailing list

• Port code from 0.1 to 0.2

• Use Giraph 1.0 released on May 6 2013

GIRAPH

Page 28: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary28

• Must guarantee minimum number of Mappers

• Capacity scheduler

− set MIN mapper of queue > Giraph job needs

• Fair scheduler

− set MIN mapper of queue > Giraph job needs

− Turn on pre-emption

− Set pre-emption wait time to a small interval – 20 sec

HADOOP ENVIRONMENT INTEGRATION

Page 29: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary29

• Memory constraint in a shared Hadoop environment

− 1.2B edges and 300M nodes

− Single purpose POC cluster mapper memory = 10 GB

− Shared R&D cluster mapper memory = 3 GB

• Reduce memory consumption is key

− Convert String to long for graph processing

− Convert back to String in post-processing for downstream application

− Cap the number of messages passed

− distance from current vertex

− message payload data values

MEMORY SCALABILITY

Page 30: Hadoop Graph Processing with Apache Giraph

Confidential and Proprietary30

• Giraph-based data engine to produce enriched data set

• Leverage Giraph on YARN

• Number of worker scalability

FUTURE WORK

Page 31: Hadoop Graph Processing with Apache Giraph

Q&A

WE ARE HIRING


Top Related