distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

Post on 03-Jan-2016

38 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce. Chao Liu, Hung- chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang Internet Services Research Center (ISRC) Microsoft Research Redmond. Internet Services Research Center (ISRC). - PowerPoint PPT Presentation

TRANSCRIPT

Distributed Nonnegative Matrix Factorization for Web-Scale DyadicData Analysis on MapReduce

Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang

Internet Services Research Center (ISRC)Microsoft Research Redmond

Internet Services Research Center (ISRC)• Advancing the state of the art in online services• Dedicated to accelerating innovations in search and ad

technologies• Representing a new model for moving technologies quickly

from research projects to improved products and servicesThursday, 04/29/2010 Friday, 04/30/201010:30~12:00pm: Data Analysis & Efficiency• Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce

11:00~12:30pm: Query Analysis• Exploring Web Scale Language Models for Search Query Processing • Building Taxonomy of Web Search Intents for Name Entity Queries• Optimal Rare Query Suggestion With Implicit User Feedback

1:30~3:00pm: Information Extraction• Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries

1:30~3:00pm: Infrastructure 2• Large-scale Bot Detection for Search Engines

Dyadic Data on the Web

• Web abounds with dyadic data– Web search: term by document,

query by clickedURL, web linkage, …– Advertising: query by ad, bid term by ad,

user by ad, …– Social media: tag by image, user by community,

friendship graph, …• Common characteristics– Good source for discovering latent relationships– High dimensionality, sparse, nonnegative, dynamic

Nonnegative Matrix Factorization (NMF)

• Effective tool to uncover latent relationships in nonnegative matrices with many applications [Berry et al., 2007, Sra & Dhillon, 2006]

– Interpretable dimensionality reduction [Lee & Seung, 1999]

– Document clustering [Shahnaz et al., 2006, Xu et al, 2006]

• Challenge: Can we scale NMF to million-by-million matrices

Am

n

WH

m

nkk

0,0,0 HWA

NMF Algorithm [Lee & Seung, 2000]

Am

n

WH

m

nkk

0,0,0 HWA

Parallel NMF [Robila & Maciak, 2006]

• Parallelism on multi-core machines– Partition along the long dimension for parallelism– Assuming all matrices can be held in shared memory

Distributed NMF

• Data Partition: A, W and H across machines

A…

),,( , jiAji

W. . . . .

),( iwi

H

. . . . .

),( jhj

Copmuting DNMF: The Big Picture

WAW

AWH

Y

XHH

T

T

*.*.

… … …

),,(: , jiAjiA

),,,( , iji wAji

Map-I

Reduce-I

),( , iji wAj

Map-II

),( , iji wAj

Reduce-II

),( jxj

Map-IIIMap-IV

),0( WW T

Map-V

),0( iTi ww

),,,( jjj yxhj

…),( jyj

),(: iwiW ),(: jhjH

… ),( newjhj

Reduce-III

Reduce-V

AWX T

… …

),,(: , jiAjiA

),,,( , iji wAji

Map-I

Reduce-I

),( , iji wAj

Map-II

),( , iji wAj

Reduce-II

),( jxj

),(: iwiW

WHWY T

… …

Map-IIIMap-IV

),0( WW T

),0( iTi ww …),( jyj

),(: iwiW ),(: jhjH

Reduce-III WHWY T

m

ii

Ti

T wwWWC1

W

. . . . .

),( iwi

. . .

. . .

YXHH *.

),( jxj

Map-V

),,,( jjj yxhj

…),( jyj

),(: jhjH

… ),( newjhj

Reduce-V

… … …

),,(: , jiAjiA

),,,( , iji wAji

Map-I

Reduce-I

),( , iji wAj

Map-II

),( , iji wAj

Reduce-II

),( jxj

Map-IIIMap-IV

),0( WW T

Map-V

),0( iTi ww

),,,( jjj yxhj

…),( jyj

),(: iwiW ),(: jhjH

… ),( newjhj

Reduce-III

Reduce-V

Experimental Evaluation

• Synthesized data on a sandbox cluster– No interference from other jobs– Performance with various parameters

• Real-world data on a commercial cluster– Real-world scalability

Synthesized Data on Sandbox Cluster

• A Hadoop cluster with 8 workers in total– Worker: Pentium-IV CPU, 1 or 2 cores, 1~2 GB

memory, 150G hard drive– V: Number of workers in cluster

• Matrix simulator– Generate m-by-n matrix with sparsity δ– k: factorization dimensionality– Defaults:

371617 2,2,2,2 knm

Computation Breakdown

• dominates the computation• is lightweight• The sparser, the faster

AWX TWHWY T

Performance w.r.t. Parameters

• Linear to m×n×δ• Linear to factorization dimension k• Sub-ideal speedup w.r.t. cluster

size V

Scalability on Real-world Data

• User-by-Website matrix– Browsed URLs of opt-in users, represented by UID– URLs trimmed to site level

• http://www.cnn.com/breakingnews --> www.cnn.com

• Experiments on Microsoft SCOPE– SCOPE: Structure Computations Optimized for Parallel

Execution [Chaiken et al., VLDB’08]

Executions w.r.t. Iterations

• Observations– Longer total elapse time– Shorter time per iteration

• Reason– Overlapped computation

across iterations

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

f(x) = 0.721508828250402 x + 0.422552166934188R² = 0.993424501613606

Iterations

Nor

mal

ized

Elap

se T

ime

Scalability w.r.t. Matrix Size

3 hours per iteration, 20 iterations take around 20*3*0.72 ≈ 43 hours

Less than 7 hours on a 43.9M-by-769M matrix with 4.38 billion nonzero values

Conclusion

• NMF is an effective tool to uncover latent structures in dyadic data that is abundant on the Web

• NMF is admissible to MapReduce • Distributed NMF solves the scalability

challenge• Applications down the road

Q&A

Thank You!

top related