distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

Distributed Nonnegative Matrix Factorization for Web-Scale DyadicData Analysis on MapReduce

Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang

Internet Services Research Center (ISRC)Microsoft Research Redmond

Internet Services Research Center (ISRC)• Advancing the state of the art in online services• Dedicated to accelerating innovations in search and ad

technologies• Representing a new model for moving technologies quickly

from research projects to improved products and servicesThursday, 04/29/2010 Friday, 04/30/201010:30~12:00pm: Data Analysis & Efficiency• Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce

11:00~12:30pm: Query Analysis• Exploring Web Scale Language Models for Search Query Processing • Building Taxonomy of Web Search Intents for Name Entity Queries• Optimal Rare Query Suggestion With Implicit User Feedback

1:30~3:00pm: Information Extraction• Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries

1:30~3:00pm: Infrastructure 2• Large-scale Bot Detection for Search Engines

Dyadic Data on the Web

• Web abounds with dyadic data– Web search: term by document,

query by clickedURL, web linkage, …– Advertising: query by ad, bid term by ad,

user by ad, …– Social media: tag by image, user by community,

friendship graph, …• Common characteristics– Good source for discovering latent relationships– High dimensionality, sparse, nonnegative, dynamic

Nonnegative Matrix Factorization (NMF)

• Effective tool to uncover latent relationships in nonnegative matrices with many applications [Berry et al., 2007, Sra & Dhillon, 2006]

– Interpretable dimensionality reduction [Lee & Seung, 1999]

– Document clustering [Shahnaz et al., 2006, Xu et al, 2006]

• Challenge: Can we scale NMF to million-by-million matrices

0,0,0 HWA

NMF Algorithm [Lee & Seung, 2000]

0,0,0 HWA

Parallel NMF [Robila & Maciak, 2006]

• Parallelism on multi-core machines– Partition along the long dimension for parallelism– Assuming all matrices can be held in shared memory

Distributed NMF

• Data Partition: A, W and H across machines

),,( , jiAji

W. . . . .

),( iwi

. . . . .

),( jhj

Copmuting DNMF: The Big Picture

… … …

),,(: , jiAjiA

),,,( , iji wAji

Reduce-I

),( , iji wAj

Map-II

),( , iji wAj

Reduce-II

),( jxj

Map-IIIMap-IV

),0( WW T

),0( iTi ww

),,,( jjj yxhj

…),( jyj

),(: iwiW ),(: jhjH

… ),( newjhj

Reduce-III

Reduce-V

… …

),,(: , jiAjiA

),,,( , iji wAji

Reduce-I

),( , iji wAj

Map-II

),( , iji wAj

Reduce-II

),( jxj

),(: iwiW

WHWY T

… …

Map-IIIMap-IV

),0( WW T

),0( iTi ww …),( jyj

),(: iwiW ),(: jhjH

Reduce-III WHWY T

T wwWWC1

. . . . .

),( iwi

YXHH *.

),( jxj

),,,( jjj yxhj

…),( jyj

),(: jhjH

… ),( newjhj

Reduce-V

… … …

),,(: , jiAjiA

),,,( , iji wAji

Reduce-I

),( , iji wAj

Map-II

),( , iji wAj

Reduce-II

),( jxj

Map-IIIMap-IV

),0( WW T

),0( iTi ww

),,,( jjj yxhj

…),( jyj

),(: iwiW ),(: jhjH

… ),( newjhj

Reduce-III

Reduce-V

Experimental Evaluation

• Synthesized data on a sandbox cluster– No interference from other jobs– Performance with various parameters

• Real-world data on a commercial cluster– Real-world scalability

Synthesized Data on Sandbox Cluster

• A Hadoop cluster with 8 workers in total– Worker: Pentium-IV CPU, 1 or 2 cores, 1~2 GB

memory, 150G hard drive– V: Number of workers in cluster

• Matrix simulator– Generate m-by-n matrix with sparsity δ– k: factorization dimensionality– Defaults:

371617 2,2,2,2 knm

Computation Breakdown

• dominates the computation• is lightweight• The sparser, the faster

AWX TWHWY T

Performance w.r.t. Parameters

• Linear to m×n×δ• Linear to factorization dimension k• Sub-ideal speedup w.r.t. cluster

size V

Scalability on Real-world Data

• User-by-Website matrix– Browsed URLs of opt-in users, represented by UID– URLs trimmed to site level

• http://www.cnn.com/breakingnews --> www.cnn.com

• Experiments on Microsoft SCOPE– SCOPE: Structure Computations Optimized for Parallel

Execution [Chaiken et al., VLDB’08]

Executions w.r.t. Iterations

• Observations– Longer total elapse time– Shorter time per iteration

• Reason– Overlapped computation

across iterations

0 1 2 3 4 5 6 70

f(x) = 0.721508828250402 x + 0.422552166934188R² = 0.993424501613606

Iterations

Scalability w.r.t. Matrix Size

3 hours per iteration, 20 iterations take around 20*3*0.72 ≈ 43 hours

Less than 7 hours on a 43.9M-by-769M matrix with 4.38 billion nonzero values

Conclusion

• NMF is an effective tool to uncover latent structures in dyadic data that is abundant on the Web

• NMF is admissible to MapReduce • Distributed NMF solves the scalability

challenge• Applications down the road

Thank You!

distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

Documents

nonnegative matrix factorization for clustering ·...

new algorithms for nonnegative matrix factorization and...

weighted nonnegative matrix factorization and face feature...

bayesian nonnegative matrix factorization with stochastic...

weakly supervised nonnegative matrix factorization for...

nonnegative matrix factorization with local similarity

1 unc, stat & or nonnegative matrix factorization

putting nonnegative matrix factorization to the test

nonnegative matrix factorization for segmentation...

efficient initialization for nonnegative matrix...

incorporating prior information in nonnegative matrix...

nonnegative matrix factorization - complexity, algorithms...

computing a nonnegative matrix factorization...

exploring nonnegative matrix factorization - stanford...

hybrid projective nonnegative matrix factorization with...

exploring nonnegative matrix factorization · introduction...

projective nonnegative matrix factorization: sparseness

nonnegative matrix factorization: algorithms and...

advances in nonnegative matrix and tensor factorization

nonnegative matrix factorization for spectral data...