Download - Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce
![Page 1: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/1.jpg)
Distributed Nonnegative Matrix Factorization for Web-Scale DyadicData Analysis on MapReduce
Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang
Internet Services Research Center (ISRC)Microsoft Research Redmond
![Page 2: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/2.jpg)
Internet Services Research Center (ISRC)• Advancing the state of the art in online services• Dedicated to accelerating innovations in search and ad
technologies• Representing a new model for moving technologies quickly
from research projects to improved products and servicesThursday, 04/29/2010 Friday, 04/30/201010:30~12:00pm: Data Analysis & Efficiency• Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce
11:00~12:30pm: Query Analysis• Exploring Web Scale Language Models for Search Query Processing • Building Taxonomy of Web Search Intents for Name Entity Queries• Optimal Rare Query Suggestion With Implicit User Feedback
1:30~3:00pm: Information Extraction• Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries
1:30~3:00pm: Infrastructure 2• Large-scale Bot Detection for Search Engines
![Page 3: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/3.jpg)
Dyadic Data on the Web
• Web abounds with dyadic data– Web search: term by document,
query by clickedURL, web linkage, …– Advertising: query by ad, bid term by ad,
user by ad, …– Social media: tag by image, user by community,
friendship graph, …• Common characteristics– Good source for discovering latent relationships– High dimensionality, sparse, nonnegative, dynamic
![Page 4: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/4.jpg)
Nonnegative Matrix Factorization (NMF)
• Effective tool to uncover latent relationships in nonnegative matrices with many applications [Berry et al., 2007, Sra & Dhillon, 2006]
– Interpretable dimensionality reduction [Lee & Seung, 1999]
– Document clustering [Shahnaz et al., 2006, Xu et al, 2006]
• Challenge: Can we scale NMF to million-by-million matrices
Am
n
WH
m
nkk
0,0,0 HWA
![Page 5: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/5.jpg)
NMF Algorithm [Lee & Seung, 2000]
Am
n
WH
m
nkk
0,0,0 HWA
![Page 6: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/6.jpg)
Parallel NMF [Robila & Maciak, 2006]
• Parallelism on multi-core machines– Partition along the long dimension for parallelism– Assuming all matrices can be held in shared memory
![Page 7: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/7.jpg)
Distributed NMF
• Data Partition: A, W and H across machines
A…
…
),,( , jiAji
W. . . . .
),( iwi
H
. . . . .
),( jhj
![Page 8: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/8.jpg)
Copmuting DNMF: The Big Picture
WAW
AWH
Y
XHH
T
T
*.*.
![Page 9: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/9.jpg)
… … …
…
),,(: , jiAjiA
),,,( , iji wAji
Map-I
Reduce-I
),( , iji wAj
Map-II
),( , iji wAj
Reduce-II
),( jxj
Map-IIIMap-IV
),0( WW T
Map-V
),0( iTi ww
…
),,,( jjj yxhj
…),( jyj
),(: iwiW ),(: jhjH
…
…
…
… ),( newjhj
Reduce-III
Reduce-V
![Page 10: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/10.jpg)
AWX T
… …
…
),,(: , jiAjiA
),,,( , iji wAji
Map-I
Reduce-I
),( , iji wAj
Map-II
),( , iji wAj
Reduce-II
),( jxj
),(: iwiW
…
…
…
![Page 11: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/11.jpg)
WHWY T
… …
Map-IIIMap-IV
),0( WW T
),0( iTi ww …),( jyj
),(: iwiW ),(: jhjH
Reduce-III WHWY T
m
ii
Ti
T wwWWC1
W
. . . . .
),( iwi
. . .
. . .
![Page 12: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/12.jpg)
YXHH *.
…
),( jxj
Map-V
…
),,,( jjj yxhj
…),( jyj
),(: jhjH
…
… ),( newjhj
Reduce-V
![Page 13: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/13.jpg)
… … …
…
),,(: , jiAjiA
),,,( , iji wAji
Map-I
Reduce-I
),( , iji wAj
Map-II
),( , iji wAj
Reduce-II
),( jxj
Map-IIIMap-IV
),0( WW T
Map-V
),0( iTi ww
…
),,,( jjj yxhj
…),( jyj
),(: iwiW ),(: jhjH
…
…
…
… ),( newjhj
Reduce-III
Reduce-V
![Page 14: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/14.jpg)
Experimental Evaluation
• Synthesized data on a sandbox cluster– No interference from other jobs– Performance with various parameters
• Real-world data on a commercial cluster– Real-world scalability
![Page 15: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/15.jpg)
Synthesized Data on Sandbox Cluster
• A Hadoop cluster with 8 workers in total– Worker: Pentium-IV CPU, 1 or 2 cores, 1~2 GB
memory, 150G hard drive– V: Number of workers in cluster
• Matrix simulator– Generate m-by-n matrix with sparsity δ– k: factorization dimensionality– Defaults:
371617 2,2,2,2 knm
![Page 16: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/16.jpg)
Computation Breakdown
• dominates the computation• is lightweight• The sparser, the faster
AWX TWHWY T
![Page 17: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/17.jpg)
Performance w.r.t. Parameters
• Linear to m×n×δ• Linear to factorization dimension k• Sub-ideal speedup w.r.t. cluster
size V
![Page 18: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/18.jpg)
Scalability on Real-world Data
• User-by-Website matrix– Browsed URLs of opt-in users, represented by UID– URLs trimmed to site level
• http://www.cnn.com/breakingnews --> www.cnn.com
• Experiments on Microsoft SCOPE– SCOPE: Structure Computations Optimized for Parallel
Execution [Chaiken et al., VLDB’08]
![Page 19: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/19.jpg)
Executions w.r.t. Iterations
• Observations– Longer total elapse time– Shorter time per iteration
• Reason– Overlapped computation
across iterations
0 1 2 3 4 5 6 70
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
f(x) = 0.721508828250402 x + 0.422552166934188R² = 0.993424501613606
Iterations
Nor
mal
ized
Elap
se T
ime
![Page 20: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/20.jpg)
Scalability w.r.t. Matrix Size
3 hours per iteration, 20 iterations take around 20*3*0.72 ≈ 43 hours
Less than 7 hours on a 43.9M-by-769M matrix with 4.38 billion nonzero values
![Page 21: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/21.jpg)
Conclusion
• NMF is an effective tool to uncover latent structures in dyadic data that is abundant on the Web
• NMF is admissible to MapReduce • Distributed NMF solves the scalability
challenge• Applications down the road
![Page 22: Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce](https://reader035.vdocuments.us/reader035/viewer/2022062422/568136d3550346895d9e7198/html5/thumbnails/22.jpg)
Q&A
Thank You!