data mining meets systems: tools and case studies
DESCRIPTION
Data Mining Meets Systems: Tools and Case Studies. Christos Faloutsos SCS CMU. Spiros Papadimitriou (CMU->IBM). Mengzhi Wang (CMU->Google). Thanks. Jimeng Sun (CMU -> IBM). Outline. Problem 1: workload characterization Problem 2: self-* monitoring Problem 3: BGP mining - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/1.jpg)
CMU SCS
Data Mining Meets Systems:Tools and Case Studies
Christos Faloutsos
SCS CMU
![Page 2: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/2.jpg)
PDL 2008 C. Faloutsos #2
CMU SCS
Thanks
Spiros Papadimitriou (CMU->IBM)
Mengzhi Wang (CMU->Google)
Jimeng Sun (CMU -> IBM)
![Page 3: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/3.jpg)
PDL 2008 C. Faloutsos #3
CMU SCS
Outline
• Problem 1: workload characterization
• Problem 2: self-* monitoring
• Problem 3: BGP mining
• (Problem 4: sensor mining)
• (Problem 5: Large graphs & hadoop)
fractals
SVDwavelets
tensors
PageRank
![Page 4: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/4.jpg)
PDL 2008 C. Faloutsos #4
CMU SCS
Problem #1:
Goal: given a signal (eg., #bytes over time)
Find: patterns, periodicities, and/or compress
time
#bytes Bytes per 30’(packets per day;earthquakes per year)
![Page 5: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/5.jpg)
PDL 2008 C. Faloutsos #5
CMU SCS
Problem #1
• model bursty traffic
• generate realistic traces
• (Poisson does not work)
time
# bytes
Poisson
![Page 6: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/6.jpg)
PDL 2008 C. Faloutsos #6
CMU SCS
Motivation
• predict queue length distributions (e.g., to give probabilistic guarantees)
• “learn” traffic, for buffering, prefetching, ‘active disks’, web servers
![Page 7: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/7.jpg)
PDL 2008 C. Faloutsos #7
CMU SCS
Q: any ‘pattern’?
time
# bytes• Not Poisson• spike; silence; more
spikes; more silence…• any rules?
![Page 8: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/8.jpg)
PDL 2008 C. Faloutsos #8
CMU SCS
solution: self-similarity
# bytes
time time
# bytes
![Page 9: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/9.jpg)
PDL 2008 C. Faloutsos #9
CMU SCS
But:
• Q1: How to generate realistic traces; extrapolate?
• Q2: How to estimate the model parameters?
![Page 10: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/10.jpg)
PDL 2008 C. Faloutsos #10
CMU SCS
Approach
• Q1: How to generate a sequence, that is– bursty– self-similar– and has similar queue length distributions
![Page 11: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/11.jpg)
PDL 2008 C. Faloutsos #11
CMU SCS
Approach
• A: ‘binomial multifractal’ [Wang+02]
• ~ 80-20 ‘law’:– 80% of bytes/queries etc on first half– repeat recursively
• b: bias factor (eg., 80%)
![Page 12: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/12.jpg)
PDL 2008 C. Faloutsos #12
CMU SCS
binary multifractals20 80
![Page 13: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/13.jpg)
PDL 2008 C. Faloutsos #13
CMU SCS
binary multifractals20 80
![Page 14: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/14.jpg)
PDL 2008 C. Faloutsos #14
CMU SCS
Parameter estimation
• Q2: How to estimate the bias factor b?
![Page 15: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/15.jpg)
PDL 2008 C. Faloutsos #15
CMU SCS
Parameter estimation
• Q2: How to estimate the bias factor b?
• A: MANY ways [Crovella+96]– Hurst exponent– variance plot– even DFT amplitude spectrum! (‘periodogram’)– More robust: ‘entropy plot’ [Wang+02]
Mengzhi Wang, Tara Madhyastha, Ngai Hang Chang, Spiros Papadimitriou and Christos Faloutsos, Data Mining Meets Performance Evaluation: Fast Algorithms for Modeling Bursty Traffic, ICDE 2002
![Page 16: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/16.jpg)
PDL 2008 C. Faloutsos #16
CMU SCS
Entropy plot
• Rationale:– burstiness: inverse of uniformity– entropy measures uniformity of a distribution– find entropy at several granularities, to see
whether/how our distribution is close to uniform.
![Page 17: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/17.jpg)
PDL 2008 C. Faloutsos #17
CMU SCS
Entropy plot
• Entropy E(n) after n levels of splits
• n=1: E(1)= - p1 log2(p1)- p2 log2(p2)
p1 p2% of bytes
here
![Page 18: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/18.jpg)
PDL 2008 C. Faloutsos #18
CMU SCS
Entropy plot
• Entropy E(n) after n levels of splits
• n=1: E(1)= - p1 log(p1)- p2 log(p2)
• n=2: E(2) = - p2,i * log2 (p2,i)
p2,1 p2,2 p2,3 p2,4
![Page 19: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/19.jpg)
PDL 2008 C. Faloutsos #19
CMU SCS
Real traffic
• Has linear entropy plot (-> self-similar)
# of levels (n)
EntropyE(n)
0.73
![Page 20: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/20.jpg)
PDL 2008 C. Faloutsos #20
CMU SCS
Observation - intuition:
intuition: slope =
intrinsic dimensionality =~
‘degrees of freedom’ or
info-bits per coordinate-bit– unif. Dataset: slope =1
– multi-point: slope = 0
# of levels (n)
EntropyE(n)
0.73
![Page 21: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/21.jpg)
PDL 2008 C. Faloutsos #35
CMU SCS
Some more entropy plots:
• Poisson vs real
Poisson: slope = ~1 -> uniformly distributed
1 0.73
![Page 22: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/22.jpg)
PDL 2008 C. Faloutsos #36
CMU SCS
B-model
• b-model traffic gives perfectly linear plot
• Lemma: its slope isslope = -b log2b - (1-b) log2 (1-b)
• Fitting: do entropy plot; get slope; solve for b
E(n)
n
![Page 23: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/23.jpg)
PDL 2008 C. Faloutsos #37
CMU SCS
Experimental setup
• Disk traces (from HP [Wilkes 93])
• web traces from LBLhttp://repository.cs.vt.edu/lbl-conn-7.tar.Z
![Page 24: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/24.jpg)
PDL 2008 C. Faloutsos #38
CMU SCS
Model validation
• Linear entropy plots
Bias factors b: 0.6-0.8 smallest b / smoothest: nntp traffic
![Page 25: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/25.jpg)
PDL 2008 C. Faloutsos #39
CMU SCS
Web traffic - results
• LBL, NCDF of queue lengths (log-log scales)
(queue length l)
Prob( >l)
![Page 26: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/26.jpg)
PDL 2008 C. Faloutsos #40
CMU SCS
Conclusions
• Multifractals (80/20, ‘b-model’, Multiplicative Wavelet Model (MWM)) for analysis and synthesis of bursty traffic
![Page 27: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/27.jpg)
PDL 2008 C. Faloutsos #41
CMU SCS
Books
• Fractals: Manfred Schroeder: Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise W.H. Freeman and Company, 1991 (Probably the BEST book on fractals!)
![Page 28: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/28.jpg)
PDL 2008 C. Faloutsos #42
CMU SCS
Outline
• Problem 1: workload characterization
• Problem 2: self-* monitoring
• Problem 3: BGP mining
• (Problem 4: sensor mining)
• (Problem 5: Large graphs & hadoop)
![Page 29: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/29.jpg)
PDL 2008 C. Faloutsos #43
CMU SCS
Clusters/data center monitoring
• Monitor correlations of multiple measurements• Automatically flag anomalous behavior• Intemon: intelligent monitoring system
– warsteiner.db.cs.cmu.edu/demo/intemon.jsp
![Page 30: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/30.jpg)
PDL 2008 C. Faloutsos #44
CMU SCS
Publication
Evan Hoke, Jimeng Sun, John D. Strunk, Gregory R. Ganger, Christos Faloutsos. InteMon: Continuous Mining of Sensor Data in Large-scale Self-* Infrastructures. ACM SIGOPS Operating Systems Review, 40(3):38-44. ACM Press, July 2006
![Page 31: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/31.jpg)
PDL 2008 C. Faloutsos #45
CMU SCS
Under the hood: SVD
• Singular Value Decomposition
• Done incrementally
Spiros Papadimitriou, Jimeng Sun and Christos Faloutsos Streaming Pattern Discovery in Multiple Time-Series VLDB 2005, Trondheim, Norway.
![Page 32: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/32.jpg)
PDL 2008 C. Faloutsos #46
CMU SCS
Singular Value Decomposition (SVD)
• SVD (~LSI ~ KL ~ PCA ~ spectral analysis...)
LSI: S. Dumais; M. Berry
KL: eg, Duda+Hart
PCA: eg., Jolliffe
Details: [Press+]
u of CPU1
u ofCPU2
t=1t=2
![Page 33: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/33.jpg)
PDL 2008 C. Faloutsos #47
CMU SCS
Singular Value Decomposition (SVD)
• SVD (~LSI ~ KL ~ PCA ~ spectral analysis...)
u of CPU1
u ofCPU2
t=1t=2
![Page 34: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/34.jpg)
PDL 2008 C. Faloutsos #48
CMU SCS
Singular Value Decomposition (SVD)
• SVD (~LSI ~ KL ~ PCA ~ spectral analysis...)
u of CPU1
u ofCPU2
t=1t=2
![Page 35: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/35.jpg)
PDL 2008 C. Faloutsos #49
CMU SCS
Singular Value Decomposition (SVD)
• SVD (~LSI ~ KL ~ PCA ~ spectral analysis...)
u of CPU1
u ofCPU2
t=1t=2
![Page 36: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/36.jpg)
PDL 2008 C. Faloutsos #50
CMU SCS
Outline
• Problem 1: workload characterization
• Problem 2: self-* monitoring
• Problem 3: BGP mining
• (Problem 4: sensor mining)
• (Problem 5: Large graphs & hadoop)
![Page 37: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/37.jpg)
PDL 2008 C. Faloutsos #51
CMU SCS
BGP updates
With • Aditya Prakash (CMU)
• Michalis Faloutsos (UC Riverside)
• Nicholas Valler (UC Riverside)
• Dave Andersen (CMU)
![Page 38: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/38.jpg)
PDL 2008 C. Faloutsos #52
CMU SCS
Time Series: #Updates per 600s, Washington Router 09/2004-09/2006
Tool #0: Time plot
![Page 39: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/39.jpg)
PDL 2008 C. Faloutsos #53
CMU SCS
Tool #0: Time plot
• Observation #1: Missing values• Observation #2: Bursty
![Page 40: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/40.jpg)
PDL 2008 C. Faloutsos #54
CMU SCS
Tool #1: Wavelets
![Page 41: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/41.jpg)
PDL 2008 C. Faloutsos #55
CMU SCS
Wavelets - DWT
• Short window Fourier transform (SWFT)
• But: how short should be the window?
time
freq
time
value
![Page 42: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/42.jpg)
PDL 2008 C. Faloutsos #56
CMU SCS
Wavelets - DWT
• Answer: multiple window sizes! -> DWT
time
freq
Timedomain DFT SWFT DWT
![Page 43: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/43.jpg)
PDL 2008 C. Faloutsos #57
CMU SCS
Haar Wavelets
• subtract sum of left half from right half
• repeat recursively for quarters, eight-ths, ...
![Page 44: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/44.jpg)
PDL 2008 C. Faloutsos #58
CMU SCS
‘Tornado Plot’ for Washington Router: Dark areas correspond to high energy
Low freq.
High freq.
time
![Page 45: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/45.jpg)
PDL 2008 C. Faloutsos #59
CMU SCS
Tornado Plot: Wavelet Transformfor Washington Router 09/2004-09/2006, All coefficients andDetail levels 1-12
Observations:
1.Obvious Spikes (E1): tornados that “touch down”
2. Prolonged Spikes (E2 and E3): when coarser scales have high values but finer scales do not
3.Intermittent Waves (E4 and E5): High-energy entries at nearby scales correspond to local periodic motion
![Page 46: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/46.jpg)
PDL 2008 C. Faloutsos #60
CMU SCS
E2: Prolonged Spike Sustained Period of relatively high Activity
Magnification of updates on 28th Aug. 2005
time
# updates
![Page 47: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/47.jpg)
PDL 2008 C. Faloutsos #61
CMU SCS
Tool #2: logarithms
![Page 48: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/48.jpg)
PDL 2008 C. Faloutsos #62
CMU SCS
Tool #2: logarithms
Prominent `clothesline’ at ~ 50 updates per 600 secs.
Culprit IP addresses:
192.211.42.0/24216.109.38.0/24207.157.115.0/24
All from Alabama (Supercomputing Center)!
![Page 49: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/49.jpg)
PDL 2008 C. Faloutsos #63
CMU SCS
Outline
• Problem 1: workload characterization
• Problem 2: self-* monitoring
• Problem 3: BGP mining
• (Problem 4: sensor mining)
• (Problem 5: Large graphs & hadoop)
fractals
SVDwavelets
tensors
PageRank
![Page 50: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/50.jpg)
PDL 2008 C. Faloutsos #64
CMU SCS
Main point
Two-way street:
<- DM can use such infrastructures to find patterns
-> DM can help such systems/networks etc to become self-healing, self-adjusting, ‘self-*’
Hot topic in Data Mining: finding patterns in Tera- and Peta-bytes
![Page 51: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/51.jpg)
PDL 2008 C. Faloutsos #65
CMU SCS
Additional resources
• Machine learning classes at SCS/MLD• Tom Mitchell’s book on Machine Learning
– Classification– Clustering/Anomaly detection– Support vector machines– Graphical models– Bayesian networks– <etc etc>
![Page 52: Data Mining Meets Systems: Tools and Case Studies](https://reader035.vdocuments.us/reader035/viewer/2022062423/56814971550346895db6c0a7/html5/thumbnails/52.jpg)
PDL 2008 C. Faloutsos #66
CMU SCS
www.cs.cmu.edu/~christos
For code, papers etc
WeH 7107 christos <at> cs