proactivity = observation + analysis + knowledge extraction + action planning ?

42
Budapest University of Technology and Economics Department of Measurement and Information Proactivity = Observation + Analysis + Knowledge extraction + Action planning? András Pataricza, Budapest University of Technology and Economics

Upload: adlai

Post on 25-Feb-2016

39 views

Category:

Documents


1 download

DESCRIPTION

Proactivity = Observation + Analysis + Knowledge extraction + Action planning ?. András Pataricza, Budapest University of Technology and Economics . Contributors. Prof. G. Horváth (BME) I. Kocsis (BME) Z. Micskei (BME) K. Gáti (BME) Zs . Kocsis (IBM) I. Szombath (BME) - PowerPoint PPT Presentation

TRANSCRIPT

PowerPoint bemutat

Proactivity = Observation + Analysis + Knowledge extraction + Action planning?Andrs Pataricza, Budapest University of Technology and Economics

Budapest University of Technology and EconomicsDepartment of Measurement and Information SystemsContributorsProf. G. Horvth (BME)I. Kocsis (BME)Z. Micskei (BME)K. Gti (BME)Zs. Kocsis (IBM)I. Szombath (BME)And many others

There will be nothing new in this lecture

I learned the basics, when I was so young

4

But old professors are happy of new audience

5

What can traditional signal processing help for proactivityProactive stance: Builds on foreknowledge (intelligence) and creativity to anticipate the situation as an opportunity, regardless of how threatening or how bad it looks; influence the system constructively instead of reacting

Reactivity vs. proactivityReactive controlacting in response to a situation rather than creating or controlling it:Proactive controlcontrolling a situation rather than just responding to it after it has happened:7

Test environment

Test configurationVirtual desktop infrastructure~ a few of tens of VM/host~ a few of tens of host/clusterVSphere monitoring and supervisory controlObjective: VM level SLA controlCapacity planning, Proactive migrationCPU-ready metrics: VM ready to run, but lack of resources to start

Performance monitoring10

Detecting a possible problem on VM or host level Failure indicator as wellThis document was created using the official VMware icon and diagram library. Copyright 2010 VMware, Inc. See http://communities.vmware.com/docs/DOC-1370210

Actions to prevent performance issue11

Add limits neighbouring VMs

Actions to prevent performance issue12

Live migrate VM to other (underutilized) hostMeasured data (at 20 sec sampling rate)13Aggregation over populationStatistical cluster behavior versusQoS over the VM population14Mean of the goal VM-metric (VM_CPU_READY)VM application:ready to run Resource lack-> Performance bottleneck-> Availability problemVmware recommended threshold:5% watching10% typically action is needed

The two trapsVisual processing: You believe your eyesAutomated processing: you believe your computer

Statistics:Mean: 0.007 -> a good systemOnly 2/3 of the samples are error-free-> A bad systemAfter eliminating failure-free cases below the threshold Mean: 0.023-> a good system

Mean of the goal VM-metricVisual inspection:Lot of bad valuesThis is a bad system

Host shared and used memory along the time NoisyHigh frrequency components dominateBut they correlate (93%!)YOU DONT SEE IT

and a host of more mundane observationsComputing power use = CPU use CPU clk rate (const.)Should be pure proportionalCorrelation coefficient:0.99998477434137Well-visible, but numerically suppressedOrigin???

19

19

Host CPU usage vs VM ratio: bad vCPU ready

Most important factor: host CPU usage mean

The battleplanImpacts of temporal resultionNyquistShannon sampling theorem:2 sampling frequency = bandwidthSampling period = 20 sec-> Sampling frequency = 5 Hz-> Bandwidth = 2.5 HzAdditionaly:Sampling clock jitter (SW sampling)Clock skew (distributed system)Precision Time Protocol(PTP) (IEEE 1588-2008) No fine granular prediction22

ProactivityProactivity needs:Situation recognition based on historical experienceWhat is to be expected ? Identification of the principal factorsSingle factor /multiple factorsOperation domains leading to failuresBoundaries Predictor designHigh failure coverageTemporal lookahead sufficient for reactionDesign of reaction

23Situations to be coveredSingle VM: application demand > resource allocatedVM-host: overcommisioning, overload due to other VMsVM-host-cluster

24

Data preparationData cleaningData reductionData reduction Huge initial set of samplesReductionObject sampling: Represenative measurement objectsParameter selection/reduction:AggregationRelevance Redundancy Temporal Sampling Relevance

Object samplingIn pursuit of discovering fine-grained behaviorand the reasons for outliers27

For presentation purposes only- Reduction of the sample size to 400 ManageabilityReal-life analysis: - keep enough data to maintain a proper correlation with the operation

Subsample: ratio > 0 + random subsampling Demo: Visual data discovery with || coordinates

Visual multifactor analysisVisual analytics for an arbitrary number of factors

Inselberg, A: Parallel Coordinates, Visual Multidimensional Geometry and Its Applications, Springer 2009You can do much, much moreRedundancy reductionCorrelation analysisClustering Data miningApproximationOptimization 30

Prediction at the cluster levelWhat ratio of the VMs will become problematic?31Pinpointed interval for one VM

Situation of interestTraining time> Prediction timeOne minute prediction based on all data sources

One minute prediction and classification

Predicted\RealAlarmNormalAlarm77 (67.54%)56 (0.3%) Normal37 (32.46%)18269 (99.7%)Piros -> missed alarm, elsfaj hibaSrga -> false alarm, msodfaj hiba34One minute prediction with selected variables

Classification error (simplest predictor)FactorsPrediction timeUncovered failure rateFalse alarm rateAll1 min73 %0,2 %Proper feature set1 min32 %0,3 %Wrong feature set1 min97 %0,04%All5 min87 %0,1%False alarm rate is low (dominant pattern) Feature set selection is critical to detectionMore is less (PROPER selection is needed cf. PFARM 2010)Case separation for different situationsLong term prediction is hard (automated reactions)

Case study Connectivity testing in Large NetworksIn dynamic infrastructures the active internode topology has to be discovered as well Utols mdosts:

37Large Networksnot known explicitlytoo complex forconventional algorithms

Social network graphYahoo! Instant Messenger friendconnectivity graph *1.8M nodes ~4M edges

Serve as a model ofLarge Infrastructures

Typical power law network75% of the friendships are related to 35% of users

Yahoo! Research Alliance Webscope program*ydata-yim-friends-graph-v1_0http://research.yahoo.com/Academic_Relations

Typical Model: Random graphsYahoo! Instant Messenger dataset Adjacency Matrix

Preferential attachment graph

GraphonOrdered by degree:Random order:Preferential attachment graph :At step m a new edges are created: randomly choosen nodes are connected with probability proportional to their degrees39

Approx. edge density by subgraph samplingSample size (k)Number of samples (n)Relative errorWhite:error < 5%

Random, k=4 sampleSample size k = 35Repeated n = 20 times2% error4% of the graph examinedThe decrease of error is exponential when x or z is small. Linear after that.Colors darker that purple means error is less than 5%!40Neighborhood sampling: Fault Tolerant Services

No. of 3 and 4 cycles = possible redundancyHigh node has many substitute nodes (e.g. load balancer)

Distribution approximated from samples are very close!Root nodeRedundancy?

Neighborhood samplingtake random nodesexplore neighborhood to a given depth (m)Fault Tolerant DomainTrendsSummary: proactivity needs

Thank you for your attention42ObservationsAll relevant cases (Stress test)AnalysisCheck of input dataVisual analysisUNDERSTANDINGAutomated methods for calculationKnowledge extractionClustering (situation recognition)Predictor(generalization)Action planningSituation defining principal factors are indicative