anomaly detection using spark mllib and spark streaming

Anomaly DetectionOffline Training using Spark Mllib; Online Testing using Spark Streaming;Details: https://github.com/keiraqz/anomaly-detection

Keira Zhou Dec, 2015

The ModelModel is trained using KMeans(Spark MLlib K-

means) approach Trained on "normal" dataset only After the model is trained, the centroid of the

"normal" dataset will be returned as well as a threshold

During the validation stage, any data points that are further than the threshold from the centroid are considered as "anomalies".

DatasetThe dataset is downloaded from KDD Cup 1999

Data for Anomaly Detection [1]Training Set: The training set is separated

from the whole dataset with the data points that are labeled as "normal" only

Validation Set: The validation set is using the whole dataset. All data points that are NOT labeled as "normal" are considered as "anomalies”

[1] KDD Cup 1999 Data: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

Offline TrainingThe majority of the training code mainly follows

the tutorial from Sean Owen, Cloudera: Video: https://www.youtube.com/watch?v=TC5cKYBZAeI Slides-1: http://www.slideshare.net/CIGTR/anomaly-detection-

with-apache-spark Slides-2: http://www.slideshare.net/cloudera/anomaly-

detection-with-apache-spark-2

Couple of modifications have been made to fit personal interest: Instead of training multiple clusters, the code only trains on

"normal" data points Only one cluster center is recorded and threshold is set to the

last of the furthest 2000 data points During later validating stage, all points that are further than

the threshold is labeled as "anomaly"

Online TestingValidation is run as a streaming job using Spark

Streaming Currently the application reads the input data

from a local file In an ideal situation, the program will read the

data from some ingestion tools such as Kafka

Also, the trained model (centroid and threshold) is also saved in a local file In production, the information should be saved

into a database

Spark Streaming context: process every 3 seconds

Load the trained model:

Load from local file and put into a queueStream

The streaming task: Calculate the distance between the data point and the centroid, then compare to the threshold

NotesCurrently the application reads the input data

from a local file In an ideal situation, the program will read the

data from some ingestion tools such as Kafka

Also, the trained model (centroid and threshold) is also saved in a local file In production, the information should be saved

into a database

The output of the testing can be saved into a database for visualization

More Detailshttps://github.com/keiraqz/anomaly-detection

anomaly detection using spark mllib and spark streaming

Data & Analytics

production readiness testing at salesforce using spark mllib

machine learning using apache spark mllib

spark mllib is the spark component providing the machine...

spark mllib is the spark component providing the machine...

apache® spark™ mllib 2.x: migrating ml workloads to...

imc summit 2016 breakout - girish kathalagiri - decision...

distributing matrix computations with spark...

spark mllib is the spark component providing the machine...

mllib: scalable machine learning on spark

spark mllib - courses.physics.illinois.edu · machine...

neural networks, spark mllib, deep learning

cloudera’sinvestmentsin thesparkecosystem -...

reactive feature generation with spark and mllib by jeffrey...

anomaly detection with apache spark

apache spark mllib - github...

generalized linear models in spark mllib and sparkr by...

embrace sparsity at web scale: apache spark mllib algorithms...

hadoop architecture and...

spark mllib and viral tweets

deep neural network regression at scale in spark mllib