anomaly detection using spark mllib and spark streaming
Post on 07-Jan-2017
736 Views
Preview:
TRANSCRIPT
Anomaly DetectionOffline Training using Spark Mllib; Online Testing using Spark Streaming;Details: https://github.com/keiraqz/anomaly-detection
Keira Zhou Dec, 2015
The ModelModel is trained using KMeans(Spark MLlib K-
means) approach Trained on "normal" dataset only After the model is trained, the centroid of the
"normal" dataset will be returned as well as a threshold
During the validation stage, any data points that are further than the threshold from the centroid are considered as "anomalies".
DatasetThe dataset is downloaded from KDD Cup 1999
Data for Anomaly Detection [1]Training Set: The training set is separated
from the whole dataset with the data points that are labeled as "normal" only
Validation Set: The validation set is using the whole dataset. All data points that are NOT labeled as "normal" are considered as "anomalies”
[1] KDD Cup 1999 Data: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Offline TrainingThe majority of the training code mainly follows
the tutorial from Sean Owen, Cloudera: Video: https://www.youtube.com/watch?v=TC5cKYBZAeI Slides-1: http://www.slideshare.net/CIGTR/anomaly-detection-
with-apache-spark Slides-2: http://www.slideshare.net/cloudera/anomaly-
detection-with-apache-spark-2
Couple of modifications have been made to fit personal interest: Instead of training multiple clusters, the code only trains on
"normal" data points Only one cluster center is recorded and threshold is set to the
last of the furthest 2000 data points During later validating stage, all points that are further than
the threshold is labeled as "anomaly"
Online TestingValidation is run as a streaming job using Spark
Streaming Currently the application reads the input data
from a local file In an ideal situation, the program will read the
data from some ingestion tools such as Kafka
Also, the trained model (centroid and threshold) is also saved in a local file In production, the information should be saved
into a database
Spark Streaming context: process every 3 seconds
Load the trained model:
Load from local file and put into a queueStream
The streaming task: Calculate the distance between the data point and the centroid, then compare to the threshold
NotesCurrently the application reads the input data
from a local file In an ideal situation, the program will read the
data from some ingestion tools such as Kafka
Also, the trained model (centroid and threshold) is also saved in a local file In production, the information should be saved
into a database
The output of the testing can be saved into a database for visualization
More Detailshttps://github.com/keiraqz/anomaly-detection
top related