apache eagle @ ieee international conference
TRANSCRIPT
EAGLE: User Profile-based Anomaly Detection for Securing Hadoop Clusters
01 NOV, 2015
CHAITALI GUPTA, RANJAN SINHA, YONG ZHANG
Outline
Why EAGLE?
Architecture of EAGLE
User Profiles in EAGLE
Experiments
Performance Results
Future Work
Big Data @ eBay
800MListings *
159M Global Active Buyers *
*Q3 2015 data
7 Hadoop Clusters*
800MHDFS operations (single cluster)*
120 PB Data*
Motivation
Who is accessing the data?
What data are they accessing?
Is someone trying to access data that they don’t have access to?
Are there any anomalous access patterns?
Is there a security threat?
How to monitor and get notified during or prior to an anomalous event occurring?
ARCHITECTURE
STREAM PROCESSINGENGINE
Dat
a C
olle
ctor
Kaf
ka
HDFS, Audit, Security
METADATA MANAGER
DATA STO
RESREMEDIATION
ENGINEApache Ranger
MACHINE LEARNING MODULE
Custom module
Alerts
Activities
Alerts
PolicyThresholdsUser properties
ML Thresholds
Real Time Alert Dashboard
HDFS Archive
Security Analyst
Admin Console
Security Engineer
Insights
Metadata
Management
MACHINE LEARNING TRAINING MODULE
USER PROFILE ALGORITHMSDensity Estimation
• Compute mean and standard deviation
• Compute probability density estimation
• Detect anomaly if probability density below minimum probability density seen so far from training set
USER PROFILE ALGORITHMS…Eigen Value Decomposition
• Compute mean and variance
• Compute Eigen Vectors and determine Principal
Components
• Normal data points lie near first few principal
components
• Abnormal data points lie further from first few
principal components and closer to later
components
USER PROFILE ARCHITECTURE
EXPERIMENTAL METHODOLOGY
User Population
• 1500 ebay users accessing Hadoop clusters
Features• HDFS operation frequencies aggregated across one
minute interval • Examples
• Command frequencies• Time of the job
EXPERIMENTAL METHODOLOGY…
Determine users who are behaviorally different
• Compute Mahalanobis distance between users data
,where are mean and standard deviation
• Compute clusters
• Use behaviorally different users from a user as cross-validation set
PERFORMANCE RESULTS
Sensitivity
FUTURE WORK
• Apache incubation releases• Twitter feed: https://twitter.com/theapacheeagle
• Extend to HIVE, HBASE, Pig and other Big Data Technologies
• Explore alternative algorithms
• Consider more features
APACHE EAGLE - OPEN SOURCE
Eagle Site: http://goeagle.io
Tech Blog: http://www.ebaytechblog.com
Github Repo:https://github.com/eBay/Eagle
Apache Incubator Project: Oct 26, 2015
Thank You!