elk for sysadmins
TRANSCRIPT
ELKNot just for InfoSec any more
Who are we?
Russel Havens● Monitoring Architect for
Adobe Digital Marketing Business Unit
● Adjunct Professor, BYU● 25 years in IT Operations, 2
years consulting, 5 years in software development
Hayden Panike● Former log analytics
undergraduate researcher● Storage Engineer
Tanner Lund● Former log analytics
undergraduate researcher● Service Engineer
ELK
● As an developer, I write logs● As an operations engineer, I live in the
log files while troubleshooting● However: in most organizations where
I’ve worked, formal log aggregation and analysis is owned by InfoSec, and access to those were limited (Splunk or Syslog)
ELK background and approach
● Monitoring is a broad area○ Up/Down○ Historical trending○ Log Analysis
● Using ELK for managing logs of 60+ Nagios servers, web servers, etc.
● Worked with a BYU Capstone team 2013-2014
ELK approach
Hayden & Tanner- Took over log management project
from previous team- Expanded partnership with BYU OIT to
gather logs from servers, SANs, and network equipment (including Wifi hotspots)
- Expanded cluster size massively
Our ELK History
We do what we must...
Partnership with BYU OITProduction ELK Deployment
Every 60 seconds at BYU we ingest event logs:*-2,431 network-430,000 IDS-59,000 wireless
62,300,000 total a day** This represents peak numbers.
Simplest Architecture
Common Architecture
Highly Available Enterprise Architecture
This process ought to be formalizedProactive monitoring
Logs and Enterprise Monitoring
Monitoring as a DisciplineSOURCES OF DATA-SNMP -stdout-/proc *stat-LogsSYSTEM PIECES-Collector agents-Aggregator nodes-Analysis platform
PRINCIPLES-Aggregation-Cause Analysis (reactive)-Behavior Analysis (proactive)
Queries: Simple Yet Elegant● As Operation
Engineers, we already know lots of keywords. These can be easily leveraged for simple queries.
● error● failure● root● port flapping● memory
Simple Yet Elegant● A simple search for the word “memory” over a 30 day
period.
Simple search● Nagios 4 can be set for a maximum number of concurrent processes. Here we
see 2 overloaded monitoring servers.
Simple Yet Elegant
Simple Yet Elegant
Simple Bin and Count
Simple Bin and Count
Simple Bin and Count
● Campus wide phone OS stats
Simple Bin and CountBusiness School
Administrative Building
Simple Bin and Count
Simple Bin and Count
Simple Bin and Count
Simple Bin and Count
-We are at the tip of a transformative iceberg-Machine Learning and Statisticians needed
A Call to Arms!