taming*the*beast:** managing*splunk*for*x1*@ …...kate*!...
TRANSCRIPT
Copyright © 2014 Splunk Inc.
Kate Lawrence-‐Gupta Principal Engineer, Comcast Joe Cramasta Sr. Engineer, Comcast
Taming the Beast: Managing Splunk for X1 @ Comcast
Copyright @ 2014 Comcast
Disclaimer
2
During the course of this presentaMon, we may make forward-‐looking statements regarding future events or the expected performance of the company. We cauMon you that such statements reflect our current expectaMons and
esMmates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-‐looking statements,
please review our filings with the SEC. The forward-‐looking statements made in the this presentaMon are being made as of the Mme and date of its live presentaMon. If reviewed aWer its live presentaMon, this presentaMon may not contain current or accurate informaMon. We do not assume any obligaMon to update any forward-‐looking statements we may make. In addiMon, any informaMon about our roadmap outlines our general product direcMon and is subject to change at any Mme without noMce. It is for informaMonal purposes only, and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligaMon either to develop the features or funcMonality described or to
include any such feature or funcMonality in a future release.
Lineup Who are Kate and Joe? ! Outline how Splunk operaMons have grown over the past 2 years to support the X1 pla[orm naMonal customer ramp-‐up
! Touch on some of operaMonal best pracMces we’ve used to accommodate having a large installaMon in terms of both volume and high search load
! Deep dive into 2 operaMonal problems and the technical soluMons designed and deployed
3
Kate ! Principal Engineer responsible for mulMple Splunk installaMons at Comcast
! Manages dedicated team providing Splunk as an operaMonal service to hundreds of internal customers including developers, execs and other operaMonal support teams
! 13 years of experience in operaMons, monitoring, and systems administraMon
! Splunk user since ~2006 iniMally for Nagios reporMng – RevoluMon Award winner at Splunk .conf2013!
4
Joe ! Senior Engineer @ Comcast Cable ! Has been with the Comcast for 7 years working on mulMple projects
! Kate's my Boss J ! Started using Splunk in 2009 to get visibility into a MicrosoW Hosted Exchange Environment
5
Comcast Overview ! Global media and technology company consisMng of Comcast Cable and NBCUniversal
! Comcast Cable: NaMon’s largest video, hi-‐speed internet and phone provider under the XFINITY brand
! NBCUniversal: One of world’s leading media and entertainment companies
! Company facts: – 64 Billion Revenue (2013 Financial report) – 125,000+ Employees
6
Then… ! Splunk was iniMally deployed for staMsMcal analysis of applicaMon logs, mainly as development soluMon
! Volume was low and performance slow ! Supported roughly 50 internal users ! 1 staffer dedicated about 75% of the Mme ! Ran Splunk on virtualized indexers and search heads w/NFS backed storage
7
Now… ! Splunk is deployed globally across
ALL services and is considered “criMcal path” for both monitoring and development
! Indexed volume has jumped by a factor of 12x
! SupporMng roughly 250+ users and dozens of automaMons
! 5 staffers dedicated 100% of the Mme ! Splunk runs on dedicated hardware
and storage across mulMple datacenters
! 99.99% upMme and less than 5 seconds of indexing latency
8
Best PracMces ! Use source control to enable tracking, roll-‐back plans and change management for all of our deployments
! Standardize rules around using Splunk by sewng limits and quotas ! Normalize alerts for easy organizaMon and tracking ! Use a central search head for “management” of your Splunk environment – Peer this search head to your other search heads and your indexers – Track and report on your license usage and alert on “hosts gone wild” – Measure your installaMon – set Key Performance Indicators (KPI) and track
your growth and capacity
9
Big Wins
Organized forwarders across produc2on
Increased Up2me to 99.99%
Support from Management
Got Crea2ve!
Made Splunk a consistently reliable method to get data fast and efficiently
“The only way to track all logs is with Splunk”
Encouraged internal departments to add data into Splunk through:
• Training
• Brown-‐bag sessions
• Evangelism
• Begging
We’ve encountered some unique problems with the pure size and complexity of our installaMon which forced us to come up with our own soluMons
10
Problem of Scale:
Joe and I have chosen 2 to deep dive into…
11
Problem #1 – CollecMng New Logs
! No Mmestamp/line-‐breaking consistency in the logs that are created
! Almost every Mme we need to collect a new log file an indexer configuraMon change is needed to accommodate for the Mmestamp/line breaking sewngs
! Results in us having to restart the Splunk service, which impacts search and alerMng
! This downMme requires change control approval and can only be performed once a week on Saturdays
12
Normally… inputs.conf on forwarder
[monitor:///opt/eddie/log.txt] sourcetype=eddie [monitor:///opt/clark/log.txt] sourcetype=clark [monitor:///opt/rusty/log.txt] sourcetype=rusty
props.conf on indexer
[source::/opt/eddie/log.txt] DATETIME_CONFIG=CURRENT SHOULD_LINEMERGE = false [source::/opt/clark/log.txt] TIME_PREFIX=^\[ TIME_FORMAT=%F %H:%M:%S,%3N LINE_BREAKER =([\r\n]+)\[ SHOULD_LINEMERGE = false [source::/opt/rusty/log.txt] SHOULD_LINEMERGE = false TIME_FORMAT=%a %b %d %H:%M:%S TIME_PREFIX=^
13
Wouldn’t it be nice….
What we noMced is that many of the new props.conf stanzas we create for logs that already existed but were matching some other source file
What if we could leverage a previously exisMng props.conf stanza to handle the Mmestamp/linebreaking recogniMon?
We sMll need to be able to define unique sourcetypes for all of our logs and we can’t modify the source of where the logs came from
14
The SoluMon: Imposter
First we needed to look at all of our current props.conf stanzas and de-‐dupe them, leaving us with a list of stanzas that each handle parsing a unique log format. With this list we then changed all of the stanza names to match on a wild-‐carded sourcetype name
[source::/opt/eddie/log.txt] DATETIME_CONFIG=CURRENT SHOULD_LINEMERGE = false [source::/opt/clark/log.txt] TIME_PREFIX=^\[ TIME_FORMAT=%F %H:%M:%S,%3N LINE_BREAKER =([\r\n]+)\[ SHOULD_LINEMERGE = false [source::/opt/rusty/log.txt] SHOULD_LINEMERGE = false TIME_FORMAT=%a %b %d %H:%M:%S TIME_PREFIX=^
[(?::){0}*-‐STANZA_1] DATETIME_CONFIG=CURRENT SHOULD_LINEMERGE = false [(?::){0}*-‐STANZA_2] TIME_PREFIX=^\[ TIME_FORMAT=%F %H:%M:%S,%3N LINE_BREAKER =([\r\n]+)\[ SHOULD_LINEMERGE = false [(?::){0}*-‐STANZA_3] SHOULD_LINEMERGE = false TIME_FORMAT=%a %b %d %H:%M:%S TIME_PREFIX=^
15
The SoluMon: Imposter
Now all we needed to do was add on to the end of our sourcetype name the stanza name which matched our log format.
props.conf [(?::){0}*-‐STANZA_1] DATETIME_CONFIG=CURRENT SHOULD_LINEMERGE = false [(?::){0}*-‐STANZA_2] TIME_PREFIX=^\[ TIME_FORMAT=%F %H:%M:%S,%3N LINE_BREAKER =([\r\n]+)\[ SHOULD_LINEMERGE = false [(?::){0}*-‐STANZA_3] SHOULD_LINEMERGE = false TIME_FORMAT=%a %b %d %H:%M:%S TIME_PREFIX=^
inputs.conf [monitor:///opt/eddie/log.txt] sourcetype=eddie-‐STANZA_3 [monitor:///opt/clark/log.txt] sourcetype=clark-‐STANZA_1 [monitor:///opt/rusty/log.txt] sourcetype=rusty-‐STANZA_2
16
The SoluMon: Imposter
Tomorrow if Eddie, Clark, and Rusty all decide they are going to start using some other log format all we need to do is change the matching stanza name on the forwarder
props.conf [(?::){0}*-‐STANZA_1] DATETIME_CONFIG=CURRENT SHOULD_LINEMERGE = false [(?::){0}*-‐STANZA_2] TIME_PREFIX=^\[ TIME_FORMAT=%F %H:%M:%S,%3N LINE_BREAKER =([\r\n]+)\[ SHOULD_LINEMERGE = false [(?::){0}*-‐STANZA_3] SHOULD_LINEMERGE = false TIME_FORMAT=%a %b %d %H:%M:%S TIME_PREFIX=^
inputs.conf [monitor:///opt/eddie/log.txt] sourcetype=eddie-‐STANZA_3 [monitor:///opt/clark/log.txt] sourcetype=clark-‐STANZA_1 [monitor:///opt/rusty/log.txt] sourcetype=rusty-‐STANZA_2
17
But Now My Sourcetypes Are Named Funny L
With a li�le transform magic we are able to strip off the assigned stanza name that we added to the sourcetype field leaving us with the original sourcetype name as if no funny business ever happened.
props.conf
[(?::){0}*-‐STANZA_1] DATETIME_CONFIG=CURRENT SHOULD_LINEMERGE = false TRANSFORMS-‐changeSourceType=stripitSourcetype
transforms.conf
[stripitSourcetype] SOURCE_KEY = MetaData:Sourcetype REGEX = (.*)-‐STANZA_\w+ FORMAT = $1 DEST_KEY = MetaData:Sourcetype
18
Problem #2 – Distributed Deployments Problem:
! One of our Splunk implementaMons required deploying Splunk into data centers where our OperaMons group wouldn’t manage the forwarders, network infrastructure or Access Control Lists
! Traffic would be restricted to within the datacenter in many cases
! This leW us with the scenario where forwarders and indexers would only be able to communicate with a local deployment server
19
Features Needed Key Features needed: ! A way to keep local deployment servers in sync with a master copy ! Could be integrated easily with the previous source-‐control process (git) ! Regions could be managed together and separately based on different change windows
! Would minimize human error and manual copying of files ! Would have reporMng and logging ! Have a cool code-‐name… Jenga!
20
Overview ! 1st we store all of our config files in a GIT repo ! Then these configs sync down to the Deployment Servers every 10 minutes
! Through a web UI we can: – View all the serverclasses – Check that a region/deployment server has been updated to match the GIT
producMon copy – Reload 1 or mulMple classes at a Mme
21
Jenga… Let’s Go a Li�le Deeper ! A git repo stores the files approved for producMon in a master “golden” branch
! Jenga Agent – Pulls down the producMon configs files from the
master branch with a GIT fetch – Rsyncs them to the default deployment directory – Updates and reloads the serverclass.conf. – Updates a text file on the Deployment Server with
the latest HASH from the producMon GIT repo – Publishes that HASH through a custom REST API
endpoint that can be read by the Jenga UI
22
Jenga UI.
! The UI lists all the classes available per region by looping through the REST API on the Deployment Server and publishing them as opMons to the user.
! Calls the Deployment Server custom REST API endpoint for the latest known Local HASH ! Calls the ProducMon GIT repo for the Current ProducMon HASH
23
Jenga workflow ! Joe
– Modifies an inputs.conf that is already deployed – Checks the changes into a new GIT branch – And pushes that branch to our GIT repo
! I review the changes and if they are OK I merge it with the master branch
! A new HASH is created for this revision and is published to the Jenga UI as the CURRENT version of producMon
! Joe with an approved window goes to the Jenga UI – Chooses the region(s) he wants – Reloads the class(es) once he sees that the Local and Current
HASHs are in sync
24
Q &A Contact info
Kate_Lawrence-‐Gupta (splunk.com) Email – Kate_Lawrence-‐[email protected] Cramasta (splunk.com) Email – [email protected]
25
Special Offer: Try Splunk MINT Express for Free! Splunk MINT offers a fast path to mobile intelligence. How fast?
Find out with a 6-‐month trial*
• Register for your free trial: h�p://mint.splunk.com/conf2014offer
• Download the Splunk MINT SDKs • Add the Splunk MINT line of SDK code and publish**
• Start gewng digital intelligence at your fingerMps!
*Offer valid for .conf2014 aDendees and coworkers of aDendees only.
**Trial allows monitoring of up to 750,000 monthly acHve users (MAUs).
26
THANK YOU