data science ops in practice...the role of data science in cyber operations 3 •the rate of data...
TRANSCRIPT
DATA SCIENCE OPS IN PRACTICELearn How Splunk Enables Fast Science for Cybersecurity Operations
OLISASTEPHENSBAILEYDAVIDBRENMANSEPTEMBER2017
Innovationcenter,Washington,D.C.
SECTION1:UNDERSTANDINGTHECORENEED
SECTION2:CROSSINGTHEANALYSISCHASM
SECTION3:ANALYSISWORKFLOWDEMONSTRATION
SECTION4:ACTIONITEMSFORYOURPROJECTS
AGENDA
1
LEARN HOW TO:
ADDRESS CULTURAL CHALLENGES
ENSURE YOUR DATA SCIENCE SOLUTIONS GET USED
HARNESS THE FULL POWER OF PYTHON WITHIN SPLUNK
DATA SCIENCE OPS IN PRACTICE
UNDERSTANDING THE CORE NEED
2
THE ROLE OF DATA SCIENCE IN CYBER OPERATIONS
3
• Therateofdatagrowthisoutpacinghumancapabilities
• Wemustoptimizeimpactofthepeoplewedohave
• DataScienceisapowerfultooltoreducethescaleoftheproblem
• Inresponsetotheseneeds,BoozAllenHamiltonwastaskedwithintegratingDataScienceintotheWatchfloor
[1]
CYBER OPERATIONS ANALYSTS & DATA SCIENTISTS POINTS OF VIEW
• Areevaluatedonquantityofoutput• HaveaclearlydefinedSOP• Willloseproductivityeverytimetheyinvestinlearninganewtool• Donotneednewtoolstobeeffective• Areleeryofbuggyprototypecode• HaveadistrustoftheblackboxMachineLearningalgorithm
• LiketounderstandwhattheAnalystistryingtodoratherthanfitexistingsolutiontoproblem• Areevaluatedondevelopmentofnovelmethods• Gainhonorandreputationfromimplementingcuttingedgealgorithms• Donotlikesupportinglegacysoftware• Haveanunwaveringtrustinmathematics
4
CyberOperationsAnalysts DataScientists
Imustmeetmyquota,Idon’thavetimefortoys
Theoldwayisoutofdate,wemustimprove
APPRECIATING YOUR ROLE FOUNDATIONAL KEY TO SUCCESS
• Themostimportantlessonlearned
• Analystsareinapowerposition:-Theyareneeded-Theyownthedomainknowledge-Theyownthetradecraft-Theyowntheaccesses-Theyownthedata
• ItistheresponsibilityoftheDataScientisttoshowrespectandlearn-TheDataScientistisintrudingintotheAnalyst'sdomain
5
AnalystsarefullycapableofmeetingtheircurrentobjectiveswithoutDataScience
[2]
CROSSING THE ANALYSIS CHASM
6
BRIDGING THE GAP BETWEEN ANALYSTS & DATA SCIENTISTS IN OPERATIONS
• ManyAnalystsdonotunderstandappliedstatisticsormachinelearninganddonotunderstandhowitcanbeappliedtotheirdomain
• DataScientistswishingtomakeanimpactshould:-Minimizethenumberofnewwidgetsananalystneedstolearn-Provideallresultswithmeaningfulsupportingevidence-Weightclarityasmuchasperformanceinalgorithmselection-Appreciatethatreportingtherearenoresultsisfarbetterthanfalsepositives
• Hostyourend-solutionsinthetoolenvironmenttheyuse
7
MinimizeNumberofTools
ProvideEvidence
EnsureInterpretability
SilenceIsaVirtue
IfAnalystsUseSplunk,YouUseSplunk
LEVERAGING THE POWER & FLEXIBILITY WITH PYTHON & SPLUNK
8
• Pros-Providesdeveloperswithaccesstowidearrayofdataprocessinglibraries-Object-Orientedprogramdesign-Rapidprototypescriptinglanguage• Cons-Mustbeabletocode-Developedprojectstendtobeindividualobjects- Steeplearninggapforusers
• Pros- Singleunifiedsystemforcollecting,digestingandqueryingdata-Attractive2Dplotting-Usersabletoseamlesslynavigatetorawdatabehindplots
• Cons-Querylanguagenarrowsfindings- Lacksflexibilityofprograminglanguage- LimitedpythonlibrarywithinSDK
Python Splunk
CombinethedevelopmentflexibilityofPythonwiththeconsistencyofSplunktobenefitAnalysts
STEP #1 - WORK DIRECTLY WITH ANALYSTS TO SOURCE A USE CASE
• OurDataScienceteamworksdirectlywithAnalyststoworktogetheronanalyticobjectives-Toidentifymaliciousoraberrantbehaviorwithinanewbatchoflogdata-TodetectsuspiciousURLs
• Theirworkflowconsistedof:1. DigestlogfilesintoSplunk2. Labelfields3. ExplorethedatawithSMEsandviaSplunkqueries4. ReportanynewSplunkqueriesofvalue
9
WeexpediteAnalysts’Splunkingby• Groupingsimilarobservations• Highlightingsuspiciousoutliers• Unlockingnewfeatures [4]
STEP #2 – SELECT METHOD FOR INTEGRATING DATA SCIENCE CAPABILITIES
10
METHOD1• Thismethodhasprovencapableinrapiddeliverysituations• IdentifyalinkingfieldandexportthedataoutofSplunk• Processthedatawithany DataScienceSoftware• CreateanewCSVandusepreviouslinkingfieldtoenrichoriginaldata
RawData
DataFormatted&Indexed
DataExportedto
CSV
RunAnySoftwareApplication
PrintCSVWithLinking
Field
IdentifyLinkingFiled
ImportCSVasLookupTable
RunSplunkProcessingQuery
EnrichedDataInReadyFor
Use
ExternalSoftware
Splunk
ImportAny
Libraries
STEP #2 – SELECT METHOD FOR INTEGRATING DATA SCIENCE CAPABILITIES
11
METHOD2• Slowertosetupfirsttime,buthighlyeffectiveafterthat• UseyourownPythonenvironment• Abletoleverageanylibrary;Scikit-Learn,TensorFlow,Theano,Scrapy,etc.
RawData
DataFormatted&Indexed
RunAnySoftwareApplication
YourAppReturnsResultstoSplunk
RunStandardSplunkQueries
ExternalPython
Splunk
ImportAny
Libraries
CallYourSplunk/Python
App
YourAppStartsExternalPython
Session
STEP #3 – EXECUTE MACHINE LEARNING ALGORITHM DEVELOPMENT PROCESS
12
DataCollection&Aggregation
Splunkmakesiteasy!
RawData
RawData
RawData
Pre-Processing&Cleaning
FeatureExtraction&Vectorization
Externalsoftwareneededfor
advancedfeaturecalculations
ApplyMLAlgorithm
PostAnalysisofResults
Splunkreallyshineswhenitcomestimetopresentyour
results
• SplunkisapowerfulassetinmanystagesoftheMachineLearningprocess
ANALYSIS WORKFLOW DEMONSTRATION
13
LOOK FAMILIAR?
14
STEP #4 – SHOW EVIDENCE TO SUPPORT ANALYSIS RESULTS
15
THENOTORIOUSBLACKBOX
JUSTBELIEVEME‘CAUSEI’MAWESOME!
BEFORE BETTER APPS…
16
ClassicWireshark Good‘Ol Excel
OUR NEW FEATURE EXTRACTION APPLICATION BRINGS NEW INSIGHTS TO ANALYSIS
New Stream AppFeatureExamples– AvoidBasicSummaryTableOverhead
Avg IP,port,time
Statistical sum(bytes),sum(bytes_in),sum(bytes_out),sum(packets_in),sum(packets_out),sum(response_time),sum(time_taken)
17
OurNewFeature Examples- MakeBetterUseofMLToolkit
Numeric duration
Statistical num_bytes_cli2srv,num_bytes_srv2cli,num_packets_cli2srv,num_packets_srv2cli,packet_deltat_avg_cli2srv,packet_deltat_avg_srv2cli,packet_deltat_entropy_2way,packet_deltat_entropy_cli2srv,packet_deltat_entropy_srv2cli
Weadded46new
features!!!!
NEW STREAM APP ENABLES DIRECT ACCESS TO RAW PCAP IN SPLUNK
18
NEW STREAM APP GIVE ANALYSTS MORE INFORMATION
19
ML TOOLKIT ENABLES EXPLORATORY DATA ANALYSIS IN SPLUNK
20
STOCK SPLUNK ML TOOLKIT HAS LIMITED FEATURES AVAILABLE FOR ANALYSIS
21
90%ofMLisPre-Processing&FeatureExtractionCraftingFeaturesisNecessaryBeforeFeedingTheMLTK
DATA SCIENTISTS CAN ADD NEW FEATURES DIRECTLY INTO SPLUNK FOR EDA
22
USER EXPERIENCE AND SUPPORTING EVIDENCE FOR DATA SCIENTISTS
23
USER EXPERIENCE AND SUPPORTING EVIDENCE FOR ANALYSTS
24
LIVE DEMO
25
ACTION ITEMS FOR YOUR PROJECTS
26
CULTURAL HURDLES & SUCCESSES
• Tacticsusedtoovercomeculturalbarriers- Youmustgototheanalyst;theywillshowyoutheiranalysisprocessANDgrantyoukeystotheirdatatroves
- Youmustbewillingtoexplainwhatanalysistechniquesyouareusingsimplyusingtheirterminologyasmuchaspossible
- Someoneonyourteamhastobewillingtotalktothecustomersandtheircustomers- thishelpsestablishanew,collaborativetribe
- Yourworkmustroleupintoastorythattellsthewhyandsowhatofthework- sometimesthisistheclosestonegetstoROI
- Marketing&brandingextremelyimportantforbreakingentrenchedthinkingandcoaxingparticipationtosomethingnew&shiny
• Buildaninterdisciplinaryteam- Unicornsarehardtofindandthebestsolutionsoftenareaproductofdivergentthought
- Dataanalysisisapipeline,journeyofsorts…ittakesdomainexpertsfromfieldsotherthanjustcomputerscienceormathematics
- HavingdatascientiststhathaveexpertiseinCyberOperationsmissionspacewillacceleratesuccess
27
FOUR STEPS TO APPLYING DATA SCIENCE WITHIN CYBER OPERATIONS
• STEP#1- WORKDIRECTLYWITHANALYSTSTOSOURCEAUSECASE
• STEP#2– SELECTMETHODFORINTEGRATINGDATASCIENCECAPABILITIES
• STEP#3– EXECUTEMACHINELEARNINGALGORITHMDEVELOPMENTPROCESS
• STEP#4– SHOWEVIDENCETOSUPPORTANALYSISRESULTS
28
TAKE AWAYS
1) Yourdatascienceteammustgototheanalyst
2) Populateyourresultswheretheuserchecks
3) Developself-containedlimitedsizeproductsthatcanbeiterativelyupdatedanddelivered
4) DataScientistsmustbeconcernedwithjustifyingtheirclaims
5) Splunkcanbeenhancedbyleveragingexternalscripting
29
INNOVATING THE CYBER DOMAIN THROUGH THE APPLICATION OF DATA SCIENCE
30