data science ops in practice...the role of data science in cyber operations 3 •the rate of data...

Post on 04-Sep-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DATA SCIENCE OPS IN PRACTICELearn How Splunk Enables Fast Science for Cybersecurity Operations

OLISASTEPHENSBAILEYDAVIDBRENMANSEPTEMBER2017

Innovationcenter,Washington,D.C.

SECTION1:UNDERSTANDINGTHECORENEED

SECTION2:CROSSINGTHEANALYSISCHASM

SECTION3:ANALYSISWORKFLOWDEMONSTRATION

SECTION4:ACTIONITEMSFORYOURPROJECTS

AGENDA

1

LEARN HOW TO:

ADDRESS CULTURAL CHALLENGES

ENSURE YOUR DATA SCIENCE SOLUTIONS GET USED

HARNESS THE FULL POWER OF PYTHON WITHIN SPLUNK

DATA SCIENCE OPS IN PRACTICE

UNDERSTANDING THE CORE NEED

2

THE ROLE OF DATA SCIENCE IN CYBER OPERATIONS

3

• Therateofdatagrowthisoutpacinghumancapabilities

• Wemustoptimizeimpactofthepeoplewedohave

• DataScienceisapowerfultooltoreducethescaleoftheproblem

• Inresponsetotheseneeds,BoozAllenHamiltonwastaskedwithintegratingDataScienceintotheWatchfloor

[1]

CYBER OPERATIONS ANALYSTS & DATA SCIENTISTS POINTS OF VIEW

• Areevaluatedonquantityofoutput• HaveaclearlydefinedSOP• Willloseproductivityeverytimetheyinvestinlearninganewtool• Donotneednewtoolstobeeffective• Areleeryofbuggyprototypecode• HaveadistrustoftheblackboxMachineLearningalgorithm

• LiketounderstandwhattheAnalystistryingtodoratherthanfitexistingsolutiontoproblem• Areevaluatedondevelopmentofnovelmethods• Gainhonorandreputationfromimplementingcuttingedgealgorithms• Donotlikesupportinglegacysoftware• Haveanunwaveringtrustinmathematics

4

CyberOperationsAnalysts DataScientists

Imustmeetmyquota,Idon’thavetimefortoys

Theoldwayisoutofdate,wemustimprove

APPRECIATING YOUR ROLE FOUNDATIONAL KEY TO SUCCESS

• Themostimportantlessonlearned

• Analystsareinapowerposition:-Theyareneeded-Theyownthedomainknowledge-Theyownthetradecraft-Theyowntheaccesses-Theyownthedata

• ItistheresponsibilityoftheDataScientisttoshowrespectandlearn-TheDataScientistisintrudingintotheAnalyst'sdomain

5

AnalystsarefullycapableofmeetingtheircurrentobjectiveswithoutDataScience

[2]

CROSSING THE ANALYSIS CHASM

6

BRIDGING THE GAP BETWEEN ANALYSTS & DATA SCIENTISTS IN OPERATIONS

• ManyAnalystsdonotunderstandappliedstatisticsormachinelearninganddonotunderstandhowitcanbeappliedtotheirdomain

• DataScientistswishingtomakeanimpactshould:-Minimizethenumberofnewwidgetsananalystneedstolearn-Provideallresultswithmeaningfulsupportingevidence-Weightclarityasmuchasperformanceinalgorithmselection-Appreciatethatreportingtherearenoresultsisfarbetterthanfalsepositives

• Hostyourend-solutionsinthetoolenvironmenttheyuse

7

MinimizeNumberofTools

ProvideEvidence

EnsureInterpretability

SilenceIsaVirtue

IfAnalystsUseSplunk,YouUseSplunk

LEVERAGING THE POWER & FLEXIBILITY WITH PYTHON & SPLUNK

8

• Pros-Providesdeveloperswithaccesstowidearrayofdataprocessinglibraries-Object-Orientedprogramdesign-Rapidprototypescriptinglanguage• Cons-Mustbeabletocode-Developedprojectstendtobeindividualobjects- Steeplearninggapforusers

• Pros- Singleunifiedsystemforcollecting,digestingandqueryingdata-Attractive2Dplotting-Usersabletoseamlesslynavigatetorawdatabehindplots

• Cons-Querylanguagenarrowsfindings- Lacksflexibilityofprograminglanguage- LimitedpythonlibrarywithinSDK

Python Splunk

CombinethedevelopmentflexibilityofPythonwiththeconsistencyofSplunktobenefitAnalysts

STEP #1 - WORK DIRECTLY WITH ANALYSTS TO SOURCE A USE CASE

• OurDataScienceteamworksdirectlywithAnalyststoworktogetheronanalyticobjectives-Toidentifymaliciousoraberrantbehaviorwithinanewbatchoflogdata-TodetectsuspiciousURLs

• Theirworkflowconsistedof:1. DigestlogfilesintoSplunk2. Labelfields3. ExplorethedatawithSMEsandviaSplunkqueries4. ReportanynewSplunkqueriesofvalue

9

WeexpediteAnalysts’Splunkingby• Groupingsimilarobservations• Highlightingsuspiciousoutliers• Unlockingnewfeatures [4]

STEP #2 – SELECT METHOD FOR INTEGRATING DATA SCIENCE CAPABILITIES

10

METHOD1• Thismethodhasprovencapableinrapiddeliverysituations• IdentifyalinkingfieldandexportthedataoutofSplunk• Processthedatawithany DataScienceSoftware• CreateanewCSVandusepreviouslinkingfieldtoenrichoriginaldata

RawData

DataFormatted&Indexed

DataExportedto

CSV

RunAnySoftwareApplication

PrintCSVWithLinking

Field

IdentifyLinkingFiled

ImportCSVasLookupTable

RunSplunkProcessingQuery

EnrichedDataInReadyFor

Use

ExternalSoftware

Splunk

ImportAny

Libraries

STEP #2 – SELECT METHOD FOR INTEGRATING DATA SCIENCE CAPABILITIES

11

METHOD2• Slowertosetupfirsttime,buthighlyeffectiveafterthat• UseyourownPythonenvironment• Abletoleverageanylibrary;Scikit-Learn,TensorFlow,Theano,Scrapy,etc.

RawData

DataFormatted&Indexed

RunAnySoftwareApplication

YourAppReturnsResultstoSplunk

RunStandardSplunkQueries

ExternalPython

Splunk

ImportAny

Libraries

CallYourSplunk/Python

App

YourAppStartsExternalPython

Session

STEP #3 – EXECUTE MACHINE LEARNING ALGORITHM DEVELOPMENT PROCESS

12

DataCollection&Aggregation

Splunkmakesiteasy!

RawData

RawData

RawData

Pre-Processing&Cleaning

FeatureExtraction&Vectorization

Externalsoftwareneededfor

advancedfeaturecalculations

ApplyMLAlgorithm

PostAnalysisofResults

Splunkreallyshineswhenitcomestimetopresentyour

results

• SplunkisapowerfulassetinmanystagesoftheMachineLearningprocess

ANALYSIS WORKFLOW DEMONSTRATION

13

LOOK FAMILIAR?

14

STEP #4 – SHOW EVIDENCE TO SUPPORT ANALYSIS RESULTS

15

THENOTORIOUSBLACKBOX

JUSTBELIEVEME‘CAUSEI’MAWESOME!

BEFORE BETTER APPS…

16

ClassicWireshark Good‘Ol Excel

OUR NEW FEATURE EXTRACTION APPLICATION BRINGS NEW INSIGHTS TO ANALYSIS

New Stream AppFeatureExamples– AvoidBasicSummaryTableOverhead

Avg IP,port,time

Statistical sum(bytes),sum(bytes_in),sum(bytes_out),sum(packets_in),sum(packets_out),sum(response_time),sum(time_taken)

17

OurNewFeature Examples- MakeBetterUseofMLToolkit

Numeric duration

Statistical num_bytes_cli2srv,num_bytes_srv2cli,num_packets_cli2srv,num_packets_srv2cli,packet_deltat_avg_cli2srv,packet_deltat_avg_srv2cli,packet_deltat_entropy_2way,packet_deltat_entropy_cli2srv,packet_deltat_entropy_srv2cli

Weadded46new

features!!!!

NEW STREAM APP ENABLES DIRECT ACCESS TO RAW PCAP IN SPLUNK

18

NEW STREAM APP GIVE ANALYSTS MORE INFORMATION

19

ML TOOLKIT ENABLES EXPLORATORY DATA ANALYSIS IN SPLUNK

20

STOCK SPLUNK ML TOOLKIT HAS LIMITED FEATURES AVAILABLE FOR ANALYSIS

21

90%ofMLisPre-Processing&FeatureExtractionCraftingFeaturesisNecessaryBeforeFeedingTheMLTK

DATA SCIENTISTS CAN ADD NEW FEATURES DIRECTLY INTO SPLUNK FOR EDA

22

USER EXPERIENCE AND SUPPORTING EVIDENCE FOR DATA SCIENTISTS

23

USER EXPERIENCE AND SUPPORTING EVIDENCE FOR ANALYSTS

24

LIVE DEMO

25

ACTION ITEMS FOR YOUR PROJECTS

26

CULTURAL HURDLES & SUCCESSES

• Tacticsusedtoovercomeculturalbarriers- Youmustgototheanalyst;theywillshowyoutheiranalysisprocessANDgrantyoukeystotheirdatatroves

- Youmustbewillingtoexplainwhatanalysistechniquesyouareusingsimplyusingtheirterminologyasmuchaspossible

- Someoneonyourteamhastobewillingtotalktothecustomersandtheircustomers- thishelpsestablishanew,collaborativetribe

- Yourworkmustroleupintoastorythattellsthewhyandsowhatofthework- sometimesthisistheclosestonegetstoROI

- Marketing&brandingextremelyimportantforbreakingentrenchedthinkingandcoaxingparticipationtosomethingnew&shiny

• Buildaninterdisciplinaryteam- Unicornsarehardtofindandthebestsolutionsoftenareaproductofdivergentthought

- Dataanalysisisapipeline,journeyofsorts…ittakesdomainexpertsfromfieldsotherthanjustcomputerscienceormathematics

- HavingdatascientiststhathaveexpertiseinCyberOperationsmissionspacewillacceleratesuccess

27

FOUR STEPS TO APPLYING DATA SCIENCE WITHIN CYBER OPERATIONS

• STEP#1- WORKDIRECTLYWITHANALYSTSTOSOURCEAUSECASE

• STEP#2– SELECTMETHODFORINTEGRATINGDATASCIENCECAPABILITIES

• STEP#3– EXECUTEMACHINELEARNINGALGORITHMDEVELOPMENTPROCESS

• STEP#4– SHOWEVIDENCETOSUPPORTANALYSISRESULTS

28

TAKE AWAYS

1) Yourdatascienceteammustgototheanalyst

2) Populateyourresultswheretheuserchecks

3) Developself-containedlimitedsizeproductsthatcanbeiterativelyupdatedanddelivered

4) DataScientistsmustbeconcernedwithjustifyingtheirclaims

5) Splunkcanbeenhancedbyleveragingexternalscripting

29

INNOVATING THE CYBER DOMAIN THROUGH THE APPLICATION OF DATA SCIENCE

30

top related