data science tutorial - sei digital library · pdf filedata science tutorial ... the call...

71
Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 2017 SEI Data Science in Cybersecurity Symposium Approved for Public Release; Distribution is Unlimited Data Science Tutorial Eliezer Kanal – Technical Manager, CERT Daniel DeCapria – Data Scientist, ETC

Upload: lykhue

Post on 06-Mar-2018

227 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

12017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 1

Software Engineering InstituteCarnegie Mellon UniversityPittsburgh, PA 15213

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited

Data Science TutorialEliezer Kanal – Technical Manager, CERT

Daniel DeCapria – Data Scientist, ETC

Page 2: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

2Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 2

About us

Eliezer KanalTechnical Manager, CERT

Recent projects:

• ML-based Malware Classifier

• Network traffic analysis

• Cybersecurity questionnaireoptimization

Daniel DeCapriaData Scientist, ETC

Recent projects:

• Cyber risk situationaldashboard

• Big Learning benchmarks

Page 3: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

3Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 3

Today’s presentation – a tale of two roles

The call center managerIntroduction to

data science capabilities

The master carpenterOverview of the

data science toolkit

Page 4: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

4Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 4

Call center manager

First day on job… welcome!

Goal: Reduce costs

Task: Keep calls short!

Data:

Average call time: 5.14 minutes (5:08)… very long!

Number of employees: 300

Average calls per day: ~28,000

Page 5: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

5Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 5

Call center manager – Gather data

Get the data!

• Where is it?

• What will you use to analyze it?

• How accurate it is?

• How complete is it?

• Is it too big to easily read?

Page 6: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

6Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 6

Data cleaning = 90% of the work

2 weeks (10 days) = 9 cleaning, 1 analyzing

Page 7: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

7Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 7

MorestructureLessstructure

Cleaning the Data – Structuring the Data

Goal: Organize data in a table, where…

Columns = descriptor (age, weight, height)

Row = individual, complete records

How can you get data out of these documents?

Page 8: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

8Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 8

Cleaning the Data

Even when you think your data should be clean, it might not be…

Page 9: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

9Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 9

Cleaning the Data – Call Center Example

Name Mgr Dir CallLength

PhoneLine

Problemsolved?

Comment

BethJones DanThomas AnneKim 1:30 1 Y …

Beth Jones DanThomas AnneKim 1:52 3 Y …

Jones,Beth DanThomas AnneKim 90 2 Y …

TomKeane MarkRyan TimPike 88 2 N …

TomKeane MarkRyan TimPike 144 3 No …

Tom Keane KevinWood TimPike 200 Yes …

Tom Keane KevinWood TimPike 94511 2 No …

Tom Keane KevinWood TimPike 421 2 Yes …

String Integer “Nominal” Unstructured

Page 10: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

10Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 10

Call Center manager – Exploring data

Page 11: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

11Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 11

Exploratory Data Analysis (EDA)

• Mean

• Median

• Standard deviation

• Histograms!

Page 12: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

12Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 12

Distributions

• The majority of data will follow SOME distributiono Weight of all Americans:

Gaussian

o phone call length:Exponential

• Determining distribution is a common Data Science task

• Multidimensional outliers: Insider Threat example

ImageCopyright2001-2016TheApacheSoftwareFoundation.SeeCopyrightslideformoredetails.

Page 13: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

13Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 13

EDA – Smart visualizations

Page 14: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

14Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 14

Page 15: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

15Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 15

Page 16: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

16Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 16

Page 17: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

17Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 17

Brief interruption

Skeptics in the audience

Page 18: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

18Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 18

Brief interruption

DataSciencehelpsyouusedatatogetresults.

Thisisit.

Page 19: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

19Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 19

Call center manager – call duration histogram

Average (5:08)

Page 20: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

20Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 20

Call Center manager – Insights!

Strategy update:

• Goodbye “reduce call time”

• Hello “reduce callbacks”

How to measure?

“callbacks” isn’t currently captured

Page 21: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

21Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 21

Feature Engineering

Need more useful data?Create it yourself!

Page 22: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

22Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 22

Feature Engineering

• Feature Engineering: coming up with new, useful (i.e.,informative) datao mean, sums, medians, etc.

o x2, xy, sqrt(xy), etc.

• Our case:o # of callbacks

o Call during peak time?

o Overall agent performance? (combination of factors)

Page 23: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

23Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 23

The role of Listening in Data Science

Data science finds hidden patterns in data

Experts know what data & patterns are important

Talk to subject matter experts

Page 24: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

24Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 24

Call Center manager – Predictive analytics

Can we predict staffing levels…

• …one day ahead?

• …one week ahead?

• …one month ahead?

Can we determine what types of calls to expect…

• …for a product we haven’t had before?

• …for a market we’ve never seen before?

Page 25: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

25Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 25

Example Predictive Analytics Questions

Predicting Current UnknownsOnline: Which ads are malicious?

Security: Is the bank transaction fraudulent?

IC: Which names map to the same person (entity resolution)?

Predicting Future EventsRetail: What will be the new trend of merchandise that a company should stock?

Security: Where will a hacker next attack our network?

IC: Who will become the next insider threat?

Determining Future ActionsSales: How can a company increase sales revenues?

Health: What actions can be taken to prevent the spread of flu?

IC: How will a vulnerability patch affect our knowledge/preparedness for future attacks?

Page 26: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

26Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 26

Call Center manager – Predictive analytics

Many techniques available, explored in next section

Page 27: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

27Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 27

Call Center manager – Review

Prediction

Action!

Interpretation

EDA,Visualization

Cleandata

Getdata

Page 28: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

28Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 28

Becauseweknowourdata,wecanask…

• …moreintelligentquestions

• …action-orientedquestions

• …questionsthatcanbeanswered

Page 29: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

29Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 29

This slide intentionally left blank

Page 30: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

30Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 30

The master carpenter

“Therighttoolforthejob”

Page 31: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

31Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 31

Feature Engineering – Part 2

The fuel of data science is data

Data preparation is critical

Data quality ≫ algorithm choice

That will come up…

”Withthewrongwood,Icanmakenothing”

Page 32: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

32Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 32

Types of Machine Learning Algorithms

Classification• Naïve Bayes• Logistic Regression• Decision Trees• K-Nearest Neighbors• Support Vector Machines

Regression• Linear Regression• Support Vector Machines

Clustering• K-Means Clustering

Page 33: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

33Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 33

Types of Machine Learning Algorithms

Applications: Everywhere

• Banking

• Weather

• Sports scores

• Economics

• Environmental science

• Cybersecurity

Page 34: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

34Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 34

Linear Regression – Prediction

Problem:

If I have examples of X and Y, when I learn a new X, can I predict Y?

Page 35: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

35Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 35

Linear Regression – Prediction

Solution: Find the line that is closest to every point

Said differently: Find the line that the SUM of all errors is smallest

Page 36: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

36Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 36

Linear Regression – Prediction

Three dimensions,same concept

HUNDREDS of dimensions,same concept

Page 37: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

37Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 37

Linear Regression

Very widely used

• Simple to implement

• Quick to run

• Easy to interpret

• Works for many problems

• First identified in early 1800’s; very well studied

When applicable:

• Works best with numeric data (usually)

• Works for predicting specific numeric outcome

Page 38: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

38Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 38

Logistic Regression – Classification

Idea: Classification using a discriminative model

• Predict future behavior based on existing labeled data

• Draws a line to assign labels

Mainly used for binary classification: either “red” or “blue”

Page 39: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

39Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 39

Logistic Regression – Classification

Look at distribution, what’s likely based on current data

Probably red

Probably blue

Page 40: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

40Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 40

Logistic Regression

Three dimensions,same concept

HUNDREDS of dimensions,same concept

Page 41: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

41Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 41

Classification: Support Vector Machine

Idea: The optimal classifier is the one that is the farthest from both classes

DewPoint

Barometric

Pressure

Page 42: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

42Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 42

Classification: Support Vector Machine

Idea: The optimal classifier is the one that is the farthest from both classes

DewPoint

Barometric

Pressure

Page 43: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

43Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 43

DewPoint

Barometric

Pressure

Classification: Support Vector Machine

Algorithm:

• Find lines like before

• Assign a cost to misclassified data points based on distance from the classification line

𝝃𝟏

𝝃𝟐

Page 44: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

44Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 44

Classification: Decision Trees

Idea: Instead of drawing a single complicated line through the data, draw many simpler lines.

Algorithm:

• Scan through all values of all features to find the one that “helps the most” to determine what data gets what label.

• Divide the data based on that value, and then repeat recursively on each part.

DewPoint

Barometric

Pressure

Page 45: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

45Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 45

Classification: Decision Trees

Idea: Instead of drawing a single complicated line through the data, draw many simpler lines.

Algorithm:

• Scan through all values of all features to find the one that “helps the most” to determine what data gets what label.

• Divide the data based on that value, and then repeat recursively on each part.

DewPoint

Barometric

Pressure

𝐷𝑃 > 60

Page 46: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

46Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 46

Classification: Decision Trees

Idea: Instead of drawing a single complicated line through the data, draw many simpler lines.

Algorithm:

• Scan through all values of all features to find the one that “helps the most” to determine what data gets what label (“information gain”).

• Divide the data based on that value, and then repeat recursively on each part.

DewPoint

Barometric

Pressure

𝐷𝑃 > 60

𝐵 > 760

Page 47: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

47Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 47

Classification: Decision Trees

Benefits:

• Works well when small.

• Very easy to understand!

Challenges:

• Trees overfit easily

• Very sensitive to data; Random Forests

DewPoint

Barometric

Pressure

𝐷𝑃 > 60

𝐵 > 760

𝐷𝑃 > 55

Page 48: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

48Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 48

Classification: K-Nearest Neighbors

Idea: A new point is likely to share the same label as points around it.

Algorithm:

• Pick constant k as number of neighbors to look at.

• For each new point, vote on new label using the k neighbor labels.

DewPoint

Barometric

Pressure

Page 49: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

49Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 49

Classification: K-Nearest Neighbors

Idea: A new point is likely to share the same label as points around it.

Algorithm:

• Pick constant k as number of neighbors to look at.

• For each new point, vote on new label using the k neighbor labels.

DewPoint

Barometric

Pressure

Page 50: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

50Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 50

Classification: K-Nearest Neighbors

Idea: A new point is likely to share the same label as points around it.

Algorithm:

• Pick constant k as number of neighbors to look at.

• For each new point, vote on new label using the k neighbor labels.

DewPoint

Barometric

Pressure

Page 51: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

51Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 51

Classification: K-Nearest Neighbors

Works well when

• there is a good distance metric and weighting function to vote on classification

Challenges:

• Not a smooth classifier; points near each other may get classified differently

• Must search all your data every time you want to classify a new point

• When k is small (1,2,3,4), essentially it is overfitting to the data points

Page 52: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

52Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 52

Clustering

• Unsupervised learning

• Structure of un-labled data

• Organize records into groups based on some similarity measure

• Cluster is the collection of records which are similar

Page 53: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

53Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 53

Clustering: K-means

Idea: Find the clusters by minimizing distances of cluster centers to data.

Algorithm:

• Instantiate k distinct random guesses 𝜇-of the cluster centers

• Each data point classifies itself as the 𝜇- it is closest to it

• Each 𝜇- finds the centroid of the points that were closest to it and jumps there

• Repeat until centroids don’t move

DewPoint

Barometric

Pressure

Page 54: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

54Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 54

Clustering: K-means

Idea: Find the clusters by minimizing distances of cluster centers to data.

Algorithm:

• Instantiate k distinct random guesses 𝜇-of the cluster centers

• Each data point classifies itself as the 𝜇- it is closest to it

• Each 𝜇- finds the centroid of the points that were closest to it and jumps there

• Repeat until centroids don’t move

DewPoint

Barometric

Pressure

Page 55: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

55Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 55

Clustering: K-means

Idea: Find the clusters by minimizing distances of cluster centers to data.

Algorithm:

• Instantiate k distinct random guesses 𝜇-of the cluster centers

• Each data point classifies itself as the 𝜇- it is closest to it

• Each 𝜇- finds the centroid of the points that were closest to it and jumps there

• Repeat until centroids don’t move

DewPoint

Barometric

Pressure

Page 56: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

56Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 56

Clustering: K-means

Idea: Find the clusters by minimizing distances of cluster centers to data.

Algorithm:

• Instantiate k distinct random guesses 𝜇-of the cluster centers

• Each data point classifies itself as the 𝜇- it is closest to it

• Each 𝜇- finds the centroid of the points that were closest to it andjumps there

• Repeat until centroids don’t move

DewPoint

Barometric

Pressure

Page 57: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

57Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 57

Clustering: K-means

Idea: Find the clusters by minimizing distances of cluster centers to data.

Algorithm:

• Instantiate k random guesses 𝜇-of the clusters

• Each data point classifies itself as the 𝜇- it is closest to it

• Each 𝜇- finds the centroid of the points that were closest to it and jumps there

• Repeat until centroids don’t move

DewPoint

Barometric

Pressure

Page 58: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

58Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 58

Clustering: K-means

Works well when

• there is a good distance metric between the points

• the number of clusters is known in advance

Challenges:

• Clusters that overlap or are not separable are difficult to cluster correctly.

DewPoint

Barometric

Pressure

Page 59: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

59Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 59

Influencers

Goal: Detect the people who control or distribute information through a network.

Page 60: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

60Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 60

Influencers: Degree Centrality

Idea: Influential people have a lot of people watching them.

Equation

• Degree centrality = number of directed edges to the node - High degree centrality people are those with large numbers of

followers.

• If undirected graph, transform to bi-directional and compute

Page 61: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

61Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 61

Influencers: Degree Centrality

Idea: Influential people have a lot of people watching them.

Equation

• Degree centrality = number of directed edges to the node - High degree centrality people are those with large numbers of

followers.

• If undirected graph, transform to bi-directional and compute

1

1

1 3

23

1

1

Page 62: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

62Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 62

Influencers: Betweenness Centrality

Idea: Influential people are “information brokers” who connect different groups of people.

Algorithm

• Find all shortest paths from all nodes to all other nodes in the graph.

• Betweenness centrality for a node = sum over all start and end nodes of the number of shortest paths in the graph that include the node

Page 63: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

63Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 63

Influencers: Betweenness Centrality

Idea: Influential people are “information brokers” who connect different groups of people.

Algorithm

• Find all shortest paths from all nodes to all other nodes in thegraph.

• Betweenness centrality for a node = sum over all start and endnodes of the number of shortest paths in the graph that includethe node

5

7

14

3

2 11

2133

Page 64: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

64Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 64

Indicator communitiesButwhatifwearen’tstartingwithareferenceindicator?Weassumethatindicatorsgeneratedbyacoherentrealworldprocesswillbemorelikelytoco-occurinticketsthanarbitrarypairsofindicators.Findgroupsofhighlysimilarindicatorsincompleteindicator-ticketgraph.

Page 65: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

65Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 65

Indicator-ticket graph

Asubsetoftheticket-indicatorgraph(forasmallsetofselectedindicators)• Ticketsaregreytriangles• Indicatorsareblackcircles• Edgesconnectticketstotheindicatorsthey

contain

Page 66: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

66Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 66

Machine Learning Is Growing

Preferred approach for many problems

• Speech recognition

• Natural language processing

• Medical diagnosis

• Robot control

• Sensor networks

• Computer vision

• Weather prediction

• Social network analysis

• AlphaGO, Watson Jeopardy!

Page 67: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

67Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 67

This slide also intentionally left blank, just like the earlier one

Page 68: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

68Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 68

What we did today

Name Mgr Dir Length Line Solved? Comment

BethJones DanThomas AnneKim 1:30 1 Y …

Beth Jones DanThomas AnneKim 1:52 3 Y …

Jones,Beth DanThomas AnneKim 90 2 Y …

TomKeane MarkRyan TimPike 88 2 N …

Average (5:08)

❶ ❶

Page 69: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

69Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 69

What we did today

Page 70: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

70Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 70

DataSciencehelpsyouusedatatogetresults.

Page 71: Data Science Tutorial - SEI Digital Library · PDF fileData Science Tutorial ... The call center manager Introduction to data science capabilities ... closest to every point

71Data Science TutorialAugust 10, 2017© 2017 Carnegie Mellon University

2017 SEI Data Science in Cybersecurity SymposiumApproved for Public Release; Distribution is Unlimited 71

Eliezer KanalTechnical Manager(412) [email protected]

Daniel DeCapriaData Scientist(412) [email protected]