future forward with data sciencekirkborne.net/phuse2018/kirkborne-phuse-nov2018.pdf · feature...

59
Principal Data Scientist Booz Allen Hamilton http://www.boozallen.com/datascience Kirk Borne @KirkDBorne Future Forward with Data Science: How to Predict (and to Change) the Future

Upload: others

Post on 14-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Principal Data Scientist

Booz Allen Hamilton

http://www.boozallen.com/datascience

Kirk Borne@KirkDBorne

Future Forward with Data Science:How to Predict (and to Change) the Future

Page 2: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Principal Data Scientist

Booz Allen Hamilton

http://www.boozallen.com/datascience

Kirk Borne@KirkDBorne

Future Forward with Data Science:How to Predict (and to Change) the Future

Page 3: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Ever since we first explored our world…

http://www.livescience.com/27663-seven-seas.html

3

Page 4: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

…We have asked questions about everything around us.

https://jefflynchdev.wordpress.com/tag/adobe-photoshop-lightroom-3/page/5/

4

Page 5: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

So, we have collected evidence (data) to answer our questions,

which leads to more questions, which leads to more data collection,

which leads to more questions, which leads to BIG DATA!

y ~ 2 * x (linear growth)

y ~ 2 ^ x (exponential growth)

https://www.linkedin.com/pulse/exponential-growth-isnt-cool-combinatorial-tor-bair

y ~ x! ≈ x ^ x→ Combinatorial Growth!(all possible interconnections,linkages, and interactions)

5

Page 6: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

DefiningBig Data

• The 3 V’s of Big Data are not just hype…

• They represent really big challenges:

1. Volume

2. Velocity

3. Variety

Source for graphic: http://www.vitria.com/blog/Big-Data-Analytics-Challenges-Facing-All-Communications-Service-Providers/ 6

Page 7: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

DefiningBig Data

• The 3 V’s of Big Data are not just hype…

• They represent really big challenges:

1. Volume

2. Velocity

3. Variety

✓VALUE!

7

Page 8: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Our mission as data scientists:is to discover Value in Big Data

(especially in high-Variety data) throughData Science and Machine Learning

8

Page 9: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

What is the Big Data Variety Challenge?

9

Page 10: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

What is the Big Data Variety Challenge?

Source for graphic: http://www.vitria.com/blog/Big-Data-Analytics-Challenges-Facing-All-Communications-Service-Providers/

1. We collect many different sources of data.

2. But we usually store diverse data in separate silos.

3. Therefore, we cannot easily integrate the data to

combine them for unified insight.

Consider the Blind Men

and the Elephant…

10

Page 11: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

11

Adding more data doesn’t necessarily help…

https://paulmead.com.au/blog/understand-perceptions/

Unless we can combine and integrate the different signals

into a “single view” of the thing, there will continue to be

many possible interpretations of what the source is!

Combining, connecting, and linking diverse data makes data “smart”!

Think of data not as information, but as measurements that encode knowledge.

Page 12: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Feature Selection is important in order to disambiguate different classes.More importantly,Class Discovery depends on choosing the right projection and selecting the right features!

Feature Selection and Projection

12

Page 13: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Your chosen data attributes represent a low-dimension projection of the full truth – the feature space (dimensions) in which you explore your data is a form of bias – it matters!

Projection Matters

13

Page 14: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Feature Selection and Model Bias:choosing features in the dark

I picked out two socks from my sock drawer this morning!

It was still dark, but that shouldn’t matter, right? After all, they are the same size … THE SAME ?!?

The Era of Big Data represents the END OF DEMOGRAPHICS (i.e., our models should no longer be based on and biased by a limited selection of attributes and features)

14

Page 15: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

An “Easy Button” for Extracting Value from Data through Machine Learning• Pattern Discovery (Detection)

– D2D: data-to-discovery

• Pattern Recognition– D2D: data-to-decisions

• Pattern Exploration– D2D: data-to-dollars (innovation)

• Pattern Exploitation– D2V: Data-to-Value (action)

– D2A: Data-to-Action (value)

15

Page 16: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Pattern Discovery is easy, but Pattern Exploitation requires more data science…

Source for graphic: http://www.holehouse.org/mlclass/10_Advice_for_applying_machine_learning.html

16

Generalization is key!

(The Goldilocks model)

The most generally useful model captures the fundamental pattern in the data and takes into account the natural variance in the data.

Page 17: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

The Goal of Machine Learning

“…is to use algorithms to learn from data,

in order to build generalizable models that

give accurate classifications or predictions,

or to find (useful) patterns, particularly

with new and previously unseen data.”

(the key is GENERALIZATION!)

https://www.innoarchitech.com/machine-learning-an-in-depth-non-technical-guide/

17

Page 18: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

18

4 Flavors of Machine Learning

for Pattern Detection and Discovery1) Class Discovery (Clustering): Find the

categories of objects (population segments), events, and behaviors in your data. + Learn the rules that constrain the class boundaries (that uniquely distinguish them).

2) Correlation (Predictive and Prescriptive Power) Discovery: Find trends, patterns, and

dependencies in data that reveal new governing principles or behavioral patterns (the object’s “DNA”).

3) Novelty (Surprise!) Discovery: Find the new,

surprising, unexpected one-in-a-[million / billion / trillion] object, event, or behavior.

4) Association (or Link) Discovery: (Graph and

Network Analytics) – Find the unusual (interesting) data associations / links / connections across entities in your domain.

Page 19: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

5 Levels of Analytics Maturity

in Data-Driven Applications1) Descriptive Analytics

– Hindsight (What happened?)

2) Diagnostic Analytics

– Oversight (real-time / What is

happening? Why did it happen?)

3) Predictive Analytics

– Foresight (What will happen?)

19

Page 20: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

5 Levels of Analytics Maturity

in Data-Driven Applications1) Descriptive Analytics

– Hindsight (What happened?)

2) Diagnostic Analytics

– Oversight (real-time / What is

happening? Why did it happen?)

3) Predictive Analytics

– Foresight (What will happen?)

4) Prescriptive Analytics

– Insight (How can we optimize what

happens?) (Follow the dots / connections in

the graph!)

5) Cognitive Analytics– Right Sight (the 360 view , what is the right

question to ask for this set of data in this

context = Game of Jeopardy)

– Finds the right insight, the right action, the

right decision,… right now!

– Moves beyond simply providing answers, to

generating new questions and hypotheses.

20

Page 21: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

PREDICTIVE

Find a function (i.e., the model) f(d,t)

that predicts the value of some

predictive variable y = f(d,t) at a future

time t, given the set of conditions found

in the training data {d}.

=> Given {d}, find y.

PRESCRIPTIVEAnalytics

Find the conditions {d’} that will produce a

prescribed (desired, optimum) value y at a

future time t, using the previously learned

conditional dependencies among the variables

in the predictive function f(d,t).

=> Given y, find {d’}.

Predictive vs Prescriptive:What’s the Difference?

21

Analytics

Page 22: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Analytics

Find a function (i.e., the model) f(d,t)

that predicts the value of some

predictive variable y = f(d,t) at a future

time t, given the set of conditions found

in the training data {d}.

=> Given {d}, find y.

Analytics

Find the conditions {d’} that will produce a

prescribed (desired, optimum) value y at a

future time t, using the previously learned

conditional dependencies among the variables

in the predictive function f(d,t).

=> Given y, find {d’}.

Predictive vs Prescriptive:What’s the Difference?

22

Confucius says…

“Study your past to know

your future”

PREDICTIVE PRESCRIPTIVE

Page 23: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

PREDICTIVEAnalytics

Find a function (i.e., the model) f(d,t)

that predicts the value of some

predictive variable y = f(d,t) at a future

time t, given the set of conditions found

in the training data {d}.

=> Given {d}, find y.

PRESCRIPTIVEAnalytics

Find the conditions {d’} that will produce a

prescribed (desired, optimum) value y at a

future time t, using the previously learned

conditional dependencies among the variables

in the predictive function f(d,t).

=> Given y, find {d’}.

Predictive vs Prescriptive:What’s the Difference?

23

Confucius says…

“Study your past to know

your future”

Baseball philosopher Yogi Berra says…

“The future ain’t what it

used to be.”

Page 24: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

© Copyright 2016 Booz Allen Hamilton – http://www.boozallen.com/datascience

Data Analytics in Medicine & Health Administration1. Benefits Administration improvement (“ACO = HIE + Analytics”: process mining,

best practices, cost-efficiency, success metrics validation)2. Do Not Pay initiatives (payment error / fraud analytics)3. Beneficiary Recommendations ("Amazon-style" predictive analytics, prescriptive

modeling)4. Consumer Engagement (personalized online web experience, "marketing

analytics")5. Health Information Exchange (HIE) Exploitation (population health discovery, link

analysis, ICD-10 mining)6. Personalized Healthcare and Patient Wellness (wearables data-sharing/mining,

health baselining)7. Personalized/Precision Medicine and Care Coordination (EHR, HIE monitoring /

mining)8. Predictive Medicine (readmissions, complications, adverse interactions)9. At-Risk Precursor Analytics (early warning signals of cancer, diabetes, heart

disease, suicidal / mental health issues, ...)10. Patient Trajectories Analysis (mining / segmentation of whole population EHR

histories, pathways, outcomes, outliers)11. Learning Health System Decision Support (advanced analytics embedded in health

system data feeds)12. What Question Should I Be Asking of My Data? (Cognitive Analytics)

24

Page 25: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

25Source for graphic: https://data-flair.training/blogs/machine-learning-applications/

Predictive Analytics is currently the most significant application of Machine Learning (*)

(*) The set of mathematical algorithms that learn (patterns) from experience (data)

Page 26: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

26Source for graphic: https://www.altexsoft.com/blog/datascience/machine-learning-strategy-7-steps/

Predictive Analytics is everywhere in Business Data and Machine Learning (AI) Strategy Discussions

Page 27: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Traditional Time Series Forecasting:Prediction based on historical patterns

Source: https://medium.com/99xtechnology/time-series-forecasting-in-machine-learning-3972f7a7a467

27

Page 28: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Traditional Time Series Forecasting:Autoregressive (uncertainty in prediction can be large)

Source: https://peltiertech.com/excel-fan-chart-showing-uncertainty-in-projections/

Un

cert

ain

ty!

28

Page 29: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Traditional Time Series Forecasting:Autoregressive (assumes future time series values

depend on the past values from the same series)

Source: http://ucanalytics.com/blogs/step-by-step-graphic-guide-to-forecasting-through-arima-modeling-in-r-manufacturing-case-study-example/

29

Page 30: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Traditional Time Series Forecasting:Even with very high-fidelity physics-based models,

uncertainty in prediction can be large!

Source: https://www.reddit.com/r/weather/comments/6xecax/tracking_hurricane_irma/ 30

Page 31: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

31

Data Science provides insights into the future: to predict it and to change it!

Page 32: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

32

Source for image: https://www.hausmanmarketingletter.com/translating-analytics-to-action/

Advances in Predictive, Prescriptive,and Cognitive Analytics provide us with

More Ways to See Around Corners

Page 33: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Examples of Forecasting(seeing around corners)

1) Cognitive

2) Associations

3) Graphs

4) Clustering

33

Page 34: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Examples of Forecasting(seeing around corners)

1) Cognitive

2) Associations

3) Graphs

4) Clustering

34

Page 35: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

“You can see a lot by just looking”

(and you can see around corners!)

Cognitive, Contextual, Insightful, Forecastful

35https://www.speedcafe.com/2017/07/12/f1-demo-take-place-london-streets/

Page 36: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Examples of Forecasting(seeing around corners)

1) Cognitive

2) Associations

3) Graphs

4) Clustering

36

Page 37: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

◼ Classic Textbook Example of Data Mining (Legend?): Data

mining of grocery store logs indicated that men who buy

diapers also tend to buy beer at the same time.

Association Discovery Example #1

37

Page 38: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

◼ Amazon.com mines its customers’ purchase logs to

recommend books to you: “People who bought this book also

bought this other one.”

Association Discovery Example #2

38

Page 39: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

◼ Netflix mines its video rental history database to recommend

rentals to you based upon other customers who rented similar

movies as you.

Association Discovery Example #3

39

Page 40: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

◼ Wal-Mart studied product sales in their Florida stores in 2004

when several hurricanes passed through Florida.

◼ Wal-Mart found that, before the hurricanes arrived, people

purchased 7 times as many of {one particular product}

compared to everything else.

Association Discovery Example #4

40

Page 41: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

◼ Wal-Mart studied product sales in their Florida stores in 2004

when several hurricanes passed through Florida.

◼ Wal-Mart found that, before the hurricanes arrived, people

purchased 7 times as many strawberry pop tarts compared

to everything else.

Association Discovery Example #4

41

Page 42: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Strawberry pop tarts???

http://www.nytimes.com/2004/11/14/business/yourmoney/14wal.htmlhttp://www.hurricaneville.com/pop_tarts.html

http://bit.ly/1gHZddA42

Page 43: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Association Rule Discovery forHurricane Intensification Forecasting

• Research by GMU geoscientists

• Predict the final strength of hurricane at landfall.

• Find co-occurrence of final hurricane strength with specific values of measured physical properties of the hurricane while it is still over the ocean.

• Result: the association rule discovery prediction is better than National Hurricane Center prediction!

• Research Paper by GMU scientists: https://ams.confex.com/ams/pdfpapers/84949.pdf

43

Page 44: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Examples of Forecasting(seeing around corners)

1) Cognitive

2) Associations

3) Graphs

4) Clustering

44

Page 45: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

(Graphic by Cray, for Cray Graph Engine CGE)

http://www.cray.com/products/analytics/cray-graph-engine

“All the World is a Graph” – Shakespeare?The natural data structure of the world is not

rows and columns, but a Graph!

45

Page 46: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Simple Example of the Power of Graph:Semi-Metric Space

• Entity {1} is linked to Entity {2} (small distance A)

• Entity {2} is linked to Entity {3} (small distance B)

• Entity {1} is *not* linked directly to Entity {3} (Similarity Distance C = infinite)

• Similarity Distances between A, B, and C violate the triangle inequality!

{1} {3}{2}

46

Page 47: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

• Entity {1} is linked to Entity {2} (small distance A)

• Entity {2} is linked to Entity {3} (small distance B)

• Entity {1} is *not* linked directly to Entity {3} (Similarity Distance C = infinite)

• Similarity Distances between A, B, and C violate the triangle inequality!

• The connection between black hat entities {1} and {3} never appears explicitly

within a transactional database.

• Examples: (a) Medical Research Discoveries across disconnected journals,

through linked semantic assertions; (b) Customer Journey modeling; (c) Safety

Incident Causal Factor Analysis; (d) Marketing Attribution Analysis; (e) Fraud

networks, Illegal goods trafficking networks, Money-Laundering networks.

{1} {3}{2}

Simple Example of the Power of Graph:Semi-Metric Space

47

Page 48: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Customer Journey Science by Clickfox.com –The Journey Graph predicts Customer outcomes with high accuracy!

48https://www.slideshare.net/Qualtrics/how-to-leverage-analytics-design-and-development-to-transform-customer-journeys

Page 49: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Examples of Forecasting(seeing around corners)

1) Cognitive

2) Associations

3) Graphs

4) Clustering

49

Page 50: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Clustering = the process of partitioning a set of data into subsets

(segments or clusters) such that a data element belonging to any

chosen cluster is more similar to data elements belonging to

that cluster than to data elements belonging to other clusters.

= Group together similar items + separate the dissimilar items

= Identify similar characteristics, patterns, or behaviors among

subsets of the data elements.

Challenge #1) No prior knowledge of the number of clusters.

#2) No prior knowledge of semantic meaning of the clusters.

#3) Different clusters are possible from the same data set!

#4) Different clusters are possible using different similarity metrics.50

Page 51: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

51

How to know if your clusters are good enough

Reference: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-S2-S5

R code for validation algorithms: https://cran.r-project.org/web/packages/clValid/clValid.pdf

◼ You know the clusters are good …

◼ … if the clusters are compact relative to their separation

◼ … if the clusters are well separated from one another

◼ … the “within cluster” errors are small (low variance within)

◼ … if the number of clusters is small relative to the number of data points

◼ Various measures of cluster compactness exist, including the Dunn index , the C-index, Silhouette analysis, and the DBI (Davies-Bouldin Index)

51

Page 52: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Application of Davies-Bouldin Index

◼ Assume K (the number of clusters) and assume other things (choice of clustering algorithm; the choice of clustering feature attributes; etc.)

◼ Measure DBI

◼ Test another set of values for the cluster input parameters (K, feature attributes, etc.)

◼ Measure DBI

◼ … continue iterating like this until you find the set of cluster input parameters that yields the best (minimum) value for DBI.

52

Page 53: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Scientific Discovery from

Cluster Analysis of data

parameters from events on

the Sun and around the Earth

Page 54: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Cluster Analysis:Find the clusters, then Evaluate them

D-

B

Ind

ex

Delay (hr) of Dst from Vsw and Bz

DBI for Dst_Vsw_Bz

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

0 1 2 3 4 5 6 7 8 9 10 11 12

Time Shift

DB

I

2C DBI

3C DBI

4C DBI

Average

Figure 10. Davies-Bouldin index for various time delays of Dst from Vsw and Bz for cases of 2 (blue), 3 (red), 4 (yellow) clusters, and the overall average (purple), indicating an optimal delay of ~2-3 hours for Dst.

Good Clusters =

Small Size relative to

Cluster Separation.

DISCOVERY! ...

Solar wind events

have the strongest

association (i.e., the

tightest clusters) with

the space plasma

events within the

Earth’s magnetosphere

about 2-4 hours after

a major plasma outburst

occurs on the Sun.

54

Page 55: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Next Steps…

55

Page 56: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Welcome to the new Hype 2018!

56https://marketoonist.com/2018/01/blockchain.html

Page 57: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

https://datasciencebowl.com

Harness your Data Science Passion.Unleash your Curiosity.

Focus on a larger Purpose using #Data4Good and #AI4socialgood in #DataSciBowl.

57

75% of rare diseases affect children.

** https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3150084/

**

Page 58: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from

Data Science Bowl – largest global competition in DS(summary statistics for Data Science Bowls 2015-2018)

58