future forward with data sciencekirkborne.net/phuse2018/kirkborne-phuse-nov2018.pdf · feature...
TRANSCRIPT
![Page 1: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/1.jpg)
Principal Data Scientist
Booz Allen Hamilton
http://www.boozallen.com/datascience
Kirk Borne@KirkDBorne
Future Forward with Data Science:How to Predict (and to Change) the Future
![Page 2: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/2.jpg)
Principal Data Scientist
Booz Allen Hamilton
http://www.boozallen.com/datascience
Kirk Borne@KirkDBorne
Future Forward with Data Science:How to Predict (and to Change) the Future
![Page 3: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/3.jpg)
Ever since we first explored our world…
http://www.livescience.com/27663-seven-seas.html
3
![Page 4: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/4.jpg)
…We have asked questions about everything around us.
https://jefflynchdev.wordpress.com/tag/adobe-photoshop-lightroom-3/page/5/
4
![Page 5: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/5.jpg)
So, we have collected evidence (data) to answer our questions,
which leads to more questions, which leads to more data collection,
which leads to more questions, which leads to BIG DATA!
y ~ 2 * x (linear growth)
y ~ 2 ^ x (exponential growth)
https://www.linkedin.com/pulse/exponential-growth-isnt-cool-combinatorial-tor-bair
y ~ x! ≈ x ^ x→ Combinatorial Growth!(all possible interconnections,linkages, and interactions)
5
![Page 6: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/6.jpg)
DefiningBig Data
• The 3 V’s of Big Data are not just hype…
• They represent really big challenges:
1. Volume
2. Velocity
3. Variety
Source for graphic: http://www.vitria.com/blog/Big-Data-Analytics-Challenges-Facing-All-Communications-Service-Providers/ 6
![Page 7: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/7.jpg)
DefiningBig Data
• The 3 V’s of Big Data are not just hype…
• They represent really big challenges:
1. Volume
2. Velocity
3. Variety
✓VALUE!
7
![Page 8: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/8.jpg)
Our mission as data scientists:is to discover Value in Big Data
(especially in high-Variety data) throughData Science and Machine Learning
8
![Page 9: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/9.jpg)
What is the Big Data Variety Challenge?
9
![Page 10: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/10.jpg)
What is the Big Data Variety Challenge?
Source for graphic: http://www.vitria.com/blog/Big-Data-Analytics-Challenges-Facing-All-Communications-Service-Providers/
1. We collect many different sources of data.
2. But we usually store diverse data in separate silos.
3. Therefore, we cannot easily integrate the data to
combine them for unified insight.
Consider the Blind Men
and the Elephant…
10
![Page 11: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/11.jpg)
11
Adding more data doesn’t necessarily help…
https://paulmead.com.au/blog/understand-perceptions/
Unless we can combine and integrate the different signals
into a “single view” of the thing, there will continue to be
many possible interpretations of what the source is!
Combining, connecting, and linking diverse data makes data “smart”!
Think of data not as information, but as measurements that encode knowledge.
![Page 12: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/12.jpg)
Feature Selection is important in order to disambiguate different classes.More importantly,Class Discovery depends on choosing the right projection and selecting the right features!
Feature Selection and Projection
12
![Page 13: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/13.jpg)
Your chosen data attributes represent a low-dimension projection of the full truth – the feature space (dimensions) in which you explore your data is a form of bias – it matters!
Projection Matters
13
![Page 14: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/14.jpg)
Feature Selection and Model Bias:choosing features in the dark
I picked out two socks from my sock drawer this morning!
It was still dark, but that shouldn’t matter, right? After all, they are the same size … THE SAME ?!?
The Era of Big Data represents the END OF DEMOGRAPHICS (i.e., our models should no longer be based on and biased by a limited selection of attributes and features)
14
![Page 15: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/15.jpg)
An “Easy Button” for Extracting Value from Data through Machine Learning• Pattern Discovery (Detection)
– D2D: data-to-discovery
• Pattern Recognition– D2D: data-to-decisions
• Pattern Exploration– D2D: data-to-dollars (innovation)
• Pattern Exploitation– D2V: Data-to-Value (action)
– D2A: Data-to-Action (value)
15
![Page 16: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/16.jpg)
Pattern Discovery is easy, but Pattern Exploitation requires more data science…
Source for graphic: http://www.holehouse.org/mlclass/10_Advice_for_applying_machine_learning.html
16
Generalization is key!
(The Goldilocks model)
The most generally useful model captures the fundamental pattern in the data and takes into account the natural variance in the data.
![Page 17: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/17.jpg)
The Goal of Machine Learning
“…is to use algorithms to learn from data,
in order to build generalizable models that
give accurate classifications or predictions,
or to find (useful) patterns, particularly
with new and previously unseen data.”
(the key is GENERALIZATION!)
https://www.innoarchitech.com/machine-learning-an-in-depth-non-technical-guide/
17
![Page 18: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/18.jpg)
18
4 Flavors of Machine Learning
for Pattern Detection and Discovery1) Class Discovery (Clustering): Find the
categories of objects (population segments), events, and behaviors in your data. + Learn the rules that constrain the class boundaries (that uniquely distinguish them).
2) Correlation (Predictive and Prescriptive Power) Discovery: Find trends, patterns, and
dependencies in data that reveal new governing principles or behavioral patterns (the object’s “DNA”).
3) Novelty (Surprise!) Discovery: Find the new,
surprising, unexpected one-in-a-[million / billion / trillion] object, event, or behavior.
4) Association (or Link) Discovery: (Graph and
Network Analytics) – Find the unusual (interesting) data associations / links / connections across entities in your domain.
![Page 19: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/19.jpg)
5 Levels of Analytics Maturity
in Data-Driven Applications1) Descriptive Analytics
– Hindsight (What happened?)
2) Diagnostic Analytics
– Oversight (real-time / What is
happening? Why did it happen?)
3) Predictive Analytics
– Foresight (What will happen?)
19
![Page 20: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/20.jpg)
5 Levels of Analytics Maturity
in Data-Driven Applications1) Descriptive Analytics
– Hindsight (What happened?)
2) Diagnostic Analytics
– Oversight (real-time / What is
happening? Why did it happen?)
3) Predictive Analytics
– Foresight (What will happen?)
4) Prescriptive Analytics
– Insight (How can we optimize what
happens?) (Follow the dots / connections in
the graph!)
5) Cognitive Analytics– Right Sight (the 360 view , what is the right
question to ask for this set of data in this
context = Game of Jeopardy)
– Finds the right insight, the right action, the
right decision,… right now!
– Moves beyond simply providing answers, to
generating new questions and hypotheses.
20
![Page 21: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/21.jpg)
PREDICTIVE
Find a function (i.e., the model) f(d,t)
that predicts the value of some
predictive variable y = f(d,t) at a future
time t, given the set of conditions found
in the training data {d}.
=> Given {d}, find y.
PRESCRIPTIVEAnalytics
Find the conditions {d’} that will produce a
prescribed (desired, optimum) value y at a
future time t, using the previously learned
conditional dependencies among the variables
in the predictive function f(d,t).
=> Given y, find {d’}.
Predictive vs Prescriptive:What’s the Difference?
21
Analytics
![Page 22: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/22.jpg)
Analytics
Find a function (i.e., the model) f(d,t)
that predicts the value of some
predictive variable y = f(d,t) at a future
time t, given the set of conditions found
in the training data {d}.
=> Given {d}, find y.
Analytics
Find the conditions {d’} that will produce a
prescribed (desired, optimum) value y at a
future time t, using the previously learned
conditional dependencies among the variables
in the predictive function f(d,t).
=> Given y, find {d’}.
Predictive vs Prescriptive:What’s the Difference?
22
Confucius says…
“Study your past to know
your future”
PREDICTIVE PRESCRIPTIVE
![Page 23: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/23.jpg)
PREDICTIVEAnalytics
Find a function (i.e., the model) f(d,t)
that predicts the value of some
predictive variable y = f(d,t) at a future
time t, given the set of conditions found
in the training data {d}.
=> Given {d}, find y.
PRESCRIPTIVEAnalytics
Find the conditions {d’} that will produce a
prescribed (desired, optimum) value y at a
future time t, using the previously learned
conditional dependencies among the variables
in the predictive function f(d,t).
=> Given y, find {d’}.
Predictive vs Prescriptive:What’s the Difference?
23
Confucius says…
“Study your past to know
your future”
Baseball philosopher Yogi Berra says…
“The future ain’t what it
used to be.”
![Page 24: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/24.jpg)
© Copyright 2016 Booz Allen Hamilton – http://www.boozallen.com/datascience
Data Analytics in Medicine & Health Administration1. Benefits Administration improvement (“ACO = HIE + Analytics”: process mining,
best practices, cost-efficiency, success metrics validation)2. Do Not Pay initiatives (payment error / fraud analytics)3. Beneficiary Recommendations ("Amazon-style" predictive analytics, prescriptive
modeling)4. Consumer Engagement (personalized online web experience, "marketing
analytics")5. Health Information Exchange (HIE) Exploitation (population health discovery, link
analysis, ICD-10 mining)6. Personalized Healthcare and Patient Wellness (wearables data-sharing/mining,
health baselining)7. Personalized/Precision Medicine and Care Coordination (EHR, HIE monitoring /
mining)8. Predictive Medicine (readmissions, complications, adverse interactions)9. At-Risk Precursor Analytics (early warning signals of cancer, diabetes, heart
disease, suicidal / mental health issues, ...)10. Patient Trajectories Analysis (mining / segmentation of whole population EHR
histories, pathways, outcomes, outliers)11. Learning Health System Decision Support (advanced analytics embedded in health
system data feeds)12. What Question Should I Be Asking of My Data? (Cognitive Analytics)
24
![Page 25: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/25.jpg)
25Source for graphic: https://data-flair.training/blogs/machine-learning-applications/
Predictive Analytics is currently the most significant application of Machine Learning (*)
(*) The set of mathematical algorithms that learn (patterns) from experience (data)
![Page 26: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/26.jpg)
26Source for graphic: https://www.altexsoft.com/blog/datascience/machine-learning-strategy-7-steps/
Predictive Analytics is everywhere in Business Data and Machine Learning (AI) Strategy Discussions
![Page 27: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/27.jpg)
Traditional Time Series Forecasting:Prediction based on historical patterns
Source: https://medium.com/99xtechnology/time-series-forecasting-in-machine-learning-3972f7a7a467
27
![Page 28: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/28.jpg)
Traditional Time Series Forecasting:Autoregressive (uncertainty in prediction can be large)
Source: https://peltiertech.com/excel-fan-chart-showing-uncertainty-in-projections/
Un
cert
ain
ty!
28
![Page 29: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/29.jpg)
Traditional Time Series Forecasting:Autoregressive (assumes future time series values
depend on the past values from the same series)
Source: http://ucanalytics.com/blogs/step-by-step-graphic-guide-to-forecasting-through-arima-modeling-in-r-manufacturing-case-study-example/
29
![Page 30: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/30.jpg)
Traditional Time Series Forecasting:Even with very high-fidelity physics-based models,
uncertainty in prediction can be large!
Source: https://www.reddit.com/r/weather/comments/6xecax/tracking_hurricane_irma/ 30
![Page 31: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/31.jpg)
31
Data Science provides insights into the future: to predict it and to change it!
![Page 32: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/32.jpg)
32
Source for image: https://www.hausmanmarketingletter.com/translating-analytics-to-action/
Advances in Predictive, Prescriptive,and Cognitive Analytics provide us with
More Ways to See Around Corners
![Page 33: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/33.jpg)
Examples of Forecasting(seeing around corners)
1) Cognitive
2) Associations
3) Graphs
4) Clustering
33
![Page 34: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/34.jpg)
Examples of Forecasting(seeing around corners)
1) Cognitive
2) Associations
3) Graphs
4) Clustering
34
![Page 35: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/35.jpg)
“You can see a lot by just looking”
(and you can see around corners!)
Cognitive, Contextual, Insightful, Forecastful
35https://www.speedcafe.com/2017/07/12/f1-demo-take-place-london-streets/
![Page 36: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/36.jpg)
Examples of Forecasting(seeing around corners)
1) Cognitive
2) Associations
3) Graphs
4) Clustering
36
![Page 37: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/37.jpg)
◼ Classic Textbook Example of Data Mining (Legend?): Data
mining of grocery store logs indicated that men who buy
diapers also tend to buy beer at the same time.
Association Discovery Example #1
37
![Page 38: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/38.jpg)
◼ Amazon.com mines its customers’ purchase logs to
recommend books to you: “People who bought this book also
bought this other one.”
Association Discovery Example #2
38
![Page 39: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/39.jpg)
◼ Netflix mines its video rental history database to recommend
rentals to you based upon other customers who rented similar
movies as you.
Association Discovery Example #3
39
![Page 40: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/40.jpg)
◼ Wal-Mart studied product sales in their Florida stores in 2004
when several hurricanes passed through Florida.
◼ Wal-Mart found that, before the hurricanes arrived, people
purchased 7 times as many of {one particular product}
compared to everything else.
Association Discovery Example #4
40
![Page 41: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/41.jpg)
◼ Wal-Mart studied product sales in their Florida stores in 2004
when several hurricanes passed through Florida.
◼ Wal-Mart found that, before the hurricanes arrived, people
purchased 7 times as many strawberry pop tarts compared
to everything else.
Association Discovery Example #4
41
![Page 42: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/42.jpg)
Strawberry pop tarts???
http://www.nytimes.com/2004/11/14/business/yourmoney/14wal.htmlhttp://www.hurricaneville.com/pop_tarts.html
http://bit.ly/1gHZddA42
![Page 43: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/43.jpg)
Association Rule Discovery forHurricane Intensification Forecasting
• Research by GMU geoscientists
• Predict the final strength of hurricane at landfall.
• Find co-occurrence of final hurricane strength with specific values of measured physical properties of the hurricane while it is still over the ocean.
• Result: the association rule discovery prediction is better than National Hurricane Center prediction!
• Research Paper by GMU scientists: https://ams.confex.com/ams/pdfpapers/84949.pdf
43
![Page 44: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/44.jpg)
Examples of Forecasting(seeing around corners)
1) Cognitive
2) Associations
3) Graphs
4) Clustering
44
![Page 45: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/45.jpg)
(Graphic by Cray, for Cray Graph Engine CGE)
http://www.cray.com/products/analytics/cray-graph-engine
“All the World is a Graph” – Shakespeare?The natural data structure of the world is not
rows and columns, but a Graph!
45
![Page 46: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/46.jpg)
Simple Example of the Power of Graph:Semi-Metric Space
• Entity {1} is linked to Entity {2} (small distance A)
• Entity {2} is linked to Entity {3} (small distance B)
• Entity {1} is *not* linked directly to Entity {3} (Similarity Distance C = infinite)
• Similarity Distances between A, B, and C violate the triangle inequality!
{1} {3}{2}
46
![Page 47: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/47.jpg)
• Entity {1} is linked to Entity {2} (small distance A)
• Entity {2} is linked to Entity {3} (small distance B)
• Entity {1} is *not* linked directly to Entity {3} (Similarity Distance C = infinite)
• Similarity Distances between A, B, and C violate the triangle inequality!
• The connection between black hat entities {1} and {3} never appears explicitly
within a transactional database.
• Examples: (a) Medical Research Discoveries across disconnected journals,
through linked semantic assertions; (b) Customer Journey modeling; (c) Safety
Incident Causal Factor Analysis; (d) Marketing Attribution Analysis; (e) Fraud
networks, Illegal goods trafficking networks, Money-Laundering networks.
{1} {3}{2}
Simple Example of the Power of Graph:Semi-Metric Space
47
![Page 48: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/48.jpg)
Customer Journey Science by Clickfox.com –The Journey Graph predicts Customer outcomes with high accuracy!
48https://www.slideshare.net/Qualtrics/how-to-leverage-analytics-design-and-development-to-transform-customer-journeys
![Page 49: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/49.jpg)
Examples of Forecasting(seeing around corners)
1) Cognitive
2) Associations
3) Graphs
4) Clustering
49
![Page 50: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/50.jpg)
Clustering = the process of partitioning a set of data into subsets
(segments or clusters) such that a data element belonging to any
chosen cluster is more similar to data elements belonging to
that cluster than to data elements belonging to other clusters.
= Group together similar items + separate the dissimilar items
= Identify similar characteristics, patterns, or behaviors among
subsets of the data elements.
Challenge #1) No prior knowledge of the number of clusters.
#2) No prior knowledge of semantic meaning of the clusters.
#3) Different clusters are possible from the same data set!
#4) Different clusters are possible using different similarity metrics.50
![Page 51: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/51.jpg)
51
How to know if your clusters are good enough
Reference: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-S2-S5
R code for validation algorithms: https://cran.r-project.org/web/packages/clValid/clValid.pdf
◼ You know the clusters are good …
◼ … if the clusters are compact relative to their separation
◼ … if the clusters are well separated from one another
◼ … the “within cluster” errors are small (low variance within)
◼ … if the number of clusters is small relative to the number of data points
◼ Various measures of cluster compactness exist, including the Dunn index , the C-index, Silhouette analysis, and the DBI (Davies-Bouldin Index)
51
![Page 52: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/52.jpg)
Application of Davies-Bouldin Index
◼ Assume K (the number of clusters) and assume other things (choice of clustering algorithm; the choice of clustering feature attributes; etc.)
◼ Measure DBI
◼ Test another set of values for the cluster input parameters (K, feature attributes, etc.)
◼ Measure DBI
◼ … continue iterating like this until you find the set of cluster input parameters that yields the best (minimum) value for DBI.
52
![Page 53: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/53.jpg)
Scientific Discovery from
Cluster Analysis of data
parameters from events on
the Sun and around the Earth
![Page 54: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/54.jpg)
Cluster Analysis:Find the clusters, then Evaluate them
D-
B
Ind
ex
Delay (hr) of Dst from Vsw and Bz
DBI for Dst_Vsw_Bz
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
0 1 2 3 4 5 6 7 8 9 10 11 12
Time Shift
DB
I
2C DBI
3C DBI
4C DBI
Average
Figure 10. Davies-Bouldin index for various time delays of Dst from Vsw and Bz for cases of 2 (blue), 3 (red), 4 (yellow) clusters, and the overall average (purple), indicating an optimal delay of ~2-3 hours for Dst.
Good Clusters =
Small Size relative to
Cluster Separation.
DISCOVERY! ...
Solar wind events
have the strongest
association (i.e., the
tightest clusters) with
the space plasma
events within the
Earth’s magnetosphere
about 2-4 hours after
a major plasma outburst
occurs on the Sun.
54
![Page 55: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/55.jpg)
Next Steps…
55
![Page 56: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/56.jpg)
Welcome to the new Hype 2018!
56https://marketoonist.com/2018/01/blockchain.html
![Page 57: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/57.jpg)
https://datasciencebowl.com
Harness your Data Science Passion.Unleash your Curiosity.
Focus on a larger Purpose using #Data4Good and #AI4socialgood in #DataSciBowl.
57
75% of rare diseases affect children.
** https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3150084/
**
![Page 58: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/58.jpg)
Data Science Bowl – largest global competition in DS(summary statistics for Data Science Bowls 2015-2018)
58
![Page 59: Future Forward with Data Sciencekirkborne.net/phuse2018/KirkBorne-PhUSE-Nov2018.pdf · Feature Selection and Model Bias: choosing features in the dark. I picked out two socks from](https://reader034.vdocuments.us/reader034/viewer/2022050612/5fb2dfdf620ead58a905b428/html5/thumbnails/59.jpg)
Thank you!Contact information, for further questions or inquiries:
Dr. Kirk Borne, Principal Data Scientist, Booz Allen Hamilton
Twitter: @KirkDBorne or Email: [email protected]
Get slides here: http://www.kirkborne.net/phuse2018/
59Booz | Allen | Hamilton