future forward with data sciencekirkborne.net/phuse2018/kirkborne-phuse-nov2018.pdf · feature...

Principal Data Scientist

Booz Allen Hamilton

http://www.boozallen.com/datascience

Kirk Borne@KirkDBorne

Future Forward with Data Science:How to Predict (and to Change) the Future

Principal Data Scientist

Booz Allen Hamilton

http://www.boozallen.com/datascience

Kirk Borne@KirkDBorne

Future Forward with Data Science:How to Predict (and to Change) the Future

Ever since we first explored our world…

http://www.livescience.com/27663-seven-seas.html

…We have asked questions about everything around us.

https://jefflynchdev.wordpress.com/tag/adobe-photoshop-lightroom-3/page/5/

So, we have collected evidence (data) to answer our questions,

which leads to more questions, which leads to more data collection,

which leads to more questions, which leads to BIG DATA!

y ~ 2 * x (linear growth)

y ~ 2 ^ x (exponential growth)

https://www.linkedin.com/pulse/exponential-growth-isnt-cool-combinatorial-tor-bair

y ~ x! ≈ x ^ x→ Combinatorial Growth!(all possible interconnections,linkages, and interactions)

DefiningBig Data

• The 3 V’s of Big Data are not just hype…

• They represent really big challenges:

1. Volume

2. Velocity

3. Variety

Source for graphic: http://www.vitria.com/blog/Big-Data-Analytics-Challenges-Facing-All-Communications-Service-Providers/ 6

DefiningBig Data

• The 3 V’s of Big Data are not just hype…

• They represent really big challenges:

1. Volume

2. Velocity

3. Variety

✓VALUE!

Our mission as data scientists:is to discover Value in Big Data

(especially in high-Variety data) throughData Science and Machine Learning

What is the Big Data Variety Challenge?

Source for graphic: http://www.vitria.com/blog/Big-Data-Analytics-Challenges-Facing-All-Communications-Service-Providers/

1. We collect many different sources of data.

2. But we usually store diverse data in separate silos.

3. Therefore, we cannot easily integrate the data to

combine them for unified insight.

Consider the Blind Men

and the Elephant…

Adding more data doesn’t necessarily help…

https://paulmead.com.au/blog/understand-perceptions/

Unless we can combine and integrate the different signals

into a “single view” of the thing, there will continue to be

many possible interpretations of what the source is!

Combining, connecting, and linking diverse data makes data “smart”!

Think of data not as information, but as measurements that encode knowledge.

Feature Selection is important in order to disambiguate different classes.More importantly,Class Discovery depends on choosing the right projection and selecting the right features!

Feature Selection and Projection

Your chosen data attributes represent a low-dimension projection of the full truth – the feature space (dimensions) in which you explore your data is a form of bias – it matters!

Projection Matters

Feature Selection and Model Bias:choosing features in the dark

I picked out two socks from my sock drawer this morning!

It was still dark, but that shouldn’t matter, right? After all, they are the same size … THE SAME ?!?

The Era of Big Data represents the END OF DEMOGRAPHICS (i.e., our models should no longer be based on and biased by a limited selection of attributes and features)

An “Easy Button” for Extracting Value from Data through Machine Learning• Pattern Discovery (Detection)

– D2D: data-to-discovery

• Pattern Recognition– D2D: data-to-decisions

• Pattern Exploration– D2D: data-to-dollars (innovation)

• Pattern Exploitation– D2V: Data-to-Value (action)

– D2A: Data-to-Action (value)

Pattern Discovery is easy, but Pattern Exploitation requires more data science…

Source for graphic: http://www.holehouse.org/mlclass/10_Advice_for_applying_machine_learning.html

Generalization is key!

(The Goldilocks model)

The most generally useful model captures the fundamental pattern in the data and takes into account the natural variance in the data.

The Goal of Machine Learning

“…is to use algorithms to learn from data,

in order to build generalizable models that

give accurate classifications or predictions,

or to find (useful) patterns, particularly

with new and previously unseen data.”

(the key is GENERALIZATION!)

https://www.innoarchitech.com/machine-learning-an-in-depth-non-technical-guide/

4 Flavors of Machine Learning

for Pattern Detection and Discovery1) Class Discovery (Clustering): Find the

categories of objects (population segments), events, and behaviors in your data. + Learn the rules that constrain the class boundaries (that uniquely distinguish them).

2) Correlation (Predictive and Prescriptive Power) Discovery: Find trends, patterns, and

dependencies in data that reveal new governing principles or behavioral patterns (the object’s “DNA”).

3) Novelty (Surprise!) Discovery: Find the new,

surprising, unexpected one-in-a-[million / billion / trillion] object, event, or behavior.

4) Association (or Link) Discovery: (Graph and

Network Analytics) – Find the unusual (interesting) data associations / links / connections across entities in your domain.

5 Levels of Analytics Maturity

in Data-Driven Applications1) Descriptive Analytics

– Hindsight (What happened?)

2) Diagnostic Analytics

– Oversight (real-time / What is

happening? Why did it happen?)

3) Predictive Analytics

– Foresight (What will happen?)

5 Levels of Analytics Maturity

in Data-Driven Applications1) Descriptive Analytics

– Hindsight (What happened?)

2) Diagnostic Analytics

– Oversight (real-time / What is

happening? Why did it happen?)

3) Predictive Analytics

– Foresight (What will happen?)

4) Prescriptive Analytics

– Insight (How can we optimize what

happens?) (Follow the dots / connections in

the graph!)

5) Cognitive Analytics– Right Sight (the 360 view , what is the right

question to ask for this set of data in this

context = Game of Jeopardy)

– Finds the right insight, the right action, the

right decision,… right now!

– Moves beyond simply providing answers, to

generating new questions and hypotheses.

PREDICTIVE

Find a function (i.e., the model) f(d,t)

that predicts the value of some

predictive variable y = f(d,t) at a future

time t, given the set of conditions found

in the training data {d}.

=> Given {d}, find y.

PRESCRIPTIVEAnalytics

Find the conditions {d’} that will produce a

prescribed (desired, optimum) value y at a

future time t, using the previously learned

conditional dependencies among the variables

in the predictive function f(d,t).

=> Given y, find {d’}.

Predictive vs Prescriptive:What’s the Difference?

Analytics

Confucius says…

“Study your past to know

your future”

PREDICTIVE PRESCRIPTIVE

PREDICTIVEAnalytics

PRESCRIPTIVEAnalytics

Confucius says…

“Study your past to know

your future”

Baseball philosopher Yogi Berra says…

“The future ain’t what it

used to be.”

Data Analytics in Medicine & Health Administration1. Benefits Administration improvement (“ACO = HIE + Analytics”: process mining,

best practices, cost-efficiency, success metrics validation)2. Do Not Pay initiatives (payment error / fraud analytics)3. Beneficiary Recommendations ("Amazon-style" predictive analytics, prescriptive

modeling)4. Consumer Engagement (personalized online web experience, "marketing

analytics")5. Health Information Exchange (HIE) Exploitation (population health discovery, link

analysis, ICD-10 mining)6. Personalized Healthcare and Patient Wellness (wearables data-sharing/mining,

health baselining)7. Personalized/Precision Medicine and Care Coordination (EHR, HIE monitoring /

mining)8. Predictive Medicine (readmissions, complications, adverse interactions)9. At-Risk Precursor Analytics (early warning signals of cancer, diabetes, heart

disease, suicidal / mental health issues, ...)10. Patient Trajectories Analysis (mining / segmentation of whole population EHR

histories, pathways, outcomes, outliers)11. Learning Health System Decision Support (advanced analytics embedded in health

system data feeds)12. What Question Should I Be Asking of My Data? (Cognitive Analytics)

25Source for graphic: https://data-flair.training/blogs/machine-learning-applications/

Predictive Analytics is currently the most significant application of Machine Learning (*)

(*) The set of mathematical algorithms that learn (patterns) from experience (data)

26Source for graphic: https://www.altexsoft.com/blog/datascience/machine-learning-strategy-7-steps/

Predictive Analytics is everywhere in Business Data and Machine Learning (AI) Strategy Discussions

Traditional Time Series Forecasting:Prediction based on historical patterns

Source: https://medium.com/99xtechnology/time-series-forecasting-in-machine-learning-3972f7a7a467

Traditional Time Series Forecasting:Autoregressive (uncertainty in prediction can be large)

Source: https://peltiertech.com/excel-fan-chart-showing-uncertainty-in-projections/

Traditional Time Series Forecasting:Autoregressive (assumes future time series values

depend on the past values from the same series)

Source: http://ucanalytics.com/blogs/step-by-step-graphic-guide-to-forecasting-through-arima-modeling-in-r-manufacturing-case-study-example/

Traditional Time Series Forecasting:Even with very high-fidelity physics-based models,

uncertainty in prediction can be large!

Source: https://www.reddit.com/r/weather/comments/6xecax/tracking_hurricane_irma/ 30

Data Science provides insights into the future: to predict it and to change it!

Source for image: https://www.hausmanmarketingletter.com/translating-analytics-to-action/

Advances in Predictive, Prescriptive,and Cognitive Analytics provide us with

More Ways to See Around Corners

Examples of Forecasting(seeing around corners)

1) Cognitive

2) Associations

3) Graphs

4) Clustering

1) Cognitive

2) Associations

3) Graphs

4) Clustering

“You can see a lot by just looking”

(and you can see around corners!)

Cognitive, Contextual, Insightful, Forecastful

35https://www.speedcafe.com/2017/07/12/f1-demo-take-place-london-streets/

1) Cognitive

2) Associations

3) Graphs

4) Clustering

◼ Classic Textbook Example of Data Mining (Legend?): Data

mining of grocery store logs indicated that men who buy

diapers also tend to buy beer at the same time.

Association Discovery Example #1

◼ Amazon.com mines its customers’ purchase logs to

recommend books to you: “People who bought this book also

bought this other one.”

◼ Netflix mines its video rental history database to recommend

rentals to you based upon other customers who rented similar

movies as you.

◼ Wal-Mart studied product sales in their Florida stores in 2004

when several hurricanes passed through Florida.

◼ Wal-Mart found that, before the hurricanes arrived, people

purchased 7 times as many of {one particular product}

compared to everything else.

◼ Wal-Mart studied product sales in their Florida stores in 2004

when several hurricanes passed through Florida.

◼ Wal-Mart found that, before the hurricanes arrived, people

purchased 7 times as many strawberry pop tarts compared

to everything else.

Strawberry pop tarts???

http://www.nytimes.com/2004/11/14/business/yourmoney/14wal.htmlhttp://www.hurricaneville.com/pop_tarts.html

http://bit.ly/1gHZddA42

Association Rule Discovery forHurricane Intensification Forecasting

• Research by GMU geoscientists

• Predict the final strength of hurricane at landfall.

• Find co-occurrence of final hurricane strength with specific values of measured physical properties of the hurricane while it is still over the ocean.

• Result: the association rule discovery prediction is better than National Hurricane Center prediction!

• Research Paper by GMU scientists: https://ams.confex.com/ams/pdfpapers/84949.pdf

1) Cognitive

2) Associations

3) Graphs

4) Clustering

(Graphic by Cray, for Cray Graph Engine CGE)

http://www.cray.com/products/analytics/cray-graph-engine

“All the World is a Graph” – Shakespeare?The natural data structure of the world is not

rows and columns, but a Graph!

Simple Example of the Power of Graph:Semi-Metric Space

• Entity {1} is linked to Entity {2} (small distance A)

• Entity {2} is linked to Entity {3} (small distance B)

• Entity {1} is *not* linked directly to Entity {3} (Similarity Distance C = infinite)

• Similarity Distances between A, B, and C violate the triangle inequality!

{1} {3}{2}

• Entity {1} is linked to Entity {2} (small distance A)

• Entity {2} is linked to Entity {3} (small distance B)

• Entity {1} is *not* linked directly to Entity {3} (Similarity Distance C = infinite)

• Similarity Distances between A, B, and C violate the triangle inequality!

• The connection between black hat entities {1} and {3} never appears explicitly

within a transactional database.

• Examples: (a) Medical Research Discoveries across disconnected journals,

through linked semantic assertions; (b) Customer Journey modeling; (c) Safety

Incident Causal Factor Analysis; (d) Marketing Attribution Analysis; (e) Fraud

networks, Illegal goods trafficking networks, Money-Laundering networks.

{1} {3}{2}

Simple Example of the Power of Graph:Semi-Metric Space

Customer Journey Science by Clickfox.com –The Journey Graph predicts Customer outcomes with high accuracy!

48https://www.slideshare.net/Qualtrics/how-to-leverage-analytics-design-and-development-to-transform-customer-journeys

1) Cognitive

2) Associations

3) Graphs

4) Clustering

Clustering = the process of partitioning a set of data into subsets

(segments or clusters) such that a data element belonging to any

chosen cluster is more similar to data elements belonging to

that cluster than to data elements belonging to other clusters.

= Group together similar items + separate the dissimilar items

= Identify similar characteristics, patterns, or behaviors among

subsets of the data elements.

Challenge #1) No prior knowledge of the number of clusters.

#2) No prior knowledge of semantic meaning of the clusters.

#3) Different clusters are possible from the same data set!

#4) Different clusters are possible using different similarity metrics.50

How to know if your clusters are good enough

Reference: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-S2-S5

R code for validation algorithms: https://cran.r-project.org/web/packages/clValid/clValid.pdf

◼ You know the clusters are good …

◼ … if the clusters are compact relative to their separation

◼ … if the clusters are well separated from one another

◼ … the “within cluster” errors are small (low variance within)

◼ … if the number of clusters is small relative to the number of data points

◼ Various measures of cluster compactness exist, including the Dunn index , the C-index, Silhouette analysis, and the DBI (Davies-Bouldin Index)

Application of Davies-Bouldin Index

◼ Assume K (the number of clusters) and assume other things (choice of clustering algorithm; the choice of clustering feature attributes; etc.)

◼ Measure DBI

◼ Test another set of values for the cluster input parameters (K, feature attributes, etc.)

◼ Measure DBI

◼ … continue iterating like this until you find the set of cluster input parameters that yields the best (minimum) value for DBI.

Scientific Discovery from

Cluster Analysis of data

parameters from events on

the Sun and around the Earth

Cluster Analysis:Find the clusters, then Evaluate them

Delay (hr) of Dst from Vsw and Bz

DBI for Dst_Vsw_Bz

0 1 2 3 4 5 6 7 8 9 10 11 12

Time Shift

2C DBI

3C DBI

4C DBI

Average

Figure 10. Davies-Bouldin index for various time delays of Dst from Vsw and Bz for cases of 2 (blue), 3 (red), 4 (yellow) clusters, and the overall average (purple), indicating an optimal delay of ~2-3 hours for Dst.

Good Clusters =

Small Size relative to

Cluster Separation.

DISCOVERY! ...

Solar wind events

have the strongest

association (i.e., the

tightest clusters) with

the space plasma

events within the

Earth’s magnetosphere

about 2-4 hours after

a major plasma outburst

occurs on the Sun.

Next Steps…

Welcome to the new Hype 2018!

56https://marketoonist.com/2018/01/blockchain.html

https://datasciencebowl.com

Harness your Data Science Passion.Unleash your Curiosity.

Focus on a larger Purpose using #Data4Good and #AI4socialgood in #DataSciBowl.

75% of rare diseases affect children.

** https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3150084/

Data Science Bowl – largest global competition in DS(summary statistics for Data Science Bowls 2015-2018)

Thank you!Contact information, for further questions or inquiries:

Dr. Kirk Borne, Principal Data Scientist, Booz Allen Hamilton

Twitter: @KirkDBorne or Email: kirk.borne@gmail.com

Get slides here: http://www.kirkborne.net/phuse2018/

59Booz | Allen | Hamilton

future forward with data sciencekirkborne.net/phuse2018/kirkborne-phuse-nov2018.pdf · feature...

Documents

johanna mursic, indication programmer | phuse congress...

phuse us connect 2021 csr narrative automation

kirkborne nist-june2012pdf | nist

phuse 2016 pd06: program-level programming strategy jennie

coding of medications - phuse wiki

adam standards - organizing the unorganized - phuse wiki

phuse eu connect 2020 virtual conference 9th –13th

phuse connect -dh-08

000817-nov2018 - omnia › hubfs › public sector... ·...

phuse us connect 2021 pinnacle 21: improving data fitness

mbbspart2 nov2018 - gmch.gov.in

a phuse working group for validating container images

generating analysis results and metadata report from a phuse...

what's new in draft sdtm ig 3.1.4 - phuse wiki

deep dive into odm validation - phuse wiki · 1 phuse 2012...

fda/phuse collaboration computational sciences symposium ......

phuse eu connect18

phuse data transparency workstream: a global view of the

sjl3 nov2018 final - tki

defining script metadata for sharing: using phuse r package...