what you would know at the end of 60 min data science session-0 de… · •an overview of the...
TRANSCRIPT
What you would know at the end of 60 min ?
• Data Science Driven Disruptions
• Data Science Demystified (in 8 mins )
• Data Science Opportunities & career paths
• Labs Hands on Clustering
• Data Science Roadmap
Apart from that we will also cover …
• An overview of the shift to Data Science Platforms
• The 3 critical components of a Data Science platform
• Industries that are most likely to get disrupted and shift to Data Science
• Characteristics of firms that get left behind the Data Science wave
• Factors that push an industry towards Data Science
• A brief overview of aspects of platform architecture beyond technology
5 Disruptions
1 Japanese dating app
2.Heart implants
MOOC 3
Sensored cows in Netherland
Googles autonomous car
What's common to the following game changing solutions ?
1
2
3
4 5
Japanese dating app
Sensored cows in Netherland Googles autonomous car
MOOC
Heart implants
At the core there is a deep embedded DATA PRODUCT !
Created by DATA SCIENCE !
• How our health gets cared for ?
• How we learn ?
• How we fall in love ?
• How we do farming ?
• How we drive ?
The world around is changing… Our lives are intimately Surrounded by Data products (an intimate fabric of our lives)
• Amazon Defeated Borders ( Books )
• Netflix Defeated Blockbuster ( Video )
• iTunes Defeated Tower records ( Music )
• Google defeated Yahoo ( Search ) – Page rank algorithm
How did the following players disrupt the Marketplace ?
Analytical Models disrupting Business Models
If Data Science is not integral you are no longer in the game
What's the secret sauce ?
Ability to “see” patterns FASTER than competition is key to SURVIVAL !!!
2. Demystifying
Data Science ( in simple plain everyday English )
20 Known Unknowns
(BI)
Unknown Unknowns
( Data Science )
Lots of $ impacting patterns
Unnoticed
Waiting to be discovered!
Data Science vs. BI
“As is” state in most organizations
Data
( Sales , Finance )
Reports
( BO, Cognos, MSAS )
“As is” stage with leading game changers
Data repository
Insights
Analytics cell + Modeling processes
( Segment, Score, Text mine )
Move from Reports Insightful Actions that Impact
What's are 4 core differences between Data Science & Dashboards ?
Data repository
Dashboards
Data repository (Purchase habits)
Signal (Similiar people discovery)
ML process (Collaborative filtering)
Actions (Recommend a product )
Outcomes (Improve cross sell)
2
3
4
Dashboards
1
ML + Signals + Actions = Game Changing Outcomes
Data Science processes can work on 2 types of big data
HUMAN GENERATED MACHINE GENERATED
What exactly is an model ?
• Mathematically defining a real world phenomena
• Representative of real world
• For example cross sell model
What are 3 common things between predictive models and caricatures ?
• Its an approximation, not a perfection
• Its better than not having anything
• It get the job done
REAL WORLD
ANALYTICAL MODEL
Demystifying Machine Learning
Its all about DETECTING PATTERNS !
Use data to discover Signals (patterns) that cause changes that impacts $ .
What's the Goal of Data Science ?
1. Segmentation
2. Unstructured Text Mining
Real world Unstructured text mining in health care
Doctors transcripts
Split sentences
onto
words/tokens
Step-1 : SPLIT
Filter “noise”
words eg : I ,
the, is, was,
Step-2 : FILTER „Pulmonary‟=
„pulmonar‟
„Insomnia‟ = „Sleep‟ =
„Sleeplessnes;
„
Step-3 : STEMMING
Keyword extraction &
Theme generation
Step-4 : THEME EXTRACTION
Step-5 : THEME /
KEYWORD ANALYSIS
Lab diagnostics Nurses Observations
Cardiac
watch list
Oncology
watch list
Pulmonary
watch list
Diabetic
watch list
Schizophreni
a watch list
3. Scoring Models
4. Forecasting !
5. Recommenders
Data Science Reference Architecture – Key components
Hadoop
Hive
Hana
Info bright
Clustering
Text mining
Mobile
Digital
Data Ingestion Pipeline
Machine Learning Reference Architecture
STORE ( Hadoop, Hive, HANA, Cloudera, Splunk, Hortonworks)
SENSE ( signal extraction- text mining, scoring models ),
RESPOND ( Front line actions thru website, call centre )
1
2
3
Polyglot persistence architecture
Asset
Sensor
Parameters
Location Sensor tags
Events
Column family
( Hbase/Cassandra)
Document db
( Mongo)
Graph db
( Neo4js)
RDBMS
( Oracle )
Insert Heavy workloads
XMLmessages Inter relationships
Low velocity self service
Logical Business Model
Snapshot of Machine Learning Techniques
1. Segmentation
3.Forecasting
5. Scoring models
2.Text mining
4. Visual Analytics
6.Optimisation
1. Customer behavior segmentation
2. Defect segmentation
3. Employee segmentation model
4. Supplier segmentation mode
5. “Chunking” groups
6. Discovered by algorithm
1. Convert messy unstructured text into actionable signals
2. Keyword frequencies
3. Sentiment ratios
4. Blogs
5. Call center transcripts
6. Emails
7. Multi channel sentiment analysis
1. Predict CLTV
2. Predict Sales at a neighborhood outlet
3. Predict Salary based on experience, qualification,
rating, market demand
4. Identify drivers of behavior
5. Weights processing
1. Beyond line, bar , pie charts
2. Geospatial modeling to see geo correlation
3. Spread analysis
4. Outlier detection
1. Churn propensity
2. Cross sell
3. Attrition modeling in HR
4. Risk scoring models in Banking
5. Logistic
6. Neural networks
7. Decision trees
8. Support Vector machines
1. Constraint modeling
2. Maximize an outcome
3. Maximize sales without cannibalizing sister brands
Why is Data Science HOT ?
DRIP Data Rich, Insight poor !
POS data Campaign Trade
Promotions Returns
Competitive Call center Loyalty Survey
info
Supplier
Claims Policy Payments Fraud
Channel Data
Broker
Warranty claims
Media Reach data
Payment info
Trade promotions
Event info
Organisations are drowning in tooooo much data !
Data Science jobs are Exploding!
Data Science Jobs exploding in India too !
1
2
3
“By 2018, the United States alone could face a shortage of 140,000 to 190,000
people with deep analytical skills as well as 1.5 million managers and analytics
with the know-how to use the analysis of big data to make effective decisions”
McKinsey & Company: Big Data: The next frontier for competition
Data Science = PASSPORT to Global Market !
Slide 51
So What does a Data Scientist really do ? The 3 Hats
1. Data Hat
3. Business Hat
2. Math Hat
Clustering Deep Dive !
3.Segmentation – The idea in brief
Slide 53
Break data into various “chunks”
The analyst picks the number of clusters through an iterative process,
looking for uniqueness between the segments
Types of segmentation
Demographic segmentation
Need based segmentation
Behavior based segmentation
Statistical techniques
K Means
Hierarchical clustering
Discriminant analysis
Segmentation – Business questions answered
Slide 54
1. What are the behavioral personas about
customer which lie buried in my raw customer
transactions in the data base ?
2. Which specific customer behavior discriminates
a high value segment from low value segment ?
3. How do customer behavior segments migrate
across time and what does it reveal to us ?
A real life customer segmentation case study
Slide 55
• Customer Context
• A large owner of fleets in US
• Each truck driver given a fuel card
• Driver info + Mileage + Refuelling behaviour + Location
• Customer Challenge
• Aligning Service Models to Customer Segments
• Drive Growth& Ability to Cross-sell & Up-sell
• Data Science Technique
• K means clustering
• Analysed over 120,000,000 Customer Records & Profiles
• Analysed over 110,000,000 Million Customer Service Rep Comments
Behavioral components considered for fleet card segmentation
Fleet related master data
1. Fleet id
2. SIC Code
3. No of trucks in fleet
4. No of drivers/cards
Fleet spend data
1. Avg _Gallons_ Per_ month
2. Avg_Spend_on_non_fuel
3. Avg _Transaction_ Per_ Month
4. Total_Active_Cards
5. MOM(3 months) growth(gallons)
6. Avg_Credit _utilization (3 months)
Current product holdings flag
1. Has_OPIS_Suite_of_Reports_flag
2. Has_EFPS_Discount
3. Has_Smart
4. Has_Rewards
5. Has_Screen_Now_Report
6. Has_Volume_or_Service_Discount
7. Has_Exception_Reports
Touch point data
1. Avg No of inbound calls per month
2. Recency of last call
3. Total no of phone calls per year
Dimensions of fleet behavior measured and segmented
Av. Gallons pcpm Av. No of
Transactions pcpm Av. Ancillary
Revenue Av. No. of
Retention Calls Av. Late Fees Av. Activation Rate Av. MOM Growth
Population 106.4 5.5 248.8 0.7 324.1 0.7 1.03
Stable Underdogs 62.5 3.6 83.4 0.3 87.7 0.5 1.05
Miniature Laggards 77.3 4.6 90.0 0.2 118.3 0.9 1.03
Cash Cows 179.9 8.0 1098.6 0.6 2965.3 0.7 0.96
Dark Horses 122.3 6.1 1196.2 0.4 534.6 0.7 0.93
Sulking Mediocres 101.5 5.2 63.2 4.6 279.2 0.6 1.07
Front-runners 276.4 12.4 163.8 0.7 371.0 0.7 1.04
Note: Undesirable Behavior
Average
Desirable Behavior
1
2 3
Ancillary product Penetration
Product % Buyers
SmartGPS 17%
Price Info 18%
Exception Report 24%
Driver screen 27%
Reward 12%
ECS Discount 3%
Service Discount 0%
Definition: The large size fleets, that are mostly medium tenure customers having very high spends but also having high late fees incidences
Constitutes 5% of total fleets and contributes 22 % of total spend.
Segment Average
71.5
30.8
21.0
0.7
179.9
8.0
1.3
1.9
0.6
1098.6
2965.3
1.0
Population Average
76.0
9.0
5.8
0.7
106.4
5.5
1.0
1.3
0.7
248.8
324.1
1.0
Segment-3 : Cash Cows … Segment Profile
Cash Cow Behavioral Portrait & Targeted Actions
This segment is extremely valuable, so the
focus should be on retention.
The Cash Cows members have highest fleet
size, highest no. of active cards and a very
high gallon usage.
Their late fees and ancillary revenue are
relatively high than the average.
At the same time they have highest
percentage of terminated cards.
• Preferential Treatment (575 Fleets)
• The fleets with more than 30 vehicles & per card gallon more than 200 should be considered for
preferential treatment E.g. Relationship Manger, Out Of turn call handling & Premium fleet
services etc
• Service Network Up-sell (2078 Fleets)
• Consider targeting fleets which have higher than average(30) size and lower than average non-
fuel and S&M spend for cross sell campaign
• Cross-sell (1853 Fleets)
• The fleets with high fleet size(greater than 30) and spend (greater than 200) must be targeted for
cross-selling SmartGPS, Exception Reports, Oil Price Info and Driver Screen
• Reactivation of terminated fleets (2593 Fleets)
• 2593 fleets among Cash Cow are terminated. A reactivation campaign must be run targeted at
the voluntary terminations.
• Drive Timely Payments (1485 Fleets)
• Fleets that have very high late fees (greater than 6000)can be targeted for discounts in order to
ensure timely payments.
“The CASH COW” Interventions for CASH COWS
Slide 60
Segmentation in Banking industry
Key cluster observations
• Cluster Observarion-1 : Low balance, Low risk, Reached credit limit often
• Possible treatment strategy : Extend Line of credit and possibly charge fixed fee depending on # of times they reach credit limit
• Cluster observation-2 : Low balance, moderate risk, reach credit limit often
• Possible treatment strategy : Possibly charge fixed fee depending on # of times they reach credit limit
• Cluster observation-3 : High balance, moderate risk, Do not reach credit limit often
• Possibly run a focused outbound campaign to sell short term fixed deposit
• Cluster observation-4 & 5 : Moderate balance, High risk, Moderate usage
• Since risk is high, interest rates and Pricing strategy
5 segments and LOC, Pricing, Campaign interventions for each customer segment
The Mathematics behind Clustering
• K means algorithm
• Specify K the number of clusters to create
• Choose K points at Random as Cluster centroids
• Assign each observation to the cluster centroid it is closest to
• Calculate centroid mean for each cluster
• Use it as the new centroid of the cluster
• Iterate till cluster centre does not change
Segmentation – The process
Segmentation – The process
Segmentation – The process
Segmentation – The process
Segmentation – The process
Segmentation – The process
Segmentation – The process
Segmentation – The process
Segmentation – The process
Segmentation – The process
Segmentation – The process
Segmentation – The process
Segmentation – The process
Segment1
Cash cow
• Segment-2
• Cautious tryers
• Segment-3
• Fast movers
Segmentation – The process
Slide 76
“POW”
Pearls of Wisdom
10 Customer
Segmentation
Best Practices
1. CLARITY ON BIZ QUESTIONS ANSWERED BY SEGMENTATION
2. TOO MANY BEHAVORIAL DIMENSONS VS TOO FEW DIMENSIONS
3. METHODOLOGY TO ISOLATE RIGHT BEHAVORIAL VARIABLE
4. CHOSING THE RIGHT CLUSTERING TECHNIQUE
5. ITERATE ! ITERATE ! ITERATE !
6. EVOLVING SEGMENT PERSONAS TO MAKE IT REAL TO BIZ
7. BEHAVORIAL OVERLAY ON GEOSPATIAL MAP
8. SENTIMENTS OVERLAY WITH BEHAVORIAL CLUSTERS
9. EVOLVING THE SEGMENT ACTIONABILITY FRAMEWORK
10. SEGMENT MIGRATION & ROI TRACKER
Real life example in Insurance industry = What drives a policy non renewal ?
1. Recency of a claims denial
2. Tenure of agent with Bharti AXA
3. Overall experience of agent ( total experience )
4. Automated deduction or cash/cheque based ( payment mode )
5. No of unanswered call center calls in last 8 weeks
6. Frequency of outbound triggers for renewal
7. Recency of phone bound renewal trigger
8. % Change in renewal commissions to the agent ( driven by policy )
9. 3 month ratio of inbound calls to outbound calls
10.Spread of multi channels interaction – Agent / Internet / Mobile / call center
11.Outbound watch list : Frequency of occurrence of specific keywords in outbound call interaction
12.Inbound watch list : Frequency of occurrence of specific keywords in inbound call interaction
13.Recency of last payment
14.Policy attributes : Type of policyholder/location / type of coverage/Policy cost / Sum assured / Issue age / Policy tenure /
15.Range of products covered
Hands on Data Science Lab Sessions
Run the session
Reading Segmentation output
(What to look for in Segmentation output ?)
Hands on Segmentation… Using Kmeans to find clusters
3
Segment average Is it above or below population
average ? Hiow does this help characterise the
segment?
4
Shows membership of each segment
1
2
Execute clustering algorithm
Shows cluster statistics
Segmentation Business Narrative template How to express Clusters discovered • “A segmentation analysis was conducted to examine the behavioural clusters.
• 4 < clustering vectors > variables were simultaneously entered into the model: Humidity, Solids, Viscosity, temperature and past defect density count. < outcome >
• Together, these 4 < predictor count > vectors resulted in 5 clusters< Cluster count>
• The outcome 5 clusters discovered were labelled as follows • Cluster-1 • Cluster-2 • Cluster-3 • Cluster-4 • Cluster-5
• Cluster-1 characteristics
• Actions recommended would be
1. ggplot() – How to draw a quick scatter plot? Visual relationship
2. 3 D Visualisation
3. boxplot() – How to draw a quick box plot to analyse spread?
6 key points regarding our UNIQUE LEARNING MODEL
Principle-1 : Humanize Machine Learning
Principle-2 : 60 % Doing + 40 % Listening
Principle-3 : Biz Backward , instead of Technology forward !
Principle-4 : Playbooks + Checklists + Worksheets
Principle-5 : Outcome triumphs Output , ROI is key !
Segmentation ROI from customers Moving to high value segments
6. Repeat top 10 R commands 5 times
What you would learn at the end of 4 weeks ? 15 Core Foundational Building Blocks for next generation job market
PREDICTIVE
SCORING
MODELING
DEMYSTIFYING
MACHINE
LEARNING
CORRELATIO
N DETECTION
ADVANCED
VISUALISATION
VOLATILITY
ANALYTICS
CLUSTERING
FEATURE
EXTRACTION
OUTLIER
EXPLORATION
BOX PLOTS
SCATTER
PLOTS
UNIVARIATE
ANALYSIS
EXPLORATORY
DATA ANALYSIS
REGRESSION
MODELING
BUSINESS USE
CASES OF ML
REFRERENCE
ARCHITECTURE
4 Week Data Science Boot camp Week by week plan
Week-1
Week-2
Week-3
Week-4
Demystifying Data Science
Introduction to Machine learning techniques
Step by Step methodology for converting noise to signal
12 tools of a Data Scientist
Descriptive vs Prescriptive statistics
How to do EDA ( Exploratory Data Analysis ) –Univariate / Bivariate / Corrrelations
Advanced Visualisation techniques
Data Science Lab Session-2 : Hands on Univariate + Bivariate + Correlation Analytics
Data Science Lab Session-1 : Getting feet wet in Data Science tools
Introduction to segmentation and clustering techniques
Segmentation in Retail Industry
Segmentation in Telecom industry
Segmentation in Healthcare industry
How to present for maximising Segmentation Business Impact
Data Science Lab Session-3 : Hands on SEGMENTATION on live data
Demystifying Predictive Analytical Models ( PAM )
Predictive Analytical Models in Retail Industry
Predictive Analytical Models in Telecom industry
Predictive Analytical Models in Healthcare industry
Mapping Impact of Predictive models on Business Outcomes
Summary of Key Data Science concepts
Data Science Lab Session-4 : Hands on PREDICTIVE ANALYTICS on live data
END 2 END MACHINE LEARNING PROJECT on live data ( Telecom or Retail or Banking )
Slide 95
You today ( IT specialist )
You tomorrow (the Data Scientist)
Cross the Chasm… Alter your LIFE !
4 week Data Science Boot Camp
1
2
3
To summarize 3 key takeaways …
DATA SCIENCE IS THE FUTURE !
So, REINVENT YOURSELF
Take that first step to becoming a DATA SCIENTIST & change the game !
FAQ
FAQ-1 : “I have worked on SAP BW/BOBJ, How do I transition to becoming a Data Scientist ?”
• Execute your first Data Science pilot • Step-1 : Learn R
• Step-2 : Zero in on a business problem to solve
• Step-3 : Setup RSAP BW connector …Get access to data from SAP BW
• Step-4 : Apply an Analytical construct ( VEDA ML )
• Step-5 : Discover the pattern which impacts the outcome
• Step-6 : Present final results to executive business team
• Explore setting up a Data science project within existing organisation
• Meetups to explore the outside world
FAQ-2: “Should I know probability and advanced statistics ?”
• Not really
• We are focussed on APPLICATION and not THEORY underpinning it
• We will teach you • Business problem to solve
• How to execute the command on a platform
• What to look for in the output
• What happens within the black box can be seen later
FAQ-3: “I am confused between Hadoop and Data Science … What's difference between Hadoop and Data Science?”
• Hadoop = Data Infrastructure layer
• Data Science = Sensing patterns from data to impact business outcome
FAQ-4: “This is a big shift for me … In your experience how long does it take to make the transition from IT to Data Science ?”
• We have seen people make the transition from 4 weeks to about 6 months
• It depends upon the time + passion + drive you have
FAQ-5: “How are we going to prepare you for the data science job market ?”
1. Mock preparatory sessions
2. Worksheets + Modelling Checklists + Data Science Playbooks
3. Live projects on clustering , scoring which can be put in resume
4. Our strategic tie-ups with Organisations looking for data science skills
5. Top 30 Practitioner generated Data Science questions
6. Watch out for our exclusive app which gives real time data science job alerts
FAQ-6: “After I take the basic intro to data science course how can I specialize further and deepen my skills?”
1. Advanced Data Science
2. Data science in Digital industry
3. Data science reference architectures
4. Data science in Telecom industry
5. Data science in Health care industry
6. Data science in Banking industry
7. Data science in Manufacturing industry
8. Data science in Digital industry
FAQ-7: “I am not an IT professional but a domain person. How can I get started ?”
1. Option-1 : Focus on Industry use cases
2. Option-2 : Take basic introduction to data sciences