barclays bank presenting at the chief data officer forum europe - london
TRANSCRIPT
1 CDO Leadership Forum – Copyright Usama Fayyad © 2014
The New CDO Challenge:Taming the Big Data Beast for Value & Insights
Usama Fayyad, Ph.D.
Chief Data Officer – Barclays
Twitter: @usamaf
Feb 25th, 2014
CDO Leadership Forum
London
2 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Outline• The CDO role• Big Data all around us• Some of the issues in BigData• Introduction to Data Mining and Predictive
Analytics Over BigData• Case studies • Summary and conclusions
3 CDO Leadership Forum – Copyright Usama Fayyad © 2014
What Matters in the Age of Analytics?
1.Being Able to exploit all the data that is available • not just what you've got available • what you can acquire and use to enhance your actions
2. Proliferating analytics throughout the organization• make every part of your business smarter
3. Driving significant business value • embedding analytics into every area of your business can
help you drive top line revenues and/or bottom line cost efficiencies
4 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Why a Chief Data Officer?• There is a fundamental realisation that Data needs to
become a primary value driver at organizations• We have lots of Data• We spend much on it: in tech and resources• We are not realising the expected value we could get from it
• A strong business need to create the CDO role:• New generation companies are not following, but adopting
the model that actually works in other data-intensive industries
• CDO has a seat at executive table: the voice of Data• Data done right is an essential element to unify large
enterprises to unlock value form business synergies
4 | DSI Town Hall l February 2014
5 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Why a Chief Data Officer?
• CDO has a governance, architectural, and advisory role,
but also…
• CDO has execution and development groups to deliver on the Data agenda
5 | DSI Town Hall l February 2014
6 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Fundamental Data Principles1. Data gains value exponentially when integrated and
coalesced. When fragmented: dramatic value loss; increased costs; reduced utility/integrity; and increased security risks
2. Fusing Data together from disparate /independent sources is difficult to achieve and impossible to maintain: hence fixing at the source and controlling lifecycle and flow is the only viable approach
3. Standardisation is essential: for sustained ability to integrate data sources and hence growing value; for simplifying down-stream systems and apps
4. Data governance and policy must be centralised and need to be enforced strongly else we slip into chaos and a Babylon of terms/languages
– An Enterprise Data Architecture spanning structured and unstructured data
6 | DSI Town Hall l February 2014
7 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Fundamental Data Principles
5. Encryption and Masking: Persisting unencrypted confidential and secret data (even within our firewalls) is an invitation for problems and risks
6. Data infrastructure needs renewal & modernization:
the pace of change and development of technology are very rapid:
– Design for migration and infrastructure replacement via abstraction layers that remove tech dependencies
7. Data is a primary competency and not a side-activity supporting other processes – hence specialized skills and know-how are a must
7 | DSI Town Hall l February 2014
8 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Why Big Data?A new term, with associated “Data Scientist” positions:
• Big Data: is a mix of structured, semi-structured, and unstructured data:– Typically breaks barriers for traditional RDB storage
– Typically breaks limits of indexing by “rows”
– Typically requires intensive pre-processing before each query to extract “some structure” – usually using Map-Reduce type operations
• Above leads to “messy” situations with no standard recipes or architecture: hence the need for “data scientists” – conduct “Data Expeditions”
– Discovery and learning on the spot
9 CDO Leadership Forum – Copyright Usama Fayyad © 2014
The 4-V’s of “Big Data”
• Big Data is Characterized by the 3-V’s:
– Volume: larger than “normal” – challenging to load/process• Expensive to do ETL
• Expensive to figure out how to index and retrieve
• Multiple dimensions that are “key”
– Velocity: Rate of arrival poses real-time constraints on what are typically “batch ETL” operations
• If you fall behind catching up is extremely expensive (replicate very expensive systems)
• Must keep up with rate and service queries on-the-fly
– Variety: Mix of data types and varying degrees of structure• Non-standard schema
• Lots of BLOB’s and CLOB’s
• DB queries don’t know what to do with semi-structured and unstructured data.
10 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Male, age 32
Lives in SFLawyer
Searched on from London last week
Searched on:“Italian restaurantPalo Alto”
Checks Yahoo! Mail daily via PC & Phone
Has 25 IM Buddies, Moderates 3 Y! Groups, and hosts a 360 page viewed by 10k people
Searched on:“Hillary Clinton”
Clicked on Sony Plasma TV
SS ad
Registration Campaign Behavior Unknown
Spends 10 hour/week
On the internet Purchased Da Vinci Codefrom Amazon
Today’s Data: e.g. Yahoo! User DNA
11 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Male, age 32
Lives in SFLawyer
Searched on from London last week
Searched on:“Italian restaurantPalo Alto”
Checks Yahoo! Mail daily via PC & Phone
Has 25 IM Buddies, Moderates 3 Y! Groups, and hosts a 360 page viewed by 10k people
Searched on:“Hillary Clinton”
Clicked on Sony Plasma TV
SS ad
Spends 10 hour/week
On the internet Purchased Da Vinci Code from Amazon
How Data Explodes: really big
Social Graph (FB)
Likes &
friends likes
Professional netwk
- reputation
Web searches on
this person,
hobbies, work,
locationMetaData on everything
Blogs, publications,
news, local papers,
job info, accidents
12 CDO Leadership Forum – Copyright Usama Fayyad © 2014
The Distinction between “Data” and “Big Data” is fast disappearing
• Most real data sets nowadays come with a serious mix of semi-structured and unstructured components:– Images– Video– Text descriptions and news, blogs, etc…– User and customer commentary– Reactions on social media: e.g. Twitter is a mix of data
anyway
• Using standard transforms, entity extraction, and new generation tools to transform unstructured raw data into semi-structured analyzable data
13 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Text Data: The Big Driver
• We speak of “big data” and the “Variety” in 3-V’s• Reality: biggest driver of growth of Big Data has been
text data– Most work on analysis of “images” and “video” data has
really been reduced to analysis of surrounding text
Nowhere more so than on the internet
• Map-Reduce popularized by Google to address the problem of processing large amounts of text data: – Many operations with each being a simple operation but
done at large scale– Indexing a full copy of the web– Frequent re-indexing
14 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Reality Check
So what do technology people worry about these days?
15 CDO Leadership Forum – Copyright Usama Fayyad © 2014
To Hadoop or not to Hadoop?
when to use techniques requiring Map-Reduce and grid computing?• Typically organizations try to use Map-Reduce
for everything to do with Big Data– This is actually very inefficient and often irrational– Certain operations require specialized storage
• Updating segment memberships over large numbers of users
• Defining new segments on user or usage data
16 CDO Leadership Forum – Copyright Usama Fayyad © 2014
To Hadoop or not to Hadoop?
when to use techniques requiring Map-Reduce and grid computing?• Map-Reduce is useful when a very simple operation is
to be applied on a large body of unstructured data– Typically this is during entity and attribute extraction– Still need Big Data analysis post Hadoop
• Map-Reduce is not efficient or effective for tasks involving deeper statistical modeling– good for gathering counts and simple (sufficient) statistics
• E.g. how many times a keyword occurs, quick aggregation of simple facts in unstructured data, estimates of variances, density, etc…
– Mostly pre-processing for Data Mining
17 CDO Leadership Forum – Copyright Usama Fayyad © 2014
ERP Financial Data1%
Supply Chain Data2%
Sensor Data2%
Financial Trading Data4%
CRM Data4%
Science Data7%
Advertising Data10%
Social Data11%
Text and Language Data16%
IT Log Data19%
Content and Preference Data24%
Hadoop Use Cases by Data Type
18 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Analysis & Programming Software
PIG
HIPI
19 CDO Leadership Forum – Copyright Usama Fayyad © 2014
20 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Reality: What is Biggest Driver for Hadoop Demand in the Enterprise?
Storage• Good performance stores at “commodity” prices
– High end: $100,000/Terabyte– Low end: $1000/Terabyte
• Hadoop typical cost:
$3000/Terabyte• Plus decent compression for unstructured data• Plus suite of really cool tools…
21 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Turning the three Vs of Big Data into ValueUnderstand context and content• What are appropriate actions?• Is it Ok to associate my brand with this content?• Is content sad?, happy?, serious?, informative?
Understand community sentiment• What is the emotion?• Is it negative or positive?• What is the health of my brand online?
Understand customer intent?• What is each individual trying to achieve?• Can we predict what to do next?• Critical in cross-sell, personalization, monetization,
advertising, etc…
22 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Many Business Uses of Predictive AnalyticsAnalytic technique Uses in business
Marketing and sales Identify potential customers; establish the
effectiveness of a campaign
Understanding customer behavior model churn, affinities, propensities, …
Web analytics & metrics model user preferences from data, collaborative filtering, targeting, etc.
Fraud detection Identify fraudulent transactions
Credit scoring Establish credit worthiness of a customer requesting a loan
Manufacturing process analysis Identify the causes of manufacturing problems
Portfolio trading optimize a portfolio of financial instruments by maximizing returns & minimizing risks
Healthcare Application fraud detection, cost optimization, detection of events like epidemics, etc...
Insurance fraudulent claim detection, risk assessment
Security and Surveillance intrusion detection, sensor data analysis, remote sensing, object/person detection, link analysis, etc...
Case Studies:
1. Context Analysis (unstructured data)
2. Predictive Analysis w/ RapidMiner
24 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Understanding Context
25 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Reality Check
So who is the company we think is best at handling BigData?
26 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Biggest BigData in Advertising?
Understanding Context for Ads
27 CDO Leadership Forum – Copyright Usama Fayyad © 2014
The Display Ads Challenge Today
What Ad would you place here?
28 CDO Leadership Forum – Copyright Usama Fayyad © 2014
The Display Ads Challenge TodayDamaging to Brand?
29 CDO Leadership Forum – Copyright Usama Fayyad © 2014
The Display Ads Challenge Today
What Ad would you place here?
30 CDO Leadership Forum – Copyright Usama Fayyad © 2014
The Display Ads Challenge Today
Irrelevant and Damaging to Brand
Completely Irrelevant
31 CDO Leadership Forum – Copyright Usama Fayyad © 2014
NetSeer: Intent for Display
• Currently Processing 4 Billion Impressions per Day
32 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Problem: Hard to Understand User Intent
Contextual Ad served by Google What NetSeer Sees:
Case Studies:
1. Context Analysis (unstructured data)
2. Predictive Analysis w/ RapidMiner
RapidMiner’s Strengths
3434
• Open Source Community & Marketplace – Crowd-sourced innovation, quality assurance, market awareness.
• Fully-integrated Platform – Integrated, process-based business analytics platform with focus on predictive analytics.
• No Programming Required – Easy-to-use, low maintenance costs, standard platform for business analysts.
• Advanced Analytics at Every Scale – In-memory, in-database and in-Hadoop analytics offer best option for every size of database.
• Connectivity – More than 60 connectors (incl. SAP & Hadoop), allowing easy access to structured and unstructured data.
10,000+ Downloads per Month
CONFIDENTIAL
SELECT LIST OF RECIPIENT ORGANIZATIONS
3535
Government & DefensePharma & Healthcare
Consulting
Oil & Gas, Chemicals
Financial ServicesSoftware & Analytics
Retail
Manufacturing
Business Services
Consumer ProductsAerospace
Technology
Entertainment Academia
Leader in Advanced Analytics
36
Source: Gartner
Magic Quadrant for Advanced
Analytics Platforms
(February 2014)
Full report at:
http://www.rapidminer.com/gartn
er2014
37CONFIDENTIAL
PayPal
Who > world leading
online payment services
provider
Solution > Customer
feedback and voice of the
customer analysis, churn
prediction and prevention,
text mining and sentiment
analysis
SmartSoft
Who > provider of
solutions for preventing
fraud, money laundering,
and risks in financial
institutions
Solution > Integration of
Rapid-I’s predictive
analytics engine into their
solutions for fraud
detection and fraud
prevention for the
financial and telecom
sectors
Select Customer Stories
EXAMPLE:
FRAUD DETECTION
Information about fraud detection in the context of Medicaid
and Medicare and how this can used for financial
transactions.
Fraud in Medicaid: ProblemProblem:
– Cost for Medicaid and Medicare constantly increasing
– Significant proportion of the amount paid is due to fraud / waste
– Waste and fraud rate officially estimated to be about 2-3%, unofficial
internal estimates assume 10-20%
– Example: State of Massachusetts
$5-6b per year, probably more than $1b waste and fraud every year!
Task for Data Mining and Predictive Analytics:
– Identify patients, doctors, and pharmacies with suspicious behavior
– Identify patterns within the claims that show potential systemic issues
– Identify best alternatives to deploy State Auditor resources to maximize
transparency and make state government work more effectively and
reduce systemic issues
CONFIDENTIAL 39
Data Sources
External Data Available
Data Targeted
RapidMiner
CONFIDENTIAL
Example Results May/June 2013: State Auditor of Massachusetts published that 1164 Social Security Numbers of
dead persons were identified still receiving Medicaid
– Some of these persons had deceased more than a year ago
– The risk analytics engine combined the Medicaid claims data with the SSN death records to
identify these cases
Similarly external data is used to identify lottery winners still receiving Medicaid payments
High risk patients, pharmacies, and prescribing doctors are identified automatically
– Example: A pharmacy with only 1117 customers in the last 8 years; 1087 of which have had at
least 1 prescription over 500 pills
– Those 1087 got over 17 Million pills and charged over 17 Million dollars
– This is over 2000 pills per patient per month
– Each Patient has filled on average over 16K worth of pills there
– Almost all patients happen to be disabled
6 patients have been prescribed over 1 million drug units, 1 patient 4,742,171pills. Nearly 4.8
Million pills at a cost of almost 6.5 M. That is nearly 1650 pills a day. Is that even possible?
CONFIDENTIAL 41
Pharmacy Claims
Filter for Paid and Current Transactions
Filter Narcotics
Calculate Risk Score with Analytics Engine
Filter Paid > $250/Year
Select Members with Risk Score greater than 25
Claims Patients Paid
170M 2M 5B
Filter for year 201225M 0.9M $577M
14M 0.7M $541M
0.67M 0.16M $9.5M
114K 6K $7.4M
5K 130 $1.4M
1
2
4
3
6
5ca. 15% fraud detectedfor narcotics
Concluding Remarks
44 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Experience summary at Yahoo!
• Dealing with one of the largest data sources (25
Terabyte per day)
• BT business was grown from $20M to about $500M
in 3 years of investment!
• BigData critical to operations
– Ad targeting creates huge value
– Right teams to build technology (3 years of recruiting)
– Search is a BigData proble,
• Big demands for grid computing (Hadoop)
– Not all BigData can be handled via Hadoop
– Spunoff BigData Segmentation data platfrom: nPario
45 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Lessons LearnedA lot more data than qualified talent
– Finding talent in BigData is very difficult
– Retaining talent in BigData is even harder
• At Yahoo! we created central group that drove huge value to company
• Data people need to feel like they have critical mass
– Makes it easier to attract
– Makes it easier to retain
• Drive data efforts by business need, not by technology priorities
– Chief Data Officer role at Yahoo! – now popular
46 CDO Leadership Forum – Copyright Usama Fayyad © 2014
BigData Analytics for Organizations
• Key to competitive Intelligence:
– Understand context
– Understand intent
• Key to understanding consumer trends through social media analysis
– Brand issues
– Trend issues
– Anticipating the next shift
47 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Threats & Opportunities• Data world is changing, especially in on-line businesses
• Major shifts from relational DB to NoSQL, document-oriented stores
• Connecting new world to “old”world?– Convenience of execution – integration with data platforms
– Appropriateness of algorithms to BigData
– Unstructured data algorithms:
• Text, Semi-structured and Unstructured data
• Entity extraction a must
• Appropriate theory and probability distributions (power laws, fat tails)
• Sparse Data
– Model management and proper aging of models
– Getting to basics so we can decide what models to use:
• Understanding noise and distributions
• Data tours
48 CDO Leadership Forum – Copyright Usama Fayyad © 2014
Usama Fayyad - [email protected][email protected]
Twitter – @Usamaf
+1-206-529-5123
www.Oasis500.com
www.open-insights.com
Thank You! & Questions