rm world 2014: big data vs. classic data

80
1 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014 BigData Vs. “Classic” Data: Predictive Analytics in a Changing Data Landscape Usama Fayyad, Ph.D. Chief Data Officer Barclays Twitter: @usamaf August 19, 2014 RapidMiner World Boston, MA

Upload: rapidminer

Post on 14-Jan-2015

553 views

Category:

Documents


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: RM World 2014: Big data vs. classic data

1 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

BigData Vs. “Classic” Data: Predictive Analytics in a Changing Data Landscape

Usama Fayyad, Ph.D.

Chief Data Officer – Barclays

Twitter: @usamaf

August 19, 2014

RapidMiner World

Boston, MA

Page 2: RM World 2014: Big data vs. classic data

2 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Outline• Big Data all around us• The CDO role• Some of the issues in BigData• Introduction to Data Mining and Predictive

Analytics Over BigData• Case studies • Summary and conclusions

Page 3: RM World 2014: Big data vs. classic data

3 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

What Matters in the Age of Analytics?

1.Being Able to exploit all the data that is available • not just what you've got available • what you can acquire and use to enhance your actions

2. Proliferating analytics throughout the organization• make every part of your business smarter• Actions and not just insights

3. Driving significant business value • embedding analytics into every area of your business can

significantly drive top line revenues and/or bottom line cost efficiencies

Page 4: RM World 2014: Big data vs. classic data

4 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Why Big Data?A new term, with associated “Data Scientist” positions:

• Big Data: is a mix of structured, semi-structured, and unstructured data:– Typically breaks barriers for traditional RDB storage

– Typically breaks limits of indexing by “rows”

– Typically requires intensive pre-processing before each query to extract “some structure” – usually using Map-Reduce type operations

• Above leads to “messy” situations with no standard recipes or architecture: hence the need for “data scientists” – conduct “Data Expeditions”

– Discovery and learning on the spot

Page 5: RM World 2014: Big data vs. classic data

5 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

The 4-V’s of “Big Data”

• Big Data is Characterized by the 3-V’s:

– Volume: larger than “normal” – challenging to load/process• Expensive to do ETL

• Expensive to figure out how to index and retrieve

• Multiple dimensions that are “key”

– Velocity: Rate of arrival poses real-time constraints on what are typically “batch ETL” operations

• If you fall behind catching up is extremely expensive (replicate very expensive systems)

• Must keep up with rate and service queries on-the-fly

– Variety: Mix of data types and varying degrees of structure• Non-standard schema

• Lots of BLOB’s and CLOB’s

• DB queries don’t know what to do with semi-structured and unstructured data.

Page 6: RM World 2014: Big data vs. classic data

6 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Male, age 32

Lives in SFLawyer

Searched on from London last week

Searched on:“Italian restaurantPalo Alto”

Checks Yahoo! Mail daily via PC & Phone

Has 25 IM Buddies, Moderates 3 Y! Groups, and hosts a 360 page viewed by 10k people

Searched on:“Hillary Clinton”

Clicked on Sony Plasma TV

SS ad

Registration Campaign Behavior Unknown

Spends 10 hour/week

On the internet Purchased Da Vinci Codefrom Amazon

“Classic” Data: e.g. Yahoo! User DNA

Page 7: RM World 2014: Big data vs. classic data

7 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Male, age 32

Lives in SFLawyer

Searched on from London last week

Searched on:“Italian restaurantPalo Alto”

Checks Yahoo! Mail daily via PC & Phone

Has 25 IM Buddies, Moderates 3 Y! Groups, and hosts a 360 page viewed by 10k people

Searched on:“Hillary Clinton”

Clicked on Sony Plasma TV

SS ad

Spends 10 hour/week

On the internet Purchased Da Vinci Code from Amazon

How Data Explodes: really big

Social Graph (FB)

Likes &

friends likes

Professional netwk

- reputation

Web searches on

this person,

hobbies, work,

locationMetaData on everything

Blogs, publications,

news, local papers,

job info, accidents

Page 8: RM World 2014: Big data vs. classic data

8 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

The Distinction between “Classic Data” and “Big Data” is fast disappearing

• Most real data sets nowadays come with a serious mix of semi-structured and unstructured components:– Images– Video– Text descriptions and news, blogs, etc…– User and customer commentary– Reactions on social media: e.g. Twitter is a mix of data

anyway

• Using standard transforms, entity extraction, and new generation tools to transform unstructured raw data into semi-structured analyzable data

Page 9: RM World 2014: Big data vs. classic data

9 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Text Data: The Big Driver

• We speak of “big data” and the “Variety” in 3-V’s• Reality: biggest driver of growth of Big Data has been

text data– Most work on analysis of “images” and “video” data has

really been reduced to analysis of surrounding text

Nowhere more so than on the internet

• Map-Reduce popularized by Google to address the problem of processing large amounts of text data: – Many operations with each being a simple operation but

done at large scale– Indexing a full copy of the web– Frequent re-indexing

Page 10: RM World 2014: Big data vs. classic data

10 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

IT Log & Security Forensics & Analytics

Automated Device DataAnalytics

Failure Analysis

Proactive Fixes

Product Planning

AdvertisingAnalytics

Segmentation

Recommendation

Social Media

Big DataWarehouse Analytics

Cost Reduction Ad Hoc Insight

Predictive Analytics

Hadoop + MPP + EDW

Find New Signal Predict Events

100% Capture

Big Data Applications and Uses

Page 11: RM World 2014: Big data vs. classic data

11 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

A few words on:The Chief Data Officer

Why are companies creating this position?

Page 12: RM World 2014: Big data vs. classic data

12 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Why a Chief Data Officer?• There is a fundamental realisation that Data needs to

become a primary value driver at organizations• We have lots of Data• We spend much on it: in technology and people• We are not realising the value we expect from it

• A strong business need to create the CDO role:• Traditional companies are not following, but adopting the

model that actually works in other data-intensive industries

• CDO has a seat at executive table: the voice of Data• Data done right is an essential element to unify large

enterprises to unlock value form business synergies

Page 13: RM World 2014: Big data vs. classic data

13 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

What about traditional IT?• Does your IT department provide you with the data you

need?• Or is it just another stumbling block to get at Data?

• Does your IT department understand what you need for analytics?

• Or is it just about ETL, SQL databases, and restricted access?

• Does your IT department have people who understand data modeling? Data Warehousing? BI?• Or is it just about capturing transactions into standard

normalized databases? • Is BI more than just a tool for MI reporting?

Page 14: RM World 2014: Big data vs. classic data

14 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Fundamental Data Principles to Support Analytics

Usama’s Obvious Data Axioms

Page 15: RM World 2014: Big data vs. classic data

15 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Data Axioms

1. Data gains value exponentially when integrated and coalesced.

– When fragmented: dramatic value loss takes place;– increased costs; – reduced utility/integrity; – and increased security risks

Page 16: RM World 2014: Big data vs. classic data

16 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Data Axioms

2. Fusing Data together from disparate/independent sources is difficult

to achieve and impossible to maintainHence only viable approach is:• Intercepting and documenting at the source• fusing at the source • controlling lifecycle and flow

Page 17: RM World 2014: Big data vs. classic data

17 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Data Axioms

3. Standardisation is essential• for sustained ability to integrate data sources

and hence growing value; • for simplifying down-stream systems and apps• For enforcing discipline as a firm increases its

data sources

Page 18: RM World 2014: Big data vs. classic data

18 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Data Axioms

4. Data governance and policy must be centralised

• needs to be enforced strongly else we slip into chaos and a Babylon of terms/languages

• An Enterprise Data Architecture spanning structured and unstructured data

Page 19: RM World 2014: Big data vs. classic data

19 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Data Axioms

5. Recency Matters data streaming in modelling and scoring

• Often, accuracy of prediction drops quickly with time (e.g. consumer shopping)

• Value of alerts drop exponentially with time…• Ability to trigger responses based on real-

time scoring critical• Streaming, real-time model updates, real-

time scoring

Page 20: RM World 2014: Big data vs. classic data

20 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Data Axioms

6. Data Infrastructure Needs: • rapid renewal & modernization: the pace of

change and development of technology are very rapid– Design for migration and infrastructure replacement

via abstraction layers that remove tech dependencies

• Encryption and Masking: Persisting unencrypted confidential and secret data (even within secure firewalls) is an invitation for problems and risks

Page 21: RM World 2014: Big data vs. classic data

21 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Data Axioms

7. Data is a primary competency and not a side-activity supporting other

processes• Hence specialized skills and know-how are a

must• Generalists will create a hopeless mess• Data is difficult: modelling, architecture, and

design to support analytics

Page 22: RM World 2014: Big data vs. classic data

22 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Reality Check:Brand/Reputation on-line

What are people saying about my brand on Social Media?

Page 23: RM World 2014: Big data vs. classic data

23 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Page 24: RM World 2014: Big data vs. classic data

24 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Reality Check

Surely there are companies I can work with that can help me make this

practical?

Page 25: RM World 2014: Big data vs. classic data

25 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Page 26: RM World 2014: Big data vs. classic data

26 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Reality Check

So what do technology people worry about these days?

Page 27: RM World 2014: Big data vs. classic data

27 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

To Hadoop or not to Hadoop?

when to use techniques requiring Map-Reduce and grid computing?• Typically organizations try to use Map-Reduce

for everything to do with Big Data– This is actually very inefficient and often irrational– Certain operations require specialized storage

• Updating segment memberships over large numbers of users

• Defining new segments on user or usage data

Page 28: RM World 2014: Big data vs. classic data

28 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Drivers of Hadoop in Large Enterprises

Cost of Storage• Fastest growing demand is more storage• Data in Data Warehouses have traditionally required

expensive storage technology:

–$100K per terabyte per year – cost of Teradata storage

– $2.5K per terabyte – much lower per year –cost of Hadoop on commodity storage

Page 29: RM World 2014: Big data vs. classic data

29 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

ERP Financial Data1%

Supply Chain Data2%

Sensor Data2%

Financial Trading Data4%

CRM Data4%

Science Data7%

Advertising Data10%

Social Data11%

Text and Language Data16%

IT Log Data19%

Content and Preference Data24%

Hadoop Use Cases by Data Type

Page 30: RM World 2014: Big data vs. classic data

30 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Analysis & Programming Software

PIG

HIPI

Page 31: RM World 2014: Big data vs. classic data

31 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Page 32: RM World 2014: Big data vs. classic data

32 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Reality: If Storage is Biggest Driver of Hadoop Adoption; What is the next biggest?

ETL• Replaces expensive licenses• Much higher performance with lower infrastructure

costs (processors, memory)• Flexibility in changing schema and representation• Flexibility on taking on unstructured and semi-

structured data• Plus suite of really cool tools…

Page 33: RM World 2014: Big data vs. classic data

33 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Turning the three Vs of Big Data into ValueUnderstand context and content• What are appropriate actions?• Is it Ok to associate my brand with this content?• Is content sad?, happy?, serious?, informative?

Understand community sentiment• What is the emotion?• Is it negative or positive?• What is the health of my brand online?

Understand customer intent?• What is each individual trying to achieve?• Can we predict what to do next?• Critical in cross-sell, personalization, monetization,

advertising, etc…

Page 34: RM World 2014: Big data vs. classic data

34 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Many Business Uses of Predictive AnalyticsAnalytic technique Uses in business

Marketing and sales Identify potential customers; establish the

effectiveness of a campaign

Understanding customer behavior model churn, affinities, propensities, …

Web analytics & metrics model user preferences from data, collaborative filtering, targeting, etc.

Fraud detection Identify fraudulent transactions

Credit scoring Establish credit worthiness of a customer requesting a loan

Manufacturing process analysis Identify the causes of manufacturing problems

Portfolio trading optimize a portfolio of financial instruments by maximizing returns & minimizing risks

Healthcare Application fraud detection, cost optimization, detection of events like epidemics, etc...

Insurance fraudulent claim detection, risk assessment

Security and Surveillance intrusion detection, sensor data analysis, remote sensing, object/person detection, link analysis, etc...

Page 35: RM World 2014: Big data vs. classic data

Case Studies:

1. Context Analysis (unstructured data)

2. Yahoo! predictive modeling

Page 36: RM World 2014: Big data vs. classic data

36 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Understanding Context

Page 37: RM World 2014: Big data vs. classic data

37 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Reality Check

So who is the company we think is best at handling BigData?

Page 38: RM World 2014: Big data vs. classic data

38 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Biggest BigData in Advertising?

Understanding Context for Ads

Page 39: RM World 2014: Big data vs. classic data

39 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

The Display Ads Challenge Today

What Ad would you place here?

Page 40: RM World 2014: Big data vs. classic data

40 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

The Display Ads Challenge TodayDamaging to Brand?

Page 41: RM World 2014: Big data vs. classic data

41 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

The Display Ads Challenge Today

What Ad would you place here?

Page 42: RM World 2014: Big data vs. classic data

42 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

The Display Ads Challenge Today

Irrelevant and Damaging to Brand

Completely Irrelevant

Page 43: RM World 2014: Big data vs. classic data

43 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

NetSeer: Intent for Display

• Currently Processing 4 Billion Impressions per Day

Page 44: RM World 2014: Big data vs. classic data

44 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Problem: Hard to Understand User Intent

Contextual Ad served by Google What NetSeer Sees:

Page 45: RM World 2014: Big data vs. classic data

Case Studies:

1. Context Analysis (unstructured data)

2. Yahoo! Predictive Modeling

Page 46: RM World 2014: Big data vs. classic data

Yahoo! – One of Largest Destinations on the Web

80% of the U.S. Internet population uses Yahoo!

– Over 600 million users per month globally!

Global network of content, commerce, media, search and

access products

100+ properties including mail, TV, news, shopping, finance,

autos, travel, games, movies, health, etc.

25+ terabytes of data collected each day

• Representing 1000’s of cataloged consumer behaviors

More people visited

Yahoo! in the past

month than:

• Use coupons

• Vote

• Recycle

• Exercise regularly

• Have children

living at home

• Wear sunscreen

regularly

Sources: Mediamark Research, Spring 2004 and comScore Media Metrix, February 2005.

Data is used to develop content, consumer, category and campaign

insights for our key content partners and large advertisers

Page 47: RM World 2014: Big data vs. classic data

Yahoo! Big Data – A league of its

own…Terrabytes of Warehoused Data

25 49 94 100500

1,000

5,000

Amaz

on

Kore

a

Tele

com

AT&T

Y! L

iveS

tor

Y! P

anam

a

War

ehou

se

Wal

mar

t

Y! M

ain

war

ehou

se

GRAND CHALLENGE PROBLEMS OF DATA PROCESSING

TRAVEL, CREDIT CARD PROCESSING, STOCK EXCHANGE, RETAIL, INTERNET

Y! Data Challenge Exceeds others by 2 orders of magnitude

Millions of Events Processed Per Day

50 120 225

2,000

14,000

SABRE VISA NYSE YSM Y! Global

Page 48: RM World 2014: Big data vs. classic data

Behavioral Targeting (BT)

Search

Ad Clicks

Content

Search Clicks

BT

Targeting ads to

consumers whose recent

behaviors online indicate

which product category is

relevant to them

Page 49: RM World 2014: Big data vs. classic data

Male, age 32

Lives in SFLawyer

Searched on from London last week

Searched on:“Italian restaurantPalo Alto”

Checks Yahoo! Mail daily via PC & Phone

Has 25 IM Buddies, Moderates 3 Y! Groups, and hosts a 360 page viewed by 10k people

Searched on:“Hillary Clinton”

Clicked on Sony Plasma TV

SS ad

Registration Campaign Behavior Unknown

Spends 10 hour/week

On the internet Purchased Da Vinci Codefrom Amazon

Yahoo! User DNA

• On a per consumer basis: maintain a behavioral/interests profile and profitability (user value and LTV) metrics

Page 50: RM World 2014: Big data vs. classic data

How it works | Network + Interests +

Modelling

Analyze predictive patterns for purchase

cycles in over 100 product categories

In each category, build models to describe

behaviour most likely to lead to an ad

response (i.e. click).

Score each user for fit with every

category…daily.

Target ads to users who get highest

‘relevance’ scores in the targeting

categories

Varying Product Purchase CyclesMatch Users to the ModelsRewarding Good BehaviourIdentify Most Relevant Users

Page 51: RM World 2014: Big data vs. classic data

Recency Matters, So Does Intensity

Active now… …and with feeling

Page 52: RM World 2014: Big data vs. classic data

Differentiation | Category specific

modelling

time

inte

nsity s

core

time

inte

nsity

score

Inte

nse

Clic

k Z

on

e

Example 1: Category Automotive Example 2: Category Travel/Last Minute

Different models allow us to weight and determine intensity and recency

Alt Behaviour 1: 5 pages, 2 search keywords, 1 search click, 1 ad click Alt Behaviour 1: 5 pages, 2 search keywords, 1 search click, 1 ad click

Inte

nse

Clic

k Z

on

e

Page 53: RM World 2014: Big data vs. classic data

Differentiation | Category specific

modelling

time

inte

nsity s

core

Intense Click Zone

Example 1: Category Automotive

Different models allow us to weight and determine intensity and recency

with no further activity, decay takes effect

Alt Behaviour 1: 5 pages, 2 search keywords, 1 search click, 1 ad click

user is in the Intense Click Zone

Page 54: RM World 2014: Big data vs. classic data

Automobile Purchase Intender Example

A test ad-campaign with a major Euro automobile manufacturer Designed a test that served the same ad creative to test and control groups

on Yahoo

Success metric: performing specific actions on Jaguar website

Test results: 900% conversion lift vs. control group Purchase Intenders were 9 times more likely to configure a vehicle, request

a price quote or locate a dealer than consumers in the control group

~3x higher click through rates vs. control group

Page 55: RM World 2014: Big data vs. classic data

Mortgage Intender Example

We found:

1,900,000 people looking

for mortgage loans.

+122%

CTR Lift

Mortgages Home Loans Refinancing Ditech

Financing section in Real Estate

Mortgage Loans area in Finance

Real Estate section in Yellow Pages

+626%

Conv Lift

Example search terms qualified for this target:

Example Yahoo! Pages visited:

Source: Campaign Click thru Rate lift is determined by Yahoo! Internal

research. Conversion is the number of qualified leads from clicks over number of impressions served. Audience size represents the audience within this behavioral interest category that has the highest propensity to engage with a brand or product and to

click on an offer.Date: March 2006

Results from a client campaign on Yahoo!

NetworkExample: Mortgages

Page 56: RM World 2014: Big data vs. classic data

Experience summary at Yahoo!

• Dealing with one of the largest data sources (25

Terabyte per day)

• Behavioral Targeting business was grown from $20M

to > $400M in 3 years of investment!

• Yahoo! Specific? -- BigData critical to operations

– Ad targeting creates huge value

– Right teams to build technology (3 years of recruiting)

– Search is a BigData problem (but this has moved to

mainstream)

Page 57: RM World 2014: Big data vs. classic data

Lessons Learned

A lot more data than qualified talent

Finding talent in BigData is very difficult

Retaining talent in BigData is even harder

At Yahoo! we created central group that drove huge value to

company

Data people need to feel like they have critical mass

Makes it easier to attract the right people

Makes it easier to retain

Drive data efforts by business need, not by technology

priorities

Chief Data Officer role at Yahoo! – now popular

Page 58: RM World 2014: Big data vs. classic data

What About RapidMiner

And Big Data?

What does this all mean to RapidMiner and its use in

enterprise analytics?

Page 59: RM World 2014: Big data vs. classic data

RapidMiner’s Strengths

5959

• Open Source Community & Marketplace – Crowd-sourced innovation, quality assurance, market awareness.

• Fully-integrated Platform – Integrated, process-based business analytics platform with focus on predictive analytics.

• No Programming Required – Easy-to-use, low maintenance costs, standard platform for business analysts.

• Advanced Analytics at Every Scale – In-memory, in-database and in-Hadoop analytics offer best option for every size of database.

• Connectivity – More than 60 connectors (incl. SAP & Hadoop), allowing easy access to structured and unstructured data.

Page 60: RM World 2014: Big data vs. classic data

30,000+ Downloads per Month

CONFIDENTIAL

SELECT LIST OF RECIPIENT ORGANIZATIONS

6060

Government & DefensePharma & Healthcare

Consulting

Oil & Gas, Chemicals

Financial ServicesSoftware & Analytics

Retail

Manufacturing

Business Services

Consumer ProductsAerospace

Technology

Entertainment Academia

Page 61: RM World 2014: Big data vs. classic data

Leader in Advanced Analytics

61

Source: Gartner

Magic Quadrant for Advanced

Analytics Platforms

(February 2014)

Full report at:

http://www.rapidminer.com/gartner

2014

Page 62: RM World 2014: Big data vs. classic data

62

PayPal

Who > world leading

online payment services

provider

Solution > Customer

feedback and voice of the

customer analysis, churn

prediction and prevention,

text mining and sentiment

analysis

SmartSoft

Who > provider of

solutions for preventing

fraud, money laundering,

and risks in financial

institutions

Solution > Integration of

Rapid-I’s predictive

analytics engine into their

solutions for fraud

detection and fraud

prevention for the

financial and telecom

sectors

Select Customer Stories

Page 63: RM World 2014: Big data vs. classic data

SO WHAT IS THE

BIG DATA STORY?

Page 64: RM World 2014: Big data vs. classic data

So the data is naturally moving to

Hadoop...

Situation:

–The data is moving to Hadoop for Cost (storage) and Convenience (ETL) forces

–How do we get the value of predictive analytics to the data?

Rather than move the data out, move the analytics to

the data!

–Can we minimize the need for data movement?

–Data copies can become a management nightmare

–Analytics on a “Business As Usual” manner require convenience

64

Page 65: RM World 2014: Big data vs. classic data

Radoop – RapidMiner on HadoopOpportunity:

–Avoid expensive data movement

–Leverage convenient data transformation

–Thousands of data connectors, many over semi-structured and unstructured data

Why is this big news?

–Leverages a naturally occurring wave

–Analytics over a richer variety requires much more processing

–The energy placed on data extraction and loading moves to energy applied on actual analysis and modelling

65

Page 66: RM World 2014: Big data vs. classic data

BigData Analytics & ThreatsOld data analysis regimes are breaking down

– They cannot accommodate all the new data sources from on-line and mobile

– Any data set can be enriched with semi-structured and unstructured data

–News articles

–User commentary and feedback

–Contextual awareness:

–Semantics of content–Semantics of location

– Data-driven marketing is key to innovation/platform success

Page 67: RM World 2014: Big data vs. classic data

Big Picture on Big Data Analytics

Observations and

Concluding Remarks

Page 68: RM World 2014: Big data vs. classic data

Retaining New Yahoo! Mail Registrants

Often,

Simple is Very Powerful!

Page 69: RM World 2014: Big data vs. classic data

Integrating Mail and News

Data showed that users often check their mail

and news in the same session

–But no easy way to navigate to Y! News from Y! Mail

Mail users who also visit Y! News are 3X more

active on Yahoo

–Higher retention, repeat visits and time-spent on Yahoo

Page 70: RM World 2014: Big data vs. classic data

“In the news” Module on Mail Welcome Page

Increased retention on Mail for light users by 40%!

– Est. Incremental revenue of $16m a year on Y! Mail alone

Page 71: RM World 2014: Big data vs. classic data

Nordstrom: Queries with No

Matches

Julie Bornstein, Web Marketing Director

–What are my customers looking for and not finding?

–June 2002: queries for “belly button rings”

–returned no matches in store

–Why the sudden interest?

Page 72: RM World 2014: Big data vs. classic data

Nordstrom: Queries with No

Matches

Print Ad Campaign

Models happen to be sporting a

navel ring

Nordstrom does not sell navel

rings

What to do???

Page 73: RM World 2014: Big data vs. classic data

Concluding Remarks

Page 74: RM World 2014: Big data vs. classic data

The early days of

mass auto

production

Page 75: RM World 2014: Big data vs. classic data

Today’s Auto:

It just works!

No need to understand what happens

when you turn on ignition

Very complex inside, but all simplicity

on the outside

Page 76: RM World 2014: Big data vs. classic data

“Gotcha”s in Big Data

Discounting the need for near-real-time data

processing and analytics

– Evils of “batch” thinking

No grid story – go build separate infrastructure to

learn

– Expensive

– Bad price-performance as utilization insufficient

Data Mining & Predictive analytics teams expect data

in their own stores

– Good luck using the analytic tools on your own DW store

– Build a “replica” or “extract” DW

Scoring and integration of results is a whole separate

story

Page 77: RM World 2014: Big data vs. classic data

Data Platform Considerations

Eliminate the evils of data extracts/movement out of

the main data store

Do you need map-reduce over a grid?

– Can you set up instant-on grid (Hadoop) as needed for the map-reduce tasks/demand without moving data?

How do you integrate with analytic functions?

– scoring, data transformations

– plug-in 3rd party analytics scoring apps

– Advanced dashboard and scorecard support

Built-in integration with statistical packages? SAS, R,

data mining algorithms, advanced analytics algorithms

libraries

Can you avoid data movement and additional copies?

Page 78: RM World 2014: Big data vs. classic data

78 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

BigData Analytics for Organizations

• Key to competitive Intelligence:

– Understand context

– Understand intent

• Key to understanding consumer trends through social media analysis

– Brand issues

– Trend issues

– Anticipating the next shift

Page 79: RM World 2014: Big data vs. classic data

79 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Threats & Opportunities• Data world is changing, especially in on-line businesses

• Major shifts from relational DB to NoSQL, document-oriented stores

• Connecting new world to “old” world?– Convenience of execution – integration with data platforms

– Appropriateness of algorithms to BigData

– Unstructured data algorithms:

• Text, Semi-structured and Unstructured data

• Entity extraction a must

• Appropriate theory and probability distributions (power laws, fat tails)

• Sparse Data

– Model management and proper aging of models

– Getting to basics so we can decide what models to use:

• Understanding noise and distributions

• Data tours

Page 80: RM World 2014: Big data vs. classic data

80 RapidMiner World Keynote talk – Copyright Usama Fayyad © 2014

Usama Fayyad - [email protected][email protected]

Twitter – @Usamaf

+1-206-529-5123

www.Oasis500.com

www.open-insights.com

Thank You! & Questions