industrial data science

36
INDUSTRIAL DATA SCIENCE Tuesday 20 August 2013

Upload: niko-vuokko

Post on 24-Jan-2015

462 views

Category:

Technology


1 download

DESCRIPTION

A three hour lecture I gave at the Jyväskylä Summer School. The talk goes through important details about the use of data science in real businesses. These include data deployment, data processing, practical issues with data solutions and arising trends in data science. See also Part 1 of the lecture: Introduction Data Science. You can find it in my profile (click the face)

TRANSCRIPT

Page 1: Industrial Data Science

INDUSTRIAL DATA SCIENCETuesday 20 August 2013

Page 2: Industrial Data Science

WHY DO WE NEED BIG DATA?

Page 3: Industrial Data Science

SIMPLE MODELS AND A LOT OF DATATRUMP MORE ELABORATE MODELSBASED ON LESS DATA

Peter Norvig

Page 4: Industrial Data Science

BIG DATA CHANGES PEOPLE AND TECHNOLOGY

• Data changes the management mindset to expect having supporting data

available for all decisions

• Decision making then creates its own data stream that can be analyzed

• Data is an asset: What is its return? Net value? Depreciation? Future

investment plan?

Page 5: Industrial Data Science

IN GOD WE TRUST,ALL OTHERS BRING DATA

W.E. Deming

Page 6: Industrial Data Science

DEPLOYING DATA

Page 7: Industrial Data Science

DATA STORAGE FOR LEARNING

• Efficient storage is critical for modeling feasibility

• What is efficient storage depends on data, algorithms and environment

• Memory: working sets, small data, online learning, fast iterations needed

• Disk: M-estimation, local context sufficient

• Data warehouse: simple models in enterprises, complex input generation

• Distributed: stochastic/ensemble methods, large and complex production models

• Cloud: variable workloads, very massive data

Page 8: Industrial Data Science

UNSUPERVISED LEARNING IN USE

ModelingSignificance

testing

Decision making

As input into

other modeling

Know-how

Selection of

useful pattern

types

Page 9: Industrial Data Science

DEPLOYMENT OF A COMMON MODEL

Modeling tool

DatabaseServicePrediction request

and answer

Datasets

periodically

for learning

Predictions

written to DB

Page 10: Industrial Data Science

DEPLOYMENT OF A LOCALIZED MODEL

Modeling tool

DatabaseServicePrediction request

and answer

Datasets

periodically

for learning

Predictions

Data builderInput

construction

Query

input

Page 11: Industrial Data Science

DEPLOYMENT OF ONLINE LEARNING

Modeling tool

DatabaseIncoming

data stream

Service

Data and/or

labels

Requests

with data

Predictions

Data and/or

labels

Page 12: Industrial Data Science

EVALUATING RESULTS AND QUALITY

• Properly evaluating the quality of modeling results depends on project

objectives, error costs and data specifics

• Classification error makes no sense for skewed class sizes,

ranks and ROC curves do

• Operational improvements evaluated as lift and incremental $$$ over previous

• Uneven error costs:

• earthquake risk estimation

• medical research, molecule potential VS patient safety

• Upsetting recommendations to an e-commerce customer

Page 13: Industrial Data Science

WHAT IS REAL-TIME?

• Real-time can mean very different things to different people

• Analyst: “What’s the user count today? By source? Now? From France?”

• Sysadmin: “Network traffic up 5x in 5 seconds! What’s going on?”

• Google: “Make a bid for these placements. You have 50 ms”

Page 14: Industrial Data Science

PROCESSING LARGE DATA

Page 15: Industrial Data Science

EXAMPLES OF DATA SIZE

Human-generated

• 5K tweets/s

• 25K events/s from a mobile game (that’s 200 GB / day)

• 40K Google searches/s

Machine-generated

• 5M quotes/s in the US options market

• 120 MB/s of diagnostics from a single gas turbine

• 1 PB/s peaking from CERN LHC

Page 16: Industrial Data Science

HUMAN AND MACHINE GENERATED DATA

• Human-generated data will get more detailed

• … but won’t grow much faster than the underlying userbase

• It will become small eventually

• Machine-generated data will grow by the Moore’s law

• … and it’s already massive

Page 17: Industrial Data Science

PROCESSING DATA THE OLD WAY

• User actions modify the current state in a transaction DB

• Single events go to an offline audit log for re-running

• Snapshots of data are exported for modeling

• Production models take exports of snapshots,

write back snapshot versioned results

Events

Snapshot

Snapshot

Snapshot

Page 18: Industrial Data Science

PROCESSING DATA IN STATUS QUO

• Data from operational databases is constantly copied over to a data

warehouse or an analytic database

• This is idealistically a one-stop-shop for all analytics and data science

• Production models preferably work inside the database,

providing high performance and data integrity

• Model learning can try pushing back some operations to the database, but

complex models will need an external tool

• Expensive modeling may require a separate testing database

Page 19: Industrial Data Science

PROCESSING DATA IN THE CLOUD

• Cloud allows endless scale

• No fixed limits on CPU and data usage, but everything is I/O-bound

• Enterprise hybrid clouds allow testing environments and “cloud bursting”

• Large datasets may require specialized algorithms or retrofits to MapReduce

• Combining stochastic learning, online learning and ensemble methods has

proven itself for the task

Page 20: Industrial Data Science

PRACTICAL ISSUES

Page 21: Industrial Data Science

REAL WORLD DATA IS RIDDLED WITH PROBLEMS

• Corrupted incoming data

• Corrupted IDs

• Transient IDs

• Multiple transient IDs without match

• Crazy timestamps

• Data types mixed up

• New variables emerge

• Old variables disappear

• Changes in variable definitions

• And much, much more …

You

Garbage Great insights

Page 22: Industrial Data Science

AND WITH MORE PROBLEMS

• Collected data is enriched with many operationally attainable sources

⇒ varying schemas and complicated ID soup

• Analytic data often developed by frontline instead of IT waterfall

⇒ faster process, but volatile data definition

• Data scientists asking for more data ⇒ temporary kludges

• Data is big and growing ⇒ risks of unnoticed discontinuity

Page 23: Industrial Data Science

NO, I’M NOT FINISHED YET

• The data is not a CSV file sitting in your disk

• It’s coming in every second of the year, often gigabytes per hour

• Availability of this data is a business critical issue

• Availability of modeling results is a business critical issue

• Robustness of modeling results is a business critical issue

Page 24: Industrial Data Science

DATA DRIFT

• Real-world data is rarely stationary

• Equipment ages, people’s preferences change

• Quality of old data models decay

• Training and testing data may need to be specially designed

• Prefer recent data with weights or online learning

Page 25: Industrial Data Science

ROBUST RESULTS?

• Inputs to a decision making process must be assessed for significance

“Can I trust these numbers? Is my decision justified?”

• Ad-hoc analyses can freely employ complex and bleeding edge modeling

• In operations stability and robustness overrides everything else

• Sanity checks and fallbacks can be used to avoid failures and errors

Page 26: Industrial Data Science

POWER LAWS

Number of users

Revenue per user

Page 27: Industrial Data Science

POWER LAWS

• Power laws are ubiquitous in the real world

• Follows from principle: “Whoever has will be given more”

• Example: new links emerge to web pages in proportion to their popularity

• Product improvements can be tracked through changes in the power law curve

• Examples

• Power laws often have a cut-off in the beginning,

not enough mass to fill the lowest ranks

• User engagement and value

• Social network activity

• Brain activity

• Wealth distribution

Page 28: Industrial Data Science

CONSEQUENCES OF POWER LAWS

• Power laws imply extremely skewed distributions

⇒ most models assume Gaussian or generally more balanced distribution

• Huge mass at the bottom ladder breaks most traditional analyses

• Different parts of the curve have complex real world interaction

• On the other hand it is relatively easy to segment power laws

⇒ separately designed treatment for different target groups

• Bringing new users as part of the power law lifts the whole curve as new

entries slowly diffuse along the curve

Page 29: Industrial Data Science

THE IMPORTANCE OF PRESENTATION

• Operations or not, visualization is critical for acceptance

• Challenger shuttle disaster linked to poor visualization of O-ring failure risks

• Requires attention from business concept to implementation

• What information do these users want to see ?

• How does this information support decision making ?

• How to visualize it with clarity yet powerfully ?

Page 30: Industrial Data Science

DATA SCIENCE IN BUSINESS

• Data analysis in business is not the sole task of the data scientist

• The whole organization must gradually mature and engage data

• This is not a technical barrier, it is a human barrier

• How to design business and social processes to employ data?

• Average business has tons of low-hanging data fruit

• Developing and automating all that takes years (and years)

• No use for advanced modeling without visibility to the underlying

Page 31: Industrial Data Science

WHAT’S COMING UP

Page 32: Industrial Data Science

PROCESSING DATA IN THE FUTURE

• The event stream itself is increasingly becoming the master input data for

analytics and data solutions

• This is a big sea change, requiring new designs of storage and processing

• Seeing the full timeline and interactions of each object is a mixed blessing

PROS Huge opportunity for discovering significant value

CONS A very complex haystack, needs additional processing, how can a human

focus on the essential?

Page 33: Industrial Data Science

STREAM PROCESSING

• Instead of handling static states of the data, the data is processed

as it enters the system

• Tables turn: the internal state of the stream persisted to a database

becomes now the backup for failure occasions

• Obvious fit for quickly reactive online learning solutions

• The whole domain was spearheaded by computer trading

• Another example: credit card transaction processing and fraud prevention

Page 34: Industrial Data Science

HADOOP AND DATA SCIENCE

• Hadoop is a general service platform, not just a MapReduce engine

• HBase is already becoming a hugely popular service backend

• In the long run Hadoop will also host a successful analytic database

• A wide selection of very different approaches to analytics and data science

exists already:

Hive and Pig, Impala, Mahout, Vowpal Wabbit, DataFu, Cloudera ML,

Giraph, RHadoop, …

Page 35: Industrial Data Science

REARRANGING THE MAP

• Change is not driven by replacing current bad solutions, but by innovating

around their shortcomings

• Stream processing of data will capture a large corner, driven by a sweeping

push closer to real-time

• High-level functional interfaces to data another winner

• Examples: Cascading for batch processing, Trident for stream processing

• Further innovation in fixing MapReduce shortcomings

• Examples: Spark and Shark for iterative tasks, Impala for analytics

Page 36: Industrial Data Science

THE END