industrial data science

INDUSTRIAL DATA SCIENCETuesday 20 August 2013

WHY DO WE NEED BIG DATA?

SIMPLE MODELS AND A LOT OF DATATRUMP MORE ELABORATE MODELSBASED ON LESS DATA

Peter Norvig

“

”

BIG DATA CHANGES PEOPLE AND TECHNOLOGY

• Data changes the management mindset to expect having supporting data

available for all decisions

• Decision making then creates its own data stream that can be analyzed

• Data is an asset: What is its return? Net value? Depreciation? Future

investment plan?

IN GOD WE TRUST,ALL OTHERS BRING DATA

W.E. Deming

“

”

DEPLOYING DATA

DATA STORAGE FOR LEARNING

• Efficient storage is critical for modeling feasibility

• What is efficient storage depends on data, algorithms and environment

• Memory: working sets, small data, online learning, fast iterations needed

• Disk: M-estimation, local context sufficient

• Data warehouse: simple models in enterprises, complex input generation

• Distributed: stochastic/ensemble methods, large and complex production models

• Cloud: variable workloads, very massive data

UNSUPERVISED LEARNING IN USE

ModelingSignificance

testing

Decision making

As input into

other modeling

Know-how

Selection of

useful pattern

types

DEPLOYMENT OF A COMMON MODEL

Modeling tool

DatabaseServicePrediction request

and answer

Datasets

periodically

for learning

Predictions

written to DB

DEPLOYMENT OF A LOCALIZED MODEL

Modeling tool

DatabaseServicePrediction request

and answer

Datasets

periodically

for learning

Predictions

Data builderInput

construction

Query

input

DEPLOYMENT OF ONLINE LEARNING

Modeling tool

DatabaseIncoming

data stream

Service

Data and/or

labels

Requests

with data

Predictions

Data and/or

labels

EVALUATING RESULTS AND QUALITY

• Properly evaluating the quality of modeling results depends on project

objectives, error costs and data specifics

• Classification error makes no sense for skewed class sizes,

ranks and ROC curves do

• Operational improvements evaluated as lift and incremental $$$ over previous

• Uneven error costs:

• earthquake risk estimation

• medical research, molecule potential VS patient safety

• Upsetting recommendations to an e-commerce customer

WHAT IS REAL-TIME?

• Real-time can mean very different things to different people

• Analyst: “What’s the user count today? By source? Now? From France?”

• Sysadmin: “Network traffic up 5x in 5 seconds! What’s going on?”

• Google: “Make a bid for these placements. You have 50 ms”

PROCESSING LARGE DATA

EXAMPLES OF DATA SIZE

Human-generated

• 5K tweets/s

• 25K events/s from a mobile game (that’s 200 GB / day)

• 40K Google searches/s

Machine-generated

• 5M quotes/s in the US options market

• 120 MB/s of diagnostics from a single gas turbine

• 1 PB/s peaking from CERN LHC

HUMAN AND MACHINE GENERATED DATA

• Human-generated data will get more detailed

• … but won’t grow much faster than the underlying userbase

• It will become small eventually

• Machine-generated data will grow by the Moore’s law

• … and it’s already massive

PROCESSING DATA THE OLD WAY

• User actions modify the current state in a transaction DB

• Single events go to an offline audit log for re-running

• Snapshots of data are exported for modeling

• Production models take exports of snapshots,

write back snapshot versioned results

Events

Snapshot

Snapshot

Snapshot

PROCESSING DATA IN STATUS QUO

• Data from operational databases is constantly copied over to a data

warehouse or an analytic database

• This is idealistically a one-stop-shop for all analytics and data science

• Production models preferably work inside the database,

providing high performance and data integrity

• Model learning can try pushing back some operations to the database, but

complex models will need an external tool

• Expensive modeling may require a separate testing database

PROCESSING DATA IN THE CLOUD

• Cloud allows endless scale

• No fixed limits on CPU and data usage, but everything is I/O-bound

• Enterprise hybrid clouds allow testing environments and “cloud bursting”

• Large datasets may require specialized algorithms or retrofits to MapReduce

• Combining stochastic learning, online learning and ensemble methods has

proven itself for the task

PRACTICAL ISSUES

REAL WORLD DATA IS RIDDLED WITH PROBLEMS

• Corrupted incoming data

• Corrupted IDs

• Transient IDs

• Multiple transient IDs without match

• Crazy timestamps

• Data types mixed up

• New variables emerge

• Old variables disappear

• Changes in variable definitions

• And much, much more …

You

Garbage Great insights

AND WITH MORE PROBLEMS

• Collected data is enriched with many operationally attainable sources

⇒ varying schemas and complicated ID soup

• Analytic data often developed by frontline instead of IT waterfall

⇒ faster process, but volatile data definition

• Data scientists asking for more data ⇒ temporary kludges

• Data is big and growing ⇒ risks of unnoticed discontinuity

NO, I’M NOT FINISHED YET

• The data is not a CSV file sitting in your disk

• It’s coming in every second of the year, often gigabytes per hour

• Availability of this data is a business critical issue

• Availability of modeling results is a business critical issue

• Robustness of modeling results is a business critical issue

DATA DRIFT

• Real-world data is rarely stationary

• Equipment ages, people’s preferences change

• Quality of old data models decay

• Training and testing data may need to be specially designed

• Prefer recent data with weights or online learning

ROBUST RESULTS?

• Inputs to a decision making process must be assessed for significance

“Can I trust these numbers? Is my decision justified?”

• Ad-hoc analyses can freely employ complex and bleeding edge modeling

• In operations stability and robustness overrides everything else

• Sanity checks and fallbacks can be used to avoid failures and errors

POWER LAWS

Number of users

Revenue per user

POWER LAWS

• Power laws are ubiquitous in the real world

• Follows from principle: “Whoever has will be given more”

• Example: new links emerge to web pages in proportion to their popularity

• Product improvements can be tracked through changes in the power law curve

• Examples

• Power laws often have a cut-off in the beginning,

not enough mass to fill the lowest ranks

• User engagement and value

• Social network activity

• Brain activity

• Wealth distribution

CONSEQUENCES OF POWER LAWS

• Power laws imply extremely skewed distributions

⇒ most models assume Gaussian or generally more balanced distribution

• Huge mass at the bottom ladder breaks most traditional analyses

• Different parts of the curve have complex real world interaction

• On the other hand it is relatively easy to segment power laws

⇒ separately designed treatment for different target groups

• Bringing new users as part of the power law lifts the whole curve as new

entries slowly diffuse along the curve

THE IMPORTANCE OF PRESENTATION

• Operations or not, visualization is critical for acceptance

• Challenger shuttle disaster linked to poor visualization of O-ring failure risks

• Requires attention from business concept to implementation

• What information do these users want to see ?

• How does this information support decision making ?

• How to visualize it with clarity yet powerfully ?

DATA SCIENCE IN BUSINESS

• Data analysis in business is not the sole task of the data scientist

• The whole organization must gradually mature and engage data

• This is not a technical barrier, it is a human barrier

• How to design business and social processes to employ data?

• Average business has tons of low-hanging data fruit

• Developing and automating all that takes years (and years)

• No use for advanced modeling without visibility to the underlying

WHAT’S COMING UP

PROCESSING DATA IN THE FUTURE

• The event stream itself is increasingly becoming the master input data for

analytics and data solutions

• This is a big sea change, requiring new designs of storage and processing

• Seeing the full timeline and interactions of each object is a mixed blessing

PROS Huge opportunity for discovering significant value

CONS A very complex haystack, needs additional processing, how can a human

focus on the essential?

STREAM PROCESSING

• Instead of handling static states of the data, the data is processed

as it enters the system

• Tables turn: the internal state of the stream persisted to a database

becomes now the backup for failure occasions

• Obvious fit for quickly reactive online learning solutions

• The whole domain was spearheaded by computer trading

• Another example: credit card transaction processing and fraud prevention

HADOOP AND DATA SCIENCE

• Hadoop is a general service platform, not just a MapReduce engine

• HBase is already becoming a hugely popular service backend

• In the long run Hadoop will also host a successful analytic database

• A wide selection of very different approaches to analytics and data science

exists already:

Hive and Pig, Impala, Mahout, Vowpal Wabbit, DataFu, Cloudera ML,

Giraph, RHadoop, …

REARRANGING THE MAP

• Change is not driven by replacing current bad solutions, but by innovating

around their shortcomings

• Stream processing of data will capture a large corner, driven by a sweeping

push closer to real-time

• High-level functional interfaces to data another winner

• Examples: Cascading for batch processing, Trident for stream processing

• Further innovation in fixing MapReduce shortcomings

• Examples: Spark and Shark for iterative tasks, Impala for analytics

THE END

industrial data science

Technology