industrial data science
Post on 24-Jan-2015
462 Views
Preview:
DESCRIPTION
TRANSCRIPT
INDUSTRIAL DATA SCIENCETuesday 20 August 2013
WHY DO WE NEED BIG DATA?
SIMPLE MODELS AND A LOT OF DATATRUMP MORE ELABORATE MODELSBASED ON LESS DATA
Peter Norvig
“
”
BIG DATA CHANGES PEOPLE AND TECHNOLOGY
• Data changes the management mindset to expect having supporting data
available for all decisions
• Decision making then creates its own data stream that can be analyzed
• Data is an asset: What is its return? Net value? Depreciation? Future
investment plan?
IN GOD WE TRUST,ALL OTHERS BRING DATA
W.E. Deming
“
”
DEPLOYING DATA
DATA STORAGE FOR LEARNING
• Efficient storage is critical for modeling feasibility
• What is efficient storage depends on data, algorithms and environment
• Memory: working sets, small data, online learning, fast iterations needed
• Disk: M-estimation, local context sufficient
• Data warehouse: simple models in enterprises, complex input generation
• Distributed: stochastic/ensemble methods, large and complex production models
• Cloud: variable workloads, very massive data
UNSUPERVISED LEARNING IN USE
ModelingSignificance
testing
Decision making
As input into
other modeling
Know-how
Selection of
useful pattern
types
DEPLOYMENT OF A COMMON MODEL
Modeling tool
DatabaseServicePrediction request
and answer
Datasets
periodically
for learning
Predictions
written to DB
DEPLOYMENT OF A LOCALIZED MODEL
Modeling tool
DatabaseServicePrediction request
and answer
Datasets
periodically
for learning
Predictions
Data builderInput
construction
Query
input
DEPLOYMENT OF ONLINE LEARNING
Modeling tool
DatabaseIncoming
data stream
Service
Data and/or
labels
Requests
with data
Predictions
Data and/or
labels
EVALUATING RESULTS AND QUALITY
• Properly evaluating the quality of modeling results depends on project
objectives, error costs and data specifics
• Classification error makes no sense for skewed class sizes,
ranks and ROC curves do
• Operational improvements evaluated as lift and incremental $$$ over previous
• Uneven error costs:
• earthquake risk estimation
• medical research, molecule potential VS patient safety
• Upsetting recommendations to an e-commerce customer
WHAT IS REAL-TIME?
• Real-time can mean very different things to different people
• Analyst: “What’s the user count today? By source? Now? From France?”
• Sysadmin: “Network traffic up 5x in 5 seconds! What’s going on?”
• Google: “Make a bid for these placements. You have 50 ms”
PROCESSING LARGE DATA
EXAMPLES OF DATA SIZE
Human-generated
• 5K tweets/s
• 25K events/s from a mobile game (that’s 200 GB / day)
• 40K Google searches/s
Machine-generated
• 5M quotes/s in the US options market
• 120 MB/s of diagnostics from a single gas turbine
• 1 PB/s peaking from CERN LHC
HUMAN AND MACHINE GENERATED DATA
• Human-generated data will get more detailed
• … but won’t grow much faster than the underlying userbase
• It will become small eventually
• Machine-generated data will grow by the Moore’s law
• … and it’s already massive
PROCESSING DATA THE OLD WAY
• User actions modify the current state in a transaction DB
• Single events go to an offline audit log for re-running
• Snapshots of data are exported for modeling
• Production models take exports of snapshots,
write back snapshot versioned results
Events
Snapshot
Snapshot
Snapshot
PROCESSING DATA IN STATUS QUO
• Data from operational databases is constantly copied over to a data
warehouse or an analytic database
• This is idealistically a one-stop-shop for all analytics and data science
• Production models preferably work inside the database,
providing high performance and data integrity
• Model learning can try pushing back some operations to the database, but
complex models will need an external tool
• Expensive modeling may require a separate testing database
PROCESSING DATA IN THE CLOUD
• Cloud allows endless scale
• No fixed limits on CPU and data usage, but everything is I/O-bound
• Enterprise hybrid clouds allow testing environments and “cloud bursting”
• Large datasets may require specialized algorithms or retrofits to MapReduce
• Combining stochastic learning, online learning and ensemble methods has
proven itself for the task
PRACTICAL ISSUES
REAL WORLD DATA IS RIDDLED WITH PROBLEMS
• Corrupted incoming data
• Corrupted IDs
• Transient IDs
• Multiple transient IDs without match
• Crazy timestamps
• Data types mixed up
• New variables emerge
• Old variables disappear
• Changes in variable definitions
• And much, much more …
You
Garbage Great insights
AND WITH MORE PROBLEMS
• Collected data is enriched with many operationally attainable sources
⇒ varying schemas and complicated ID soup
• Analytic data often developed by frontline instead of IT waterfall
⇒ faster process, but volatile data definition
• Data scientists asking for more data ⇒ temporary kludges
• Data is big and growing ⇒ risks of unnoticed discontinuity
NO, I’M NOT FINISHED YET
• The data is not a CSV file sitting in your disk
• It’s coming in every second of the year, often gigabytes per hour
• Availability of this data is a business critical issue
• Availability of modeling results is a business critical issue
• Robustness of modeling results is a business critical issue
DATA DRIFT
• Real-world data is rarely stationary
• Equipment ages, people’s preferences change
• Quality of old data models decay
• Training and testing data may need to be specially designed
• Prefer recent data with weights or online learning
ROBUST RESULTS?
• Inputs to a decision making process must be assessed for significance
“Can I trust these numbers? Is my decision justified?”
• Ad-hoc analyses can freely employ complex and bleeding edge modeling
• In operations stability and robustness overrides everything else
• Sanity checks and fallbacks can be used to avoid failures and errors
POWER LAWS
Number of users
Revenue per user
POWER LAWS
• Power laws are ubiquitous in the real world
• Follows from principle: “Whoever has will be given more”
• Example: new links emerge to web pages in proportion to their popularity
• Product improvements can be tracked through changes in the power law curve
• Examples
• Power laws often have a cut-off in the beginning,
not enough mass to fill the lowest ranks
• User engagement and value
• Social network activity
• Brain activity
• Wealth distribution
CONSEQUENCES OF POWER LAWS
• Power laws imply extremely skewed distributions
⇒ most models assume Gaussian or generally more balanced distribution
• Huge mass at the bottom ladder breaks most traditional analyses
• Different parts of the curve have complex real world interaction
• On the other hand it is relatively easy to segment power laws
⇒ separately designed treatment for different target groups
• Bringing new users as part of the power law lifts the whole curve as new
entries slowly diffuse along the curve
THE IMPORTANCE OF PRESENTATION
• Operations or not, visualization is critical for acceptance
• Challenger shuttle disaster linked to poor visualization of O-ring failure risks
• Requires attention from business concept to implementation
• What information do these users want to see ?
• How does this information support decision making ?
• How to visualize it with clarity yet powerfully ?
DATA SCIENCE IN BUSINESS
• Data analysis in business is not the sole task of the data scientist
• The whole organization must gradually mature and engage data
• This is not a technical barrier, it is a human barrier
• How to design business and social processes to employ data?
• Average business has tons of low-hanging data fruit
• Developing and automating all that takes years (and years)
• No use for advanced modeling without visibility to the underlying
WHAT’S COMING UP
PROCESSING DATA IN THE FUTURE
• The event stream itself is increasingly becoming the master input data for
analytics and data solutions
• This is a big sea change, requiring new designs of storage and processing
• Seeing the full timeline and interactions of each object is a mixed blessing
PROS Huge opportunity for discovering significant value
CONS A very complex haystack, needs additional processing, how can a human
focus on the essential?
STREAM PROCESSING
• Instead of handling static states of the data, the data is processed
as it enters the system
• Tables turn: the internal state of the stream persisted to a database
becomes now the backup for failure occasions
• Obvious fit for quickly reactive online learning solutions
• The whole domain was spearheaded by computer trading
• Another example: credit card transaction processing and fraud prevention
HADOOP AND DATA SCIENCE
• Hadoop is a general service platform, not just a MapReduce engine
• HBase is already becoming a hugely popular service backend
• In the long run Hadoop will also host a successful analytic database
• A wide selection of very different approaches to analytics and data science
exists already:
Hive and Pig, Impala, Mahout, Vowpal Wabbit, DataFu, Cloudera ML,
Giraph, RHadoop, …
REARRANGING THE MAP
• Change is not driven by replacing current bad solutions, but by innovating
around their shortcomings
• Stream processing of data will capture a large corner, driven by a sweeping
push closer to real-time
• High-level functional interfaces to data another winner
• Examples: Cascading for batch processing, Trident for stream processing
• Further innovation in fixing MapReduce shortcomings
• Examples: Spark and Shark for iterative tasks, Impala for analytics
THE END
top related