bigml's take on big data
DESCRIPTION
BigML's take on Big Data. University of Geneva, October 12, 2012. In the "Big Data" era, rapidly and easily getting insights from your data or creating data-driven applications does not have to be painful. BigML shows how business managers, application developers, and data scientists can start building their own predictive models in a matter of minutes.TRANSCRIPT
Geneva, October 12, 2012BigML Inc, 2012
Geneva, October 12, 2012BigML Inc, 2012 2
Agenda
·•Short intro·•The Big Data Revolution·•What is BigML?·•Behind the scenes·•Coming down the pike·•Hacking with the BigML API
Geneva, October 12, 2012BigML Inc, 2012 3
Francisco J Martin
BigML:
• Co-founder and CEO• Joined: January 2011• Tasks: Product conceptualization, design, and architecture• Develops: BigML middle-end and public API• 1202 (19%) of commits to total BigML code base
Background:• 5-year degree in Computer Science, UPV• Ph.D. in Artificial Intelligence, UPC• Postdoc (Machine Learning), Oregon State University• Founder and CEO at iSOCO• Founder and CEO at Strands• Co-authored 6 patents acquired by Apple Inc• Directly raised $75+MM in venture capital and cashed
out additional $18+MM for early investors• Directly sold and negotiated $30+MM in licenses
Geneva, October 12, 2012BigML Inc, 2012 4
Neo, sooner or later you're going to realize, just as I did, that there's
a difference between knowing the path, and walking
the path
Academia vs the Real-world
Geneva, October 12, 2012BigML Inc, 2012
1996 1999 2002 2004 2011 2012
5
8-queen problem
Multi-agent Learning
Personalization
E-commerce
RecommenderSystems
Music, video, fitness, finance
Intrusion Detection
Machine Learning
Large-scale Machine Learning
Academia iSOCO Academia Strands Inc BigML Inc
Everything
Data
Walking the data path
Geneva, October 12, 2012BigML Inc, 2012 6
BigML Status
·•Founded in Jan 2011·•9 FTE, 1 PT·•5 Ph.Ds·•4 patent applications
US Patent Application No. 61/557,826For: METHODS FOR BUILDING AND USING DECISION TREES IN A DISTRIBUTED ENVIRONMENTFiled: November, 2011
US Patent Application No. 61/555,615For: VISUALIZATION AND INTERACTION WITH COMPACT REPRESENTATION OF DECISION TREESFiled: November, 2011
US Patent Application No. 61/557,539For: EVOLVING PARALLEL SYSTEM TO AUTOMATICALLY IMPROVE THE PERFORMANCE OF DISTRIBUTED SYSTEMSFiled: November, 2011
·•Advisors and BA:
US Patent Application No. 61/710,175For: SYSTEM AND METHODS TO EXCHANGE ACTIONABLE PREDICTIVE MODELS IN A VIRTUAL MARKETPLACEFiled: October, 2012
Geneva, October 12, 2012BigML Inc, 2012 7
Beneath Hill 60
From the trenches
BigML Team
Geneva, October 12, 2012BigML Inc, 2012 8
Agenda
·•Short intro·•The Big Data Revolution·•What is BigML?·•Behind the scenes·•Coming down the pike·•Hacking with the BigML API
Geneva, October 12, 2012BigML Inc, 2012 9
Big DataWhat is Big Data? What is a Data Scientist?
How not to start with Big Data? What is Data-driven Decision Making?
Geneva, October 12, 2012BigML Inc, 2012 10
Trends
http://strata.oreilly.com/2011/08/building-data-startups.html
Geneva, October 12, 2012BigML Inc, 2012 11
What’s Big Data?
Big Data means way too many different things to
many different people
“when the human cost of making the decision of throwing something away became higher than the machine cost of
continuing to store it” George Dyson
Geneva, October 12, 2012BigML Inc, 2012 12
What’s Big Data?
Volume(big, enormous, huge, vast, immense, very
large, etc)
Variety(heterogenous, diverse, complex, multiple
sources, sensors, etc)
Velocity(speed, dynamic real-time, streamed, etc)
The 3 v’s The 3 I’sImmediate
In the sense that you need to do something about it
IntimidatingWhat if you do not?
Ill-definedWhat is it? Anyway?
Data matters!!!
Geneva, October 12, 2012BigML Inc, 2012 13
Machine Learning
Even if we, human beings, are learning machines, we are really bad at processing small amounts of data
Machines are good at quickly processing huge amounts of data.Machine Learning can make them learn from
data
Geneva, October 12, 2012BigML Inc, 2012 14
It’s all about machine learning
It's as if the machines have been in training all their lives to adapt and make use of the Big Data now being thrown at them - a combination of Moore's Law and the cloud mixed in with Machine Learning finally makes it all possible. --- Jeff Bussgang
Forget plastics. It’s all about
machine learninghttp://www.youtube.com/watch?v=PSxihhBzCjk
Geneva, October 12, 2012BigML Inc, 2012 15
Unknown Modelf : X -> Y
Example: ideal credit approval formula
ModelsM
Example: set of candidate credit approval formulas
Learning from Data
Based on Learning from Data by Y. Abu-Mostafa, M. Magdon-Ismail and H. Lin
Final Modelg ~ f
Example: learned credit approval formula
LearningAlgorithm
Training Examples(x1, l1), (x2, l2), ..., (xN, lN)
Example: historical records of credit customers
x1
xN
labelf1 f2 fn
Geneva, October 12, 2012BigML Inc, 2012 16
What’s Big Machine Learning?
VolumeWhat to do when data is too big to fit within the
system memory of a single computer?
Variety
Large-scale machine learning
Clean, refine, update, join, merge, aggregate, structure or deconstruct data until it matches the required input format or (why not) just generate/store data in the right format
Velocity Stream Algorithms
Geneva, October 12, 2012BigML Inc, 2012 17
...or you can deal with that!Machine Learning
Geneva, October 12, 2012BigML Inc, 2012 18
More featuresMore exam
ples
Does More Data beat Better Algorithms?
More Data or Better Models.Xavier Amatriain
The Unreasonable Effectiveness of Data
Geneva, October 12, 2012BigML Inc, 2012 19
Global realization that learning from data (i.e., Machine Learning)
can help us better analyze our past, understand our present, and predict our future. --- Francisco J Martin
Data Past Present Future
What’s Big Data?
Geneva, October 12, 2012BigML Inc, 2012 20
Big DataWhat is Big Data? What is a Data Scientist?
How not to start with Big Data? What is Data-driven Decision Making?
Geneva, October 12, 2012BigML Inc, 2012 21
Is Wikipedia right?
Really? Seriously?? Are you kidding me???
Geneva, October 12, 2012BigML Inc, 2012 22
Data can’t be wrong?
Geneva, October 12, 2012BigML Inc, 2012 23
McKinsey can’t be wrong
Critical Shortage Of “Data Scientist” Talent Predicted By 2018
Geneva, October 12, 2012BigML Inc, 2012 24
HBR can’t be wrong
Geneva, October 12, 2012BigML Inc, 2012 25
Wikipedia is right!
Geneva, October 12, 2012BigML Inc, 2012 26
If Data Scientists don’t existcan they be created?
Geneva, October 12, 2012BigML Inc, 2012 27
The first Data Scientist
Computer Scientist
Mathematician
Statistician
Hans’ brain, the first Data Scientist
Geneva, October 12, 2012BigML Inc, 2012 28
The magic formula
A data scientist is“part analyst, part artist.”
Anjul Bhambhri,Vice President of Big Data
Products at IBM
Geneva, October 12, 2012BigML Inc, 2012 29
Are Data Scientists super heroes?
Geneva, October 12, 2012BigML Inc, 2012 30http://photos.oregonlive.com/photo-essay/2012/06/ashton_eaton_sets_decathlon_wo.html
The most powerful human super hero
Geneva, October 12, 2012BigML Inc, 2012 31
Events Decathlon World Record High school World Record
World Record
100 m 10.21 10.08 9.58
Long Jump 8.23 m 8.16 m 8.95 m
Shot Put 14.20 m 20.65 m 23.12 m
High Jump 2.05 m 2.31 m 2.45 m
400 m 46.70 44.69 43.18
110 m hurdles 13.70 13.74 12.80
Discus throw 42.81 m 61.38 m 74.08 m
Pole Vault 5.30 m 5.56 m 6.14 m
Javelin Throw 58.87 m 73.74 m 98.48 m
1500m 4:14.48 3:38.26 3:26.00
Are Data Scientists super heroes?
Geneva, October 12, 2012BigML Inc, 2012 32
The Wikipedia is always right!
Geneva, October 12, 2012BigML Inc, 2012 33
BigML’s Data Science Team
Machine Learning Research
Large-scale and learning algorithm implementation
Architecture, Software Design,
Distributed Systems
Tom Dietterich, PhD
Charles Parker,PhD Adam Ashenfelter, MSc
Jao, PhD
Bea Garcia, BSc
Poul Petersen, MSc
Justin Donaldson Ph.D.
Francisco J Martin, PhD
Oscar Rovira, MSc* Infrastructure, Cloud-based
Com
puting
DesignVisualization
UI
Jos Verwoerd, MScBusi
ness
and
C
omm
on S
ense
Product Design
Geneva, October 12, 2012BigML Inc, 2012 34
Tom Dietterich, PhD
Charles Parker,PhD Adam Ashenfelter, MSc
Jao, PhD
Bea Garcia, BSc
Poul Petersen, MSc
Justin Donaldson Ph.D.
Francisco J Martin, PhD
Oscar Rovira, MSc*
Jos Verwoerd, MSc
Take Away
So instead of trying to quickly create “mediocre data scientists”, Universities should focus on creating excellent mathematicians, statisticians, computer scientists, software architects, designers, etc who are fabulous team players
Geneva, October 12, 2012BigML Inc, 2012 35
Big DataWhat is Big Data? What is a Data Scientist?
How not to start with Big Data? What is Data-driven Decision Making?
Geneva, October 12, 2012BigML Inc, 2012 36
Iris Dataset
http://en.wikipedia.org/wiki/Iris_flower_data_set
Geneva, October 12, 2012BigML Inc, 2012 37
Ingestion(capturing and storing)
Digestion(processing)
Absorption(deriving insights)
Assimilation (making insights actionable)
Egestion
(reject bad data, wrong insights)
Digesting Big Data
Too much attention!!!
Almost no attention!!!
Geneva, October 12, 2012BigML Inc, 2012 38
·•Hadoop has been excessively promoted as the way to make Big Data problems easy.
·•There are quite a few vendors pushing different Hadoop flavors to the market.
Big Data meets Hadoop
However, Hadoop is complex, slow, expensive and batch
Geneva, October 12, 2012BigML Inc, 2012 39
Running Hadoop on a cluster - The New IT sport of 2012
Big Data and Hadoop
Geneva, October 12, 2012BigML Inc, 2012 40
Real-Time Hadoop?
Really? Seriously?? Are you kidding me???
Geneva, October 12, 2012BigML Inc, 2012 41
Why not Hadoop?
·•Evidence suggests that many MapReduce-like jobs process relatively small input data sets (less than 14 GB)
·•Iterative-machine learning algorithms, do not map trivially to MapReduce.·•Memory has reached a GB/$ ratio such that it is now technically and financially feasible to have servers with 100s GB of DRAM·•In terms of hardware and programmer time, this may be a better option for the majority of
data processing jobs.Rowstron, A. et al, Nobody ever got fired for using Hadoop on a cluster, Microsoft Research, Cambridge, 2012
·•Hadoop is bad at iterative algorithms: High job startup costs and awkward to retain state across iterations
·•High sensitivity to skew: iteration speed bounded by slowest task.·•Potentially poor cluster utilization: must shuffle all data to a single reducer.
Large-Scale Machine Learning at Twitter, Jimmy Lin
Hadoop on a cluster is the right solution for jobs where the input data is multi-terabyte or larger
Geneva, October 12, 2012BigML Inc, 2012 42
Hadoop
·•Complex·•Slow·•Batch·•Expensive
Streaming Algorithms
·•Simple·•Fast·•Real-time·•Cheap
Making Big Data Small
Noel Welsh, Strata conference, London, October 2012
Geneva, October 12, 2012BigML Inc, 2012 43
Self-imposed Shackles
Tackling Big Data with Hadoop on a cluster is like self-imposing shackles on your own project
Once a baby elephant accepts the limitation imposed on him it becomes a permanent belief, or in his case, a conditioned reaction. Now as the elephant grows into adulthood, he has the power to easily pull the stake out of the ground, but his
conditioning has taught him that the effort will not only be futile, it will be
painful as well.
http://www.selfgrowth.com/articles/Martinez1.html
Geneva, October 12, 2012BigML Inc, 2012 44
•Buy a few machines and set up a cluster.•Installing and running any flavor of Hadoop.•Figure out how to implement complex map-reduce algorithms to compute a few analytics.
•Start with a very small data sample.•Use free or cloud-based tools to build a first predictive model that you can understand.•Check if the model gives you any practical insight.•Use the model to generate predictions and see if it can improve your performance.•Check how more data can improve the model.•Check if more sophisticated models can beat your model •Iterate.•Check if the volume, variety, and velocity of your data require a behind-the-firewall/ cloud solution or a batch/stream solution.
Starting with Big Data
Geneva, October 12, 2012BigML Inc, 2012 45
Big DataWhat is Big Data? What is a Data Scientist?
How not to deal with Big Data? What is Data-driven Decision Making?
Geneva, October 12, 2012BigML Inc, 2012 46
Data-Driven Decisions
http://www.nytimes.com/2011/04/24/business/24unboxed.html
Automated, data-driven decisions will significantly impact more industries than any other information
system since “computers” were people
Geneva, October 12, 2012BigML Inc, 2012 47
The “HiPPO” (Highest Paid Person’s Opinion) is dead
Geneva, October 12, 2012BigML Inc, 2012 48
Descriptive AnalyticsTraditional, backward-looking business
analytics
Predictive AnalyticsMachine Learning
Predictive Analytics
Geneva, October 12, 2012BigML Inc, 2012 49
“The goal of a predictive model is not
to predict the future but to help you make a better decision in the present”
Taken from Paul Saffo, HBR
Predictive Model
Geneva, October 12, 2012BigML Inc, 2012 50
Analytics and Predictive Analytics combined with Experience&Intuition
Data-Driven Decision Making
Geneva, October 12, 2012BigML Inc, 2012 51
Ingestion(capturing and storing)
Digestion(processing)
Absorption(deriving insights)
Assimilation (making insights actionable)
Egestion
(reject bad data, wrong insights)
less attention!!!
More attention!!!
More focus on the models and how to operationalize them than on the infrastructure to generate
them
It’s time to switch the attention
Geneva, October 12, 2012BigML Inc, 2012 52
Take aways•Big Data is just data
•It’s all about machine learning
•Try to excel in one of the data science disciplines
•Don’t shackle yourself to the wrong platform
•Trying to predict the future can help you make the right decision in the present
•Focus on evaluation and actionability of models and not on how they are built
Geneva, October 12, 2012BigML Inc, 2012 53
Agenda
·•Short intro·•The Big Data Revolution·•What is BigML?·•Behind the scenes·•Coming down the pike·•Hacking with the BigML API
Geneva, October 12, 2012BigML Inc, 2012 54
BigML Goal
Highly Scalable, Cloud-based Machine Learning Service
Simple, Easy-to-Use and Seamless-to-Integrate
Geneva, October 12, 2012BigML Inc, 2012 55
...or you can deal with that!
BigML vs ML
BigML 1-click model
You can deal with this...
Geneva, October 12, 2012BigML Inc, 2012 56
BigML vs Big Data
BigML 1-click model
You can deal with this...
...or you can deal with that!
Geneva, October 12, 2012BigML Inc, 2012 57
How it Works
Geneva, October 12, 2012BigML Inc, 2012 58
True
Machine Learning Made Easy
Geneva, October 12, 2012BigML Inc, 2012 59
“Any fool can make something complicated. It takes a genius to make it simple.”
― Woody Guthrie
Simple is not easy
Geneva, October 12, 2012BigML Inc, 2012 60
Fully Web based
Geneva, October 12, 2012BigML Inc, 2012 61
RESTful API
Geneva, October 12, 2012BigML Inc, 2012 62
Agenda
·•Short intro·•The Big Data Revolution·•What is BigML? - Demo·•Behind the scenes·•Coming down the pike·•Hacking with the BigML API
Geneva, October 12, 2012BigML Inc, 2012 63
Agenda
·•Short intro·•The Big Data Revolution·•What is BigML?·•Behind the scenes·•Coming down the pike·•Hacking with the BigML API
Geneva, October 12, 2012BigML Inc, 2012 64
BigML’ Software Architecture
Middle-end[Apian]
Backend[Wintermute]
Boto, FabricInfrastructure
[Sauron]
Front-end[Neutronia]
[Sky]
[CuriousYellow]
[Medusa]
Geneva, October 12, 2012BigML Inc, 2012 65
BigML’s AWS-based Architecture
Geneva, October 12, 2012BigML Inc, 2012 66
Why Tree Models?
·•Highly scalable·•Graphically representable and interactive·•Easily understandable·•Easily translatable into rules, PMML, and code. ·•Easily upgradable with ensembles: boosting, bagging, and random forests, etc·•Top performers! http://www.niculescu-mizil.org/papers/empirical.icml06.pdfempirical.icml06.pdf
Geneva, October 12, 2012BigML Inc, 2012 67
Streaming
Data is never kept in memory but needs only one pass over
the data to capture the distribution.
Memory constrained
The less memory allocated, the lossier the compressed
distribution.
Dynamic
The histogram bins adjust themselves as they observe the
data.
Robust to ordered data
So it works even if the data stream is non-stationary
Merge friendly
For parallelization and distribution.
More...
http://blog.bigml.com/2012/06/18/bigmls-fancy-
histograms/
BigML's trees and dataset summaries use histograms with the following traits:
BigML Histograms
Geneva, October 12, 2012BigML Inc, 2012 68
BigML Streaming Trees
CART
Classification & Regression Trees
Grown breadth first
So partial trees are meaningful
Built Hoeffding-style
So they consume streaming data and can split "early"
Friendly for parallelization
Can work over multiple cores or multiple computers
BigML's trees are:
Geneva, October 12, 2012BigML Inc, 2012 69
Growing a Streaming Tree
·•Each split breaks the data into subsets.
·•The split should make the subsets as distinct from one another as possible.
·•Subsets are chosen to maximize information gain (classification) or minimize squared error (regression).
Geneva, October 12, 2012BigML Inc, 2012 70
Distributed Streaming Trees
Geneva, October 12, 2012BigML Inc, 2012 71
Streaming Trees - Early Splits
Geneva, October 12, 2012BigML Inc, 2012 72
Agenda
·•Short intro·•The Big Data Revolution·•What is BigML?·•Behind the scenes·•Coming down the pike·•Hacking with the BigML API
Geneva, October 12, 2012BigML Inc, 2012 73
Automatic Evaluations
Geneva, October 12, 2012BigML Inc, 2012 74
A marketplace for predictive models
Geneva, October 12, 2012BigML Inc, 2012 75
“Any fool can make something complicated. It takes a genius to make it simple.”
― Woody Guthrie
Simple is not easy
Geneva, October 12, 2012BigML Inc, 2012 76
True
Machine Learning Made Easy
Geneva, October 12, 2012BigML Inc, 2012 77
Agenda
·•Short intro·•The Big Data Revolution·•Demo·•Behind the scenes·•Coming down the pike·•Hacking with the BigML API
Geneva, October 12, 2012BigML Inc, 2012 78
Back to the trenches
Gallipoli
Geneva, October 12, 2012BigML Inc, 2012 79
Big Data Trends - David Feinleibhttp://www.slideshare.net/bigdatalandscape/big-data-trends
Hey Graduates: Forget Plastics - It's All About Machine Learning. Jess Bussgang. http://bostonvcblog.typepad.com/vc/2012/05/forget-plastics-its-all-about-machine-learning.html
More Data or Better Models. Xavier Amatriain http://technocalifornia.blogspot.ch/2012/07/more-data-or-better-models.html
Making Big Data Small. Noel Welshhttp://strataconf.com/strataeu/public/schedule/detail/25984
Data Killed the HiPPO star. Jeff Jordan, Andreessen Horowitzhttp://gigaom.com/2012/02/18/data-killed-the-hippo-star/
When There’s No Such Thing as Too Much Information. Steve Lohrhttp://www.nytimes.com/2011/04/24/business/24unboxed.html
Nobody ever got fired for using Hadoop on a cluster. Antony Rowstron, Dushyanth Narayanan, Austin Donnelly, Greg O’Shea, Andrew Douglashttp://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf
Six Rules for Effective Forecasting. Paul Saffohttp://www.usc.edu/schools/annenberg/asc/projects/wkc/pdf/200912digitalleadership_saffo.pdf
Large-scale Machine Learning at Twitter. Jimmy Lin and Alek Kolczhttp://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf
Good Reading
Geneva, October 12, 2012BigML Inc, 2012