a new year in data science: ml unpaused

A New Year in Data Science: ML Unpaused

Data Day Texas Austin, 2015-01-10

Paco Nathan, @pacoid

http://databricks.com/spark-training/



















































































http://datadaytexas.com/

https://twitter.com/pacoid


http://liber118.com/


















http://www.amplifypartners.com/

Observations about Machine Learning, Data Science, Big Data, Open Source, Cluster Computing, Notebooks, etc., over the past year … plus, a look ahead

Backstory

Backstory: The Sun Also Rises

Some wake early in the morning and go build buildings


Some gaze into the heavens, sit back, and explain the process…



Clearly, provably, our Sun revolves around the Earth at an observable rate


Others create and evaluate models to predict the Earth’s orbit of the Sun


Sometimes, when the sky gods become angry and obscure the Sun as our due punishment…

We grow scared and react: sacrifices must be offered, our plans must change, etc.


Sometimes, when the sky gods become angry and obscure the Sunpunishment…

We grow scared and react: sacrifices must be offered, our plans must

These points are what I’d like to discuss today

Whither Data Science?


twitter.com/josh_wills/status/198093512149958656

https://twitter.com/josh_wills/status/198093512149958656

Feel free to disagree, but I find that definition to be flawed…



1. That ignores DevOps (how’s that working out?) and Visualization/Design (ditto)



1. That ignores DevOps (how’s that working out?) and Visualization/Design (ditto)

2. When the CEO asks you to help explain why revenue nose-dived over the past month… neither field has a clue about how to model business phenomena


Software Engineering: implement and test a model that somebody selected

…almost ignores the matter of modeling entirely, at least not since old school types like Dijkstra

!

Statistics: measure and justify a model that somebody selected

…was never particularly good at teaching how to model problems – as two renowned statisticians, William Cleveland and Leo Breiman, noted


http://www.goodreads.com/quotes/32629-the-question-of-whether-a-computer-can-think-is-no

http://cm.bell-labs.com/stat/doc/datascience.ps

http://projecteuclid.org/euclid.ss/1009213726

Software Engineering: implement and test a model that somebody selected

…almost ignores the matter of modeling entirely, at least not since old school types like

!

Statistics: measure and justify a model that somebody selected

…was never particularly good at teaching how to model problems – as two renowned statisticians, William Cleveland


Both fields are necessary, but not sufficient

http://cm.bell-labs.com/stat/doc/datascience.ps

http://projecteuclid.org/euclid.ss/1009213726

The Thorn in the Side of Big Data: too few artists Christopher Ré, Stanfordsafaribooksonline.com/library/view/strata-conference-santa/9781491900321/part92.html


https://www.safaribooksonline.com/library/view/strata-conference-santa/9781491900321/part92.html

The Thorn in the Side of Big Data: too few artistsChristopher Ré, Stanfordsafaribooksonline.com/library/view/strata-conference-santa/9781491900321/part92.html


“You should think about features and not algorithms”

https://www.safaribooksonline.com/library/view/strata-conference-santa/9781491900321/part92.html

Remember EJBs?

Floyd Marinescu observed about the aftermath of EJBs in Brief History…

Intended for building framework components, e.g., for IBM, Oracle, Sun, but not many others

Based on RMI, prior to notions like RESTful web services

Enterprise Java Beans: Lessons from hate-watch reality television

https://web.archive.org/web/20050727000010/http://dynamicsemantics.blog-city.com/abriefhistoryofejb.htm

http://shop.oreilly.com/product/9780596009786.do

Maybe a handful of people in the world would ever actually need to use EJBs, but those few people wanted a spec

Then, for tragic political reasons (MSFT envy), Sun Microsystems made EJBs prominent in their Java APIs



Fortunately, we evolved: Spring, JBoss, etc., those came along as relatively more sane tech

Now we see the Docker thing soar, with notions such as microservices displacing legacy cruft

(BTW, if you haven’t yet, check out Weave)


http://www.apple.com

http://www.apple.com


I mention this because, to me, EJB represented a convoluted form of template thinking:


developing complex web apps for the sake of developing complex web apps


IRL developers and template thinking don’t determine public policy… right?


To paraphrase Dean Wampler, consider WordCount a simple apps written for MapReduce in Hadoop … ~50 lines of unapologetic Java that feels hella like writing EJBs:


Compare that with functional programming, where the same WC app is three lines of easily-read Scala when run in Apache Spark:

http://spark.apache.org/


Check out Dean’s talk at 11:00, “Why Scala is Taking Over the Big Data World”

Compare that with functional programming, where the same WC app is three lines of easily-read Scala when run in Apache Spark:



Hadoop suffers because, IMHO, that convoluted EJB style of developer-centric template thinking staged a coup

Perhaps we could “donate” some OSS talent…

Send a pull request…

Or something.

https://twitter.com/acmurthy/status/371744218546331648

Lies, Damn Lies, Statistics, and Data Science

Probability got going, formally, in the 16th c. – although interesting mathematical estimations trace back to classical times

Arabs in the 9th c. used frequency analysis – later rediscovered by Europeans during the early Italian Renaissance

Statistics followed, originally more about what we might call demographics – through 18th c.

Lies, Damn Lies, Statistics, Data Science

Laplace, Gauss, et al., bridged the fields in the late 18th c. using distributions (what we studied in Stats 101) to infer the probability of errors in estimates

!

!

Much of the 19th/20th c. work was about using goodness of fit tests, etc., justifying some distribution

• generally speaking, that require samples

• that, in turn, implies batch windows


https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss

https://en.wikipedia.org/wiki/Pierre-Simon_Laplace


That kind of template thinking in action really lurvs it some batch windows

While 19th/20th c. stats work focused on defensibility

21st c. work, w.r.t. Big Data apps, focuses more on predictability – plus there’s a shift in how we make estimates…


BTW, doesn’t it seem weird to crunch through piles of data in large batch jobs, at large expense, when the results get used to approximate features ultimately? Why not perform that in stream?

A fascinating, relatively new area pioneered by relatively few people – e.g., Philippe Flajolet

Provides approximation with error bounds using much less resources (RAM, CPU, etc.)

highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/


http://algo.inria.fr/flajolet/

http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

http://algo.inria.fr/pfac/PFAC/PFAC.html

algorithm use case example

Bloom Filter set membership code

MinHash set similarity code

HyperLogLog set cardinality code

Count-Min Sketch frequency summaries code

DSQ streaming quantiles code

SkipList ordered sequence search code


http://en.wikipedia.org/wiki/Bloom_filter

https://github.com/sameeragarwal/blinkdb

http://en.wikipedia.org/wiki/MinHash

https://github.com/mrsqueeze/spark-hash

http://algo.inria.fr/flajolet/Publications/DuFl03.pdf

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterAlgebirdHLL.scala

https://sites.google.com/site/countminsketch/home

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterAlgebirdCMS.scala

http://research.neustar.biz/2013/09/16/sketch-of-the-day-frugal-streaming/

https://github.com/laserson/dsq

ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf

http://redis.io/download


E.g., ±4% could buy you two orders of magnitude reduction in the required memory footprint for an analytics app !OSS projects such as Algebird and BlinkDB provide for this newer approach to the math of approximations at scale

https://github.com/twitter/algebird/

http://blinkdb.org/


E.g., ±4% could buy you two orders of magnitude reduction in the required memory footprint for an analytics app !OSS projects such as provide for this newer approach to the math of approximati

Oscar Boykin at 14:00, “Aggregators: Modeling Data Queries Functionally”

co-author of Algebird, Scalding

https://github.com/twitter/algebird/

http://blinkdb.org/

The Interzone

Data Science is inherently interdisciplinary

To paraphrase Chris Ré, emphasis on algorithms is relatively minor in the grand scheme –

Especially when compared to needs for modeling business problems effectively

To wit: beyond phenomenology, leading into quantitative analysis and repeatable results

On the one hand, CS + Stats do not quite address those needs…

The Interzone

http://radar.oreilly.com/2011/09/building-data-science-teams.html

https://en.wikipedia.org/wiki/Phenomenology_(science)

On the other hand, Physics does well to teach modeling –

I like to hire physicists to work on Data teams…

The Interzone

They tend to get the interdisciplinary aspects: got the math background, coding experience, generally good at systems engineering, etc.

Not saying we should all rush out to get Physics degrees; there’s something to be learned there, vital for the work and priorities ahead

https://en.wikipedia.org/wiki/Wilhelm_R%C3%B6ntgen

I mention this because we are at a crossroads, which has more to do with the physical world – some talks here at DDTx15 help illustrate that

Vast implications for Health Care, Transportation, Agriculture, Energy, Gov, Manufacturing in general…

More about that in a bit –

The Interzone


http://www.netflix.com/WiMovie/70305883

The Libraries

Most of the ML libraries that one encounters today focus on two general kinds of solutions:

• convex optimization

• matrix factorization

The Libraries: Alexandria Redux

http://stanford.edu/~boyd/admm.html

https://www.cs.purdue.edu/homes/dgleich/

http://www.knime.org/

http://www.rstudio.com/ide/docs/r_markdown

http://julialang.org/

http://pandas.pydata.org/

https://github.com/cloudera/oryx
































































http://spark.incubator.apache.org/
















































































http://mahout.apache.org/




















http://scikit-learn.org/stable/




























https://github.com/JohnLangford/vowpal_wabbit

http://www.mathworks.com/products/matlab/































































https://rapidminer.com/























































http://www.wolfram.com/mathematica/new-in-10/highly-automated-machine-learning/




































http://0xdata.com/product/





























https://dato.com/

https://dato.com/

https://dato.com/

One might think of the convex optimization in this case as a kind of curve fitting – generally with some regularization term to avoid overfitting, which is not good

Good Bad


For supervised learning, used to create classifiers:

1. categorize the expected data into N classes

2. split a sample of the data into train/test sets

3. use learners to optimize classifiers based onthe training set, to label the data into N classes

4. evaluate the classifiers against the test set, measuring error in predicted vs. expected labels


Bokay, great for security problems with simply two classes: good guys vs. bad guys

How do you decide what the classes are for more complex problems in business?

That’s where the matrix factorization parts come in handy…


http://faculty.cse.tamu.edu/davis/welcome.html

For unsupervised learning, which is often used to reduce dimension:

1. create a covariance matrix of the data

2. solve for the eigenvectors and eigenvalues of the matrix

3. select the top N eigenvectors, based on diminishing returns for how they explain variance in the data

4. those eigenvectors define your N classes


An excellent overview of ML definitions (up to this point) is given in:


To wit: Generalization = Representation + Optimization + Evaluation

A Few Useful Things to Know about Machine Learning Pedro Domingos CACM 55:10 (Oct 2012) http://dl.acm.org/citation.cfm?id=2347755

http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

evaluationoptimizationrepresentationcirca 2010

ETL into cluster/cloud

datadata

visualize,reporting

Data Prep

Features

Learners, Parameters

UnsupervisedLearning

Explore

train set

test set

models

Evaluate

Optimize

Scoringproduction

datause

cases

data pipelines

actionable resultsdecisions, feedback

bar developers

foo algorithms

Algorithms and developer-centric template thinking only go so far in a workflow…

Results are shown in blue, and the real work is highlighted in red




datadata

visualize,reporting

Data Prep

Features



Explore

train set

test set

models

Evaluate

Optimize

Scoringproduction

datause

cases

data pipelines


bar developers

foo algorithms

Algorithms and developer-centric template thinking only go so far

Results are shown in is highlighted in

1. focus on features not algorithms

2. learn how to model business problems by leveraging data

3. notice the workflows needed?

4. leave the dev-centric thinking for odd city council meetings




datadata

visualize,reporting

Data Prep

Features



Explore

train set

test set

models

Evaluate

Optimize

Scoringproduction

datause

cases

data pipelines


bar developers

foo algorithms

Algorithms and developer-centric template thinking only go so far

Results are shown in is highlighted in


Matthew Kirk 12:00 “Lessons Learned: Machine Learning and Technical Debt”

Ted Dunning 13:00 “Computing with Chaos”

Julia Evans 15:00 “Data Pipelines. They're a lot of work!”

Christopher Johnson 16:00 “Scala Data Pipelines for Music Recommendations”

Even so, business demands exceed far beyond what classifiers and labels alone can give us…

Businesses lurv Optimization, gobs of it; in that context ML libraries today merely scratch the surface

Round hole, square peg


http://web.stanford.edu/group/SOL/

http://www.club-mst3k.com/

Imagine that you compete with FedEx… how do you optimize delivery routes for airplanes, trucks, trains, nanodrones, hoverboards, etc.?

Which do you optimize: fuel cost, delivery time, maintenance schedules, minimizing lost packages?

Doesn’t sound much like online advertising, social networks, or any episode of Silicon Valley


http://www.instructables.com/id/Real-Hoverboard-Hovercraft/

http://www.imdb.com/title/tt2575988/

ML, Unpaused

What were the origins of machine learning?

• Marvin Minsky @MIT, 1950s

• Support Vector Machines @Bell Labs, 1990s

• Google @Stanford, 1990s

• Ray Kurzweil, 2000s

Nope…

ML, Unpaused

http://www.media.mit.edu/people/minsky

http://link.springer.com/article/10.1007%2FBF00994018

http://archive.wired.com/wired/archive/13.08/battelle.html

http://www.kurzweilai.net/

ML has been an aspect of AI research for a long while, through several different vectors

A good early history (up to 1980s) is given in:

ML, Unpaused

Machine Learning: A Historical and Methodological Analysis Jaime Carbonell, Ryszard Michalski, Tom Mitchell AI Magazine 4:3 (1983) http://dx.doi.org/10.1609/aimag.v4i3.406

To wit:

task-oriented studies, knowledge acquisition, cognitive simulation, theoretical exploration … overall, a much broader class of optimization problems

https://www.aaai.org/ojs/index.php/aimagazine/article/viewFile/406/342

An era of anticipation – AI was making inroads…

• emphasis on capturing/representing knowledge and expertise – production use cases in medicine

• Fifth Generation Computing (parallel h/w) in Japan ⇒ MCC, etc.

However:

• few outside academia had enough cluster compute power – aside from 3-letter agencies and AT&T

• meanwhile ML was not yet considered “academic” enough within academia

Circa early 1980s:

https://en.wikipedia.org/wiki/Fifth_generation_computer

Stock market “corrected” in 1987:

But…

https://www.google.com/finance?chdnp=1&chdd=1&chds=1&chdv=1&chvs=Linear&chdeh=0&chfdeh=0&chdet=1420313823555&chddm=493442&chls=IntervalBasedLine&q=INDEXSP:.INX&ntsp=0&ei=2kSoVMG4AoL5sQev6YHIDg

Some fundamental tech platforms emerge…

• Hubble Space Telescope, Human Genome Project, WWW, electric cars relaunched

And throughout that decade:

• Linux, Java @Sun, JavaScript @Netscape

• Firefly, an initial commercial ML app on teh interwebs @MIT Media Lab

• Rise of e-commerce leveraging horizontal scale-out with commodity hardware

Circa early 1990s:

http://www.businessweek.com/1996/41/b349690.htm

Stock market “tumbled” in 2000:

But…


GOOG AMZN EBAY YHOO LNKD NFLX FB TWTR emerged out of the dust…

• web apps dominated for search, e-commerce, social networks, etc.

• did we mention EJBs and template thinking?

• mobile picked up traction

• recommender systems went mainstream

• AI picked up with semantic web efforts…

Circa early 2000s:

http://doriantaylor.com/a-brief-and-fuzzy-history-of-web-application-development

http://www.crunchbase.com/category/semantic-web/ac34c4c6e430f66f44aeef8ca45b52bb

Stock market “went free-fall” in 2008:

But…


Successful e-commerce firms have IPO’ed and are now busy building skyscrapers in downtown SF…

Circa mid 2010s:

LinkedIn, 350 Bush

Transbay Transit

Salesforce, 415 Mission

http://www.bizjournals.com/sanfrancisco/stories/2008/03/31/focus32.html

http://transbaycenter.org/

http://techcrunch.com/2014/04/11/salesforce-will-pay-680m-to-expand-its-sf-hq-into-the-new-salesforce-tower/

An odd truism about the hubris of the uber-wealthy and the timing of their skyscraper projects…

But…

Sears Tower, Chicago

Lehman Brothers, London

Fontainebleau, Las Vegas

https://en.wikipedia.org/wiki/Willis_Tower

http://www.bloomberg.com/news/2014-06-27/lehman-haunted-by-780-million-london-skyscraper-claim.html

http://www.reuters.com/article/2009/06/10/fontainebleaulasvegas-idUSBNG42415320090610

An odd truism about the hubris of the uber-wealthy and the timing of their skyscraper projects…

But…

Businesses lurv Optimization, lots of it…

• ML circa 1985 focused on those needs, but got knocked back to something inevitably more aristotelian and predictable

• Outside of Silicon Valley, we’ve made big strides

• One danger: next downturn cycle, VCs might reshape tech industry, reverting to “safe bets”

Circa mid 2010s: Back to the Future

However, a few extremely interesting aspects have emerged…

http://web.stanford.edu/group/SOL/



datadata

visualize,reporting

Data Prep

Features



Explore

train set

test set

models

Evaluate

Optimize

Scoringproduction

datause

cases

data pipelines


bar developers

foo algorithms

We have approximation, deep learning and symbolic regression to assist on “Features”



datadata

visualize,reporting

Data Prep

Features



Explore

train set

test set

models

Evaluate

Optimize

Scoringproduction

datause

cases

data pipelines


bar developers

foo algorithms

Or, maybe, cognitive computing will help on several of the more difficult aspects of this…

Circa mid 2010s: Extremely Interesting Emerging Aspects

http://youtu.be/T0FEZYZqqV8

http://faculty.hampshire.edu/lspector/

http://www.research.ibm.com/cognitive-computing/


DeepDive @Stanford http://deepdive.stanford.edu/

Knowledge Graph @Google http://www.google.com/insidesearch/features/search/knowledge.html

IBM Watson http://www.ibm.com/smarterplanet/us/en/ibmwatson/

Scaled Inference https://scaledinference.com/

http://deepdive.stanford.edu/

http://deepdive.stanford.edu/

http://www.google.com/insidesearch/features/search/knowledge.html

http://www.ibm.com/smarterplanet/us/en/ibmwatson/












































https://scaledinference.com/

https://scaledinference.com/

http://www.google.com/insidesearch/features/search/knowledge.html


Rhetorical postures: “Is AI a good idea, or potentially harmful?” – per Elon Musk, et al.

http://mashable.com/2014/08/03/elon-musk-artificial-intelligence/


Clearly: good idea brewbot.io

Rhetorical postures: “Is AI a good idea, or potentially harmful?” – per Elon Musk, et al.

http://www.brewbot.io/

http://www.brewbot.io/

http://mashable.com/2014/08/03/elon-musk-artificial-intelligence/


Speaking of which, a highly recommended podcast by actual data scientists drinking really good beers: partiallyderivative.com

http://www.partiallyderivative.com













































































































http://www.partiallyderivative.com/


2015: Notebooks in Containers in the Cloud

“Keep simple things simple and complex things possible.” databricks.com/product

Publishing Workflows for Jupyter Andrew Odewahn, Kyle Kelley, Rune Madsen odewahn.github.io/publishing-workflows-for-jupyter

IPython Interactive Demo Nature Magazine + Rackspace nature.com/news/ipython-interactive-demo-7.21492

http://databricks.com/product


http://odewahn.github.io/publishing-workflows-for-jupyter/#1

http://www.nature.com/news/ipython-interactive-demo-7.21492

2015: Notebooks in Containers in the Cloud

“Keep simple things simple and complex things possible.” databricks.com/product

Publishing Workflows for Jupyter Andrew Odewahnodewahn.github.io/publishing-workflows-for-jupyter

IPython Interactive DemoNature Magazine + Rackspace nature.com/news/ipython-interactive-demo-7.21492


Makes me wonder about the “data engineer” role … notebooks simplify ops needs, while ultimately the domain experts wield the real power with data



http://odewahn.github.io/publishing-workflows-for-jupyter/#1

http://www.nature.com/news/ipython-interactive-demo-7.21492

Frontstory

Frontstory: The Sun Also Rises

Some wake early in the morning and go build buildings

dev-centric templates


20th c. stats


Sometimes, when the sky gods become angry and obscure the Sun as our due punishment… VCs during recessions


Others create and evaluate models to predict the Earth’s orbit of the Sun

What’s needed most


Forward Motion:

SV trend: early data scientists displace old-school product managers

Because there are hard problems to be solved…

Because we need new eyes on target…

Because use cases…

Because Use Cases

Because Use Cases: Health Care

“In fact, using our Topological Data Analysis system, they were able to discover multiple types of Type 2 diabetes … huge impact on all the hundreds of millions of people” – Ayasdi

“Nobody knows what to do with those archives … They’re just sitting there, costing money. This is just seen as a big opportunity. It’s like, ‘Oh, this is what we were saving this up for!’” – Enlitic

“Sloan-Kettering is also training Watson on 1,500 real-world lung cancer cases, helping it to decipher physician notes and learn from the hospital’s expertise in treating cancer.” – IBM Watson

Employing tech such as deep learning and cognitive computing for vital use cases in health care:

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://lumiata.com/

http://www.enlitic.com/
















































































http://www.ayasdi.com/
































































Because Use Cases: Transportation

http://automatic.com/ !Detects events like hard braking, acceleration – uploaded in real-time with geolocation to a Spark Streaming pipeline … data trends indicate road hazards, blind intersections, bad signal placement, and other input to improve traffic planning. Also detects inefficient vehicle operation, under-inflated tires, poor driving behaviors, aggressive acceleration, etc.

https://www.automatic.com/













































































http://automatic.com/

Because Use Cases: Education

https://databricks.com/blog/2014/12/08/pearson… !Integrates Kafka + Spark Streaming + Cassandra + Blur, running within a YARN cluster on AWS to provide a scalable, reliable, cloud-based platform for services that analyze student performance across product and institution boundaries.

Delivers immersive learning experiences designed for how students read, think, and learn; as well as efficacy insights to both learners and institutions which were not possible before. !Reliability features handle Kafka node failures, receiver failures, leader changes, committed offset in ZK, plus adjustable data-rate throughput.

http://www.pearsoned.com/

https://databricks.com/blog/2014/12/08/pearson-uses-spark-streaming-for-next-generation-adaptive-learning-platform.html

Because Use Cases: Language, everywhere

http://idibon.com/ !!!Our social fabric is encoded as text documents, and similarly it get tested, deployed, maintained, and monitored there – it’s the launch point for cognitive computing.

http://digitalreasoning.com/

http://idibon.com/

http://idibon.com/

http://www.digitalreasoning.com/





































































































































































































































































































Because Use Cases: Language, everywhere

http://idibon.com/!!!Our social fabric is encoded as text documents, and similarly it get tested, deployed, maintained, and monitored there – it’s the launch point for cognitive computing.

Robert Munroe, 12:00 “Building Better Experts: co-optimization of human and machine intelligence at Idibon”

Andrew Trask, David Gilmore 11:00 “Deep Learning for Natural Language Processing”





































































































































































































































































































http://idibon.com/

http://idibon.com/

Because Use Cases: Geospatial

Advanced geo uses cases throughout all levels of gov and industry for Big Data, machine learning, graph algorithms, approximations, etc.

If you roll trucks you probably use licenses from ESRI.

Also consider the IoT sensor data, e.g., from National Instruments' customers – where does it go, what do organizations use to analyze it?

These are the large-scale optimization problems you were looking for…

http://esri.github.io/gis-tools-for-hadoop/ (and Spark) http://thunderheadxpler.blogspot.com/ http://geotrellis.io/ http://www.oculusinfo.com/tiles/ https://databricks.com/blog/2014/12/03/app...

http://www.esri.com/




























































http://esri.github.io/gis-tools-for-hadoop/

http://thunderheadxpler.blogspot.com/

http://geotrellis.io/

http://www.oculusinfo.com/tiles/

https://databricks.com/blog/2014/12/03/application-spotlight-technicolor-virdata-internet-of-things-platform.html

http://www.azavea.com/products/





















































http://www.oculusinfo.com/tiles/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

http://virdata.com/

Because Use Cases: Telecom, Travel, Banking, etc.

http://spark-summit.org/2014/talk/stratio-streaming…

Stratio represents one of the most sophisticated integrations for Spark Streaming – the union of a real-time messaging bus with a complex event processing engine: Kafka, Spark Streaming, Cassandra, along with the Siddhi CEP engine

Telecom, in particular, is leveraging this new streaming technology as a big win near-term

http://www.openstratio.org/https://github.com/stratio https://github.com/Stratio/streaming-cep-engine

BTW if you’re in Madrid next fall check out Big Data Hispano

http://spark-summit.org/2014/talk/stratio-streaming-a-new-approach-to-spark-streaming

http://www.openstratio.org/

https://github.com/stratio

https://github.com/Stratio/streaming-cep-engine

http://www.stratio.com/











































































































http://www.bigdatahispano.org/

Because Use Cases…

Common theme: many of those use cases are powered by Apache Spark –

Especially notice Spark Streaming, which is a big game-changer for analytics across industry



















































































































































Because Use Cases…

Common theme: many of those use cases are powered by

Especially notice game-changer for analytics across industry

Taylor Goetz 11:00 “Beyond the Tweeting Toaster: IoT Streaming Analytics With Apache Storm, Kafka, and Arduino”

Hari Shreedharan 12:00 “Real Time Data Processing Using Spark Streaming”



















































































































































Because Use Cases: Agriculture

Ag+Data Issues http://radar.oreilly.com/2014/04/agdata.html

Data Guild whitepaper: Ag Systems + Data Outlook http://goo.gl/OK8RFf

• livelihood for 40% of world population • $15T/year annual GDP globally • data-intensive issues, much legal impasse

Over a half billion small farms worldwide, and most are family-run farms that rely on rain-fed agriculture

Nudge, and I just might propose DWave clusters into cold craters on the Lunar South Pole with routers @L5 and an LLO skyhook… to handle the vector quantization demands. Or something.

airshipse.g., JP Aerospace, 40 km

atmostatse.g., Titan Aerospace, 20 km

microsatse.g., Planet Labs, 400 km

robotse.g., Blue River, 1 m

sensorse.g., Hortau, -0.3 m

dronese.g., HoneyComb, 120 m

Layered Sensing Networks

http://radar.oreilly.com/2014/04/agdata.html

http://goo.gl/OK8RFf

http://www.dwavesys.com/quantum-computing

http://meche.mit.edu/people/index.html?id=55


































https://en.wikipedia.org/wiki/Lunar_south_pole

Resources

Apache Spark developer certificate program

• http://oreilly.com/go/sparkcert

• defined by Spark experts @Databricks

• assessed by O’Reilly Media

• establishes the bar for Spark expertise

certification:

http://oreilly.com/go/sparkcert

MOOCs:

Anthony Joseph UC Berkeley begins 2015-02-23 edx.org/course/uc-berkeleyx/uc-berkeleyx-cs100-1x-introduction-big-6181

Ameet Talwalkar UCLA begins 2015-04-14 edx.org/course/uc-berkeleyx/uc-berkeleyx-cs190-1x-scalable-machine-6066

https://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-cs100-1x-introduction-big-6181

https://www.edx.org/course/uc-berkeleyx/uc-berkeleyx-cs190-1x-scalable-machine-6066

community:

spark.apache.org/community.html

events worldwide: goo.gl/2YqJZK

!video+preso archives: spark-summit.org

resources: databricks.com/spark-training-resources

workshops: databricks.com/spark-training

http://spark.apache.org/community.html

http://goo.gl/2YqJZK

http://spark-summit.org

http://databricks.com/spark-training-resources

http://databricks.com/spark-training

http://spark-summit.org/



confs:Strata CA San Jose, Feb 18-20 strataconf.com/strata2015

Spark Summit East NYC, Mar 18-19 spark-summit.org/east

Big Data Tech Con Boston, Apr 26-28 bigdatatechcon.com

Strata EULondon, May 5-7 strataconf.com/big-data-conference-uk-2015

Spark Summit 2015 SF, Jun 15-17 spark-summit.org

http://strataconf.com/strata2015

http://spark-summit.org/east

http://www.bigdatatechcon.com/

http://strataconf.com/big-data-conference-uk-2015


books:

Fast Data Processing with Spark Holden Karau Packt (2013) shop.oreilly.com/product/9781782167068.do

Spark in Action Chris FreglyManning (2015*) sparkinaction.com/

Learning Spark Holden Karau, Andy Konwinski, Matei ZahariaO’Reilly (2015*) shop.oreilly.com/product/0636920028512.do


http://sparkinaction.com/



presenter:

Just Enough Math O’Reilly, 2014

justenoughmath.compreview: youtu.be/TQ58cWgdCpA

monthly newsletter for updates, events, conf summaries, etc.: liber118.com/pxn/

Enterprise Data Workflows with Cascading O’Reilly, 2013

shop.oreilly.com/product/0636920028536.do

http://justenoughmath.com

http://youtu.be/TQ58cWgdCpA


http://liber118.com/pxn/



a new year in data science: ml unpaused

Technology

santa9 data science

notedwhither data science

todaywhither data science

flawedwhither data science

big data

unpaused data day texas

risessome gaze

renowned statisticians