data science in 2016: moving up

66
Data Science in 2016: Moving Up 2015-10-15 Madrid http://bigdataspain.org/ Paco Nathan, @pacoid O’Reilly Media

Upload: paco-nathan

Post on 09-Jan-2017

8.895 views

Category:

Technology


0 download

TRANSCRIPT

Data Science in 2016: Moving Up

2015-10-15 • Madrid • http://bigdataspain.org/

Paco Nathan, @pacoid O’Reilly Media

• general patterns

• trends and analysis: the discipline, the jobs

• some good examples: moving up into use cases

• glimpses ahead: an emerging content

• a proposed theme

Data Science 2016: Moving Up

Design Patterns

Design Patterns

Methodology for cloud-computing architecture (2008-06-29)http://ceteri.blogspot.com/2008/06/methodology-for-cloud-computing.html

cluster scheduler

datapipes

some cloud

containers

analytics

search/index

elasticcompute

elasticstorage

Design Patterns

Design Patterns

some cloud

Design Patterns

some cloud

DataStax$189.7M

Confluent$30.9M

Databricks$47M

Jupyter$6M

Elastic$104M

Docker$162MMesosphere

$48.75M

Design Patterns: Issues

some cloud

• integration could be better• that implies sharing markets• VCs in Silicon Valley dislike that• customers need integration

some cloud

Design Patterns: Where?

Design Patterns: Where?

some cloud

Design Patterns: Where?

some cloud

Design Patterns: Where?

some cloud

Design Patterns: Where?

some cloud

Design Patterns: Where?

some cloud

• that playing field becomes overly crowded, soon…

• what happens at that point?

• so much emphasis on plumbing: `data engineering`

• not enough on domain expertise, which trumps all

Much activity in Big Data seems awkwardly focused at the bottom of the tech stack: infrastructure, not domain

However, that may be changing…

Design Patterns: Opinion

Interesting Trends

Interesting Trends

There are many possible trends to discuss, but let’s concentrate on four of these going into 2016:

• leveraging multicore and large memory spaces

• generalized libraries for frequently repeated work

• workflows blend the best of people and computing

• framework for a big leap ahead, not just incremental

Original definitions for what became relational databases had less to do with dedicated SQL products, more similarity with something like Spark SQL

Interesting Trend #1: Contemporary Hardware

A relational model of data for large shared data banks Edgar Codd Communications of the ACM (1970) dl.acm.org/citation.cfm?id=362685

Python Java/Scala R SQL …

DataFrame Logical Plan

LLVM JVM GPU NVRAM

Unified API, One Engine, Automatically Optimized

Tungsten backend

language frontend

from Databricks

Interesting Trend #1: Contemporary Hardware

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal Josh Rosen spark-summit.org/2015/events/deep-dive-into-project-tungsten-bringing-spark-closer-to-bare-metal/

Set Footer from Insert Dropdown Menu 27

Physical Execution: CPU Efficient Data Structures

Keep data closure to CPU cache

Interesting Trend #1: Contemporary Hardware

from Databricks

Interesting Trend #2: Generalized Libraries

Tensors are a good way to handle time-series geo-spatially distributed linked data with lots of N-dimensional attributes

In other words, nearly a general case for handling much of the data that we’re likely to encounter

That’s better than attempting to shoehorn data into matrix representation, then writing lots of custom code to support it

Tensor factorization may be problematic, but probabilistic solutions seem to provide relatively general case solutions:

The Tensor Renaissance in Data Science Anima Anandkumar @UC Irvine radar.oreilly.com/2015/05/the-tensor-renaissance-in-data-science.html

Spacey Random Walks and Higher Order Markov Chains David Gleich @Purdueslideshare.net/dgleich/spacey-random-walks-and-higher-order-markov-chains

Interesting Trend #2: Generalized Libraries

Interesting Trend #3: Leveraging Workflows

evaluationoptimizationrepresentationcirca 2010

ETL into cluster/cloud

datadata

visualize,reporting

Data Prep

Features

Learners, Parameters

UnsupervisedLearning

Explore

train set

test set

models

Evaluate

Optimize

Scoringproduction

datause

cases

data pipelines

actionable resultsdecisions, feedback

bar developers

foo algorithms

APIs, algorithms, developer-centric template thinking – these only go so far; the overall context is a workflow…

evaluationoptimizationrepresentationcirca 2010

ETL into cluster/cloud

datadata

visualize,reporting

Data Prep

Features

Learners, Parameters

UnsupervisedLearning

Explore

train set

test set

models

Evaluate

Optimize

Scoringproduction

datause

cases

data pipelines

actionable resultsdecisions, feedback

bar developers

foo algorithms

look beyond an API, beyond a code repo … think of people and machines working together

Interesting Trend #3: Leveraging Workflows

APIs, algorithms, developer-centric template thinking – these only

Chris Ré, @Stanfordhttps://www.macfound.org/fellows/943/

Drugs, DNA, and Dinosaurs: Building High Quality Knowledge Bases with DeepDive Strata CA (2015)

The Thorn in the Side of Big Data: too few artists Strata CA (2014)

Interesting Trend #4: A Leap Ahead

Chris Réhttps://www.macfound.org/fellows/943/

Drugs, DNA, and Dinosaurs: Building High Quality Knowledge Bases with DeepDiveStrata CA (2015)

The Thorn in the Side of Big Data: too few artistsStrata CA (2014)

Interesting Trend #4: A Leap Ahead

cognitive computing “flywheel”: probabilistic reasoning about complex data and predictions together

Data Scientists

William Cleveland “Data Science: an Action Plan for Expanding the Technical Areas of the Field of Statistics,” International Statistical Review (2001), 69, 21-26http://www.stat.purdue.edu/~wsc/papers/datascience.pdf

Leo Breiman “Statistical modeling: the two cultures”, Statistical Science (2001), 16:199-231http://projecteuclid.org/euclid.ss/1009213726

…also good to mention John Tukey

Data Scientists: Primary Sources

Data Scientists: Five Years of Strata Conference

One 2015 report (RJMetrics) tallied a minimum of 11,400 data scientists worldwide by scraping LinkedIn

So many suddenly, really? Perhaps that’s doubtful…

Comparing surveys: O’Reilly Media conducts salary surveys for data scientists, along with exploring about the tools used

2013 – tools, trends, not all data is “Big”, coding scripts!2014 – correlation of tools and skills, rapid evolution2015 – divide blurring between open source and proprietary

Data Scientists: Everywhere, all the time?

http://radar.oreilly.com/2015/09/2015-data-science-salary-survey.htmlJohn King, Roger Magoulas

Data Scientists: 2015 Survey

Data Scientists: 2015 Survey

Moving Up

Enlitic http://www.enlitic.com/deep learning to assist doctors treating cancer

Moving Up: Medicine

Moving Up: Medicine

“Whatever the models might discover or predict, Howard isn’t suggesting they’ll do away with a doctor’s judgment. Rather, artificially intelligent computers could provide strong, unbiased second opinions, or perhaps lead a doctor down a path of investigation she other wouldn’t have considered.”

With Enlitic, a veteran data scientist plans to fight disease using deep learning GigaOM (2014-08-22) https://gigaom.com/2014/08/22/with-enlitic-a-veteran-data-scientist-plans-to-fight-disease-using-deep-learning/

Moving Up: Political Platform

http://www.predikon.ch/en/voting-patterns/residents

Moving Up: Political Platform

Mining DemocracyMatthias Grossglauser @EPFL ICT Labs (2015) http://ictlabs-summer-school.sics.se/slides/mining%20democracy.pdf

What if a political candidate could cluster political positions in a multi-dimensional data space, to optimize for being recommended to voters?

http://www.predikon.ch/en/voting-patterns/residents

Moving Up: Government Ethics

The White House has a plan to help society through data analysis Fortune (2018-09-30) http://fortune.com/2015/09/30/dj-patil-white-house-data/

Moving Up: Government Ethics

The White House has a plan to help society through data analysis Fortune (2018-09-30) http://fortune.com/2015/09/30/dj-patil-white-house-data/

“Opening up government data about child labor to concerned data scientists; recruiting folks to help analyze data about suicide prevention, social injustice and incarceration; a call for mandatory and `intrinsic` ethics instruction in every course teaching students data science; and an effort to help the transgender community create its own census of sorts, so that members and society can get a better grasp on the issues that matter to the group.”

Moving Up: Neuroscience

Analytics + Visualization for Neuroscience: Spark, Thunder, LightningJeremy Freeman 2015-01-29youtu.be/cBQm4LhHn9g?t=28m55s

For excellent examples of Science and Data together see CodeNeuro, particularly for use of Jupyter notebooks + Apache Spark

Moving Up: Neuroscience

Learning

Learning: What About MOOCs?

Massive Open Online Courses – seven year trend, beginning with:

Connectivism and Connective Knowledge George Siemens, Stephen DownesUniversity of PEI (2008) http://cck11.mooc.ca/

Learning: What About MOOCs?

Adios Ed Tech. Hola something else George Siemens (2015-09-09) http://www.elearnspace.org/blog/2015/09/09/adios-ed-tech-hola-something-else/

Online education: MOOCs taken by educated fewEzekiel Emanuel, Nature 503, 342 (2013-11-21)

• 80% students already have an advanced degree

• 80% come from the richest 6% of the population

Michael Shanks @Stanford: “retrenchment around traditional disciplines will make disparities even more pronounced”

An Early Report Card on Massive Open Online CoursesGeoffrey Fowler, WSJ (2013-10-08)

Amherst, Duke, etc., have rejected edX

Learning: What About MOOCs?

Online education: MOOCs taken by educated fewEzekiel Emanuel

• 80% students already have an advanced degree

• 80% come from the richest 6% of the population

Michael Shanksdisciplines will make disparities even more pronounced”

An Early Report Card on Massive Open Online CoursesGeoffrey Fowler

Amhers

Learning: What About MOOCs?

So then, what else works better?

How to Flip a Class CTL @UT/Austin http://ctl.utexas.edu/teaching/flipping-a-class/how

1. identify where the flipped classroom model makes the most sense for your course

2. spend class time engaging students in application activities with feedback

3. clarify connections between inside and outside of class learning

4. adapt your materials for students to acquire course content in preparation of class

5. extend learning beyond class through individual and collaborative practice

Learning: Inverted Classroom

Scalable LearningDavid Black-Schaffer @UppsalaSverker Janson @KTH SICShttps://www.scalable-learning.com/

• active learning: Flipped Classroom and Just-in-time Teaching

• exams built directly into specific diagrams within videos

• metrics for where in video+code that students get stuck

• instructor can customize subsequent classroom discussions (active teaching phase) based on stuck/unstuck metrics

Learning: Inverted Classroom

Learning programming at scalePhilip Guo O’Reilly Radar (2015-08-13)http://radar.oreilly.com/2015/08/learning-programming-at-scale.html

• PythonTutor• CodechellaTutors could keep an eye on around 50 learners during a 30-minute session, start 12 chat conversations, and concurrently help 3 learners at once

Learning: Collaborative Learning

Data-driven Education and the Quantified StudentLorena Barba @GWUPyData Seattle (2015)https://youtu.be/2YIZ2SY9mW4

• keynote talk: abstract, slides• homepage• Open edX Universities Symposium, DC 2015-11-11

Learning: If you study just one link from this talk…

If by some bizarre chance you haven’t used it already, go to https://jupyter.org/

• 50+ different language kernels• new funding 2015-07

• UC Berkeley, Cal Poly

• nbgrader autograder by Jess Hamrick• jupyterhub multi-user server

• curating a list of examples• repeatable science!

see also: Teaching with Jupyter Notebooks http://tinyurl.com/scipy2015-education

Learning: Jupyter Project

Learning: O’Reilly Media

https://beta.oreilly.com/

in-person blended on-demand

MostlySynchronous

MostlyAsynch

InvertedClassroom

Subscription

Free

Content

Learning: Audience Patterns

Is it possible to measure “distance” between a learner and a subject community?

From Amateurs to Connoisseurs: Modeling the Evolution of User Expertise through Online Reviews Julian McAuley, Jure Leskovec http://i.stanford.edu/~julian/pdfs/www13.pdf

Learning: Machine Learning about People Learning

Learning, Assessment, Team Building, Diversity – these can be accomplished together, in situ

Collective Intelligence in Human Groups Anita Williams Woolley @CMUhttps://youtu.be/Bz1dDiW2mvM

• balance of participation (no one dominates)

• 2+ women engaging within the group

• group size < 9

• diversity of formal backgrounds

Learning: Machine Learning about People Learning

People + Automation

Data Science teams apply machine learning (automation) to help arrive at key insights, to learn what is important in data sets – finding the proverbial needle in the haystack

Cognitive Computing exhibits people + automation as a process, in a learning context

That’s also a basic tenet of workflows in general: people + automation

And a key aspect of the emerging gig economy too…

People + Automation

People + Automation: Gig Economy

People + Automation: Gig Economy

http://orchestra.unlimitedlabs.com/

“Workflows with humans and machines”

People + Automation: Gig Economy

Workers in a World of Continuous Partial EmploymentTim O’ReillyMedium (2015-08-31) https://medium.com/the-wtf-economy/workers-in-a-world-of-continuous-partial-employment-4d7b53f18f96

http://conferences.oreilly.com/next-economy

Learning is key. Effective use of Data Science in these new economic conditions requires people + automation, learning together – albeit in different ways. Plus, there’s an excellent framework for that:

Autopoiesis and Cognition Humberto Maturana, Francisco VarelaSpringer (1973)https://books.google.es/books?id=nVmcN9Ja68kC

People + Automation

I’d like to leave this as a theme for you to consider about Data Science 2016, Moving Up into use cases…

We see an intersection of key points in both the emerging Cognitive Computing context and the Gig Economy in general:

systems of people + automation, learning together

It posits an interesting duality for use to leverage

With that I wish you a great conference here at Big Data Spain!

People + Automation

Gracias

contact:

Just Enough Math O’Reilly (2014)

justenoughmath.compreview: youtu.be/TQ58cWgdCpA

monthly newsletter for updates, events, conf summaries, etc.: liber118.com/pxn/

Intro to Apache SparkO’Reilly (2015) shop.oreilly.com/product/0636920036807.do