data science - bosatsu · data scientist: n. person who is better at statistics than any software...

55
Data Science Brian Sletten ! @bsletten 09/29/2014

Upload: others

Post on 02-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Data ScienceBrian Sletten

! @bsletten 09/29/2014

Page 2: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Speaker QualificationsSpecialize in next-generation technologiesAuthor of "Resource-Oriented Architecture Patterns for Webs of Data"Speaks internationally about REST, Semantic Web, Security, Visualization,ArchitectureWorked in Defense, Finance, Retail, Hospitality, Video Game, Health Care andPublishing IndustriesOne of Top 100 Semantic Web People

···

·

·

3/110

AgendaIntroductionData Science TechniquesProgrammingVisualizationMachine LearningData MiningBig DataLinked Data

········

4/110

Page 3: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Introduction

http://www.delphianalytics.net/wp-content/uploads/2013/04/GrowthOfDataVsDataAnalysts.png

6/110

Page 4: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

We’re witnessing the beginning of a massive, culturally saturated feedback loop

where our behavior changes the product and the product changes our behavior.

Technology makes this possible: infrastructure for large-scale data processing,

increased memory, and bandwidth, as well as a cultural acceptance of

technology in the fabric of our lives. This wasn’t true a decade ago.

Cathy O'Neil and Rachel Schutt

8/110

Page 5: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

9/110

10/110

Page 6: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Correlation is not causation.“ ”

Empirically observed covariation is a necessarybut not sufficient condition for causality.

“”

edvard Tufte

Page 7: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Correlation is not causation but it sure is ahint.

“”

edvard Tufte

So What?Unrelated (Pirates and Climate Change)Reverse Causation (Windmills Cause Wind)Bi-Directional Causation (Temperature/Pressure)Common Causal Variable (Sleeping with your shoes on causes headaches)

····

14/110

Page 8: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Long before worrying about how to convinceothers, you first have to understand what'shappening yourself.

Andrew Gelman

Naïve realism, also known as direct realism or common sense realism, is a

philosophy of mind rooted in a theory of perception that claims that the senses

provide us with direct awareness of the external world.

Wikipediahttp://en.wikipedia.org/wiki/Naïve_realism

Page 9: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

17/110

1951 Princeton/Dartmouth GameStoried rivalryPrinceton's star player had his nose brokenPrinceton player snapped a Dartmouth player's legPrinceton won 13-0Editorials from both schools blamed the otherTwo versions of Truth

······

18/110

Page 10: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

They Saw a GameAlbert Hastorf (Dartmouth) and Hadley Cantril (Princeton) showed the gameagain to students from both schoolsAsked them to notice infractions, penalties, fill out a questionnairePrinceton students 'saw' twice as many infractions by Dartmouth players thanDartmouth students didDartmouth students saw a 'rough but fair' game

·

··

·

19/110

In brief, the data here indicate that there is no such 'thing' as a 'game' existing

'out there' in its own right which people merely 'observe.' The game 'exists' for a

person and is experienced by him only insofar as certain happenings have

significances in terms of his purpose.

Hastorf and CantrilThey Saw a Game

Page 11: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Everything that has ever happened to you hashappened inside your skull.

“”

David McRaneyYou Are Now Less Dumb

Comparing the StudentsAll maleEthnic and socioeconomically similarSame part of the countrySame ageSame basic culture and religious beliefsDifferent schools

······

22/110

Page 12: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

It’s a real problem, though, when politicians, CEOs, and other people with the

power to change the way the world works start bungling their arguments for or

against things based on self-delusion generated by imperfect minds and senses.

David McRaneyYou Are Now Less Dumb

Real World IssuesFoods and CancerVaccination and AutismGlobal WarmingGMOs

····

24/110

Page 13: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Data scientist: n. person who is better atstatistics than any software engineer andbetter at software engineering than anystatistician.

Josh Wills

There's a distinct lack of respect for researchers in academia and industry labs

who have been working on this kind of stuff for years, and whose work is based

on decades (in some cases, centuries) of work by statisticians, computer

scientists, mathematicians, engineers and scientists of all types.

Cathy O'Neil and Rachel Schutt

Page 14: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

27/110

Data science is the civil engineering of data. Itsacolytes possess a practical knowledge oftools and materials, coupled with a theoreticalunderstanding of what is possible.

Cathy O'Neil and Rachel Schutt

Page 15: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Data Science StrategyEngineering and Infrastructure for collection and loggingPrivacy Access policiesRole in Decision Making Process

···

29/110

Narratives are meaning transmitters. They are history-preservation devices. They

create and maintain cultures, and they forge identities that emerge out of the

malleable, imperfect memories of life events.

David McRaneyYou Are Now Less Dumb

Page 16: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Your narrative bias makes it nearly impossiblefor you to really absorb the information fromthe outside world without arranging it intocauses and effects.

David McRaneyYou Are Now Less Dumb

Your ancestors invented the scientific methodbecause the common belief fallacy rendersyour default strategies for making sense of theworld generally awful and prone to error.

David McRaneyYou Are Now Less Dumb

Page 17: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Your natural tendency is to start from a conclusion and work backward to

confirm your assumptions, but the scientific method drives down the wrong side

of the road and tries to disconfirm your assumptions.

David McRaneyYou Are Now Less Dumb

Data Science Techniques

Page 18: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Doing Data Science (O'Neil and Schutt)

35/110

MathStatisticsLinear AlgebraNumerical AnalysisCalculus

····

36/110

Page 19: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

StatisticsObservations and SamplesBiasModelingDistributionsFitting a modelOverfitting

······

37/110

TechniquesLinear Regressionk Nearest Neighbork Means clusteringDecision Trees

····

38/110

Page 20: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Programming

Programming LanguagesC/C++FortranPythonJuliaR

·····

40/110

Page 21: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

PythonGeneral Purpose, High-Level Programming LanguageEmphasis on readabilityExpressive syntaxSupports OO, FP, Procedural ProgrammingDynamicCPython/Jython

······

41/110

SciPyEcosystem of open-source packages for science, math and engineeringNumPySciPyMatplotlibIPythonSympypandas

·······

42/110

Page 22: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

SciPy/NumPyNumerical analysisOptimization problemsN-Dimensional ArraysLinear AlgebraFourier transformations

·····

43/110

JuliaAn attempt to create a high-performance, general purpose numericallanguageThink: MATLAB meets Fortran, Python and LispLLVM-based JIT CompilerGrowing base of packagesImpressive performance benchmarksMIT License

·

·····

44/110

Page 23: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

45/110

Nerdy Language FeaturesMultiple dispatchDynamic type systemBuilt-in package managerLisp-like macros and other metaprogramming fuAbility to call C/Python functionsSupports parallel and distributed processingCoroutines

·······

46/110

Page 24: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

RCreated by Ross Ihaka and Robert GentlemanMaintained by the R Development Core TeamPart of the GNU ProjectCommercial supportImplemented in C, Fortran and R

·····

47/110

48/110

Page 25: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

49/110

50/110

Page 26: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

SupportsBasic MathStatistical AnalysisOptimization ProblemsSignal ProcessingGraphics and VisualizationData MiningMachine Learning

·······

51/110

Visualization

Page 27: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

53/110

54/110

Page 28: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

55/110

56/110

Page 29: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

57/110

58/110

Page 30: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

59/110

TechniquesGraphical AnalysisPresentation Graphics

··

60/110

Page 31: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

61/110

ipython -pylab

x = linspace( 0, 10, 100 )plot( x, sin(x) )plot( x, 0.5*cos(2*x) )title( "A matplotlib plot" )text( 1, -0.8, "A text label" )ylim( -1.1, 1.1 )savefig( 'matplotlib.png' )

PYTHON

62/110

Page 32: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

63/110

d3.jsData-Driven DocumentsJavaScript library that uses HTML, SVG and CSSBind data to the DOMData-driven transformations

····

64/110

Page 33: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

https://github.com/mbostock/d3/wiki/Gallery

URL

http://mbostock.github.io/d3/talk/20111116/iris-splom.html

URL

http://bost.ocks.org/mike/uberdata/

URL

http://exposedata.com/parallel/

URL

http://mbostock.github.io/d3/tutorial/circle.html

URL

65/110

Machine Learning

Page 34: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Data Mining

Big Data

Page 35: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

First, it is a bundle of technologies. Second, it is a potential revolution in

measurement. And third, it is a point of view, or philosophy, about how decisions

will be— and perhaps should be— made in the future.

Steve Lohr, New York Times (2013-10-09)

Everything You Know About Something

ID Col1 Col2 Col3 Col4 Col5 Col6 .... ColN

Thing1 Value1 Value2 Value3 Value4 Value5 Value6 .... ValueN

70/110

Page 36: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Everything You Know About Everything

ID Col1 Col2 Col3 Col4 Col5 Col6 .... ColN

Thing1 Value1 Value2 Value3 Value5 .... ValueN

Thing2 Value1 Value3 Value4 Value5 Value6 .... ValueN

Thing3 Value2 Value3 Value5 Value6 .... ValueN

Thing4 Value1 Value2 Value3 Value4 Value5 Value6 .... ValueN

... ... ... ... ... ... ... .... ...

71/110

Distribute Rows in their Entirety

ID Col1 Col2 Col3 Col4 Col5 Col6 .... ColN

Thing1 Value1 Value2 Value3 Value5 .... ValueN

Thing3 Value2 Value3 Value5 Value6 .... ValueN

72/110

Page 37: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Distribute Columns in their Entirety

ID Col2 Col3 Col5 ColN

Thing1 Value2 Value3 ValueN

Thing3 Value2 Value3 Value5 ValueN

Thing4 Value2 Value3 Value5 ValueN

... ... ... ... ...

73/110

Linked Data

Page 38: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Distribute Arbitrary Cells

ID Col2 Col3 Col5 ColN

Thing1 Value3 ValueN

Thing3 Value5 ValueN

Thing4 Value2 Value3 ValueN

... ... ... ... ...

75/110

Linking Open Data ProjectStarted in 2007 by W3C Semantic Web Education and Outreach(SWEO) InterestGroupMake data freely availableDoubled in size every 10 months

·

··

76/110

Page 39: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

77/110

78/110

Page 40: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

79/110

80/110

Page 41: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

81/110

82/110

Page 42: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

83/110

84/110

Page 43: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

85/110

86/110

Page 44: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

87/110

Domain # Datasets # Triples # Links

Media 25 1,800,000,000 50,000,00

Geographic 31 6,000,000,000 35,000,000

Government 49 13,000,000,000 19,000,000

Publications 87 2,900,000,000 140,000,000

Cross-Domain 41 4,100,000,000 63,000,000

Life Sciences 41 3,000,000,000 191,000,000

User-Generated Content 20 134,000,000 3,400,000

Total 295 31,000,000,000 504,000,000

http://lod-cloud.net/state/

88/110

Page 45: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

89/110

90/110

Page 46: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

91/110

DBPediaLinked Dataset derived from WikipediaCreative Commons Attribution-ShareAlike 3.0 LicenseGNU Free Documentation LicenseMulti-domainConsensus-basedKept current by Wikipedia activityMulti-lingual

·······

92/110

Page 47: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

DBPedia Numbers (English Version)http://dbpedia.org/About

Describes 4 million things3.22 million are classified by an ontology832,000 people639,000 places372,000 creative works209,000 organizations226,000 species5,600 diseases

········

93/110

DBPedia Numbers (Non-English Version)http://dbpedia.org/About

119 Localized Language VersionsDescribe 24.9 million things (w/ repetition)16.8 million are connected to English DBPedia

···

94/110

Page 48: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

DBPedia Summaryhttp://wiki.dbpedia.org/Datasets39/DatasetStatistics?v=dqp

Overall 12.6 million unique things24.6 million links to images27.6 million links to pages45 million links to other RDF datasets67 million links to Wikipedia categories41.2 million links to YAGO categories2.46 billion RDF triples470 million (English), 1.98 billion (Non-English)

········

95/110

Use Caseshttp://wiki.dbpedia.org/UseCases?v=ene

Improve Wikipedia SearchInclude DBPedia data in your documentsSupport for Geographic DataDocumentation Classification, AnnotationMulti-Domain Ontology

·····

96/110

Page 49: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

DBPediahttp://dbpedia.org

97/110

Most Important Query Ever Runhttp://tinyurl.com/n9hhs68

98/110

Page 50: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

http://www.r-bloggers.com/sparql-with-r-in-less-than-5-minutes/

HTML

http://linkedscience.org/tools/sparql-package-for-r/tutorial-on-sparql-package-for-r/

HTML

99/110

Books

Page 51: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

Data Science BooksData Analysis w/ Open Source Tools, Philipp K. Janet (ORA)Data Smart: Using Data Science to Transform Information into Insight, John W.Foreman (Wiley)Doing Data Science : Straight Talk from the Frontline, Cathy O'Neil, RachelSchutt (ORA)The R Book, Michael J. Crawley (Wiley)R Tutorial w/ Bayesian Statistics Using OpenBUGS, Chi YauApplied Predictive Modeling, Max Kuhn and Kjell Johnson (Springer)Introductory Statistics w/ R, Peter Dalgaard (Wiley)Think Stats, Allen B. Downey (ORA)R Cookbook, Paul Teetor (ORA)R Graphics Cookbook, Winston Chang (ORA)

··

·

·······

101/110

http://www.quora.com/Data-Science/How-do-I-become-a-data-scientist

HTML

http://cm.bell-labs.com/cm/ms/departments/sia/doc/datascience.pdf

HTML

102/110

Page 52: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

103/110

104/110

Page 53: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

105/110

106/110

Page 54: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

107/110

108/110

Page 55: Data Science - Bosatsu · Data scientist: n. person who is better at statistics than any software engineer and better at software engineering than any statistician. “ ” Josh Wills

109/110

Questions?

" [email protected]

! @bsletten

+ http://tinyurl.com/bjs-gplus

$ bsletten