data science in the wild › lectures › lec1... · a definition of data science wikipedia data...

27
Data Science in the Wild Giri Iyengar Cornell University [email protected] Jan 24, 2018 Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 1 / 27

Upload: others

Post on 27-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Data Science in the Wild

Giri Iyengar

Cornell University

[email protected]

Jan 24, 2018

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 1 / 27

Page 2: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Overview

1 IntroductionAbout your InstructorWhat is Data Science?What we will cover in this course?

2 Class MechanicsSoftware tools

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 2 / 27

Page 3: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Overview

1 IntroductionAbout your InstructorWhat is Data Science?What we will cover in this course?

2 Class MechanicsSoftware tools

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 3 / 27

Page 4: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

About me

EE from IIT Mumbai, India. PhD from MIT (Media Lab)Researcher at IBM Research doing Audio-Visual Speech Recognitionand Multimedia MiningStartup No. 1 - Mobile and Social Media appsStartup No. 2 - Big Data Machine Learning as a Service. Acquired byAOL/Verizon in 2015Engineering Director of Merchandising, currently Head of ComputerVision, eBay

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 4 / 27

Page 5: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

What is the excitement all about?

Harvard Business Review called it the sexiest job of the 21st century(https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/)

HBR

Mashable called it the best job in America(http://mashable.com/2016/01/20/the-best-jobs-in-america-2016/),based on a recently concluded Glassdoor annual survey Mashable

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 5 / 27

Page 6: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Elections Projections

Figure: Nate Silver Elections projections

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 6 / 27

Page 7: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

A Definition of Data Science

WikipediaData Science is an interdisciplinary field about processes and systems toextract knowledge or insights from large volumes of data in various forms,either structured or unstructured, which is a continuation of some of thedata analysis fields such as statistics, data mining and predictive analytics,as well as Knowledge Discovery in Databases (KDD).Data scientists use their data and analytical ability to find and interpretrich data sources; manage large amounts of data despite hardware,software, and bandwidth constraints; merge data sources; ensureconsistency of datasets; create visualizations to aid in understanding data;build mathematical models using the data; and present andcommunicate the data insights and findings.

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 7 / 27

Page 8: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Data Science

Figure: Drew Conway Venn DiagramGiri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 8 / 27

Page 9: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Who is a Data Scientist?

Josh Wills, Slack Data Scientist, Open Source CommitterData Scientist (n.): Person who is better at statistics than any softwareengineer and better at software engineering than any statistician.

IBM Developer WorksPart Scientist, Part Artist

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 9 / 27

Page 10: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

A broader perspective

Figure: Rob Hyndman Venn Diagram

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 10 / 27

Page 11: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

The goal of this course

Teach skills beyond Machine Learning and Database management systems

Figure: Kernel Machine by Alisneaky, RDBMS by Scifipete

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 11 / 27

Page 12: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

What is Machine Learning?

Machine learning is a subfield of computer science that evolved from thestudy of pattern recognition and computational learning theory in artificialintelligence. Machine learning explores the study and construction ofalgorithms that can learn from and make predictions on data. Suchalgorithms operate by building a model from example inputs in order tomake data-driven predictions or decisions, rather than following strictlystatic program instructions.

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 12 / 27

Page 13: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Machine Learning as per HBR

How Machines Learn (and you win) HBR

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 13 / 27

Page 14: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

ML compared with DS

Machine Learning1 Develop new models2 Prove mathematical

properties3 Validate on relatively clean

(possibly small) datasets4 Publish paper

Data Science1 Explore many models, focus on

tuning2 Understand empirical properties

of models3 Handle messy, massive datasets4 Actionable systems

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 14 / 27

Page 15: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

DBMS compared with DS

Database Systems1 Individual records valuable2 Modest data volumes3 Structured, Consistent,

Auditable4 ACID compliance

Data Science1 Individual rows ”cheap”2 Massive data volumes3 Structured, Unstructured, and

everything in between4 Lots of ad-hoc

querying/transformations

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 15 / 27

Page 16: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

The Data Science Process

Figure: Data Science Process by Farcaster

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 16 / 27

Page 17: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Data provides valuable insights

Figure: Seven Countries Study - Cholesterol vs Mortality Go

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 17 / 27

Page 18: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Good Data Visualization is Invaluable

Figure: Gapminder - Wealth vs Health Go

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 18 / 27

Page 19: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Good Data Visualization is Invaluable

Figure: Facebook World Connections

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 19 / 27

Page 20: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Good Data Visualization is Invaluable

Figure: World Bank OpenData World Bank

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 20 / 27

Page 21: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

What makes Data Science hard?

Insufficient domain knowledgeIncorrect assumptionsAd-hoc explanations of data patternsOverreachValidation/Data integrityComplex data and modeling pipelinesGoing from prototype to productionCommunicating the implications

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 21 / 27

Page 22: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Topics Covered

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 22 / 27

Page 23: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Overview

1 IntroductionAbout your InstructorWhat is Data Science?What we will cover in this course?

2 Class MechanicsSoftware tools

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 23 / 27

Page 24: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Class Mechanics

Meet twice a week. Mondays, Wednesdays 4:45-6:00 PM6 Programming assignments1 Course Project

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 24 / 27

Page 25: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Software tools we’ll be using

PyTorch PyTorch

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 25 / 27

Page 26: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Weekly Reading

Forrester Analyst Video Play

Hilary Mason Video Play

Hans Rosling TED Talk TED

Short History of Data Science Blog

O’Reilly Definition of Data Science OReilly

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 26 / 27

Page 27: Data Science in the Wild › lectures › lec1... · A Definition of Data Science Wikipedia Data Science is an interdisciplinary field about processes and systems to extract knowledge

Giri Iyengar (Cornell Tech) Data Science Introduction Jan 24, 2018 27 / 27