musings on data science and students experiencing data analytics new england sencer center for...

22
Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department of Mathematical Sciences Worcester Polytechnic Institute [email protected] 2014

Upload: destini-macklin

Post on 14-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

Musings on Data Science and Students Experiencing Data Analytics

New England SENCER Center for Innovation

Prof. Randy PaffenrothData Science Program

Department of Mathematical SciencesWorcester Polytechnic Institute

[email protected]

2014

Page 2: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

My Research

"Internet Connectivity Access layer" by User:Ludovic.ferre - Internet_Connectivity_Overview2_Access.svg. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Internet_Connectivity_Access_layer.svg#mediaviewer/File:Internet_Connectivity_Access_layer.svg

Page 3: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

This is a panel, so I want to be provocative!

Provocative

Adjective

1. tending or serving to provoke; inciting, 

stimulating, irritating, or vexing.

So, I will be a little sad if I don’t end up irritating anyone

Page 4: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

The first war: Terminology

• Analyzing data has a long history!

• There have been many terms that have been used to describe such endeavors:

• Statistics

• Artificial Intelligence

• Machine learning

• Data analytics

• Since I happen to work in a “Data Science” program perhaps I may be allowed the indulgence of using that terminology…

Page 5: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

Whatever we call it, what makes things different now?

Page 6: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

Experiments, observations, and numerical simulations in many areas of science and business are currently generating terabytes of data, and in some cases are on the verge of generating petabytes and beyond. Analyses of the information contained in these data sets have already led to major breakthroughs in fields ranging from genomics to astronomy and high-energy physics and to the development of new information-based industries.- Frontiers in Massive Data Analysis, National Research Council of the National Academies

Given a large mass of data, we can by judicious selection construct perfectly plausible unassailable theories—all of which, some of which, or none of which may be right. - Paul Arnold Srere

Page 7: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.- Hal Varian, Google's Chief Economist, http://www.mckinsey.com/insights/innovation/hal_varian_on_how_the_web_challenges_managers

My personal goal: Getting students to be able tothink critically about data.

Page 8: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

What is Big Data? The are many examples of "data", but what makes some of

it “big”? The classic definition revolves around the three Vs.

Volume, velocity, and variety.

Volume: There is a just a lot of it being generated all the time. Things get interesting and “big”, when you can’t fit it all on one computer anymore. Why? There are many ideas here such as MapReduce, Hadoop, etc. that all revolve around being able to process data that goes from Terabytes, to Petabytes, to Exabytes.

Velocity: Data is being generated very quickly. Can you even store it all? If not, then what do you get rid of and what do you keep?

Variety: The data types you mention all take different shapes. What does it mean to store them so that you can play with or compare them?

http://pl.wikipedia.org/wiki/Green_Giant#mediaviewer/Plik:Jolly_green_giant.jpg

Page 9: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

Is Big Data the same as Data Science?

Are Big Data and Data Science the same thing? I wouldn't say so... Data Science can be done on small data sets. And not everything done using Big Data would

necessarily be called Data Science.

Big Data Data Science

Page 10: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

Is Big Data the same as Data Science?

Are Big Data and Data Science the same thing? I wouldn't say so... Data Science can be done on small data sets. And not everything done using Big Data would

necessarily be called Data Science. But there certainly is a substantial overlap!

Big DataData Science

Page 11: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

Can you even be certain?

For real world problems, I claim that you will never be certain of any inferences from data.

I mean, what happens to your carefully thought out marketing plan for some rocking slacks when the Martians land.

What is unacceptable is when the data you actually have does not support the conclusion you report.

Public domain image

Page 12: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

It can be easy to fool yourself!Human beings are really good at pattern detection...

http://en.wikipedia.org/wiki/Cydonia_(region_of_Mars)

Perhaps a bit too good!

Page 13: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

It can be easy to fool yourself!

http://en.wikipedia.org/wiki/Cydonia_(region_of_Mars)

Page 14: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

Skills for Data Science

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Page 15: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

Which is most important?

http://en.wikipedia.org/wiki/View_of_the_World_from_9th_Avenue

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Page 16: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

WPI Data Science Program:A Collaboration

Business School

Computer ScienceDepartmentMathematical

SciencesDepartment

Page 17: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

M.S. in Data Science Program

INTEGRATIVE DATA SCIENCE (3 CREDITS) INTEGRATIVE DATA SCIENCE (3 CREDITS)

GRADUATE QUALIFYING PROJECT OR MS THESIS (3 TO 9 CREDITS)

MATHEMATICALANALYTICS(3 CREDITS)

DATA ACCESS & MANAGEMENT

(3 CREDITS)

DATA ANALYTICS &

MINING(3 CREDITS)

BUSINESSINTELLIGENCE &

CASE STUDIES(3 CREDITS)

CONCENTRATION AND ELECTIVES(9 TO 15 CREDITS)

18

Page 18: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

Data Science Core

INTEGRATIVE DATA SCIENCE :INTEGRATIVE DATA SCIENCE :DS 501 INTRODUCTION TO DATA SCIENCE (NEW COURSE)

MATHEMATICAL ANALYTICS MATHEMATICAL ANALYTICS (SELECT ONE):MA 543/DS 502 STATISTICAL METHODS FOR DATA SCIENCE (NEW COURSE)MA 542 REGRESSION ANALYSISMA 554 APPLIED MULTIVARIATE ANALYSIS

DATA ACCESS AND MANAGEMENT DATA ACCESS AND MANAGEMENT (SELECT ONE):

CS 542 DATABASE MANAGEMENT SYSTEMSMIS 571 DATABASE APPLICATIONS DEVELOPMENTCS 561 ADVANCED TOPICS IN DATABASE SYSTEMSCS 585/DS 503 BIG DATA MANAGEMENT (NEW COURSE)

DATA ANALYTICS AND MINING DATA ANALYTICS AND MINING (SELECT ONE):

CS 548 KNOWLEDGE DISCOVERY AND DATA MININGCS 539 MACHINE LEARNINGCS 586/DS 504 BIG DATA ANALYTICS (NEW COURSE)

BUSINESS INTELLIGENCE AND CASE STUDIES BUSINESS INTELLIGENCE AND CASE STUDIES (SELECT ONE):

MIS 584 BUSINESS INTELLIGENCE MKT 568 DATA MINING BUSINESS APPLICATIONS

Data Science Certificate Program (18 credits);

•15 CREDIT DATA SCIENCE COREplus•3 CREDIT ELECTIVE

Page 19: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

2014 Data Science Cohort

NATIONALITY

CAMBODIA

INDIA

CHINA

PAKISTAN

TAIWAN

IRAN

U.S.A.

BRAZIL

NEPAL

AFGHANISTAN

INDONESIA

EDUCATIONAL FOUNDATION QUANTITATIVE/ COMPUTATIONAL BACKGROUNDSPROGRAMMING WITH DATA STRUCTURES AND ALGORITHMS FOR COMPUTATIONAL SKILLSQUANTITATIVE SKILLS CALCULUS, LINEAR ALGEBRA AND STATISTICS

EMPLOYMENT HISTORIESSENIOR RESEARCH ANALYST SENIOR BUSINESS ANALYSTPATIENT FINANCIAL SERVICES DATA BASE ANALYST-ARCHITECT DECISION SCIENTIST MINISTRY OF FINANCE LAHEY HEALTH TECHNICAL PROGRAM MANAGEMENTU.S. DEPARTMENT OF STATE

66.70% Male66.70% Male33.3% Female33.3% Female

GENDERGENDER

10% FULBRIGHTSCHOLARS

Page 20: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

2014 Data Science Cohort

FALL 2014 FALL 2014 Total ApplicantsTotal Applicants 126 126Total acceptancesTotal acceptances 33 33Fulbright ScholarsFulbright Scholars 3 3Brazil Science Mobility Student 1 Brazil Science Mobility Student 1 Countries Represented 9Countries Represented 9Domestic Students 5Domestic Students 5International Students 28International Students 28

Many hold more than one earned Bachelor’s DegreeMany hold more than one earned Bachelor’s DegreeUS Universities include Columbia, UNH and WPIUS Universities include Columbia, UNH and WPIDean Oates gave two Awards of $5K to outstanding Dean Oates gave two Awards of $5K to outstanding students. students. These awards help attract top students.These awards help attract top students.

Page 21: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

Skills Acquired by Our StudentsFundamental/Technical :

SQL/ Data Modeling / Cleaning

Data Integration / Warehousing

Statistical Learning / Machine Learning

Distributed Computing

Big Data Management

Classif./Regression/DecisionTrees

Business Intelligence

Distributed Mining Algorithms

Professional Skills:

Business Use Cases / Entrepreneurship

Interdisciplinary Teams / Leadership

Tools :

Oracle /MySQL/DB2/SQLServer

R / SAS / SciKit

Weka /RapidMiner /MatLab

IBM Cognos / SPSS Modeler

Hadoop / Mahout / Cassandra

Python / Java / Cloud Computing

Storm / Sparc / InfoSphere Streams

Spotfire / Tableaux

Professional Skills:

Story Telling / Visualization

Presentations / Reports

Page 22: Musings on Data Science and Students Experiencing Data Analytics New England SENCER Center for Innovation Prof. Randy Paffenroth Data Science Program Department

Data Science Tools for Students: Free!

Software:

•Python

•http://www.python.org/• iPython: http://ipython.org/

• Numpy: http://www.numpy.org/

• Pandas: http://pandas.pydata.org/

• Matplotlib: http://matplotlib.org/

• Mayavi: http://mayavi.sourceforge.net/

• Scikit-learn: http://scikit-learn.org/stable/

Data:

•UCI Machine learning repository

• http://archive.ics.uci.edu/ml/

•Kaggle

• https://www.kaggle.com/

•U.S. Government

• https://www.data.gov/