csci 3022 intro to data science with probability and ... · the plan goal: fluency in the...
TRANSCRIPT
![Page 1: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/1.jpg)
CSCI 3022Intro to Data Science
with Probability and Statistics
![Page 2: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/2.jpg)
What is Data Science?
![Page 3: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/3.jpg)
What is Data Science?Seriously. What do YOU think it is?
![Page 4: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/4.jpg)
What is Data Science?Seriously. What do YOU think it is?
![Page 5: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/5.jpg)
What is Data Science?
DataAnalysisandInferentialStatistics
ModelingandMachineLearning
DataMiningandPatternRecognition
![Page 6: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/6.jpg)
What is Data Science?
ProbabilityandStatistics LinearAlgebraNumericalOptimizationComputingTools
DataAnalysisandInferentialStatistics
ModelingandMachineLearning
DataMiningandPatternRecognition
![Page 7: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/7.jpg)
Course Topicso Exploratory Data Analysis
o Cleaning, Munging, and Wrangling Data
o Probability Theory and Simulation
o Hypothesis Testing and Inferential Statistics
o Modeling and Classification
![Page 8: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/8.jpg)
Exploratory Data Analysis
Dig into your data, compute simple statistics, draw pictures and look for INSIGHT
![Page 9: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/9.jpg)
Data Cleaning and Wrangling
Some Data Scientists report they sped up to 80% of their time cleaning messy data sets
![Page 10: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/10.jpg)
Probability Theory and Simulation
Different probability distributions model different types of events
![Page 11: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/11.jpg)
Probability Theory and Simulation
Different probability distributions model different types of events
The Binomial Distribution: The number of customers that will unsubscribe from your company’s email-list as a function of how many advertisements you send them
![Page 12: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/12.jpg)
Probability Theory and Simulation
Different probability distributions model different types of events
The Poisson Distribution: The number of online customers that will visit your website over a particular time period, or the daily number of car crashes on a particular stretch of road
![Page 13: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/13.jpg)
Probability Theory and Simulation
Different probability distributions model different types of events
The Exponential Distribution: The amount of time you can expect a compute node in a large-scale cluster to function before failure
![Page 14: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/14.jpg)
Hypothesis Testing and Inferential Statistics
What is the data really telling us, and how confident should we be in our conclusions?
![Page 15: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/15.jpg)
Hypothesis Testing and Inferential Statistics
What is the data really telling us, and how confident should we be in our conclusions?
![Page 16: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/16.jpg)
Hypothesis Testing and Inferential Statistics
What is the data really telling us, and how confident should we be in our conclusions?
![Page 17: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/17.jpg)
Hypothesis Testing and Inferential Statistics
What is the data really telling us, and how confident should we be in our conclusions?
![Page 18: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/18.jpg)
Modeling and Classification
A gentle foray into Machine Learning.
o Linear Regression
o Multiple Linear Regression
o Logistic Regression
![Page 19: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/19.jpg)
The PlanGoal: Fluency in the theoretical and computational aspects of data analysis
At the end of this course you’ll be able to
1. Clean, munge, and wrangle data in Python and perform Exploratory Data Analysis
2. Draw insight from data by computing and interpreting classic summary statistics
3. Know the ins-and-outs of probability and how to use it to solve real-world problems
4. Perform statistical tests to determine if your conclusions are real or due to chance
5. Construct and analyze simple models to make predictions and inferences about data
6. Tell compelling stories about data using modern visualization and presentation tools
![Page 20: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/20.jpg)
Course LogisticsKeep track of course webpages (Piazza and GitHub)
o Piazza: https://piazza.com/colorado/fall2017/csci3022
o GitHub: https://github.com/chrisketelsen/csci3022
§ Send me private messages on Piazza, rather than emails§ Address message specifically to me if necessary
§ In-class work will be posted here, as well as homework§ Good idea to clone repo and do a pull everyday before class§ Good Git tutorial if you’re unfamiliar: http://rogerdudler.github.io/git-guide/
![Page 21: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/21.jpg)
Course LogisticsCourse Work:
o Homework assignments every two weeks (35%)
o Class Participation through tutorial problems and short Moodle Quizzes (5%)
o Midterm Exam (20%)
o Practicum (15%)
o Final Exam (25%)
§ Lowest homework score dropped § 3 total late days (1min – 23hr 59min late = 1 late day)
![Page 22: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/22.jpg)
Course LogisticsCollaboration Policy:
o Data Science is a collaborative field. Discuss problems with classmates and instructors.
o But do your own work. Write solutions and CODE on your own.
o Give hints, not solutions, on Piazza.
o Make repositories containing your homework private (GitHub, Azure)
o More info about collaboration on syllabus
![Page 23: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/23.jpg)
Course Reading
o Good book with useful examples and exercises
o Doesn’t force you to use R!
o Free PDF through CU library! (link on syllabus)
o Overly mathy sometimes
o Only responsible for what we cover in class
o Does things in slightly different order than us
![Page 24: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/24.jpg)
Course Reading
o Supplemental Text on Data Analysis with Python
o Beware Python 2 vs Python 3 differences!
o Free PDF through publisher! (link on syllabus)
o Not mathy enough most of the time
o Won’t really refer to it in class.
o Use for extra Python help.
![Page 25: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/25.jpg)
Computingo We will use Python 3 and in particular Numpy and Pandas
o Lot’s of great data science libraries and decent plotting
o We’ll exclusively work in Jupyter Notebooks
o Jupyter is ubiquitous DS collaboration and communication tool
o Easiest way to get both is Anaconda Python 3.6
o We strongly recommend you install local copy
o If not, you can use Microsoft Azure Notebooks
o Often work on problems in groups in class
o Bring a laptop or have a buddy with a laptop
![Page 26: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/26.jpg)
About Meo Starting 5th year as an instructor at CU (first 3 in APPM, last year in CS)
o Specialize in the Mathy courses (Discrete, Lin. Alg., Data Science, Machine Learning)
o Before CU, at Lawrence Livermore National Lab
o Before that, PhD in Applied Math at CU
o Before that, taught Philosophy at Washington State
o Research: Numerical Linear Algebra and Stochastic Simulation
o Please call me Chris or Dr. Ketelsen
o Office Hours: MW 2-3:30 in ECOT 731, or F 11-12pm by appointment
![Page 27: CSCI 3022 Intro to Data Science with Probability and ... · The Plan Goal: Fluency in the theoretical and computational aspects of data analysis At the end of this course you’ll](https://reader034.vdocuments.us/reader034/viewer/2022042310/5ed73001c30795314c175c46/html5/thumbnails/27.jpg)
Let’s Go to Work! Let’s start exploring some computing tools
Get out your laptop, or better yet, pair-up with someone else with a laptop