orie 4741: learning with big messy data [2ex] introduction · 29 f ct $53,000 college yes clinton...
TRANSCRIPT
![Page 1: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/1.jpg)
ORIE 4741: Learning with Big Messy Data
Introduction
Professor Udell
Operations Research and Information EngineeringCornell
September 8, 2020
1 / 39
![Page 2: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/2.jpg)
Outline
Logistics
Stories
Definitions
Kinds of learning
Syllabus
2 / 39
![Page 3: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/3.jpg)
ORIE 4741: Learning with Big Messy Data
want to take this class?
I ASAP:I enroll (or drop) (or get on wait list)I fill out course survey (provides access campuswire Q&A)I sign up for iClicker REEF
I Thursday 9/10/2020: homework 0
links on course website:https://people.orie.cornell.edu/mru8/orie4741/
3 / 39
![Page 4: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/4.jpg)
Course staff
I Prof. Madeleine Udell
I TA: Chengrun Yang (ECE PhD)
I TA: Yuxuan Chen (Statistics PhD)
I TA: Anusha Avyukt (Statistics MPS)
I TA: Juliet Zhong (ORIE MEng + Undergraduate)
I TA: Allison Grimsted (ORIE Undergraduate)
I TA: Carrie Rucker (ORIE Undergraduate)
4 / 39
![Page 5: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/5.jpg)
Tech stack
I Zoom for lectures
I Course website for course materials(syllabus, schedule, homework, project, etc)
I iClicker REEF for polls
I Campuswire for Q&A and announcements
I Gradescope for quizzes, submitting homework, grades,solutions
I Github for code (demos, projects, and hw starter code)
Zoom contingencies
I If I get logged off (eg, due to connectivity issues), your TAswill stay on and provide further instructions
I If the Zoom platform fails (eg, Zoom-bombing or Zoomoutage), look on Campuswire for further instructions
5 / 39
![Page 6: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/6.jpg)
Tech stack
I Zoom for lectures
I Course website for course materials(syllabus, schedule, homework, project, etc)
I iClicker REEF for polls
I Campuswire for Q&A and announcements
I Gradescope for quizzes, submitting homework, grades,solutions
I Github for code (demos, projects, and hw starter code)
Zoom contingencies
I If I get logged off (eg, due to connectivity issues), your TAswill stay on and provide further instructions
I If the Zoom platform fails (eg, Zoom-bombing or Zoomoutage), look on Campuswire for further instructions
5 / 39
![Page 7: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/7.jpg)
Course requirements and grading
course website:(grading, course requirements, lectures, homework, etc)https://people.orie.cornell.edu/mru8/orie4741/
I (15%) Participation: for every lecture (after this one), useI iClicker REEF for sync lecturesI participation form for async lectures
I (30%) HomeworkI due every two weeks or soI first one due next Thursday
I (15%) QuizzesI 30 min quiz every week or so
I (40%) Project
FAQ:
I yes, you can take the class async in any timezoneI yes, you can take section async, or not take the section
6 / 39
![Page 8: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/8.jpg)
Course requirements and grading
course website:(grading, course requirements, lectures, homework, etc)https://people.orie.cornell.edu/mru8/orie4741/
I (15%) Participation: for every lecture (after this one), useI iClicker REEF for sync lecturesI participation form for async lectures
I (30%) HomeworkI due every two weeks or soI first one due next Thursday
I (15%) QuizzesI 30 min quiz every week or so
I (40%) Project
FAQ:
I yes, you can take the class async in any timezoneI yes, you can take section async, or not take the section
6 / 39
![Page 9: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/9.jpg)
Questions
during lecture:
I ask out loud
I zoom chat to TA Carrie Rucker
outside of lecture:
I ask at office hours
I ask on campuswire
I don’t send email
7 / 39
![Page 10: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/10.jpg)
Outline
Logistics
Stories
Definitions
Kinds of learning
Syllabus
8 / 39
![Page 11: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/11.jpg)
Oh, you work with big messy data? Maybe you couldhelp us out...?
9 / 39
![Page 12: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/12.jpg)
My career in big data
academic
I B.S. in Mathematics and Physics at YaleI Ph.D. in Computational and Mathematical Engineering at
StanfordI postdoctoral fellow at the Center for the Mathematics of
Information at CaltechI professor in ORIE at Cornell
applied work
I finance: Goldman Sachs, BlackRock, Capital One,Schonfeld, Two Sigma, . . .
I cybersecurity: DARPA, Expanse (formerly Qadium)I healthcare: Apixio, OntarioI clean energy: AuroraI commerce: Retina.ai, Marketing AttributionI politics: Obama 2012
10 / 39
![Page 13: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/13.jpg)
My career in big data
academic
I B.S. in Mathematics and Physics at YaleI Ph.D. in Computational and Mathematical Engineering at
StanfordI postdoctoral fellow at the Center for the Mathematics of
Information at CaltechI professor in ORIE at Cornell
applied work
I finance: Goldman Sachs, BlackRock, Capital One,Schonfeld, Two Sigma, . . .
I cybersecurity: DARPA, Expanse (formerly Qadium)I healthcare: Apixio, OntarioI clean energy: AuroraI commerce: Retina.ai, Marketing AttributionI politics: Obama 2012
10 / 39
![Page 14: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/14.jpg)
Data table: politics
age gender state income education voted? support · · ·29 F CT $53,000 college yes Biden · · ·57 ? NY $19,000 high school yes ? · · ·? M CA $102,000 masters no Trump · · ·
41 F NV $23,000 ? yes Trump · · ·...
......
......
......
...
goals:
I detect demographic groups?
I find typical responses?
I identify related features?
I impute missing entries?
11 / 39
![Page 15: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/15.jpg)
Data table: politics
age gender state income education voted? support · · ·29 F CT $53,000 college yes Biden · · ·57 ? NY $19,000 high school yes ? · · ·? M CA $102,000 masters no Trump · · ·
41 F NV $23,000 ? yes Trump · · ·...
......
......
......
...
goals:
I detect demographic groups?
I find typical responses?
I identify related features?
I impute missing entries?11 / 39
![Page 16: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/16.jpg)
How to get data for politics?
two major sources of data:
I census
I voter registration
data quality is critical!
I for hw0, you’ll respond to census + register to voteI if eligible:
I census eligibility: college students (including foreign) shouldbe counted at the on-campus or off-campus residence wherethey sleep most of the time (even if you went home beforeCensus Day last spring, and even if you’re a foreign citizen)
I voter eligibility: US citizen 18+
12 / 39
![Page 17: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/17.jpg)
Medicine
13 / 39
![Page 18: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/18.jpg)
Data table: medicine
age gender heart disease statins? · · ·29 F yes no · · ·57 ? no no · · ·? M no no · · ·
41 F yes yes · · ·...
......
...
I find similar patients?
I understand systemic healthcare needs?
I use symptoms to detect which patients have COVID-19?
I detect patients who had series of mini-strokes?
14 / 39
![Page 20: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/20.jpg)
COVID projections: Cornell data
https://covid.cornell.edu/testing/dashboard/
16 / 39
![Page 21: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/21.jpg)
Poll
(click “participants” and use Zoom reactions to respond to poll)
So far .2% of Cornell population has had COVID-19. I think X%of Cornell population will get COVID this semester, where X is
I (yes) < .5%
I (no) .5− 1%
I (go slower) 1− 5%
I (go faster) 5− 10%
I (coffee) > 10%
17 / 39
![Page 22: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/22.jpg)
Poll
(click “participants” and use Zoom reactions to respond to poll)
I think
I (yes) I will get COVID
I (no) I will not get COVID
I (coffee) I’ve already had COVID
18 / 39
![Page 23: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/23.jpg)
Poll
(click “participants” and use Zoom reactions to respond to poll)
I think a vaccine will be widely available by
I (yes) November
I (no) January
I (slower) Spring 2021
I (faster) Summer 2021
I (coffee) later
I (down) never
19 / 39
![Page 24: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/24.jpg)
Data for projections: simulation
I simulate model given parameter values
I learn parameter values to match data
https:
//github.com/ORIE4741/demos/blob/master/SIR.ipynb
20 / 39
![Page 25: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/25.jpg)
Simulation: results
by varying parameters, we see
I with infrequent testing (weekly+), pandemic is nearlyimpossible to control
I people who go to big parties get COVIDI if too many people go to big parties, Cornell shuts downI if moderate (e.g. 10%) of people go to big parties, classes
shut downI if < 1% go to big parties, Cornell stays open
I if PPE is effective,I people who just go to classes are nearly always okI people who go to small parties might get COVID
I if PPE is not effective,I people who just go to classes might get COVIDI people who go to small parties have even odds of getting
COVID
21 / 39
![Page 26: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/26.jpg)
Simulation: assumptions
simulation is evocative, not realistic. assumes
I fluid limit: large population, no randomness
I only 3 groups that mix internally
I only 3 modes of contact: parties, classes, external(eg grocery shopping)
I no latency period
I actions don’t depend on infection rates
I quarantine is effective
I . . .
22 / 39
![Page 27: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/27.jpg)
Application areas
I health
I politics
I governance
I advertising
I retail
I ecommerce
I finance
I . . .
23 / 39
![Page 28: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/28.jpg)
Outline
Logistics
Stories
Definitions
Kinds of learning
Syllabus
24 / 39
![Page 29: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/29.jpg)
Big
I NASA, 1997: “taxing the capacities of main memory, localdisk, and even remote disk”
I OED, 2015: “data of a very large size, typically to theextent that its manipulation and management presentsignificant logistical challenges”
I 4 Vs:
1
I 5th V: value
1image courtesy of Kim Minor @ IBM25 / 39
![Page 30: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/30.jpg)
Big
I NASA, 1997: “taxing the capacities of main memory, localdisk, and even remote disk”
I OED, 2015: “data of a very large size, typically to theextent that its manipulation and management presentsignificant logistical challenges”
I 4 Vs:
1
I 5th V: value
1image courtesy of Kim Minor @ IBM25 / 39
![Page 31: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/31.jpg)
Big
I NASA, 1997: “taxing the capacities of main memory, localdisk, and even remote disk”
I OED, 2015: “data of a very large size, typically to theextent that its manipulation and management presentsignificant logistical challenges”
I 4 Vs:
1
I 5th V: value
1image courtesy of Kim Minor @ IBM25 / 39
![Page 32: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/32.jpg)
Big
I NASA, 1997: “taxing the capacities of main memory, localdisk, and even remote disk”
I OED, 2015: “data of a very large size, typically to theextent that its manipulation and management presentsignificant logistical challenges”
I 4 Vs:
1
I 5th V: value1image courtesy of Kim Minor @ IBM
25 / 39
![Page 33: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/33.jpg)
Big: our definition
Definition
An algorithm for big data is one with computational andmemory requirements that scale linearly (or nearly linearly) inthe size of the data.
why this definition? independent of
I hardware
I business
if you use only algorithms for big data, then you’re workingwith big data
26 / 39
![Page 34: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/34.jpg)
Big: our definition
Definition
An algorithm for big data is one with computational andmemory requirements that scale linearly (or nearly linearly) inthe size of the data.
why this definition? independent of
I hardware
I business
if you use only algorithms for big data, then you’re workingwith big data
26 / 39
![Page 35: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/35.jpg)
Big: our definition
Definition
An algorithm for big data is one with computational andmemory requirements that scale linearly (or nearly linearly) inthe size of the data.
why this definition? independent of
I hardware
I business
if you use only algorithms for big data, then you’re workingwith big data
26 / 39
![Page 36: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/36.jpg)
Messy
I noisy: some (or all) values suffer errors, inaccuracies, ormalicious corruption
I missing: some values are missing, inconsistent, notrecorded, or lost
I heterogeneous: values of many different typesI continuous values (e.g., 4.2, π)I discrete values (e.g., 0, 4, 994)I nominal values (e.g., apple, banana, pear)I ordinal values (e.g., rarely, sometimes, often)I graphs or networks (e.g., person 1 is friends with person 2)I text (e.g., doctor’s note describing symptoms)I sets (e.g., items purchased)
27 / 39
![Page 37: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/37.jpg)
Messy
I noisy: some (or all) values suffer errors, inaccuracies, ormalicious corruption
I missing: some values are missing, inconsistent, notrecorded, or lost
I heterogeneous: values of many different typesI continuous values (e.g., 4.2, π)I discrete values (e.g., 0, 4, 994)I nominal values (e.g., apple, banana, pear)I ordinal values (e.g., rarely, sometimes, often)I graphs or networks (e.g., person 1 is friends with person 2)I text (e.g., doctor’s note describing symptoms)I sets (e.g., items purchased)
27 / 39
![Page 38: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/38.jpg)
Messy
I noisy: some (or all) values suffer errors, inaccuracies, ormalicious corruption
I missing: some values are missing, inconsistent, notrecorded, or lost
I heterogeneous: values of many different typesI continuous values (e.g., 4.2, π)I discrete values (e.g., 0, 4, 994)I nominal values (e.g., apple, banana, pear)I ordinal values (e.g., rarely, sometimes, often)I graphs or networks (e.g., person 1 is friends with person 2)I text (e.g., doctor’s note describing symptoms)I sets (e.g., items purchased)
27 / 39
![Page 39: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/39.jpg)
Learning
I machine learning?
I human learning?
I when data is big and messy,machine help is essential for human learning!
28 / 39
![Page 40: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/40.jpg)
Learning
I machine learning?
I human learning?
I when data is big and messy,machine help is essential for human learning!
28 / 39
![Page 41: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/41.jpg)
Learning
I machine learning?
I human learning?
I when data is big and messy,machine help is essential for human learning!
28 / 39
![Page 42: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/42.jpg)
Learning
I machine learning?
I human learning?
I when data is big and messy,machine help is essential for human learning!
28 / 39
![Page 43: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/43.jpg)
Outline
Logistics
Stories
Definitions
Kinds of learning
Syllabus
29 / 39
![Page 44: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/44.jpg)
Data table
n examples (patients, respondents, households, assets)d features (tests, questions, sensors, times) A
=
a11 · · · a1d...
. . ....
an1 · · · and
I ai is ith row of A: feature vector for ith example
I a:j is jth column of A: values for jth feature across allexamples
I aij is jth feature of ith example
30 / 39
![Page 45: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/45.jpg)
Supervised learning
I identify one column of data that we want to predict A
=
x11 · · · x1 d−1 y1...
. . ....
...xn1 · · · xn d−1 yn
=
X y
I xi ∈ X for i = 1, . . . , n are rows of X
I yi ∈ Y for i = 1, . . . , n are entries of y
I we believe there is a mapping f : X → Y
yi ≈ f (xi )
I our goal is to learn f
31 / 39
![Page 46: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/46.jpg)
Supervised learning
I identify one column of data that we want to predict A
=
x11 · · · x1 d−1 y1...
. . ....
...xn1 · · · xn d−1 yn
=
X y
I xi ∈ X for i = 1, . . . , n are rows of X
I yi ∈ Y for i = 1, . . . , n are entries of y
I we believe there is a mapping f : X → Y
yi ≈ f (xi )
I our goal is to learn f
31 / 39
![Page 47: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/47.jpg)
Example: supervised learning for credit decisioning
I goal: decide which credit card applicants should beapproved
I input space: entries of X ∈ Rd correspond to fields incredit applicationI e.g., salary, years in residence, outstanding debt,
number of credit lines, . . .I output space: Y = {+1,−1}
I +1 means approveI −1 means reject
I data: D = (x1, y1), . . . , (xn, yn)applications of previous customers, and credit approvaldecisions made by humans
Q: what are potential problems with using a model built withthis data?A: wrong objective: human decision may not be correct decision;covariate shift: future data may look unlike past data; . . .
32 / 39
![Page 48: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/48.jpg)
Example: supervised learning for credit decisioning
I goal: decide which credit card applicants should beapproved
I input space: entries of X ∈ Rd correspond to fields incredit applicationI e.g., salary, years in residence, outstanding debt,
number of credit lines, . . .I output space: Y = {+1,−1}
I +1 means approveI −1 means reject
I data: D = (x1, y1), . . . , (xn, yn)applications of previous customers, and credit approvaldecisions made by humans
Q: what are potential problems with using a model built withthis data?
A: wrong objective: human decision may not be correct decision;covariate shift: future data may look unlike past data; . . .
32 / 39
![Page 49: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/49.jpg)
Example: supervised learning for credit decisioning
I goal: decide which credit card applicants should beapproved
I input space: entries of X ∈ Rd correspond to fields incredit applicationI e.g., salary, years in residence, outstanding debt,
number of credit lines, . . .I output space: Y = {+1,−1}
I +1 means approveI −1 means reject
I data: D = (x1, y1), . . . , (xn, yn)applications of previous customers, and credit approvaldecisions made by humans
Q: what are potential problems with using a model built withthis data?A: wrong objective: human decision may not be correct decision;covariate shift: future data may look unlike past data; . . .
32 / 39
![Page 50: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/50.jpg)
Exercise: formalizing real problems
I identify a prediction goal
I identify the input space XI identify the output space YI identify the data D = (x1, y1), . . . , (xn, yn) you’d like to use
I what kinds of noise do you expect in the data?
33 / 39
![Page 51: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/51.jpg)
Outline
Logistics
Stories
Definitions
Kinds of learning
Syllabus
34 / 39
![Page 52: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/52.jpg)
Course objectives (I)
I plot
I predict
I cluster
I impute
I denoise
I recommend
I understand
35 / 39
![Page 53: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/53.jpg)
Course objectives (II)
this course is about
I algorithms for big messy data
I learning to ask the right questions
at the end of the course, you should have learned
I at least one method to solve any problem
I machine learning is not magic; it’s math
I when not to trust your solution
the rest you can learn online. . .
36 / 39
![Page 54: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/54.jpg)
Course objectives (II)
this course is about
I algorithms for big messy data
I learning to ask the right questions
at the end of the course, you should have learned
I at least one method to solve any problem
I machine learning is not magic; it’s math
I when not to trust your solution
the rest you can learn online. . .
36 / 39
![Page 55: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/55.jpg)
Next steps
I ASAP:I enroll (or drop) (or get on wait list)I fill out course survey (provides access to campuswire Q&A)I sign up for iClicker REEF
I Thursday 9/10/2020 9:30am: homework 0
links on course website:https://people.orie.cornell.edu/mru8/orie4741/
37 / 39
![Page 56: ORIE 4741: Learning with Big Messy Data [2ex] Introduction · 29 F CT $53,000 college yes Clinton 57 ? NY $19,000 high school yes ? ? M CA $102,000 masters no Trump ... (KPI) I carbon](https://reader034.vdocuments.us/reader034/viewer/2022050602/5fa998d1ac0b64005f0977f7/html5/thumbnails/56.jpg)
Questions?
https://docs.google.com/spreadsheets/d/
1vLbwi0WCOn0wU6cU_r0RHAnY7C0fDZ1F8Yq09pqYYuk
38 / 39