big messy data: looking for the signal in noisy and biased ...€¦ · big messy data: looking for...

43
Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou CRA-W Undergraduate Town Hall Oct 4 th , 2018

Upload: others

Post on 25-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

Big Messy Data: Looking for the Signal in Noisy and Biased Data

Alexandra Meliou

CRA-W Undergraduate Town Hall Oct 4th, 2018

Page 2: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

Speaker & Moderator

Dr. Lori Pollock is a Professor in Computer and Information Sciences at University of Delaware. Her current research focuses on program analysis for building better software maintenance tools, software testing, energy-efficient software and computer science education. Dr. Pollock is an ACM Distinguished Scientist and was awarded the University of Delaware’s Excellence in Teaching Award and the E.A. Trabant Award for Women’s Equity.

Lori Pollock

Alexandra Meliou is an Assistant Professor in the College of Information and Computer Science, at the University of Massachusetts, Amherst.  Prior to that, she was a Post-Doctoral Research Associate at the University of Washington.  Alexandra received her PhD degree from the Electrical Engineering and Computer Sciences Department at the University of California, Berkeley.  She has received recognitions for research and teaching, including a CACM Research Highlight, an ACM SIGMOD Research Highlight Award, an ACM SIGSOFT Distinguished Paper Award, an NSF CAREER Award, a Google Faculty Research Award, and a Lilly Fellowship for Teaching Excellence.  Her research focuses on data provenance, causality, explanations, data quality, and algorithmic fairness.

Alexandra Meliou

Page 3: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

INSIDE THE GLISTENING RED CAVE of the patient’s abdomen, sur-geon Michael Stifelman carefully guides two robotic arms to tie knots in a piece of thread. He manipulates a third arm to drive a suturing needle through the fleshy mass of the patient’s kidney, stitching together the hole where a tumor used to be. The final arm holds the endoscope that streams visuals to Stifelman’s display screens. Each arm enters the body through a tiny incision about 5 millimeters wide. To watch this tricky procedure is to marvel at what can be achieved when robot and human work in tandem. Stifelman, who has done several thousand robot-assisted surgeries as director of NYU Langone’s Robotic Surgery Center, controls the robotic arms from a console. If he swivels his wrist and pinches his fingers closed, the instruments inside the patient’s body per-form the same exact motions on a much smaller scale. “The robot is one with me,” Stifelman says as his mechanized appendages pull tight another knot.

PRECISE AND DEXTEROUS

SURGICAL ROBOTS MAY

OUTDO HUMAN SURGEONS

By ELIZA STRICKLAND

Trusting Robots

32 | jun 2016 | north AmericAn | SPectrum.ieee.orG ILLUSTRATION BY Carl De Torres

06.SurgicalRobots.NA [P].indd 32 5/13/16 4:47 PM

Page 4: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education
Page 5: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

“What if data is biased?”

“What if data is dirty?”

Page 6: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education
Page 7: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education
Page 8: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

dataquality

Page 9: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education
Page 10: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

Data cleaning: Automated, semi-automated, manual, crowd-sourced…

Page 11: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

but why is data dirty in the first place?

Page 12: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

Structured information retrieved from unstructured web data

Page 13: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

TXT TBL DOM ANO

Web Sources

Extractor

Fusion

Extractor Extractor… …

Extraction System

prKBprKB

Example: Knowledge Vault [Dong14]3.0 billion extracted triples

More than 70% are wrong

a look at Knowledge Bases

Page 14: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

prKB

TXT TBL DOM ANO

Web Sources

Extractor

Fusion

Extractor Extractor… …

Extraction System

Traditional method:1. Identify errors2. Drop them

“Perfect” KB

traditional data cleaning

Page 15: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

prKB

but the errors are deeper…

TXT TBL DOM ANO

Web Sources

Extractor

Fusion

Extractor Extractor… …

Extraction System

Errors are systemic

Faulty information

Bad extraction

rules Rules don’t match the

source

Page 16: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

prKBprKB prKBprKB prKB

…continue producing bad data

TXT TBL DOM ANO

Web Sources

Extractor

Fusion

Extractor Extractor… …

Extraction System

Errors are systemic

Page 17: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

CHALLENGES

• Massive scaleSampling loses statistical strength and misses a lot of mistakes

• System complexityComplex processes, thousands extraction patterns

• High error rates

Page 18: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

DATA X-RAY

Example: Default value error(besoccer.com, date_of_birth, 1986-02-18)# Triples 630Error Rate 100%Context: Date of birth of athletes extracted from besoccer.com is set to default value 1986-02-18, due to copied html segments

Data X-Ray Works on simple meta-dataParallelizable in MapReduce

Page 19: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

“What if data is biased?”

“What if data is dirty?”

Page 20: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

Software can make bad decisions. Software can discriminate!

Page 21: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

algorithms can exacerbate societal biases

Page 22: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education
Page 23: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education
Page 24: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

algorithms don’t provide the same service to all

automatic captions

Page 25: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

Rachael Tatman, "Gender and Dialect Bias in YouTube's Automatic Captions" in 2017 Workshop on Ethics in Natural Language Processing

algorithms don’t provide the same service to all

automatic captions

Page 26: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

algorithms don’t provide the same service to all

Joy Buolamwini https://www.ted.com/talks/joy_buolamwini_how_i_m_fighting_bias_in_algorithms

Page 27: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

algorithms don’t provide the same service to all

Joy Buolamwini https://www.ted.com/talks/joy_buolamwini_how_i_m_fighting_bias_in_algorithms

Page 28: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

algorithms don’t provide the same service to all

Joy Buolamwini https://www.ted.com/talks/joy_buolamwini_how_i_m_fighting_bias_in_algorithms

Page 29: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

Research in Algorithmic Bias

How to detect it How to measure it How to repair the algorithms How to repair the data

Page 30: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

How to Fail! A Guide to Anticipating and Overcoming Failures

Page 31: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

our successes are public

our failures are not

Page 32: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

what about schools I didn’t get into?

what about jobs I didn’t get?

what about awards and scholarships I got rejected for?

what about my rejected grants?

what about all my paper rejections?

Page 33: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

Let’s demystify failure!

Page 34: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

Let’s demystify failure!

some lessons learned from my personal experiences

Page 35: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

an exam failure

Page 36: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

an exam failure

Lesson: helped me shape a stronger background that I am

now proud of

Page 37: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

research failure

Page 38: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

research failure

Lesson: the slow down was temporary, and the core of what I learned guides my current work

Page 39: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

failure to network

Page 40: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

failure to network

Lesson: seek collaborators and mentors, don’t hesitate to ask and

share

Page 41: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

job search failure

Page 42: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

job search failure

Lesson: the slow down was temporary, and it allowed me to

pursue better opportunities

Page 43: Big Messy Data: Looking for the Signal in Noisy and Biased ...€¦ · Big Messy Data: Looking for the Signal in Noisy and Biased Data Alexandra Meliou ... and computer science education

Good practices in handling failures

Do not isolate yourself Seek mentors (also aside from failures!) Think about what pieces of a failed attempt are reusable Do not let it reflect on your abilities and prospects

easier said than done, but it is true! Confront the reasons

Remember: Even the most accomplished have failed numerous times