recent riselab developments in uc berkeley ds-100jegonzal/assets/slides/...recent developments in...

35
Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS [email protected] rise lab UC Berkeley DS-100 Principles and Techniques of Data Science &

Upload: others

Post on 05-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

RecentDevelopments in

Data Science @ Berkeley

Joseph E. GonzalezAsst. Professor in EECS

[email protected]

riselabUC Berkeley

DS-100Principles and Techniques of Data Science

&

Page 2: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

Joseph E. GonzalezAsst. Professor in EECS

[email protected]

ds100.org

DS-100Principles and Techniques of Data Science

RecentDevelopments in

Data Science @ Berkeley

Page 3: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

Course DevelopmentStarted active course development in Spring 2016

Ø Reference: CS 194-16 Introduction to Data ScienceØ Initially: focused more on toolsØ Eventually: fairly advanced many techniques and topics covered

Big DecisionsØ Intermediate level Ø Narrow scopeØ Minimize pre-req. chainØ Strong stats. perspective

DS-100

Principles and Techniques of Data Science

8Data UpperDiv.

DS-100

ds100.org

Page 4: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

Our GoalsPrepare students for advanced Berkeley courses in data-management (CS186), machine learning (CS189), and statistics (Stat-154), by providing the necessary foundationand context

Enable students to start careers as data scientists by providing experience in working with real data, tools, and techniques

Empower students to apply computational and inferentialthinking to tackle real-world problems

Page 5: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

DS100 Created and Taught by Faculty and TAsWith Diverse Background & Perspectives

Joseph Gonzalez

JosephHellerstein

DeborahNolan

BinYu

AndrewDo

HenryMilner

SamLau

JosephSimonian

Page 6: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

Data Science RequiresMany Skills Domain

Expertise

DataScience

Can’t cover everything in DS100.

Instead we coverØKey Concepts

Ø … some detailsØConnectionsØHow to learn …

Page 7: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

Big Concepts in Data ScienceØ Data preparation and representation

Ø Efficient and scalable data processing

Ø Question formulation and experimental design

Ø Exploratory data analysis and visualization

Ø Modeling fitting and inference

Ø Machine learning techniques and overfitting

Ø Validation and hypothesis testing

Page 8: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

Data Science Lifecycle

High-level description of the data science workflow

Ø Frame questions & design experiments

Ø Obtain and clean data

Ø Summarize and visualize data

Ø Inference and prediction

continuous process …

AskQuestion

ObtainData

UnderstandData

UnderstandWorld

Reports & Data-Products

Page 9: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

Using Real ToolsØ Focus on Python programming language

Ø We will use various different technologiesØ Jupyter notebooks, pandas, numpy, matplotlib, SQL Server,

github, Wrangler, plotly, tableau, Spark?, …

Ø We won’t teach students everything …Ø Students will learn to read documentationØ Students will learn to teach themselves

Ø BETA WARNING: things will break …Ø Students will learn how to debugØ Students will learn how to get help (Piazza)

Page 10: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

Syllabus

Week 1 Week 2 Week 3 Week 4 Week 5 Week 6

Week 7 Week 8 Week 9 Week 10 Week 11 Week 12

Week 13 Week 14 Week 15 Week 16

Review & Midterm

SpringBreak

RRR

Big Concepts in Data Science Working with Real Data and SQL

Basic StatisticalModeling

Model Estimation

ML Techniques

ML Techniques

Big Data Analytics

HW1

HW2

HW3

HW4

HW4

HW5

HW6

HW6

HW6

http://www.ds100.org/sp17/syllabus

DS-100

Page 11: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

The Data Science Major …We are working on it!

GoalsØ InterdisciplinaryØPersonalizedØ Technically deepØContextualizedØPragmatic

Page 12: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

RecentDevelopments in Data Science @ Berkeley

Joseph E. GonzalezAsst. Professor in EECS

[email protected]

Principles and Techniques of Data Science

ds100.org

DS-100&

Page 13: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

Principles and Techniques of Data Science

DS-100&

RecentDevelopments in Data Science @ Berkeley

Joseph E. GonzalezAsst. Professor in EECS

[email protected]

Research …

Page 14: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

Berkeley’s lab tradition

• 5-6 Projects addressing big problem

• Bringing faculty from different areas (in CS)14

Page 15: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

Berkeley’s lab tradition

• Working for 5-6 years on a new major problem

• Bringing faculty from different areas15

Page 16: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

BigData

Big Model

Training

Page 17: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

BigData

Big Model

Training

Splash CoCoA

Page 18: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

From batch data to advanced analytics

AMP Lab

18

From live data to real-time decisionsRISE Lab

Page 19: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

BigData

Big Model

Training

Learning

?

Page 20: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

BigData

Big Model

Training

Learning

ConferencePapers

Page 21: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

BigData

Big Model

Training

Learning ConferencePapers

Dashboards and Reports

Page 22: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

BigData

Big Model

Training

Learning ConferencePapers

Dashboards and

Reports

Drive Actions

Page 23: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

BigData

Big Model

Training

Learning Drive Actions

Page 24: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

why is challenging?Need to render low latency (< 10ms) predictions for complex

under heavy load with system failures.

Models Queries

Top

K

FeaturesSELECT * FROMusers JOIN items,click_logs, pagesWHERE …

Inference

Page 25: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

Robust is criticalSelf “Parking” Cars Self “Driving” Cars Chat AIs

Inference

Page 26: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

BigData

Big Model

Training

Application

Decision

Query

Learning Inference

Feedback

Page 27: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

BigData

Training

Application

Decision

Learning Inference

Feedback

Timescale: hours to weeksOften re-run training Another area of focus in RISE

Page 28: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

Why is challenging?Closing the Loop

Self ReinforcingFeedback Loops

Implicit and DelayedFeedback

d

dtWorld Changesat varying rates

Page 29: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

BigData

Big Model

Training

Application

Decision

Query

Learning Inference

Feedback

Responsive(~10ms)

Adaptive(~1 seconds)

Page 30: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

Learning InferenceResponsive

(~10ms)Adaptive

(~1 seconds)

Secure

Page 31: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

AR/VR Systems Home Monitoring Voice Technologies Medical Imaging

Protect the data, the model, and the query

Intelligence in Sensitive Contexts

Page 32: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

Data

High-Value Data is Sensitive• Medical Info.• Home video• Finance

Models capture value in data • Core Asset • Sensitive

Queries can be as sensitive as the data

Protect the data, the model, and the query

Page 33: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

We are developing new technologies that will enables applications to make low-latency intelligent decision on live data with strong security guarantees.

Page 34: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

A few early projects …

Opaque: A Data Analytics Platform with Strong Security

AbstractAn increasing number of organizations are moving

their analytics workloads into the cloud to take advan-tage of the elasticity and cost savings. However, the riskof data breaches is hampering this trend. While hard-ware enclaves promise a much needed solution to dataconfidentiality and secure execution of arbitrary compu-tation, they still su↵er from a significant leakage vector:access pattern leakage. We propose Opaque, a distributeddata analytics platform supporting a wide range of querieswhile providing strong security guarantees for both accesspatterns and integrity protection.

To achieve this, Opaque introduces new distributedoblivious relational operators that hide access patterns,and novel query planning techniques to optimize for theseoperators. Opaque is implemented in Apache Spark usingthe Catalyst optimizer with minimal changes. While thestrong security in Opaque comes at a cost, with queriesbeing between 12-200x slower than their insecure coun-terpart, Opaque provides an improvement of three ordersof magnitude over state-of-the-art oblivious protocols.

1 IntroductionCloud-based big data platforms collect and analyze vastamounts of sensitive data such as user information (emails,social interactions, shopping history), medical data, andfinancial data. These systems extract value out of this datathrough advanced SQL [7], machine learning [28, 17], orgraph analytics [16] queries. However, these information-rich systems are also valuable targets for attacks [18, 35].

Ideally, we want to both protect data confidentialityand maintain its value by supporting the existing richstack of analytics tools. Recent developments on securehardware enclaves (such as Intel SGX [27] and AMDMemory Encryption [20]) promise support for arbitrarycomputation at processor speeds while protecting the data.Previous work such as Haven [9] and VC3 [37] haveshown that it is possible to run unmodified binaries andMapReduce jobs using enclaves.

Unfortunately, enclaves still su↵er from an importantattack vector: access pattern leakage [41, 31]. Theseattacks are of two kinds: at the memory level and at thenetwork level. Regarding the memory level, while a com-promised OS is unable to decrypt the data, it can inferinformation about the data by monitoring application pageaccesses. Previous work [41] has shown that an attackercan extract hundreds of kilobytes of data from confidentialdocuments during a single run of a spellcheck application,

Spark ExecutionCatalyst

o-filter

query optimization

SQL ML GraphOpaque

o-groupby o-join

Figure 1: Opaque e�ciently executes a wide range of distributeddata analytics tasks by introducing SGX enabled oblivious rela-tional operators that mask data access patterns and new queryoptimization techniques to reduce performance overhead.

as well as discernible outlines of jpeg images from animage processing application running inside the enclave.Regarding the network level, such access pattern leakageis also dangerous in the distributed data processing set-ting, as standard tasks (e.g., sorting or hash-partitioning)produce network tra�c patterns that reveal informationabout the data (e.g., key skew). Even if the messagessent over the network are encrypted, Ohrimenko et al [31]showed that an attacker who only observes a MapReducequery’s network tra�c patterns (the source and destina-tion of each message but not its content) can identify theage group, marital status, and place of birth for specificrows in a census database. Therefore, to truly securethe data, we need to make the computation oblivious, i.e.computation should not leak any access pattern.

In this paper, we introduce Opaque1, a distributed andoblivious data analytics platform. Utilizing Intel SGXhardware enclaves, Opaque provides strong security guar-antees: obliviousness and computation integrity.

One key question when implementing the obliviousfunctionality is: at what layer in the software stack shouldwe implement it? Implementing it at the application layerwill likely result in application-specific solutions that arenot widely applicable. Implementing it at the executionlayer, while very general, provides us little semanticsabout application beyond the execution graph, which sig-nificantly reduces our ability to optimize the implemen-tation. Thus, neither of these two natural approachesappears satisfactory.

Fortunately, recent developments and trends in big dataprocessing frameworks provide us with a compelling op-portunity: the query optimization layer. Previous workhas shown that the relational model can express a widevariety of big data workloads, including complex graph

1The name “Opaque” stands for Oblivious Platform for AnalyticQUEries, as well as opacity, meaning no sensitive information is visible.

Ray

Page 35: Recent riselab Developments in UC Berkeley DS-100jegonzal/assets/slides/...Recent Developments in Data Science @ Berkeley Joseph E. Gonzalez Asst. Professor in EECS jegonzal@berkeley.edu

How can I get involved in Research for RISE (or anywhere on campus)

1. Learn about the ongoing projects:Ø https://rise.cs.berkeley.eduØ https://github.com/ucbrise

2. Email faculty and grad students to see if they have openings or are looking for helpØ Have an idea for the project!

3. Try to attend seminars hosted by the labØ We will start posting these soon!

Joseph E. Gonzalez ([email protected])