exploring the future of big data at mit · mit big data initiative at csail project: big data...

MIT Big Data Initiative at CSAIL

Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014

Proposal

& Request for Data

Introduction We now have the ability to collect and acquire digital information at an unprecedented rate across practically all aspects of our life including healthcare, financial transactions, social interactions, education, energy usage, transportation, environmental monitoring and so on. "Big Data" is about harnessing all of this digital information by combining and analyzing it in completely new ways to make better predictions and ultimately, better decisions. Over the next decade Big Data has the potential to profoundly change the way we live, work and play. Big Data also introduces unique challenges when it comes to managing and protecting personal privacy (MIT co-‐hosted a Workshop on Big Data Privacy with the White House in March 2014 to discuss these issues). Big Data privacy issues are complex, introducing a host of ethical, legal, policy and technical questions. How do we build on Big Data’s potential for good, while maintaining essential privacy protections? And, how do we design future technologies, policies, and practices to get that balance right for society? MIT is well positioned to take a leadership role in demonstrating not only how organizations can leverage data in the future, but also how we collect, manage, and use personal information, from setting appropriate policies to demonstrating systems that can implement it in practice. In terms of integrating, analyzing, and sharing data, MIT faces similar challenges to many organizations across different sectors whether in industry or government. A Big Data testbed at MIT will allow us to demonstrate how data can be used to better understand and improve our community; collectively explore ways to address technical and privacy challenges; and demonstrate new approaches and solutions emerging from the research community. MIT Big Data Living Lab Project We propose building a testbed at MIT that demonstrates a unified interface to data for the MIT Community and enables the community to demonstrate the value of big data across a variety of applications creating a "Living Lab" at MIT (see Appendix I. Figures 1 and 2) The MIT Living Lab data platform will provide the ability to access, share and use data about MIT, for MIT -‐-‐ allowing MIT itself to be more data-‐driven. The MIT Big Data Living Lab will allow MIT campus to serve as a microcosm for many Big Data efforts (whether in government, industry or other academic institutions and non-‐profits) and would enable us to: • Lead by example, employing organizational best practices for collecting and managing

personal information; • Explore technical issues and test prototypes for large-‐scale access control, data integration,

data governance, analytics, and visualization; • Build and demonstrate privacy technology and privacy policy; • Demonstrate the impact and benefits of big data with a plethora of new applications; • Explore social implications of big data e.g. understanding people's reaction to and use of this

kind of data collection;

• Enable personal innovation and ownership, by providing members of MIT community appropriate access to their own personal data;

• Demonstrate a system that will provide useful services to the MIT community architected such that the suite of services is extensible.

If the project is successful, it will result in a number of visualizations and analyses of MIT life that we hope will provide value to the MIT community. Examples of the types of questions we might investigate: Social Patterns:

• How much do people in different departments, labs and organizations co-‐mingle? • What are the informal social relationships between different groups on campus? • Which parts of the campus are under/over utilized?

Health and Wellness:

• How can we better promote wellness? • Is exercise correlated with performance in other aspects of student/employees

performance? • What are the patterns of use of MIT's athletic facilities?

Transportation:

• What are the patterns in how people get to/from campus, when, and via what routes? • Are their opportunities for carpooling, improving transportation services, or offering

new types of services? Collaboration:

• Which groups at MIT have expertise in a particular technical area, or have worked on a particular research topic?

• What are the cross-‐departmental collaborations that are occurring, and are there ways we can strengthen such ties?

A key challenge for the MIT Living Lab is opening up repositories of information on campus that contain the data needed to answer these and many other questions. As part of the MIT Big Data Living Lab project, we will work with MIT administration, MIT IS&T and various other MIT departments across campus in a number of different capacities including 1) as customers and 2) as data owners or data curators. As "customers" we expect to collaborate with different people and groups at MIT who have a desire to explore data and gain new insights. For example, we have already started conversations with the Community Wellness Program at MIT Medical (Maryanne Kirkbride), MIT DAPER (Tim Mertz), and the MIT Office of Sustainability (Julie Newman) on what would be valuable output. We have also been working closely with IS&T on identifying repositories of data on campus, mapping out the different data curators, and how we will access the data (Stephen Buckley, Thomas Hardjono, Myra Hope Eskridge). As data "owners" or "curators" we expect to work with different people and groups at MIT on requesting and accessing data. Our goal is to develop a unified set of interfaces to data on and

about campus, while respecting the privacy and security of individuals recorded in those data sets. The Project aims to bring together many kinds of data from many different sources, including: 1) MIT data: Data generated and collected by MIT as an organization, for example, campus maps; card swipes; building maps; WiFi access points and video/CCTV data. 2) Personal Data: Data collected and stored by individual members of the MIT community, about themselves, which they control and then choose to share with certain groups for certain purposes, e.g., location (GPS) and "quantified self" metrics for activity monitoring (# of steps taken) and tracking behavior. The Big Data Living Lab team is now beta-‐testing a new app that will allow people at MIT to collect and store data using their smart phones using the Open Personal Data Store (OpenPDS) architecture (Pentland, Media Lab) and the DataHub system (Madden, CSAIL). 3) External or Public Data: Data from other sources, for example, social media data (Twitter), weather data, events data, local city data. Aggregating this diversity of data will allow us to derive patterns from disparate data types. Analyzing aggregate data, even anonymized, can reveal valuable insights about trends and patterns on campus. In the initial phase of the Big Data Living Lab project, we will focus on two application areas to demonstrate proof-‐of-‐concept: “MIT Moves” (where do people spend time on campus? and what are the movement patterns?) and MIT Wellness or "MIT Quantified-‐Self" (what are the patterns in people’s activity levels, sleep patterns, eating habits, etc by time or by subgroup on campus?). See Appendix I for further descriptions of these specific applications. Request for MIT Data For the Living Lab projects, we would like request access to data from different groups within MIT including the following: For Phase One: Data Type Description MIT Data

Owner MIT/Relevant Policies

Purpose

1) MIT Campus Maps

Campus maps including public spaces inside buildings

Facilities Information Systems, Department of Facilities

for mapping and cross-‐referencing access points across campus

2) Wireless access

# of unique identifiers at

Operations and

http://ist.mit.edu/network/rules

for understanding aggregate patterns of

points

each wireless access point

Infrastructure, IS&T

(MIT only stores data for 30 days)

movement across campus

3) Card swipes

card swipe data for all campus buildings and parking lots

Security and Emergency Management Office, Department of Facilities

http://web.mit.edu/semo/security/policies.html

for understanding aggregate patterns of movement across campus

4) Campus CCTV Video

video from cameras on campus

Security and Emergency Management Office, Department of Facilities

http://web.mit.edu/semo/security/policies.html

for understanding aggregate patterns of movement across campus; raw video footage will be processed by CSAIL researchers (Fisher et al) and only the processed data (not actual video footage) will be shared

6) Community Wellness at MIT Medical/GetFit Program/ DAPER

fitness data tracked by participants

MIT Medical work with existing programs as an "opt-‐in"

for better understanding staff, faculty, student, and MIT community exercise patterns and "wellness"

For Phase Two: Given initial demonstrations are successful, we would like to access additional data. In some cases this will mean opening up access to data for individuals, and allowing them the option to share data with certain Living Labs applications. In some cases we will be looking to access anonymized data that will be studied in aggregate to explore correlations and patterns. Additional MIT data sources we would like to include:

• TechCash • Registrar's Office • Travel Operations • Payroll/Financial Operations • MIT Medical/Healthcare • Infrastructure data for campus (roads, electricity, location of MIT busses and vehicles,

etc)

Request for Process For the MIT Big Data Living Lab Project we want to establish an effective process for requesting, receiving and managing MIT Data and we want to ensure that data is shared and managed according to an agreed upon set of Best Practices for Personal Information and Privacy (see proposed list of principles re MIT adopting a Personal Data Bill of Rights in Appendix II). We suggest MIT establish an internal "Data Use Oversight and Review Panel" to approve Living Lab applications and studies. This would operate as a small group of faculty, administration officials and/or experts who can ensure safeguards while innovating practices during the initial Living Lab test phase. As part of our commitment to responsible and proactive treatment of the data, we will request that the MIT data owners or "curators", with the assistance of the Living Lab team as needed, provide the data as required by the policies which govern them. For example, this treatment may include preconditioning the data to an agreed level of abstraction or other practices deemed appropriate under the circumstances. To govern the oversight process, we propose establishing an "internal" Data Use Agreement (iDUA) to ensure relevant expectations are agreed about data access and use, the following is an example framework for this iDUA: Overview

• Project Purpose • MIT Data Owner/Data Curator

Data Requested • Description of Data request [data schema, data types, time period etc] • Will the data include Personally Identifiable Information (PII)? • What are the privacy concerns, sensitivities or particular considerations?

Stewardship

• How will PII be managed? • What anonymization (if required) will be done and who will be responsible for

anonymizing the data (e.g the MIT Data Owner or Living Lab Project)? • Description of anonymization method and/or other agreed preconditioning • Persons who will have access to the data • Data delivery, storage method and protocols • MIT requirements/policies governing lifecycle of data, e.g., limits on storage

Proposed Use

• What data will be shared publicly and how will it be shared? • What output products will be generated from the data and how will this be shared

publicly?

Request for Standardized Web-‐Based Interfaces to Data at MIT MIT as an institution can achieve scalable access to data (both institutional data and community-‐generated data) by using the same strategy that Amazon, Google, Facebook and other Internet data-‐rich services have adopted in the past few years. This strategy is as follows:

• Standard Web Interfaces: Make data creation, storage and access be done through a standard set of web-‐based data interfaces. This will make data available regardless of the clients that access the data (e.g. mobile, or browser) or the services that publish the data. Our eventual goal is to standardize these interfaces across all organizations at MIT so that developers and users who need to access data can expect the same data interface and similar structure, regardless of the owner of the data or the services that publish them. This will involve documenting and publishing MIT-‐wide interface descriptions, so that any authorized user at MIT can use the service to access data.

• Common MIT-‐wide authentication & authorization infrastructure: Deploy a common

infrastructure for authentication & authorization for all MIT’s Web APIs so that users need only authenticate once, and obtain the authorization tokens necessary to access data (i.e. Single Sign On). Today the industry standard for authorization for web interface is the OAuth2.0 token format and the OpenID-‐Connect (OIDC) protocol. MIT already possesses a good open source implementation of OAuth2.0 & OIDC, which are being integrated into the Touchstone authentication infrastructure.

• Empower the MIT community: Enable the community to create new kinds of

applications. Interfaces with a delegated authorization model like OpenID Connect allow users to grant data access to applications, empowering community developers to use data traditionally inaccessible to the community in a manner that respects user security and Institute data policies.

APPENDIX I: LIVING LAB PHASE 1

Figure 1. A high level vision of the MIT Big Data Living Lab architecture.

Figure 2. Data at MIT: How use data from and to MIT community?

Applications In the initial phase of the Living Lab Project we will focus on two application areas to demonstrate proof-‐of-‐concept: MIT Moves and MIT Wellness or MIT Quantified-‐Self. 1) "MIT Moves" Project will focus on understanding aggregate movement, patterns and flow of people on campus. Using aggregate data we can look at patterns in movement of people around campus and how it changes over time depending on different factors, such as events on campus, time of year, etc. Examples of questions we could investigate include:

• Which parts of campus are under/over utilized? • What are patterns in where people congregate? • What factors most impact patterns of movement? • What are "typical" patterns of flow? • What can we learn about campus safety? • Can we better understand transportation needs/services on campus? • How might this inform long term facilities and campus planning? • What are the "traffic" patterns of how people move around campus?

Customer(s): 1) MIT Senior Leadership/Campus Planning 2) "MIT Quantified-‐Self" Project will focus on fitness and wellness allowing individuals to collect basic activity metrics using their smart phones (which can be a far more powerful tracker than just a fitness tracker, e.g. FitBit) and allowing the use of aggregate data to understand patterns and trends in wellness across campus. For this project we are developing a customized version of the MIT app for the Living Lab Project enabling users to collect and share specific types of personal data related to location and activity (movement, usage, sensors etc). Individual users will be able to collect, store and view their own personal data. Simple queries (or quizzes) could allow tracking of events, for example getting the flu or monitoring stress-‐level, happiness-‐level. For users that opt-‐in to sharing, certain data can be shared with selected groups of people (friends, classmates, colleagues etc) and aggregate (anonymized) data will allow us to look at patterns across campus. Examples of questions we could investigate include:

• How do student's activity levels change over the course of a semester? • Can we correlate patterns in wellness with performance in other aspects of MIT life? • Can we identify patterns in the spread of the flu on campus? Can we predict flu

outbreaks? • What motivates students to participate in tracking their wellness?

Customer(s): 1) MIT Community Wellness at MIT Medical and MIT DAPER 2) MIT students, researchers and staff

APPENDIX II: PRINCIPLES -‐ MANAGING PERSONAL DATA Proposed Principles for consideration in establishing a Personal Data Bill of Rights at MIT:

• Individual Control: MIT community members have a right to exercise control over what personal data organizations collect from them and how they use it.

• Transparency: MIT community members have a right to easily understand information

about privacy and security practices.

• Respect for Context: MIT community members have a right to expect that organizations will collect, use, and disclose personal data in ways that are consistent with the context in which consumers provide the data.

• Security: MIT community members have a right to secure and responsible handling of

personal data.

• Access and Accuracy: MIT community members have a right to access and correct personal data in usable formats, in a manner that is appropriate to the sensitivity of the data and the risk of adverse consequences to consumers if the data are inaccurate.

• Focused Collection: MIT community members have a right to reasonable limits on the

personal data that companies working with MIT collect and retain.

• Accountability: MIT community members have a right to have personal data handled by companies with appropriate measures in place to assure they adhere to the MIT Privacy Bill of Rights.

APPENDIX III: MIT WEB-‐APIs for Data at MIT

MIT Big Data Platform and APIs: Design Notes Thomas Hardjono Justin Anderson Sam Madden Elizabeth Bruce 1. Summary

This document seeks to provide an overview of the authorization model for the MIT Big Data Living Lab platform for personal data stores at MIT. It is anticipated that a number of data repositories will be made available for access to the MIT community under the MIT Big Data at CSAIL (bigdata@CSAIL) initiative.

Figure 1

A simplified interaction flow is shown in Figure 1:

• A member of the MIT community (User) seeks to access data that resides within one or more data-‐stores at MIT that have participated within the BigData@CSAIL initiative.

• The User may be employing software, such as a web-‐application, to perform the actual access to the data-‐stores. Such a web-‐application aids the User in reading or viewing the data, since in most cases the raw data coming from the data-‐store maybe to voluminous and too fine-‐grained.

• Any MIT data-‐store that participates in the BigData@CSAIL initiative must require the User to authenticate to the MIT authentication infrastructure (i.e. Touchstone) and obtain authorization from the MIT authorization infrastructure (i.e. OIDC).

2. Authentication and Authorization Requirements for BigData@CSAIL Platform

In order for a user (requesting party) to obtain access to any data stores belonging to the BigData@CSAIL Platform, there are a number of requirements on (i) the part of the user and (ii) on the data service that participates in the BigData@CSAIL Platform.

(a) MIT User and Client-‐side general requirements: • MIT credentials: The user must be in possession of an MIT issued identity and

credential (e.g. Kerberos account) • Touchstone authentication: The user must first authenticate to Touchstone@MIT

(either directly, or through a re-‐direct from the authorization server or from the data service).

• OIDC authorization: After authentication succeeds, the user must obtain authorization from the MIT OIDC server.

(b) MIT Data-‐store requirements for BigData@CSAIL initiative:

• RESTful Web APIs: MIT data-‐stores and other services that wish to participate in the

BigData@CSAIL initiative must implement a RESTful Web-‐API (over HTTPS). • Support OAuth2.0 and OIDC protocols: The Web-‐API must support authorization

using OAuth2.0 standard (RFC6749) and the OpenID-‐Connect (OIDC) protocol standard.

3. OAuth2.0 and OpenID-‐Connect (OIDC) Authorization: An Overview

The OAuth2.0 and the OIDC protocols are today the industry standard for authorization for services accessible via RESTful Web APIs. These protocols address only authorization (not authentication). As such, they assume authentication has occurred using a separate mechanism and are in fact agnostic to the authentication strength/mechanism being deployed in the infrastructure. At MIT, the Touchstone authentication infrastructure supports authentication via passwords, via Kerberos and via X509 certificates. Touchstone also supports the issuance of SAML2.0 assertions (digitally signed), which allows touchstone to implement the Single-‐Sign-‐On (SSO) feature to sites that accept these assertions. There are a number of basic steps required to obtain authentication and authorization to access the Web APIs (Figure 2):

• Step 1: The MIT User using the MIT Mobile App or using a browser must authenticate itself to MIT Touchstone.

• Step 2: MIT Touchstone will return a ticket (in the case of Kerberos) or a SAML2.0 assertion (in the case of Single-‐Sign-‐On).

• Step 3: The User employs the Web Application (which acts as the OAuth2.0 Approved Client) to attempt access to the data store which is part of the BigData@CSAIL initiative.

• Step 4: In this case since the User has not yet been authorized, the User is redirected to the MIT OIDC Authorization Server (AS) in order to obtain the necessary authorization (in the form of an OAuth2.0 Token).

• Step 5: In this Step the User interacts with the OIDC Server, obtaining an OAuth2.0 Token that will be cached at the Web Application.

Figure 2

4. MIT BigData@CSAIL Platform: Structure of Data Stores

In order to provide seamless and efficient access of data by the User, an MIT data-‐store that participates in the BigData@CSAIL initiative must observe the following.

(a) Common data directory structure: We expect data-‐stores to make available a Web-‐API that makes data available in the following format:

example.mit.edu/ServiceName/EndPointName/ where the ServiceName uniquely identifies the service at MIT (e.g. facilities) and the EndPointName quniquely identifies the Web end-‐point accessible to the Client.

(b) Web-‐discoverable configuration file:

We expect each service (i.e. ServiceName) will make available a well-‐known computer-‐readable configuration file under a uniform name that provides detailed instructions to the Client software as to how the Client must proceed. The well-‐known configuration file will list, among others:

• All the accessible end-‐points within the service. • The required level of permissions (i.e. OAuth.20 scopes) • The format of the data within the Service. • The structure of the directory in the Service where the data resides.

• Others TBD.

5. MIT Mobile App Platform: Native Applications

m.mit.edu and the MIT Mobile iOS and Android apps will serve as a proof of concept for this style of authentication, authorization, and discoverability.

(a) Authorization Server: To keep things simple for the proof of concept, an instance of MIT KIT’s open source OIDC authorization server will be set up on m.mit.edu, the same host as the APIs used by MIT Mobile.

(b) APIs: The secure APIs on m.mit.edu will be updated to require an auth token from the Authorization Server instead of the current requirement of Touchstone alone.

(c) Discoverable configuration file: A configuration file will be added to m.mit.edu at a well-‐known URL.

(d) Approved Clients: MIT Mobile for iOS and Android will be updated to redirect users to OIDC and Touchstone in a web browser to access the APIs on a user’s behalf.

Once the proof of concept is in place, other OIDC clients may be created to pull data from the APIs on m.mit.edu, and APIs on other servers can redirect their clients to the Authorization Server on m.mit.edu. If that proves successful, the Authorization Server will be spun off into its own domain.

exploring the future of big data at mit · mit big data initiative at csail project: big data...

Documents