exploring the future of big data at mit · mit big data initiative at csail project: big data...

14
MIT Big Data Initiative at CSAIL Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014 Proposal & Request for Data

Upload: others

Post on 29-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring the Future of Big Data at MIT · MIT Big Data Initiative at CSAIL Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014 ! Proposal & Request for

 

 

MIT Big Data Initiative at CSAIL

Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014

 

Proposal

& Request for Data

       

                                                           

   

                                 

Page 2: Exploring the Future of Big Data at MIT · MIT Big Data Initiative at CSAIL Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014 ! Proposal & Request for

 

   Introduction    We  now  have  the  ability  to  collect  and  acquire  digital  information  at  an  unprecedented  rate  across  practically  all  aspects  of  our  life  including  healthcare,  financial  transactions,  social  interactions,  education,  energy  usage,  transportation,  environmental  monitoring  and  so  on.    "Big  Data"  is  about  harnessing  all  of  this  digital  information  by  combining  and  analyzing  it  in  completely  new  ways  to  make  better  predictions  and  ultimately,  better  decisions.    Over  the  next  decade  Big  Data  has  the  potential  to  profoundly  change  the  way  we  live,  work  and  play.    Big  Data  also  introduces  unique  challenges  when  it  comes  to  managing  and  protecting  personal  privacy  (MIT  co-­‐hosted  a  Workshop  on  Big  Data  Privacy  with  the  White  House  in  March  2014  to  discuss  these  issues).  Big  Data  privacy  issues  are  complex,  introducing  a  host  of  ethical,  legal,  policy  and  technical  questions.    How  do  we  build  on  Big  Data’s  potential  for  good,  while  maintaining  essential  privacy  protections?    And,  how  do  we  design  future  technologies,  policies,  and  practices  to  get  that  balance  right  for  society?        MIT  is  well  positioned  to  take  a  leadership  role  in  demonstrating  not  only  how  organizations  can  leverage  data  in  the  future,  but  also  how  we  collect,  manage,  and  use  personal  information,  from  setting  appropriate  policies  to  demonstrating  systems  that  can  implement  it  in  practice.    In  terms  of  integrating,  analyzing,  and  sharing  data,  MIT  faces  similar  challenges  to  many  organizations  across  different  sectors  whether  in  industry  or  government.      A  Big  Data  testbed  at  MIT  will  allow  us  to  demonstrate  how  data  can  be  used  to  better  understand  and  improve  our  community;  collectively  explore  ways  to  address  technical  and  privacy  challenges;  and  demonstrate  new  approaches  and  solutions  emerging  from  the  research  community.        MIT  Big  Data  Living  Lab  Project    We  propose  building  a  testbed  at  MIT  that  demonstrates  a  unified  interface  to  data  for  the  MIT  Community  and  enables  the  community  to  demonstrate  the  value  of  big  data  across  a  variety  of  applications  creating  a  "Living  Lab"  at  MIT  (see  Appendix  I.  Figures  1  and  2)    The  MIT  Living  Lab  data  platform  will  provide  the  ability  to  access,  share  and  use  data  about  MIT,  for  MIT  -­‐-­‐  allowing  MIT  itself  to  be  more  data-­‐driven.        The  MIT  Big  Data  Living  Lab  will  allow  MIT  campus  to  serve  as  a  microcosm  for  many  Big  Data  efforts  (whether  in  government,  industry  or  other  academic  institutions  and  non-­‐profits)  and  would  enable  us  to:    • Lead  by  example,  employing  organizational  best  practices  for  collecting  and  managing  

personal  information;  • Explore  technical  issues  and  test  prototypes  for  large-­‐scale  access  control,  data  integration,  

data  governance,  analytics,  and  visualization;  • Build  and  demonstrate  privacy  technology  and  privacy  policy;  • Demonstrate  the  impact  and  benefits  of  big  data  with  a  plethora  of  new  applications;    • Explore  social  implications  of  big  data  e.g.  understanding  people's  reaction  to  and  use  of  this  

kind  of  data  collection;  

Page 3: Exploring the Future of Big Data at MIT · MIT Big Data Initiative at CSAIL Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014 ! Proposal & Request for

 

• Enable  personal  innovation  and  ownership,  by  providing  members  of  MIT  community  appropriate  access  to  their  own  personal  data;  

• Demonstrate  a  system  that  will  provide  useful  services  to  the  MIT  community  architected  such  that  the  suite  of  services  is  extensible.  

 If  the  project  is  successful,  it  will  result  in  a  number  of  visualizations  and  analyses  of  MIT  life  that  we  hope  will  provide  value  to  the  MIT  community.    Examples  of  the  types  of  questions  we  might  investigate:    Social  Patterns:    

• How  much  do  people  in  different  departments,  labs  and  organizations  co-­‐mingle?    • What  are  the  informal  social  relationships  between  different  groups  on  campus?  • Which  parts  of  the  campus  are  under/over  utilized?    

 Health  and  Wellness:    

• How  can  we  better  promote  wellness?  • Is  exercise  correlated  with  performance  in  other  aspects  of  student/employees  

performance?  • What  are  the  patterns  of  use  of  MIT's  athletic  facilities?  

 Transportation:      

• What  are  the  patterns  in  how  people  get  to/from  campus,  when,  and  via  what  routes?    • Are  their  opportunities  for  carpooling,  improving  transportation  services,  or  offering  

new  types  of  services?    Collaboration:      

• Which  groups  at  MIT  have  expertise  in  a  particular  technical  area,  or  have  worked  on  a  particular  research  topic?      

• What  are  the  cross-­‐departmental  collaborations  that  are  occurring,  and  are  there  ways  we  can  strengthen  such  ties?  

 A  key  challenge  for  the  MIT  Living  Lab  is  opening  up  repositories  of  information  on  campus  that  contain  the  data  needed  to  answer  these  and  many  other  questions.    As  part  of  the  MIT  Big  Data  Living  Lab  project,  we  will  work  with  MIT  administration,  MIT  IS&T  and  various  other  MIT  departments  across  campus  in  a  number  of  different  capacities  including  1)  as  customers  and  2)  as  data  owners  or  data  curators.          As  "customers"  we  expect  to  collaborate  with  different  people  and  groups  at  MIT  who  have  a  desire  to  explore  data  and  gain  new  insights.    For  example,  we  have  already  started  conversations  with  the  Community  Wellness  Program  at  MIT  Medical  (Maryanne  Kirkbride),  MIT  DAPER  (Tim  Mertz),  and  the  MIT  Office  of  Sustainability  (Julie  Newman)  on  what  would  be  valuable  output.    We  have  also  been  working  closely  with  IS&T  on  identifying  repositories  of  data  on  campus,  mapping  out  the  different  data  curators,  and  how  we  will  access  the  data  (Stephen  Buckley,  Thomas  Hardjono,  Myra  Hope  Eskridge).    As  data  "owners"  or  "curators"  we  expect  to  work  with  different  people  and  groups  at  MIT  on  requesting  and  accessing  data.    Our  goal  is  to  develop  a  unified  set  of  interfaces  to  data  on  and  

Page 4: Exploring the Future of Big Data at MIT · MIT Big Data Initiative at CSAIL Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014 ! Proposal & Request for

 

about  campus,  while  respecting  the  privacy  and  security  of  individuals  recorded  in  those  data  sets.  The  Project  aims  to  bring  together  many  kinds  of  data  from  many  different  sources,  including:        1)  MIT  data:  Data  generated  and  collected  by  MIT  as  an  organization,  for  example,  campus  maps;  card  swipes;  building  maps;  WiFi  access  points  and  video/CCTV  data.        2)  Personal  Data:  Data  collected  and  stored  by  individual  members  of  the  MIT  community,  about  themselves,  which  they  control  and  then  choose  to  share  with  certain  groups  for  certain  purposes,  e.g.,  location  (GPS)  and  "quantified  self"  metrics  for  activity  monitoring  (#  of  steps  taken)  and  tracking  behavior.        The  Big  Data  Living  Lab  team  is  now  beta-­‐testing  a  new  app  that  will  allow  people  at  MIT  to  collect  and  store  data  using  their  smart  phones  using  the  Open  Personal  Data  Store  (OpenPDS)  architecture  (Pentland,  Media  Lab)  and  the  DataHub  system  (Madden,  CSAIL).    3)  External  or  Public  Data:  Data  from  other  sources,  for  example,  social  media  data  (Twitter),  weather  data,  events  data,  local  city  data.    Aggregating  this  diversity  of  data  will  allow  us  to  derive  patterns  from  disparate  data  types.    Analyzing  aggregate  data,  even  anonymized,  can  reveal  valuable  insights  about  trends  and  patterns  on  campus.        In  the  initial  phase  of  the  Big  Data  Living  Lab  project,  we  will  focus  on  two  application  areas  to  demonstrate  proof-­‐of-­‐concept:    “MIT  Moves”  (where  do  people  spend  time  on  campus?  and  what  are  the  movement  patterns?)  and  MIT  Wellness  or  "MIT  Quantified-­‐Self"  (what  are  the  patterns  in  people’s  activity  levels,  sleep  patterns,  eating  habits,  etc  by  time  or  by  subgroup  on  campus?).    See  Appendix  I  for  further  descriptions  of  these  specific  applications.        Request  for  MIT  Data    For  the  Living  Lab  projects,  we  would  like  request  access  to  data  from  different  groups  within  MIT  including  the  following:    For  Phase  One:      Data  Type   Description   MIT  Data  

Owner  MIT/Relevant  Policies  

Purpose  

1)    MIT  Campus  Maps    

Campus  maps  including  public  spaces  inside  buildings      

Facilities  Information  Systems,  Department  of  Facilities    

  for  mapping  and  cross-­‐referencing  access  points  across  campus  

2)    Wireless  access  

#  of  unique  identifiers  at  

Operations  and  

http://ist.mit.edu/network/rules  

for  understanding  aggregate  patterns  of  

Page 5: Exploring the Future of Big Data at MIT · MIT Big Data Initiative at CSAIL Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014 ! Proposal & Request for

 

points      

each  wireless  access  point  

Infrastructure,  IS&T  

 (MIT  only  stores  data  for  30  days)    

movement  across  campus    

3)    Card  swipes      

card  swipe  data  for  all  campus  buildings  and  parking  lots    

Security  and  Emergency  Management  Office,  Department  of  Facilities      

http://web.mit.edu/semo/security/policies.html    

for  understanding  aggregate  patterns  of  movement  across  campus    

4)    Campus  CCTV  Video      

video  from  cameras  on  campus  

Security  and  Emergency  Management  Office,  Department  of  Facilities    

http://web.mit.edu/semo/security/policies.html    

for  understanding  aggregate  patterns  of  movement  across  campus;  raw  video  footage  will  be  processed  by  CSAIL  researchers  (Fisher  et  al)  and  only  the  processed  data  (not  actual  video  footage)  will  be  shared    

6)  Community  Wellness  at  MIT  Medical/GetFit  Program/  DAPER    

fitness  data  tracked  by  participants      

MIT  Medical     work  with  existing  programs  as  an  "opt-­‐in"    

for  better  understanding  staff,  faculty,  student,  and  MIT  community  exercise  patterns  and  "wellness"    

 For  Phase  Two:    Given  initial  demonstrations  are  successful,  we  would  like  to  access  additional  data.    In  some  cases  this  will  mean  opening  up  access  to  data  for  individuals,  and  allowing  them  the  option  to  share  data  with  certain  Living  Labs  applications.    In  some  cases  we  will  be  looking  to  access  anonymized  data  that  will  be  studied  in  aggregate  to  explore  correlations  and  patterns.    Additional  MIT  data  sources  we  would  like  to  include:    

• TechCash  • Registrar's  Office  • Travel  Operations  • Payroll/Financial  Operations  • MIT  Medical/Healthcare  • Infrastructure  data  for  campus  (roads,  electricity,  location  of  MIT  busses  and  vehicles,  

etc)    

Page 6: Exploring the Future of Big Data at MIT · MIT Big Data Initiative at CSAIL Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014 ! Proposal & Request for

 

 Request  for  Process    For  the  MIT  Big  Data  Living  Lab  Project  we  want  to  establish  an  effective  process  for  requesting,  receiving  and  managing  MIT  Data  and  we  want  to  ensure  that  data  is  shared  and  managed  according  to  an  agreed  upon  set  of  Best  Practices  for  Personal  Information  and  Privacy  (see  proposed  list  of  principles  re  MIT  adopting  a  Personal  Data  Bill  of  Rights  in  Appendix  II).        We  suggest  MIT  establish  an  internal  "Data  Use  Oversight  and  Review  Panel"  to  approve  Living  Lab  applications  and  studies.    This  would  operate  as  a  small  group  of  faculty,  administration  officials  and/or  experts  who  can  ensure  safeguards  while  innovating  practices  during  the  initial  Living  Lab  test  phase.    As  part  of  our  commitment  to  responsible  and  proactive  treatment  of  the  data,  we  will  request  that  the  MIT  data  owners  or  "curators",  with  the  assistance  of  the  Living  Lab  team  as  needed,  provide  the  data  as  required  by  the  policies  which  govern  them.    For  example,  this  treatment  may  include  preconditioning  the  data  to  an  agreed  level  of  abstraction  or  other  practices  deemed  appropriate  under  the  circumstances.        To  govern  the  oversight  process,  we  propose  establishing  an  "internal"  Data  Use  Agreement  (iDUA)  to  ensure  relevant  expectations  are  agreed  about  data  access  and  use,  the  following  is  an  example  framework  for  this  iDUA:    Overview  

• Project  Purpose  • MIT  Data  Owner/Data  Curator    

Data  Requested  • Description  of  Data  request  [data  schema,  data  types,  time  period  etc]  • Will  the  data  include  Personally  Identifiable  Information  (PII)?  • What  are  the  privacy  concerns,  sensitivities  or  particular  considerations?  

 Stewardship  

• How  will  PII  be  managed?      • What  anonymization  (if  required)  will  be  done  and  who  will  be  responsible  for  

anonymizing  the  data  (e.g  the  MIT  Data  Owner  or  Living  Lab  Project)?  • Description  of  anonymization  method  and/or  other  agreed  preconditioning  • Persons  who  will  have  access  to  the  data  • Data  delivery,  storage  method  and  protocols    • MIT  requirements/policies  governing  lifecycle  of  data,  e.g.,  limits  on  storage  

 Proposed  Use    

• What  data  will  be  shared  publicly  and  how  will  it  be  shared?    • What  output  products  will  be  generated  from  the  data  and  how  will  this  be  shared  

publicly?        

Page 7: Exploring the Future of Big Data at MIT · MIT Big Data Initiative at CSAIL Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014 ! Proposal & Request for

 

Request  for  Standardized  Web-­‐Based  Interfaces  to  Data  at  MIT    MIT  as  an  institution  can  achieve  scalable  access  to  data  (both  institutional  data  and  community-­‐generated  data)  by  using  the  same  strategy  that  Amazon,  Google,  Facebook  and  other  Internet  data-­‐rich  services  have  adopted  in  the  past  few  years.        This  strategy  is  as  follows:    

• Standard  Web  Interfaces:  Make  data  creation,  storage  and  access  be  done  through  a  standard  set  of  web-­‐based  data  interfaces.    This  will  make  data  available  regardless  of  the  clients  that  access  the  data  (e.g.  mobile,  or  browser)  or  the  services  that  publish  the  data.      Our  eventual  goal  is  to  standardize  these  interfaces  across  all  organizations  at  MIT  so  that  developers  and  users  who  need  to  access  data  can  expect  the  same  data  interface  and  similar  structure,  regardless  of  the  owner  of  the  data  or  the  services  that  publish  them.  This  will  involve  documenting  and  publishing  MIT-­‐wide  interface  descriptions,  so  that  any  authorized  user  at  MIT  can  use  the  service  to  access  data.  

 • Common  MIT-­‐wide  authentication  &  authorization  infrastructure:  Deploy  a  common  

infrastructure  for  authentication  &  authorization  for  all  MIT’s  Web  APIs  so  that  users  need  only  authenticate  once,  and  obtain  the  authorization  tokens  necessary  to  access  data  (i.e.  Single  Sign  On).    Today  the  industry  standard  for  authorization  for  web  interface  is  the  OAuth2.0  token  format  and  the  OpenID-­‐Connect  (OIDC)  protocol.  MIT  already  possesses  a  good  open  source  implementation  of  OAuth2.0  &  OIDC,  which  are  being  integrated  into  the  Touchstone  authentication  infrastructure.  

 • Empower  the  MIT  community:  Enable  the  community  to  create  new  kinds  of  

applications.  Interfaces  with  a  delegated  authorization  model  like  OpenID  Connect  allow  users  to  grant  data  access  to  applications,  empowering  community  developers  to  use  data  traditionally  inaccessible  to  the  community  in  a  manner  that  respects  user  security  and  Institute  data  policies.  

 

Page 8: Exploring the Future of Big Data at MIT · MIT Big Data Initiative at CSAIL Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014 ! Proposal & Request for

 

APPENDIX  I:  LIVING  LAB  PHASE  1  

Figure 1. A high level vision of the MIT Big Data Living Lab architecture.

Figure 2. Data at MIT: How use data from and to MIT community?  

Page 9: Exploring the Future of Big Data at MIT · MIT Big Data Initiative at CSAIL Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014 ! Proposal & Request for

 

Applications    In  the  initial  phase  of  the  Living  Lab  Project  we  will  focus  on  two  application  areas  to  demonstrate  proof-­‐of-­‐concept:    MIT  Moves  and  MIT  Wellness  or  MIT  Quantified-­‐Self.    1)    "MIT  Moves"  Project  will  focus  on  understanding  aggregate  movement,  patterns  and  flow  of  people  on  campus.    Using  aggregate  data  we  can  look  at  patterns  in  movement  of  people  around  campus  and  how  it  changes  over  time  depending  on  different  factors,  such  as  events  on  campus,  time  of  year,  etc.    Examples  of  questions  we  could  investigate  include:    

• Which  parts  of  campus  are  under/over  utilized?  • What  are  patterns  in  where  people  congregate?  • What  factors  most  impact  patterns  of  movement?  • What  are  "typical"  patterns  of  flow?  • What  can  we  learn  about  campus  safety?    • Can  we  better  understand  transportation  needs/services  on  campus?  • How  might  this  inform  long  term  facilities  and  campus  planning?  • What  are  the  "traffic"  patterns  of  how  people  move  around  campus?  

 Customer(s):      1)  MIT  Senior  Leadership/Campus  Planning      2)    "MIT  Quantified-­‐Self"  Project  will  focus  on  fitness  and  wellness  allowing  individuals  to  collect  basic  activity  metrics  using  their  smart  phones  (which  can  be  a  far  more  powerful  tracker  than  just  a  fitness  tracker,  e.g.  FitBit)  and  allowing  the  use  of  aggregate  data  to  understand  patterns  and  trends  in  wellness  across  campus.        For  this  project  we  are  developing  a  customized  version  of  the  MIT  app  for  the  Living  Lab  Project  enabling  users  to  collect  and  share  specific  types  of  personal  data  related  to  location  and  activity  (movement,  usage,  sensors  etc).    Individual  users  will  be  able  to  collect,  store  and  view  their  own  personal  data.    Simple  queries  (or  quizzes)  could  allow  tracking  of  events,  for  example  getting  the  flu  or  monitoring  stress-­‐level,  happiness-­‐level.    For  users  that  opt-­‐in  to  sharing,  certain  data  can  be  shared  with  selected  groups  of  people  (friends,  classmates,  colleagues  etc)  and  aggregate  (anonymized)  data  will  allow  us  to  look  at  patterns  across  campus.        Examples  of  questions  we  could  investigate  include:  

• How  do  student's  activity  levels  change  over  the  course  of  a  semester?      • Can  we  correlate  patterns  in  wellness  with  performance  in  other  aspects  of  MIT  life?      • Can  we  identify  patterns  in  the  spread  of  the  flu  on  campus?    Can  we  predict  flu  

outbreaks?  • What  motivates  students  to  participate  in  tracking  their  wellness?  

 Customer(s):       1)  MIT  Community  Wellness  at  MIT  Medical  and  MIT  DAPER       2)  MIT  students,  researchers  and  staff    

 

Page 10: Exploring the Future of Big Data at MIT · MIT Big Data Initiative at CSAIL Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014 ! Proposal & Request for

 

   

APPENDIX  II:  PRINCIPLES  -­‐  MANAGING  PERSONAL  DATA    Proposed  Principles  for  consideration  in  establishing  a  Personal  Data  Bill  of  Rights  at  MIT:    

• Individual  Control:    MIT  community  members  have  a  right  to  exercise  control  over  what  personal  data  organizations  collect  from  them  and  how  they  use  it.  

 • Transparency:    MIT  community  members  have  a  right  to  easily  understand  information  

about  privacy  and  security  practices.    

• Respect  for  Context:    MIT  community  members  have  a  right  to  expect  that  organizations  will  collect,  use,  and  disclose  personal  data  in  ways  that  are  consistent  with  the  context  in  which  consumers  provide  the  data.  

 • Security:    MIT  community  members  have  a  right  to  secure  and  responsible  handling  of  

personal  data.    

• Access  and  Accuracy:    MIT  community  members  have  a  right  to  access  and  correct  personal  data  in  usable  formats,  in  a  manner  that  is  appropriate  to  the  sensitivity  of  the  data  and  the  risk  of  adverse  consequences  to  consumers  if  the  data  are  inaccurate.  

 • Focused  Collection:    MIT  community  members  have  a  right  to  reasonable  limits  on  the  

personal  data  that  companies  working  with  MIT  collect  and  retain.    

• Accountability:    MIT  community  members  have  a  right  to  have  personal  data  handled  by  companies  with  appropriate  measures  in  place  to  assure  they  adhere  to  the  MIT  Privacy  Bill  of  Rights.  

   

Page 11: Exploring the Future of Big Data at MIT · MIT Big Data Initiative at CSAIL Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014 ! Proposal & Request for

 

 APPENDIX  III:  MIT  WEB-­‐APIs  for  Data  at  MIT  

 MIT  Big  Data  Platform  and  APIs:  Design  Notes    Thomas  Hardjono  Justin  Anderson  Sam  Madden  Elizabeth  Bruce    1. Summary  

This  document   seeks   to  provide  an  overview  of   the  authorization  model   for   the  MIT  Big  Data  Living   Lab   platform   for   personal   data   stores   at   MIT.   It   is   anticipated   that   a   number   of   data  repositories  will  be  made  available  for  access  to  the  MIT  community  under  the  MIT  Big  Data  at  CSAIL  (bigdata@CSAIL)  initiative.    

 Figure  1  

 A  simplified  interaction  flow  is  shown  in  Figure  1:  

• A  member  of  the  MIT  community  (User)  seeks  to  access  data  that  resides  within  one  or  more  data-­‐stores  at  MIT  that  have  participated  within  the  BigData@CSAIL  initiative.  

• The  User  may  be  employing  software,  such  as  a  web-­‐application,  to  perform  the  actual  access   to   the  data-­‐stores.     Such  a  web-­‐application  aids   the  User   in   reading  or  viewing  the   data,   since   in   most   cases   the   raw   data   coming   from   the   data-­‐store   maybe   to  voluminous  and  too  fine-­‐grained.  

• Any  MIT  data-­‐store   that   participates   in   the  BigData@CSAIL   initiative  must   require   the  User   to   authenticate   to   the   MIT   authentication   infrastructure   (i.e.   Touchstone)   and  obtain  authorization  from  the  MIT  authorization  infrastructure  (i.e.  OIDC).  

   2. Authentication  and  Authorization  Requirements  for  BigData@CSAIL  Platform  

Page 12: Exploring the Future of Big Data at MIT · MIT Big Data Initiative at CSAIL Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014 ! Proposal & Request for

 

In   order   for   a   user   (requesting   party)   to   obtain   access   to   any   data   stores   belonging   to   the  BigData@CSAIL  Platform,  there  are  a  number  of  requirements  on  (i)  the  part  of  the  user  and  (ii)  on  the  data  service  that  participates  in  the  BigData@CSAIL  Platform.    

(a) MIT  User  and  Client-­‐side  general  requirements:    • MIT   credentials:   The   user   must   be   in   possession   of   an   MIT   issued   identity   and  

credential  (e.g.  Kerberos  account)  • Touchstone   authentication:   The   user   must   first   authenticate   to   Touchstone@MIT  

(either   directly,   or   through   a   re-­‐direct   from   the   authorization   server   or   from   the  data  service).  

• OIDC   authorization:   After   authentication   succeeds,   the   user   must   obtain  authorization  from  the  MIT  OIDC  server.  

 (b) MIT  Data-­‐store  requirements  for  BigData@CSAIL  initiative:  

 • RESTful  Web  APIs:  MIT  data-­‐stores  and  other  services  that  wish  to  participate  in  the  

BigData@CSAIL  initiative  must  implement  a  RESTful  Web-­‐API  (over  HTTPS).  • Support   OAuth2.0   and   OIDC   protocols:   The   Web-­‐API   must   support   authorization  

using   OAuth2.0   standard   (RFC6749)   and   the   OpenID-­‐Connect   (OIDC)   protocol  standard.  

   3. OAuth2.0  and  OpenID-­‐Connect  (OIDC)  Authorization:  An  Overview  

The   OAuth2.0   and   the   OIDC   protocols   are   today   the   industry   standard   for   authorization   for  services   accessible   via   RESTful   Web   APIs.   These   protocols   address   only   authorization   (not  authentication).    As  such,  they  assume  authentication  has  occurred  using  a  separate  mechanism  and   are   in   fact   agnostic   to   the   authentication   strength/mechanism   being   deployed   in   the  infrastructure.    At  MIT,  the  Touchstone  authentication  infrastructure  supports  authentication  via  passwords,  via  Kerberos   and   via   X509   certificates.     Touchstone   also   supports   the   issuance   of   SAML2.0  assertions   (digitally   signed),   which   allows   touchstone   to   implement   the   Single-­‐Sign-­‐On   (SSO)  feature  to  sites  that  accept  these  assertions.  There  are  a  number  of  basic  steps  required  to  obtain  authentication  and  authorization  to  access  the  Web  APIs  (Figure  2):  

• Step  1:  The  MIT  User  using  the  MIT  Mobile  App  or  using  a  browser  must  authenticate  itself  to  MIT  Touchstone.  

• Step   2:   MIT   Touchstone   will   return   a   ticket   (in   the   case   of   Kerberos)   or   a   SAML2.0  assertion  (in  the  case  of  Single-­‐Sign-­‐On).  

• Step  3:  The  User  employs   the  Web  Application   (which  acts  as   the  OAuth2.0  Approved  Client)  to  attempt  access  to  the  data  store  which  is  part  of  the  BigData@CSAIL  initiative.  

• Step  4:  In  this  case  since  the  User  has  not  yet  been  authorized,  the  User  is  redirected  to  the  MIT  OIDC  Authorization  Server   (AS)   in  order   to  obtain   the  necessary  authorization  (in  the  form  of  an  OAuth2.0  Token).  

Page 13: Exploring the Future of Big Data at MIT · MIT Big Data Initiative at CSAIL Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014 ! Proposal & Request for

 

• Step   5:   In   this   Step   the   User   interacts   with   the   OIDC   Server,   obtaining   an   OAuth2.0  Token  that  will  be  cached  at  the  Web  Application.  

 

 Figure  2  

   4. MIT  BigData@CSAIL  Platform:  Structure  of  Data  Stores  

In  order   to  provide   seamless   and  efficient   access  of   data  by   the  User,   an  MIT  data-­‐store   that  participates  in  the  BigData@CSAIL  initiative  must  observe  the  following.  

(a) Common  data  directory  structure:  We   expect   data-­‐stores   to  make   available   a  Web-­‐API   that  makes   data   available   in   the  following  format:    

example.mit.edu/ServiceName/EndPointName/  where   the   ServiceName   uniquely   identifies   the   service   at  MIT   (e.g.   facilities)   and   the  EndPointName  quniquely  identifies  the  Web  end-­‐point  accessible  to  the  Client.  

 (b) Web-­‐discoverable  configuration  file:  

We  expect  each  service  (i.e.  ServiceName)  will  make  available  a  well-­‐known  computer-­‐readable  configuration  file  under  a  uniform  name  that  provides  detailed  instructions  to  the  Client  software  as  to  how  the  Client  must  proceed.    The  well-­‐known  configuration  file  will  list,  among  others:  

• All  the  accessible  end-­‐points  within  the  service.  • The  required  level  of  permissions  (i.e.  OAuth.20  scopes)  • The  format  of  the  data  within  the  Service.  • The  structure  of  the  directory  in  the  Service  where  the  data  resides.  

Page 14: Exploring the Future of Big Data at MIT · MIT Big Data Initiative at CSAIL Project: Big Data Living Lab Exploring the Future of Big Data at MIT April 2014 ! Proposal & Request for

 

• Others  TBD.  

   5. MIT  Mobile  App  Platform:  Native  Applications  

m.mit.edu  and   the  MIT  Mobile   iOS  and  Android  apps  will   serve  as  a  proof  of   concept   for   this  style  of  authentication,  authorization,  and  discoverability.  

(a) Authorization  Server:  To   keep   things   simple   for   the  proof   of   concept,   an   instance  of  MIT   KIT’s   open   source  OIDC  authorization  server  will  be  set  up  on  m.mit.edu,  the  same  host  as  the  APIs  used  by  MIT  Mobile.      

(b) APIs:  The   secure   APIs   on   m.mit.edu   will   be   updated   to   require   an   auth   token   from   the  Authorization  Server  instead  of  the  current  requirement  of  Touchstone  alone.    

(c) Discoverable  configuration  file:  A  configuration  file  will  be  added  to  m.mit.edu  at  a  well-­‐known  URL.    

(d) Approved  Clients:  MIT   Mobile   for   iOS   and   Android   will   be   updated   to   redirect   users   to   OIDC   and  Touchstone  in  a  web  browser  to  access  the  APIs  on  a  user’s  behalf.  

Once  the  proof  of  concept  is   in  place,  other  OIDC  clients  may  be  created  to  pull  data  from  the  APIs   on   m.mit.edu,   and   APIs   on   other   servers   can   redirect   their   clients   to   the   Authorization  Server  on  m.mit.edu.  If  that  proves  successful,  the  Authorization  Server  will  be  spun  off  into  its  own  domain.