preventing data errors with continuous testing

27
Preventing Data Errors with Continuous Testing Kıvanç Muşlu Yuriy Brun Alexandra Meliou University of Washington University of Massachusetts

Upload: others

Post on 11-Apr-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Preventing Data Errors with Continuous Testing

Preventing Data Errors with Continuous Testing

Kıvanç  Muşlu                    Yuriy  Brun                  Alexandra  Meliou    

University of Washington University of Massachusetts

Page 2: Preventing Data Errors with Continuous Testing
Page 3: Preventing Data Errors with Continuous Testing

Data  discrepancies  

6:15  PM  

6:15  PM  

6:22  PM  

9:40  PM  8:33  PM  

9:54  PM  

slide  credit:  Luna  Dong  and  Divesh  Srivastava  

Page 4: Preventing Data Errors with Continuous Testing

“Pray,  Mr.  Babbage,  if  you  put  into  the  machine  wrong  figures,  will  the  right  answers  come  out?”  

Charles  Babbage,  from  Passages  from  the  Life  of  a  Philosopher  

Page 5: Preventing Data Errors with Continuous Testing

Dealing  with  soUware  errors  

•  program  analysis  •  language  features  •  code  reviews  •  formal  verificaVon  

…  and  

tesVng  

Page 6: Preventing Data Errors with Continuous Testing

Key  idea:    TesVng  for  data-­‐intensive  systems  

•  IdenVfying  system  failures  caused  by    well-­‐formed  but  incorrect  data  

•  Using  applicaVon-­‐specific  execuVon  informaVon  •  IntegraVng  into  the  system  usage  workflow  

Bringing  tesVng  to  the  data  domain  

Page 7: Preventing Data Errors with Continuous Testing

How  is  data  tesVng  different?  

•  Data  semanVcs  •  Test  query  generaVon  •  Timeliness  •  Unobtrusive  and  precise  interface  

System  administrators  and  users  are  not  soUware  engineers  

Page 8: Preventing Data Errors with Continuous Testing

Data  semanVcs  

•  Valid  data  can  span  mulVple  orders  of  magnitude  –  thwarts  staVsVcal  outlier  detecVon  

•  SemanVcs  of  nearly  idenVcal  data  differ  vastly:  

617  616  

339  

Page 9: Preventing Data Errors with Continuous Testing

Test  query  generaVon  

•  Administrators  and  users  have  not  (yet)  bought  into  tesVng  

•  Manually  wrieen  tests  will  come  – developers  can  ship  these  with  applicaVon  

•  But  automaVc  generaVon  is  needed  now  – mine  queries  from  code  –  record  historical  queries  – adapt  related  work  on  database-­‐use  test  generaVon  

Page 10: Preventing Data Errors with Continuous Testing

Timeliness  

•  Data  tesVng  is  a  runVme  acVvity  

•  Administrators  and  users  don’t  troubleshoot  unless  something  goes  wrong  

•  Learning  about  an  error  sooner  means  error  has  less  impact  

Page 11: Preventing Data Errors with Continuous Testing

Unobtrusive  and  precise  user  interface  

•  Administrators  and  users  may  not  understand  test  outcomes  

•  Tests  must  integrate  into  workflow  •  Results  must  link  to  

– acVons  that  caused  failure,  or    – data  values  relevant  to  failure  

Page 12: Preventing Data Errors with Continuous Testing

Example  scenario:  car  dealership  

• Manager  wants  to  put  cars  between  $10K  and  $15K    on  a  30%-­‐off  sale  

Page 13: Preventing Data Errors with Continuous Testing

Car  dealership  challenges  

•  Data  semanVcs  reducing  all  prices  by  30%  seems  like  a  valid  change  

•  Test  query  generaVon  manager  doesn’t  know  about  wriVng  tests  

•  Timeliness  manager  doesn’t  know  to  run  tests  reporVng  delay  causes  financial  losses  

•  Unobtrusive  and  precise  interface  problem  cannot  be  reported  as  “test  failed”  

Page 14: Preventing Data Errors with Continuous Testing

ConVnuous  Data  TesVng  (CDT)  

•  Data  semanVcs  and  test  query  generaVon  mines  source  code  for  queries  allows  manually  wrieen  and  history-­‐mined  tests  

•  Timeliness  run  tests  conVnuously  opVmizaVons  to  trigger  proper  tests  at  proper  Vmes  

•  Unobtrusive  and  precise  interface  delegates  problem  to  system  designer  highlights  data  involved  in  tests  

Page 15: Preventing Data Errors with Continuous Testing

Making  CDT  possible  (opVmizaVons)  

•  NaïveCDT:    run  tests  conVnuously  •  SimpleCDT:    only  run  tests  aUer  updates  •  SmartCDT:    only  relevant  tests  aUer  

 relevant  changes  •  SmartCDTTC:  test  compression  •  SmartCDTIT:  incremental  test  query  

 execuVon  •  SmartCDTTC+IT:  all  of  the  above  

Page 16: Preventing Data Errors with Continuous Testing

CDT  effecVveness  

How  effecVve  is  CDT    at  prevenVng  data  entry  errors?    

Do  false  posiVves  reduce    CDT’s  effecVveness?    

CDT’s  and  integrity  constraints’  effect    on  data  entry  speed    

Page 17: Preventing Data Errors with Continuous Testing

Data  came  from  real-­‐world  spreadsheets  from  data.gov  and  the  EUSES  corpus;    tests  from  formulae  in  the  spreadsheets.  

96  disVnct  users  were  asked  to  copy    numeric  values  into  the  corresponding  cells  

Crowd  study  Amazon  Mechanical  Turk  

Group  1:  control  

No  highlighVng.    Submiqng  errors  allowed.  

Group  2:  CDT  

Data  involved  in  failing  tests  highlighted.  Submiqng  errors  OK.  highlighted.  

Group  3:  CDT  with  false  posiVves  

Data  involved  in  failing  test  and  40%  extra  data  highlighted.  Submiqng  errors  OK.  highlighted.  

Group  4:  integrity  constraints  

Data  involved  in  failing  integrity  constraints  highlighted.  No  submiqng  errors.  highlighted.  

Page 18: Preventing Data Errors with Continuous Testing

0

20

40

60

80

100

corrected errors (%) time to correct (sec)

controlintegrity constraints

CDTCDTFP

CDT,  even  with  false  posiVves,  successfully  prevented  data  errors  

CDT  much  faster  than  integrity  constraints  

Page 19: Preventing Data Errors with Continuous Testing

CDT  efficiency  

0

2

4

6

8

10

12

14

16

18

0.1/min 0.2/min 0.33/min 1/min 2/min

Ove

rhea

d (%

)

NaïveCDTSimpleCDT

SmartCDTSmartCDTTC

SmartCDTITSmartCDTTC+IT

more  frequent  updates  =  more  overhead  

Even  with  frequent  updates,  overhead  is  manageable  

CDT’s  effect  on  performance  of  performance-­‐intensive  applicaVons  

Page 20: Preventing Data Errors with Continuous Testing

Related  work  

•  GeneraVng  tests  for  systems  that  use  databases   Chays,  Shahid,  and  Frankl.  Query-­‐based  test  generaVon  

for  database  applicaVons.  DBTest  2008.    Khalek  and  Khurshid.  SystemaVc  tesVng  of  database  engines  using  a  relaVonal  constraint  solver.  ICST  2011.    Li  and  Csallner.  Dynamic  symbolic  database  applicaVon  tesVng.  DBTest  2010.    Pan,  Wu,  and  Xie.  Database  state  generaVon  via  dynamic  symbolic  execuVon  for  coverage  criteria.  DBTest  2011.    Pan,  Wu,  and  Xie.  GeneraVng  program  inputs  for  database  applicaVon  tesVng.  ASE  2011.  

Page 21: Preventing Data Errors with Continuous Testing

Related  work  

•  GeneraVng  tests  for  systems  that  use  databases  

•  ConVnuous  tesVng  Saff  and  Ernst.  Reducing  wasted  development  Vme  via  conVnuous  tesVng.  ISSRE  2003.    Saff  and  Ernst.  An  experimental  evaluaVon  of  conVnuous  tesVng  during  development.  ISSTA  2004.  

Page 22: Preventing Data Errors with Continuous Testing

Related  work  

•  GeneraVng  tests  for  systems  that  use  databases  

•  ConVnuous  tesVng  •  Effects  of  conVnuous  feedback  

Katzan  Jr.  Batch,  conversaVonal,  and  incremental  compilers.  The  American  FederaVon  of  InformaVon  Processing  SocieVes  1969.    Muslu,  Brun,  Holmes,  Ernst,  and  Notkin.  SpeculaVve  analysis  of  integrated  development  environment  recommendaVons.  OOPSLA  2012.    Boekhoudt.  The  big  bang  theory  of  IDEs.  Queue  2003.  

Page 23: Preventing Data Errors with Continuous Testing

Related  work  

•  GeneraVng  tests  for  systems  that  use  databases  

•  ConVnuous  tesVng  •  Effects  of  conVnuous  feedback  •  Errors  in  spreadsheets  

Badame  and  Dig.  Refactoring  meets  spreadsheet  formulas.  ICSM  2012.    Barowy,  Gochev,  and  Berger.  CheckCell:  Data  debugging  for  spreadsheets.  OOPSLA  2014.    Hermans  and  Dig.  BumbleBee:  A  refactoring  environment  for  spreadsheet  formulas.  FSE  Tool  Demo  2014.  

Page 24: Preventing Data Errors with Continuous Testing

Related  work  

•  GeneraVng  tests  for  systems  that  use  databases  

•  ConVnuous  tesVng  •  Effects  of  conVnuous  feedback  •  Errors  in  spreadsheets  •  Data  cleaning  

Culoea  and  McCallum.  Joint  deduplicaVon  of  mulVple  record  types  in  relaVonal  data.  CIKM  2005.    Domingos.  MulV-­‐relaVonal  record  linkage.  In  Workshop  on  MulV-­‐RelaVonal  Data  Mining  2004.    Dong,  Halevy,  and  Madhavan.  Reference  reconciliaVon  in  complex  informaVon  spaces.  SIGMOD  2005.    Li,  Tziviskou,  Wang,  Dong,  Liu,  Maurino,  and  Srivastava.  Chronos:  FacilitaVng  history  discovery  by  linking  temporal  records.  PVLDB    2012.    Kashyap  and  Sheth.  SemanVc  and  schemaVc  similariVes  between  database  objects:  A  context-­‐based  approach.  The  VLDB  Journal  1996.    Hernandez  and  Stolfo.  The  merge/purge  problem  for  large  databases.  SIGMOD  1995.  

Page 25: Preventing Data Errors with Continuous Testing

Related  work  

•  GeneraVng  tests  for  systems  that  use  databases  

•  ConVnuous  tesVng  •  Effects  of  conVnuous  feedback  •  Errors  in  spreadsheets  •  Data  cleaning  •  Understanding  errors  in  databases  

Meliou,  Gaeerbauer,  Moore,  and  Suciu.  The  complexity  of  causality  and  responsibility  for  query  answers  and  non-­‐answers.  PVLDB  2010.    Meliou,  Roy,  and  Suciu.  Causality  and  explanaVons  in  databases.  PVLDB  2014.    Wang,  Dong,  and  Meliou.  Data  X-­‐Ray:  A  diagnosVc  tool  for  data  errors.  SIGMOD  2015.    Khoussainova,  Balazinska,  and  Suciu.  Towards  correcVng  input  data  errors  probabilisVcally  using  integrity  constraints.  ACM  InternaVonal  Workshop  on  Data  Engineering  for  Wireless  and  Mobile  Access  2006.  

Page 26: Preventing Data Errors with Continuous Testing

Related  work  

•  GeneraVng  tests  for  systems  that  use    databases  

•  ConVnuous  tesVng  •  Effects  of  conVnuous  feedback  •  Errors  in  spreadsheets  •  Data  cleaning  •  Understanding  errors  in  databases  •  A  truckload  of  automated  test  generaVon  work  

Page 27: Preventing Data Errors with Continuous Testing

ContribuVons  •  Four  challenges  of  data  tesVng:  

•  ConVnuous  Data  TesVng  prototype  for  PostgreSQL  

•  OpVmizaVons  for  which  tests  to  run  when  •  CDT  efficient  and  effecVve  at  prevenVng  errors,  even  when  low-­‐quality  tests  result  in  false  posiVves  

1.  data  semanVcs  2.  test  generaVon  

3.  Vmeliness  4.  interface  

heps://bitbucket.org/ameli/contest