preventing data errors with continuous testing

Preventing Data Errors with Continuous Testing

Kıvanç Muşlu Yuriy Brun Alexandra Meliou

University of Washington University of Massachusetts

Data discrepancies

6:15 PM

6:15 PM

6:22 PM

9:40 PM 8:33 PM

9:54 PM

slide credit: Luna Dong and Divesh Srivastava

“Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?”

Charles Babbage, from Passages from the Life of a Philosopher

Dealing with soUware errors

•  program analysis •  language features •  code reviews •  formal verificaVon

… and

tesVng

Key idea: TesVng for data-‐intensive systems

•  IdenVfying system failures caused by well-‐formed but incorrect data

•  Using applicaVon-‐specific execuVon informaVon •  IntegraVng into the system usage workflow

Bringing tesVng to the data domain

How is data tesVng different?

•  Data semanVcs •  Test query generaVon •  Timeliness •  Unobtrusive and precise interface

System administrators and users are not soUware engineers

Data semanVcs

•  Valid data can span mulVple orders of magnitude –  thwarts staVsVcal outlier detecVon

•  SemanVcs of nearly idenVcal data differ vastly:

617 616

339

Test query generaVon

•  Administrators and users have not (yet) bought into tesVng

•  Manually wrieen tests will come – developers can ship these with applicaVon

•  But automaVc generaVon is needed now – mine queries from code –  record historical queries – adapt related work on database-‐use test generaVon

Timeliness

•  Data tesVng is a runVme acVvity

•  Administrators and users don’t troubleshoot unless something goes wrong

•  Learning about an error sooner means error has less impact

Unobtrusive and precise user interface

•  Administrators and users may not understand test outcomes

•  Tests must integrate into workflow •  Results must link to

– acVons that caused failure, or – data values relevant to failure

Example scenario: car dealership

• Manager wants to put cars between $10K and $15K on a 30%-‐off sale

Car dealership challenges

•  Data semanVcs reducing all prices by 30% seems like a valid change

•  Test query generaVon manager doesn’t know about wriVng tests

•  Timeliness manager doesn’t know to run tests reporVng delay causes financial losses

•  Unobtrusive and precise interface problem cannot be reported as “test failed”

ConVnuous Data TesVng (CDT)

•  Data semanVcs and test query generaVon mines source code for queries allows manually wrieen and history-‐mined tests

•  Timeliness run tests conVnuously opVmizaVons to trigger proper tests at proper Vmes

•  Unobtrusive and precise interface delegates problem to system designer highlights data involved in tests

Making CDT possible (opVmizaVons)

•  NaïveCDT: run tests conVnuously •  SimpleCDT: only run tests aUer updates •  SmartCDT: only relevant tests aUer

relevant changes •  SmartCDTTC: test compression •  SmartCDTIT: incremental test query

execuVon •  SmartCDTTC+IT: all of the above

CDT effecVveness

How effecVve is CDT at prevenVng data entry errors?

Do false posiVves reduce CDT’s effecVveness?

CDT’s and integrity constraints’ effect on data entry speed

Data came from real-‐world spreadsheets from data.gov and the EUSES corpus; tests from formulae in the spreadsheets.

96 disVnct users were asked to copy numeric values into the corresponding cells

Crowd study Amazon Mechanical Turk

Group 1: control

No highlighVng. Submiqng errors allowed.

Group 2: CDT

Data involved in failing tests highlighted. Submiqng errors OK. highlighted.

Group 3: CDT with false posiVves

Data involved in failing test and 40% extra data highlighted. Submiqng errors OK. highlighted.

Group 4: integrity constraints

Data involved in failing integrity constraints highlighted. No submiqng errors. highlighted.

0

20

40

60

80

100

corrected errors (%) time to correct (sec)

controlintegrity constraints

CDTCDTFP

CDT, even with false posiVves, successfully prevented data errors

CDT much faster than integrity constraints

CDT efficiency

0

2

4

6

8

10

12

14

16

18

0.1/min 0.2/min 0.33/min 1/min 2/min

Ove

rhea

d (%

)

NaïveCDTSimpleCDT

SmartCDTSmartCDTTC

SmartCDTITSmartCDTTC+IT

more frequent updates = more overhead

Even with frequent updates, overhead is manageable

CDT’s effect on performance of performance-‐intensive applicaVons

Related work

•  GeneraVng tests for systems that use databases Chays, Shahid, and Frankl. Query-‐based test generaVon

for database applicaVons. DBTest 2008. Khalek and Khurshid. SystemaVc tesVng of database engines using a relaVonal constraint solver. ICST 2011. Li and Csallner. Dynamic symbolic database applicaVon tesVng. DBTest 2010. Pan, Wu, and Xie. Database state generaVon via dynamic symbolic execuVon for coverage criteria. DBTest 2011. Pan, Wu, and Xie. GeneraVng program inputs for database applicaVon tesVng. ASE 2011.

Related work

•  GeneraVng tests for systems that use databases

•  ConVnuous tesVng Saff and Ernst. Reducing wasted development Vme via conVnuous tesVng. ISSRE 2003. Saff and Ernst. An experimental evaluaVon of conVnuous tesVng during development. ISSTA 2004.

Related work


•  ConVnuous tesVng •  Effects of conVnuous feedback

Katzan Jr. Batch, conversaVonal, and incremental compilers. The American FederaVon of InformaVon Processing SocieVes 1969. Muslu, Brun, Holmes, Ernst, and Notkin. SpeculaVve analysis of integrated development environment recommendaVons. OOPSLA 2012. Boekhoudt. The big bang theory of IDEs. Queue 2003.

Related work


•  ConVnuous tesVng •  Effects of conVnuous feedback •  Errors in spreadsheets

Badame and Dig. Refactoring meets spreadsheet formulas. ICSM 2012. Barowy, Gochev, and Berger. CheckCell: Data debugging for spreadsheets. OOPSLA 2014. Hermans and Dig. BumbleBee: A refactoring environment for spreadsheet formulas. FSE Tool Demo 2014.

Related work


•  ConVnuous tesVng •  Effects of conVnuous feedback •  Errors in spreadsheets •  Data cleaning

Culoea and McCallum. Joint deduplicaVon of mulVple record types in relaVonal data. CIKM 2005. Domingos. MulV-‐relaVonal record linkage. In Workshop on MulV-‐RelaVonal Data Mining 2004. Dong, Halevy, and Madhavan. Reference reconciliaVon in complex informaVon spaces. SIGMOD 2005. Li, Tziviskou, Wang, Dong, Liu, Maurino, and Srivastava. Chronos: FacilitaVng history discovery by linking temporal records. PVLDB 2012. Kashyap and Sheth. SemanVc and schemaVc similariVes between database objects: A context-‐based approach. The VLDB Journal 1996. Hernandez and Stolfo. The merge/purge problem for large databases. SIGMOD 1995.

Related work


•  ConVnuous tesVng •  Effects of conVnuous feedback •  Errors in spreadsheets •  Data cleaning •  Understanding errors in databases

Meliou, Gaeerbauer, Moore, and Suciu. The complexity of causality and responsibility for query answers and non-‐answers. PVLDB 2010. Meliou, Roy, and Suciu. Causality and explanaVons in databases. PVLDB 2014. Wang, Dong, and Meliou. Data X-‐Ray: A diagnosVc tool for data errors. SIGMOD 2015. Khoussainova, Balazinska, and Suciu. Towards correcVng input data errors probabilisVcally using integrity constraints. ACM InternaVonal Workshop on Data Engineering for Wireless and Mobile Access 2006.

Related work


•  ConVnuous tesVng •  Effects of conVnuous feedback •  Errors in spreadsheets •  Data cleaning •  Understanding errors in databases •  A truckload of automated test generaVon work

ContribuVons •  Four challenges of data tesVng:

•  ConVnuous Data TesVng prototype for PostgreSQL

•  OpVmizaVons for which tests to run when •  CDT efficient and effecVve at prevenVng errors, even when low-‐quality tests result in false posiVves

1.  data semanVcs 2.  test generaVon

3.  Vmeliness 4.  interface

heps://bitbucket.org/ameli/contest

preventing data errors with continuous testing

Documents