topes: enabling end-user programmers to validate and reformat data christopher scaffidi committee:...

68
Topes: Enabling End-User Topes: Enabling End-User Programmers to Validate and Programmers to Validate and Reformat Data Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair) Institute for Software Research, Carnegie Mellon University Sebastian Elbaum Computer Science & Engineering, University of Nebraska-Lincoln Jim Herbsleb Institute for Software Research, Carnegie Mellon University

Post on 20-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

Topes: Enabling End-User Topes: Enabling End-User Programmers to Validate and Reformat Programmers to Validate and Reformat

DataData

Christopher Scaffidi

Committee:

Mary Shaw (chair) Institute for Software Research, Carnegie Mellon University

Sebastian Elbaum Computer Science & Engineering, University of Nebraska-Lincoln

Jim Herbsleb Institute for Software Research, Carnegie Mellon University

Brad Myers Human-Computer Interaction Institute, Carnegie Mellon University

Page 2: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

22

Target populationTarget population

• In 2012, there will be 90 million computer end users in American workplaces.

• Of these, at least 55 million will create spreadsheets, databases, web applications, or other programs.– Spreadsheets for computing budgets– Spreadsheets and databases for storing information– Web applications for collecting data from coworkers

And similar programs for automating a wide range of tedious or error-prone work tasks.

Introduction Requirements Topes Tools Evaluation Conclusion

Page 3: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

33

Contextual inquiry:Contextual inquiry:What are the problems of end users?What are the problems of end users?

Observed 3 administrative assistants, 4 managers, and 3 webmasters/graphic designers (1-3 hrs, each)

Introduction Requirements Topes Tools Evaluation Conclusion

Page 4: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

44

Lots of manual labor—Lots of manual labor—validating and reformatting stringsvalidating and reformatting strings

• Building a staff roster, merging data from web sites:– Had to scrutinize data to identify questionable values

(e.g.: CMU campus phone numbers are usually 268-xxxx but 269-xxxx might be right)

– Had to manually transform data to consistent format(e.g.: Put person names in Lastname, Firstname format)

Introduction Requirements Topes Tools Evaluation Conclusion

Page 5: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

55

Another person’s task: validate web forms--Another person’s task: validate web forms--but he didn’t know JavaScript / regexpsbut he didn’t know JavaScript / regexps

Introduction Requirements Topes Tools Evaluation Conclusion

Page 6: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

66

Collaborations of programmers withCollaborations of programmers withwidely varying skills, interests, concernswidely varying skills, interests, concerns

• Interviewing creators of Hurricane Katrina “person locator” sites (helping survivors publish their status)

• 4 managers in IT firms, 1 student, 1 graphic designer

– 2 people each created a site on their own– 4 people collaborated with other programmers

(principally on site aggregation)

Introduction Requirements Topes Tools Evaluation Conclusion

Page 7: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

77

Hurricane Katrina “Person Locator” site:Hurricane Katrina “Person Locator” site:Many inputs unvalidatedMany inputs unvalidated

Introduction Requirements Topes Tools Evaluation Conclusion

Page 8: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

88

Hurricane Katrina sites are not alone in Hurricane Katrina sites are not alone in lacking input validation.lacking input validation.

• Eg: Google Base web application–13 primary web forms –Even numeric fields accept unreasonable inputs (such as a salary of “-45”)

• If professional programmers can’t get this right, then it’s unsurprising that those 90 million end users also have so much trouble.

So many unvalidated inputs. So many data errors. So much time to find mistakes. So many millions of people laboriously reformatting data by hand.

We need a better way!

Introduction Requirements Topes Tools Evaluation Conclusion

Page 9: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

99

OutlineOutline

1. Requirements for a better model

2. Topes• Model for describing data• Tools for creating/using topes

3. Evaluations

4. Conclusion

Introduction Requirements Topes Tools Evaluation Conclusion

Page 10: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

1010

Underlying problem: abstraction mismatchUnderlying problem: abstraction mismatch

• Tools support strings, integers, floats, maybe dates.

• Problem domain involves higher-level data categories:

– Person names “Scaffidi, Chris”, “Chris Scaffidi”

– CMU phone numbers “8-1234”, “x8-1234”

– CMU room numbers “WeH 4623”, “Wean 4623”

Introduction Requirements Topes Tools Evaluation Conclusion

Page 11: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

1111

Approach: Create a new abstraction for Approach: Create a new abstraction for each category of dataeach category of data

• Like software “libraries,” implementations of these abstractions could be reused in many programs.

• Abstractions would need to include functions for:– Recognizing instances of the category

(for automating data validation)

– Transforming instances among various formats

(for automating data reformatting)

Introduction Requirements Topes Tools Evaluation Conclusion

Page 12: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

1212

1. Identify valid, invalid, and 1. Identify valid, invalid, and questionable valuesquestionable values

• Data is sometimes questionable… yet valid.– E.g.: an unusually long email address– In practice, person names and other proper nouns are never

validated with regexps… too brittle.– Life is full of corner cases and exceptions.

• If code can identify questionable data, then it can double-check the data:– Ask an application end user to confirm the input– Flag the input for checking by a system administrator– Compare the value to a list of known exceptions– Call up a server and see if it can confirm the value

Introduction Requirements Topes Tools Evaluation Conclusion

Page 13: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

1313

2. Capture reformatting rules2. Capture reformatting rules

• Two different strings can be equivalent.– What if an end user types a date in the wrong format?– “Jan-3-2007” and “1/3/2007” mean the same thing because of

the category that they are in: date.– Sometimes the interpretation is ambiguous. In real life,

preferences and experience guide interpretation.

• If code can transform among formats, then it can put data in an unambiguous format as needed.– Display result so users can check/fix interpretation

Introduction Requirements Topes Tools Evaluation Conclusion

Page 14: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

1414

3. User-extensibility3. User-extensibility

• Many kinds of data are organization-specific

• But users at those organizations know what the data values mean—take advantage of what they know…

• Users can describe the constrained parts of data.– Eg: CMU room numbers, “EDSH 303”, have a building name

and an internal room number– Valid data obeys intra- and inter-part constraints.

Introduction Requirements Topes Tools Evaluation Conclusion

Page 15: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

1515

4. Reusability across programming 4. Reusability across programming environments (“platforms”)environments (“platforms”)

• Validity does not depend on whether the string is in a spreadsheet or a webform or a database

• To validate a kind of data, people don’t want to write– JavaScript for webforms on the client side– C#/Java/PHP for webforms on the server side– Stored procedures for databases– VBScript for spreadsheets

Introduction Requirements Topes Tools Evaluation Conclusion

Page 16: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

1616

Limitations of existing approachesLimitations of existing approaches

Types do not support questionable values

Grammars (eg: regexps, CFGs, Lapis) do not either, and cannot reformat

Tools to integrate heterogeneous databases require a professional DBA and are specific to database systems (ie: not spreadsheets, webforms, etc).

Cues, Forms/3, -calculus, Slate, etc, infer numerical constraints but not constraints on strings, and they are tied to specific programming platforms

Information extraction algorithms rely on grammatical cues that are absent during validation

Introduction Requirements Topes Tools Evaluation Conclusion

Page 17: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

1717

TopesTopes

• A “tope” = a platform-independent abstraction that describes how to recognize and reformat instances of a data category

• Greek word for “place,” because each corresponds to a data category with a natural place in the problem domain

Introduction Requirements Topes Tools Evaluation Conclusion

Page 18: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

1818

A tope is a graph.A tope is a graph.Node = format, edge = transformationNode = format, edge = transformation

Notional representation for a CMU room number tope…

Formal building name& room number

Elliot Dunlap Smith Hall 225

Colloquial building name& room number

Smith 225

Building abbreviation& room number

EDSH 225

Introduction Requirements Topes Tools Evaluation Conclusion

Page 19: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

1919

A tope has functions for recognizing and A tope has functions for recognizing and transforming instances of a data categorytransforming instances of a data category• Each tope implementation has executable functions:

– 1 isa:string[0,1] function per format, for recognizing instances of the format (a fuzzy set)

– 0 or more trf:stringstring functions linking formats, for transforming values from one format to another

• Validation function:(str) = max(isaf(str))where f ranges over tope’s formats– Valid when (str) = 1– Invalid when (str) = 0– Questionable when 0 < (str) < 1

Introduction Requirements Topes Tools Evaluation Conclusion

Page 20: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

2020

Common kinds of topes:Common kinds of topes:enumerations and proper nouns enumerations and proper nouns

• Multi-format Enumerations, e.g: US states– “New York”, “CA”, maybe “Guam”

• Open-set proper nouns, e.g.: company names– Whitelist of definitely valid names (“Google”), with

alternate formats (e.g. “Google Corp”, “GOOG”)– Augmented with a pattern for promising inputs that

are not yet on the whitelist

Introduction Requirements Topes Tools Evaluation Conclusion

Page 21: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

2121

Two other common kinds of topes:Two other common kinds of topes:numeric and hierarchicalnumeric and hierarchical

• Numeric, e.g.: human masses– Numeric and in a certain range– Values slightly outside range might be questionable– Sometimes labeled with an explicit unit– Transformation usually by multiplication

• Hierarchical, e.g.: address lines– Parts described with other topes (e.g.: “100 Main St.”

uses a numeric, a proper noun, and an enum)– Simple isas can be implemented with regexps.– Transformations involve permutation of parts, lookup

tables, and changes to separators & capitalization.

Introduction Requirements Topes Tools Evaluation Conclusion

Page 22: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

2222

Role of good tool supportRole of good tool support

• Some simple isa functions could be implemented as– Enumerations– Regular expressions / formal grammars

• But for many topes, we also need to support questionable values and reformatting

• And usability can almost always be improved by tailoring the tools to the problem domain– Integrate with users’ familiar tools– Match the user interface to the problem’s structure

Introduction Requirements Topes Tools Evaluation Conclusion

Page 23: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

2323

Topes in actionTopes in action

1. Users create data descriptions (abstract, user-friendly descriptions of data categories)

2. Users publish data descriptions on repositories.

3. Other users download data descriptions to cache.

4. System automatically generates tope implementations from data descriptions.

5. Tool add-ins help users browse their cache and associate topes with variables and input fields.

6. Add-ins get topes from local cache and call them at runtime to validate and reformat data.

Introduction Requirements Topes Tools Evaluation Conclusion

Page 24: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

2424

What the user seesWhat the user sees

Introduction Requirements Topes Tools Evaluation Conclusion

User highlights cellsClicks “New” button on our Validation toolbar

Page 25: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

2525

System infers a boilerplate topeSystem infers a boilerplate topeand presents it for review and customizationand presents it for review and customization

Induction steps:1. Identify number & word parts2. Align parts based on punctuation3. Infer simple constraints on parts

2525 Introduction Requirements Topes Tools Evaluation Conclusion

Page 26: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

2626

User gives names to the partsUser gives names to the partsand edits constraintsand edits constraints

Features• Part names• Value whitelists• Testing features• Soft constraints (never / rarely / often / almost always / always)

Introduction Requirements Topes Tools Evaluation Conclusion

Page 27: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

2727

System identifies typosSystem identifies typos

Introduction Requirements Topes Tools Evaluation Conclusion

Features• Targeted messages• Overridable• Filterable• Can add to “whitelist”• Integrated with Excel’s “reviewing” functionality

Checking inputs1. Convert description to CFG w/

constraints on productions2. Parse each input string3. For each constraint violation,

downgrade parse’s isa score

Page 28: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

2828

Easy access to reformatting functionalityEasy access to reformatting functionality

Introduction Requirements Topes Tools Evaluation Conclusion

Reformatting string1. Parse with input format’s CFG2. For each part in target format,

a) Get node from parse treeb) Reformat node if needed (recurse)c) Concatenate (with separators if needed)

3. Validate result with target format’s CFG

Page 29: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

2929

Recommending topes based on label and Recommending topes based on label and examples-to-matchexamples-to-match

Introduction Requirements Topes Tools Evaluation Conclusion

Efficient recommendation• Only consider a tope if its instances could possibly have the “character content” of each example string.(eg.: could this have 12 letters & 1 space?)

Page 30: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

3030

Search repository by Search repository by label and/or exampleslabel and/or examples

Note: many repositories will be organization-specific

Introduction Requirements Topes Tools Evaluation Conclusion

Page 31: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

3131

Integration with Visual Studio.NETIntegration with Visual Studio.NET

Introduction Requirements Topes Tools Evaluation Conclusion

Features• Targeted messages• Overridable• Drag & drop code generation

Page 32: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

3232

Other integrations to date:Other integrations to date:CoScripter, Robofox, XML/HTML libraryCoScripter, Robofox, XML/HTML library

Introduction Requirements Topes Tools Evaluation Conclusion

Page 33: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

3333

Other integration underwayOther integration underway

Introduction Requirements Topes Tools Evaluation Conclusion

• RedRover– Spreadsheet auditing– They already support formula auditing– Goal: Using topes for checking strings

• LogicBlox– Decision-support– Helping users enter data & make decisions from it– Goal: Using topes for validating data– Goal: Using topes for data de-duplication

Page 34: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

3434

EvaluationEvaluation

• Many evaluations rely on the EUSES Spreadsheet Corpus (collected by Univ. Nebraska) – In particular, 4250 spreadsheet columns that

contained at least 20 strings

• These evaluations generally use the F1 statistic as a measure of accuracy1. Get strings from the corpus

2. Manually validate the strings

3. Automatically validate the strings (eg: with topes)

4. Compute F1 to check agreementF1 = precision * recall / ( (precision + recall)/2 )

Introduction Requirements Topes Tools Evaluation Conclusion

Page 35: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

3535

Evaluating accuracyEvaluating accuracy

• Implemented topes for spreadsheet data– Created 32 topes for the most common categories

• Covering 1199 columns, which was ~69% of the 1713 categorized columns, or ~28% of all 4250 columns

• Up to 5 formats per tope

– Compared to current practice• Validate w/ tope, simulate asking user on questionable inputs, F1=0.7• Validate w/ regexps or enumerations if available, but accept all inputs

when no regexp or enumeration is available, F1=0.19

– Tope-based validation was 3 times as accurate• Big benefit from supporting multi-format topes• Moderate benefit from validating currently-unvalidated categories • Small benefit from double-checking questionable values

(~ 3% of inputs)

Introduction Requirements Topes Tools Evaluation Conclusion

Page 36: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

3636

Evaluating reusabilityEvaluating reusability

• Reused spreadsheet-based topes on webform data– Downloaded data for 8 data categories on

Google Base and 5 in Hurricane Katrina website– Reused spreadsheet-based topes on the web data– Validation was even more accurate than on

spreadsheets

• F1=0.75 for Google Base, 0.92 for Hurricane Katrina• Website data had less formatting diversity than spreadsheets

Introduction Requirements Topes Tools Evaluation Conclusion

Page 37: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

3737

Evaluating support for data cleaningEvaluating support for data cleaning

• Used topes to put web data into consistent formats– Again with the 5 columns in Hurricane Katrina website– Used transformation functions to put each string into

the most common format for that data category– Increased number of duplicate strings found by 10%

Introduction Requirements Topes Tools Evaluation Conclusion

Page 38: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

3838

Evaluating usability for data validationEvaluating usability for data validation

• Users validating data with single-format topes– Between-subjects lab study– 8 users validated spreadsheet data with our tools;

for comparison, 8 users validated with Lapis patterns– Yes/no validation tasks (no questionable data)– Our tool users vs Lapis users

• Found three times as many typos (comparable F1 scores)• Were twice as fast• Reported significantly higher user satisfaction

– Our tool users vs users in earlier regexp study• Faster & more accurate

(Similar but not identical tasks: not statistically comparable)

Introduction Requirements Topes Tools Evaluation Conclusion

Page 39: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

3939

Evaluating usability for data reformattingEvaluating usability for data reformatting

• Users reformatting data with multi-format topes– Within-subjects lab study– 9 users reformatted spreadsheet data by creating &

using topes; for comparison, they then did it manually– Effort of creating a tope “pays off” at only 47 strings

(further reuse is essentially “free”)– Every participant strongly preferred using our tools

instead of doing tasks manually

Introduction Requirements Topes Tools Evaluation Conclusion

Page 40: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

4040

Evaluating tope recommendationsEvaluating tope recommendations

• Quickly recommend existing tope for data at hand– Supports keyword-based search + search-by-match

(eg: topes that match “888-555-1212”)– Evaluated by searching through topes for the 32 most

common data categories in EUSES spreadsheet corpus, using strings from corpus

– High accuracy: Recall over 80% (result set size = 5)– Adequate speed: User is likely to have a few dozen

topes on computer, taking under 1 sec to search

Introduction Requirements Topes Tools Evaluation Conclusion

Page 41: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

4141

Closing the mismatch between Closing the mismatch between data abstractions and the real worlddata abstractions and the real world

• People often work with strings that are possibly-questionable instances of multi-format categories.

• These categories are application-agnostic and often common to many people.

• By capturing rules for validating and reformatting strings (including distinguishing questionable strings and multiple formats), topes…– Increase the accuracy of validation– Help users to accomplish validation and reformatting

activities quickly and effectively – Improve the reusability of validation code

Introduction Requirements Topes Tools Evaluation Conclusion

Page 42: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

4242

Thank You…Thank You…

• To my committee and the entire EUSES Consortium for helpful suggestions

• To NSF for funding

Introduction Requirements Topes Tools Evaluation Conclusion

Page 43: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

4343

ReferencesReferences

For more information on end users and topes- End users’ counts and needs: VL/HCC’05, VL/HCC’07- Topes model: ICSE’08- Format inferrence: ICEIS’07- Integration with other systems: WEUSE’08 & FSE’08- Our latest tools + usability validation: ISEUD’09 & IUI’09

For more information on some related work- Dependent types, eg: X. Ou, Dynamic Typing with Dependent Types, Tech Rpt TR-695-04, Princeton Univ, 2004

- Regexp induction, eg: K. Lerman, S. Minton. Learning the Common Structure of Data, Proc. AAAI, 2000.

- Lapis system: R. Miller, Lightweight structure in text, Tech Rpt CMU-CS-02-134, Carnegie Mellon Univ., 2002.

- SWYN regexp editor: A. Blackwell, See What You Need: Helping End-users to Build Abstractions, JVLC, 2001.

- Federated databases, eg: A. Sheth, J. Larsen, Federated database systems for managing distributed, heterogeneous, and autonomous databases, CSUR, 1990.

- ETL Tools, eg: E. Rahn, H. Do, Data Cleaning: Problems and Current Approaches, IEEE Data Eng. Bulletin, 2000.

- Potter’s Wheel: V. Raman, J. Hellerstein, Potter's Wheel: An Interactive Data Cleaning System, VLDB, 2001.

- Forms/3 : M. Burnett et al, End-user software engineering with assertions in the spreadsheet paradigm, ICSE, 2003.

- -calculus: M. Erwig, M. Burnett, Adding Apples and Oranges. Symp. Practical Aspects of Declarative Lang., 2002.

- Named entities, eg: Message Understanding Conference series.

Introduction Requirements Topes Tools Evaluation Conclusion

Page 44: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

4444

Professional programmers use lots of tricks Professional programmers use lots of tricks to simplify validation code. Eg: njtransit.comto simplify validation code. Eg: njtransit.com

Split inputs into many easy-to-validate fields.Who cares if the user has to type tabs now,or if he can’t just copy-paste into one field?

Make users pick from drop-downs.Who cares if it’s faster for users to type

“NJ” or “1/2007”?(Disclaimer: drop-downs sometimes are good!)

I implemented this site in 2003.

Introduction Requirements Topes Tools Evaluation Conclusion

Page 45: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

4545

Even with these tricks, writing validation is Even with these tricks, writing validation is still very time-consuming.still very time-consuming.

Overall, the site had over 1100 lines of JavaScript

just for validation….Plus equivalent server-side Java code (too bad code

isn’t platform-independent)

if (!rfcCheckEmail(frm.primaryemail.value)) return messageHelper(frm.primaryemail, "Please enter a valid Primary Email address.");var atloc = frm.primaryemail.value.indexOf('@');if (atloc > 31 || atloc < frm.primaryemail.value.length-33) return messageHelper(frm.primaryemail, "Sorry. You may only enter 32 characters or less for your email name\r\n”+ ”and 32 characters or less for your email domain (including @).");

Introduction Requirements Topes Tools Evaluation Conclusion

Page 46: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

4646

That was worst case.That was worst case.Best case: reusable regexps.Best case: reusable regexps.

• Many IDEs allow the programmer to enter oneregular expression for validating each input field.– Usually, this drastically reduces the amount of code,

since most validation ain’t fancy.– Yet programmers don’t validate most inputs.

Introduction Requirements Topes Tools Evaluation Conclusion

Page 47: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

4747

Users’ spreadsheets are rife with Users’ spreadsheets are rife with formatting inconsistencies & other typosformatting inconsistencies & other typos

In one study by Univ Nebraska, nearly 40% of spreadsheet cell values were strings (not numbers or dates).

Part of an actual spreadsheet on Carnegie Mellon’s public web site

Introduction Requirements Topes Tools Evaluation Conclusion

Page 48: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

4848

Evaluating expressivenessEvaluating expressiveness

• Implemented topes for common webform inputs– Instrumented web browsers of 4 administrative

assistants for 3 weeks– Logged strings that they typed into forms – in a

regexp-masked format e.g.: [email protected] [a-z]{4}[0-9]@[A-Z]{3}.[A-Z]{3}

– Also logged strings nearby to textfields– Semi-automatically grouped strings into categories

e.g.: project number, expense type, email address, zip code

– Implemented 14 most common topes– Found 22 probable typos in user inputs (0.5%)

Introduction Requirements Topes Tools Evaluation Conclusion

Page 49: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

4949

Tope Development Environment (TDE)Tope Development Environment (TDE)

Introduction Requirements Topes Tools Evaluation Conclusion

User’s finished data description

Strings from applications

Stringsfrom

applications

Boilerplate data description

Error messages and reformatted strings

Tope implementation and data description

Error messages andreformatted strings

Data description (download)

Data description (upload)

End user applications Microsoft Excel – spreadsheets Visual Studio.NET – web forms Robofox – web macros Vegemite/Co-Scripter – web macros

...

Topei

Toped++ Topeg

Add-ins

Remote repositories

Local repository

Page 50: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

5050

As a tool builder, what do I have to do so As a tool builder, what do I have to do so that people can use topes in my tool?that people can use topes in my tool?

You need to make an add-in1. Figure out what kind of fields you want to help your

users validate/reformat(eg: spreadsheets’ cells; webforms’ textboxes)

2. Download our open source C# or Java API (library)

3. In your tool’s UI, add buttons and other widgets so user can select a tope for the fields; in your event handler, call our API methods

4. At runtime, pass field’s value (a string) to our API methods to validate or reformat strings

5. Display validation error messages; update value in UI

Introduction Requirements Topes Tools Evaluation Conclusion

Page 51: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

5151

Recognizing multiple formats and Recognizing multiple formats and questionable inputs raises accuracyquestionable inputs raises accuracy

Condition 4: Hypothetical user has to help on ~ 3% of inputs

Condition 1: Recall = 0 (fails to identify any invalid inputs)

Introduction Requirements Topes Tools Evaluation Conclusion

Page 52: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

5252

User’s finished data description

Strings from applications

Stringsfrom

applications

Boilerplate data description

Error messages and reformatted strings

Tope implementation and data description

Error messages and

reformatted strings

Data description (download)

Data description (upload)

End user applications

Microsoft Excel – spreadsheets

Visual Studio.NET – web forms

Robofox – web macros

Vegemite/Co-Scripter – web macros

...

Topei

Toped++ Topeg

Add-ins

Remote repositories

Local repository

Page 53: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

5353

Imagine a world where…Imagine a world where…

• Code can ask an oracle, “Is this a person name?”, and the oracle replies yes, no, almost definitely, probably not, and other shades of gray.

• Code allows input in any reasonable format, since the code can ask the oracle to put the input into the format that is actually needed.

• Regardless of whether they are working in spreadsheets, webforms, or other programming environment, end users can teach the oracle about a new data category by concisely stating its parts and constraints.

Introduction Requirements Topes Tools Evaluation Conclusion

Page 54: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

5454

Data errors reduce the usefulness of data.Data errors reduce the usefulness of data.

Even little typos impede data de-duplication.

Age is not useful for flying my helicopter to come rescue you.

Nor is a “city name” with 1 letter.

Introduction Requirements Topes Tools Evaluation Conclusion

Page 55: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

5555

Page 56: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

5656

3

14

55

90

0 10 20 30 40 50 60 70 80 90 100

professionalprogrammers

people who say they"do programming"

spreadsheet /database users

computer users

Projected population size (millions)

Page 57: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

5757

Page 58: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

5858

Page 59: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

5959

Page 60: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

6060

A Word-like part that almost always contains 1-6 words that each always have 1-8 lowercase letters per word and only hyphens or ampersands between words:

#PART : #WORDLIST : COUNT(#WORD)>=1 && COUNT(#WORD)<=6 {90}#WORDLIST : #WORD | #WORD #SEP #WORDLIST#WORD : #CHLIST : COUNT(#CH)>=1 && COUNT(#CH)<=8 {100}#CHLIST : #CH | #CH #CHLIST #CH : a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z#SEP : - | &

Page 61: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

6161

Page 62: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

6262

Page 63: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

6363

Page 64: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

6464

Page 65: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

6565

Page 66: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

6666

Page 67: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

6767

Page 68: Topes: Enabling End-User Programmers to Validate and Reformat Data Christopher Scaffidi Committee: Mary Shaw (chair)Institute for Software Research, Carnegie

6868