a data model to support end-user software engineering christopher scaffidi carnegie mellon...
DESCRIPTION
3 Target audience In 2012, we project that there will be 90 million computer end users (“EUs”) in American workplaces. Of these, at least half will create spreadsheets, databases, and/or web applications. These are called end-user programmers (“EUPs”). [5] Both EUs and EUPs will benefit from the proposed research, though the proposed research is primarily aimed at EUPs (including EUs who become EUPs because of the research). introduction ● prototype ● proposed work ● evaluationTRANSCRIPT
A Data Model to Support A Data Model to Support End-User Software EngineeringEnd-User Software Engineering
Christopher ScaffidiCarnegie Mellon University
22
Questions for the panelQuestions for the panel
Some areas where I would appreciate suggestions:
• What aspects of this work would be of most interest to the ICSE community (in future research papers)?
• For any potential problems that you see in the work, what solutions can you suggest?
33
Target audienceTarget audience
• In 2012, we project that there will be 90 million computer end users (“EUs”) in American workplaces.
• Of these, at least half will create spreadsheets, databases, and/or web applications. These are called end-user programmers (“EUPs”). [5]
• Both EUs and EUPs will benefit from the proposed research, though the proposed research is primarily aimed at EUPs (including EUs who become EUPs because of the research).
introduction ● prototype ● proposed work ● evaluation
44
Contextual inquiry:Contextual inquiry:What are the problems of EUs and EUPs?What are the problems of EUs and EUPs?
Observed 3 administrative assistants, 4 managers, and 3 webmasters/graphic designers (1-3 hrs, each) [3]
[9]
introduction ● prototype ● proposed work ● evaluation
55
How can EUPs validate web formsHow can EUPs validate web formsif they do not know JavaScript?if they do not know JavaScript?
introduction ● prototype ● proposed work ● evaluation
Is the input valid?“EDSH 225”
Is the input nearly valid?“EDXH 225”
Does it just need reformatting?“Smith 225”
Or is it obviously badly invalid?“Robotics Institute”
66
Other tasks, other data, other problemsOther tasks, other data, other problems
• When building a staff roster by merging data sources into a single spreadsheet, one of the EUs:– Had to manually transform data to consistent format
(e.g.: Put person names in Lastname, Firstname format)– Had to scrutinize data to identify questionable values that
deserved double-checking(e.g.: A first name with 15 characters might be right)
– Had to manually check for (near-) duplicates(e.g.: “Scaffidi, Christopher” and “Scaffidi, Chris”)
• We and research collaborators identified many additional data validation and data reuse tasks that were poorly supported by existing tools. [3][7][9]
introduction ● prototype ● proposed work ● evaluation
77
Underlying problem: abstraction mismatchUnderlying problem: abstraction mismatch
• Tools support strings, integers, floats, sometimes dates.• Problem domain involves higher-level categories of data:
– University names “Carnegie Mellon”, “CMU”
– Person names “Scaffidi, Christopher”, “Chris Scaffidi”
– CMU phone numbers “8-1234”, “x8-1234”
– CMU room numbers “WeH 4623”, “Wean 4623”
• These data categories are:– Human-readable– Short (~ 1 input field)– Multi-format– Sometimes ambiguous / fuzzy (non-binary scale of validity)– Often particular to certain groups of people
introduction ● prototype ● proposed work ● evaluation
88
A New Direction: Create a new abstraction A New Direction: Create a new abstraction for each category of datafor each category of data
• Like software “libraries,” implementations of these abstractions could be reused in many programs.
• Abstractions would need to include functionality for:– Recognizing instances of the category
(for automating data validation)
– Transforming instances among various formats(for automating data reformatting)
– Testing instances for equality(for automating removal of duplicates)
introduction ● prototype ● proposed work ● evaluation
99
A New Direction: Other requirements A New Direction: Other requirements for abstractionsfor abstractions
• EUPs over a range of programming expertise must be able to create custom new abstractions.
• Flexibility:– Abstractions must capture fuzziness when recognizing
instances of the category and when testing equivalence.– EUPs must have the option of configuring abstractions to
learn exceptional cases.
• Sharability:– EUPs must still be able to share and find useful
abstractions even as the number of abstractions grows.
introduction ● prototype ● proposed work ● evaluation
1010
ThesisThesis
The proposed data model and development environment will enable end-user programmers to implement and share custom abstractions for flexibly recognizing, transforming and equivalence-testing values in categories of short, human-readable data.
The model and environment will help end-user programmers to more quickly and correctly validate and reuse data than is possible through currently practiced methods.
introduction ● prototype ● proposed work ● evaluation
1111
TopesTopes
• Tope = an abstraction implementation for a data category– Greek word for “place,” because each corresponds to a
data category with a natural place in the problem domain
• Topes in practice:1. EUPs create new topes by using the basic tope editor (or
by writing topes in another language, such as JavaScript)2. EUPs publish topes on repositories.3. Other EUs & EUPs download topes to their local cache.4. Tool plug-ins let EUs & EUPs browse their local cache and
associate topes with variables and input fields.5. Plug-ins get topes from local cache and use them to
recognize, transform, and equivalence-test data.
introduction ● prototype ● proposed work ● evaluation
1212
Related Work: Existing approaches do not Related Work: Existing approaches do not meet the requirements.meet the requirements.
• Regexps / grammars / data detectors recognize data but do not specify how to transform data
• Types:– A value is or is not a valid instance of a type (non-fuzzy)– If invalid at compilation, values cannot become valid at runtime– Typed languages are probably difficult for EUPs who are
uncomfortable with untyped scripting languages.• Research on units (e.g.: Slate) and constraint systems (e.g.:
Cues) typically only apply to numeric data in certain applications (e.g.: spreadsheets).
• And none of these has built-in support for helping users decide which abstractions to trust, so sharing is impeded.
introduction ● prototype ● proposed work ● evaluation
1313
OutlineOutline
• Introduction• Related work• Prototype• Proposed work• Evaluation
introduction ● prototype ● proposed work ● evaluation
How could flexible formats be expressed?
1414
Sample task: web form validationSample task: web form validationThe painful old wayThe painful old way
• Drag widgets and validator onto page, select a regexp, customize if desired.
introduction ● prototype ● proposed work ● evaluation
1515
Sample task: web form validationSample task: web form validationResults of the painful old wayResults of the painful old way
• Invalid inputs cause a hard-coded message to appear.
Oops, forgot to enter a message at design-time.
• For valid inputs, no error message appears.
Hm, didn’t realize the area code was optional.
What if I want to allow campus phone numbers?
introduction ● prototype ● proposed work ● evaluation
1616
Sample task: web form validationSample task: web form validationThe wonderful new way The wonderful new way
• Drag widgets and validator onto page, select a format, customize if desired.
introduction ● prototype ● proposed work ● evaluation
1717
Sample task: web form validationSample task: web form validationCreating this format took 55 secondsCreating this format took 55 seconds
introduction ● prototype ● proposed work ● evaluation
1818
Sample task: web form validationSample task: web form validationResults of the new wayResults of the new way
• Invalid inputs cause a targeted message to appear.
• Inputs that violate an always or never constraint cannot be submitted to the server.
• Inputs that violate an often constraint cause a warning, which the application user can override.
introduction ● prototype ● proposed work ● evaluation
1919
Prototype implementationPrototype implementationSystem block diagramSystem block diagram
introduction ● prototype ● proposed work ● evaluation
Spreadsheet Microsoft Excel
Plug-in
Microsoft Visual Studio.NET
Plug-in
Format editor
Parser
Web application
Validator
2020
Expressiveness evaluationExpressiveness evaluation
• Four administrative assistants’ use of a web browser was logged for three weeks, resulting in nearly 6000 sample data values that they typed into web forms.
• Not logged verbatim: characters were generalized– Eg: [email protected] Aa{7}0@a{5}.a{3}
• We manually grouped values into 19 semantic families (eg: email address) based on widget’s HTML name and words visually nearby to the widgets
• Created and tested formats for 14 families (4250 values)– Omitted: username/passwords and long blocks of “text”– Inference & testing features were not used during format creation
introduction ● prototype ● proposed work ● evaluation
2121
Expressiveness evaluation resultsExpressiveness evaluation results
• 9 families needed 1 format each; 5 needed 2 formats each
• The only error attributable to editor expressiveness:– 1 of the 4250 test values had a trailing period on a street
type (in an address line)– This particular version of the editor had no way to say that a
part could contain a period but only at the end
• After support for multiple formats is added, then the editor as a whole will be evaluated for usability.
introduction ● prototype ● proposed work ● evaluation
[6]
2222
OutlineOutline
• Introduction• Related work• Prototype• Proposed work• Evaluation
introduction ● prototype ● proposed work ● evaluation
Generalizing the prototype:
A lightweight data model
+
A development environment to help EUPs create, share
and use topes
2323
Proposed data modelProposed data model
• 1 tope implementation contains executable functions:– 1 isa:string[0,1] function per format, for recognizing
instances of the format– 0 or 1 eqc:string x string[0,1] function per format,
for testing equivalence of two values in a format(default is a binary test for being exactly identical)
– 0 or more trf:stringstring function linking formats, for transforming values form one format to another
• A lightweight data model…– Only contains 3 kinds of functions (isa/eqc/trf)– These correspond to the operations that people had to keep
performing manually in our studies.
introduction ● prototype ● proposed work ● evaluation
2424
Example topeExample topeNotional representationNotional representation
• An example tope for CMU room numbers– 3 isa functions, up to 3 eqc functions, 4 trf functions– A tope’s eqc and trf functions can be omitted if desired
introduction ● prototype ● proposed work ● evaluation
Formal building name& room number
Elliot Dunlap Smith Hall 225
Building abbreviation& room number
EDSH 225
Colloquial building name& room number
Smith 225
2525
Proposed development environmentProposed development environmentFunctional decomposition diagramFunctional decomposition diagram
Basic Topes Editor Repository Software
Publishing Tools Search Tools
Development Environment
Plug-Ins
introduction ● prototype ● proposed work ● evaluation
EUPs implement topes in basic topes editor (or JavaScript), then publish in repositories.Other EUs and EUPs search for topes, download them, then use them through plug-ins.
2626
Proposed development environmentProposed development environmentEnhanced basic topes editorEnhanced basic topes editor
Basic Topes Editor Repository Software
Publishing Tools Search Tools
Development Environment
Plug-Ins
introduction ● prototype ● proposed work ● evaluation
2727
Proposed workProposed workEnhancing the basic topes editorEnhancing the basic topes editor
• Extend isa support– Improve error message generation
• Add trf support– EUPs will specify a series of steps:
• Select a part, select an operator• Operators: permutation, lookup, arithmetic, capitalization
– Add (regression) testing features to facilitate consistency• Add eqc support
– For each part, EUPs will specify a comparison operator, returning value in [0,1], and these will be multiplied.
• Operators: exactly identical, case-insensitive comparison, ~arithmetic distance, ~edit distance
introduction ● prototype ● proposed work ● evaluation
2828
Proposed development environmentProposed development environmentPublishing toolsPublishing tools
Basic Topes Editor Repository Software
Publishing Tools Search Tools
Development Environment
Plug-Ins
introduction ● prototype ● proposed work ● evaluation
2929
Proposed WorkProposed WorkPublishing topes in repositoriesPublishing topes in repositories
• Clients will have a list of “known” repository servers– Generally pre-configured to include a global server at CMU– Organizations will configure clients to include the
organizational server– EUs and EUPs will be able to add new servers to their list
• To support publishing/searching, the repository will house meta-information about topes, including…– a human-visible non-unique name & description– an internally-used globally unique id (guid) based on the
tope’s URL in the repository
introduction ● prototype ● proposed work ● evaluation
3030
Proposed development environmentProposed development environmentSearch toolsSearch tools
Basic Topes Editor Repository Software
Publishing Tools Search Tools
Development Environment
Plug-Ins
Normalization
introduction ● prototype ● proposed work ● evaluation
3131
Proposed workProposed workSearching for relevant topesSearching for relevant topes
• Search by keyword:– Search tope name and description– And match based on words that are visually near to topes
• Search by groups of people:– Within an organization, or by author’s email domain– Within spaces that are “group-private”
• Search by groups of topes:– “If you liked this tope, you may also like XYZ”– Similar to Amazon.com’s product recommendations
• Search by example:– “Find me a tope that recognizes 412-555-1212”– For efficiency, filter based on “signature” (\d{3}-\d{3}-\d{4})
introduction ● prototype ● proposed work ● evaluation
3232
Proposed workProposed workSearching for trustworthy topesSearching for trustworthy topes
introduction ● prototype ● proposed work ● evaluation
Evidence [8] EUs and EUPs may trust topes: Search features
Explicit formal roles Created by their organization’s system administrators. Search by tope author
Prior performance From people who have previously supplied good topes.
Model of motivation From vendors that care about brand image.
Group membership From people who are known to have a similar background.
Reputation That earned anonymous votes of confidence. Search by tope ratings (either anonymous or not)References That present a list of high-profile people who like the topes.
Certification That are inspected and certified by a third party.
Social context That are actively maintained—that is, for which improved versions are regularly available.
That are implemented in a familiar language/platform.
Search by tope publication date and execution platform
3333
Proposed development environmentProposed development environmentEnhanced plug-insEnhanced plug-ins
Basic Topes Editor Repository Software
Publishing Tools Search Tools
Development Environment
Plug-Ins
introduction ● prototype ● proposed work ● evaluation
3434
Proposed workProposed workEnhancing plug-insEnhancing plug-ins
• Target tools– Microsoft Excel– Microsoft Visual Studio.NET– Robofox
• Operations supported– Assertions run isa on selected cells
– Transformation run trf on selected cells
– De-duplication run eqc on selected cells, cluster the cells
• Each will support basic editor topes & JavaScript topes
introduction ● prototype ● proposed work ● evaluation
3535
Proposed workProposed workRecognizing exceptions in plug-insRecognizing exceptions in plug-ins
• Tope creators might overlook values.• From the standpoint of a tope format, these “normal”
values are exceptional cases that need to be tolerated.
• Simple approach: Record a whitelist of exceptions• More sophisticated: For each format, record exceptions,
infer a format (new isa function), and average this function’s score with the raw function’s score
• Exceptional values can be incorporated into the tope in the local cache and/or, at EUP’s discretion, propagated to the repository of the tope’s master copy
introduction ● prototype ● proposed work ● evaluation
3636
OutlineOutline
• Introduction• Related work• Prototype• Proposed work• Evaluation
introduction ● prototype ● proposed work ● evaluation
Examples
Experiments
Field testing
3737
EvaluationEvaluation
Expressiveness – Identify test tasks based on previous studies; create topes for data involved in those tasks
Creation of topes by EUPs – Controlled experiment in which students & staff create topes
Usefulness for tasks – Controlled experiment in which students & staff use topes to perform the test tasks
Flexibility of topes – Test the topes created by participants on test data drawn from EUSES spreadsheet corpus
Sharability of topes – Field testing in which several dozen students & staff will install and use the environment
introduction ● prototype ● proposed work ● evaluation
3838
Referenced papersReferenced papersConference papers[1] C. Scaffidi. Unsupervised Inference of Data Formats in Human-Readable Notation. Proceedings of 9th
International Conference on Enterprise Integration Systems (ICEIS'07), 2007, to appear. [2] C. Scaffidi, K. Bierhoff, E. Chang, M. Felker, H. Ng, C. Jin. Red Opal: Product-Feature Scoring from Reviews.
Proceedings of 8th ACM Conference on Electronic Commerce (ACMEC'07), 2007, to appear [3] C. Scaffidi, A. Cypher, S. Elbaum, A. Koesnandar, and B. Myers. Scenario-Based Requirements for Web Macro
Tools. Submitted for publication, 2007.[4] C. Scaffidi, A. Ko, B. Myers, M. Shaw. Dimensions Characterizing Programming Feature Usage by Information
Workers. VL/HCC'06: Proceedings of the 2006 IEEE Symposium on Visual Languages and Human-Centric Computing, pp. 59-62, 2006.
[5] C. Scaffidi, M. Shaw, and B. Myers. Estimating the Numbers of End Users and End User Programmers. VL/HCC'05: Proceedings of the 2005 IEEE Symposium on Visual Languages and Human-Centric Computing , pp. 207-214, 2005.
Other papers[6] C. Scaffidi, B. Myers, M. Shaw. The Topes Format Editor and Parser, Technical Report CMU-ISRI-07-104, School
of Computer Science, Carnegie Mellon University, Pittsburgh, PA, May 2007. [7] C. Scaffidi, B. Myers, and M. Shaw. Trial By Water: Creating Hurricane Katrina "Person Locator" Web Sites. In
Leadership at a Distance: Research in Technologically-Supported Work (S. Weisband, ed), Lawrence Erlbaum, pp. 209-222, 2007.
[8] C. Scaffidi, M. Shaw. Toward a Calculus of Confidence. First International Workshop on the Economics of Software and Computation, co-located with ICSE'07, 2007, to appear.
[9] C. Scaffidi, M. Shaw, B. Myers. Games Programs Play: Obstacles to Data Reuse, 2nd Workshop on End User Software Engineering (WEUSE), 2006.
introduction ● prototype ● proposed work ● evaluation
3939
Thank You…Thank You…
• …to the symposium committee/panel for the opportunity to present
• …to many people for helpful suggestions
• …to NSF and EUSES for funding (ITR-0325273 and CCF-0438929)
introduction ● prototype ● proposed work ● evaluation
Marwan Abi-Antoun Margaret Burnett Martin Erwig Andy Ko Mary Beth Rosson
Robin Abraham Owen Cheng George Fairbanks Thomas LaToza Mary Shaw
Matt Bass Ciera Christopher Thomas Green Alon Lavie Jeff Stylos
Nels Beckman Michael Coblenz Josh Gross Henry Lieberman Dean Sutherland
Kevin Bierhoff Allen Cypher Greg Hartman Larry Maccherone Steve Tanimoto
Alan Blackwell Uri Dekel Jim Herbsleb Brad Myers Susan Wiedenbeck
Barry Boehm Sebastian Elbaum John Hosking John Pane
4040
Questions for the panelQuestions for the panel
Some areas where I would appreciate suggestions:
• What aspects of this work would be of most interest to the ICSE community (in future research papers)?
• For any potential problems that you see in the work, what solutions can you suggest?
introduction ● prototype ● proposed work ● evaluation
4141
This slide intentionally left blank.
4242
Survey of EUPs:Survey of EUPs:Better data-manipulation features neededBetter data-manipulation features needed• Asked 831 information workers about use of 23 features in
5 tools (eg: creating spreadsheet macros, database stored procedures, and web forms) [4][9]
• The most widely used features were related to manipulating linked structures of data (eg: database tables) rather than imperative or macro programming
• Yet respondents complained about these features:– “Not always easy to move sturctured [sic] data or text”– “Not always integrated a lot of data manipulation redundant”– “Information entered inconsistently into database fields by
different people leaves a lot of database cleaning”
introduction ● prototype ● proposed work ● evaluation
4343
Interviews of web site creators:Interviews of web site creators:Confirmation of specific problemsConfirmation of specific problems
• Interviewed 6 people involved in creating “person locator” web sites after Hurricane Katrina [7][9]
• Many omitted data validation on web forms– Hard to detect that “12 Years old” is an invalid street address
(what would the regexp look like?)
• “Aggregator” sites were built to scrape and consolidate data from numerous person locator sites.– Hard to transform data into a single consistent format– Hard to identify probable duplicates in the merged data set
introduction ● prototype ● proposed work ● evaluation
4444
Sample task: validating person namesSample task: validating person namesCustomizing constraints in our prototypeCustomizing constraints in our prototype• User can add/edit constraints
introduction ● prototype ● proposed work ● evaluation
4545
Benefits of the format editorBenefits of the format editor
• Exotic regexp notation is replaced with sentence-like screen prompts.
• Soft constraints (“often”) are supported.• Negation constraints (“never”) are supported.
• In terms of expressiveness,Augmented context-free grammars > context-free grammars > regexps
But is the expressiveness adequate for common data?
introduction ● prototype ● proposed work ● evaluation