a data model to help end user programmers manipulate and validate data christopher scaffidi carnegie...

19
A Data Model to Help A Data Model to Help End User Programmers End User Programmers Manipulate and Validate Manipulate and Validate Data Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

Post on 22-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

A Data Model to HelpA Data Model to HelpEnd User ProgrammersEnd User Programmers

Manipulate and Validate DataManipulate and Validate Data

Christopher Scaffidi

Carnegie Mellon University

ISRI SSSG Oct 2006

Page 2: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

MotivationMotivation22

Consider automating repetitive actions in a Consider automating repetitive actions in a web browser.web browser.

• Our recent contextual inquiry revealed that administrative assistants fill out many expense reports.

• Given a location and date, they used a government site to find the per diem rate.

Page 3: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

MotivationMotivation33

Web macros cannot automate this task.Web macros cannot automate this task.

Existing macro tools cannot convert from two-letter state abbreviation to full state name.

Page 4: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

MotivationMotivation44

Web macros cannot automate this task.Web macros cannot automate this task.

Nor can they convert dates from MM/DD/YYYY to Month DD.

Page 5: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

MotivationMotivation55

• Examples:– Dates

– Credit card numbers

– Person names

– Quantities of RAM

– Product codes

• Such data are – “bigger” than floats and strings.

– “smaller” than a database row.

– typically domain-specific.

– full of exceptional cases.

The world is full of “user-level” data.The world is full of “user-level” data.

– State names

– US phone numbers

– Bus route numbers

– Dewey decimal numbers

– Etc…

Page 6: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

ProblemProblem66

Tools do not “understand” user-level Tools do not “understand” user-level data such as states and dates.data such as states and dates.

• Limited support for data manipulation– Reformatting data in web macros or spreadsheets

– Transporting (transforming) data between applications

• Limited support for data validation– Are any values mistyped?

– Does the dataset contain duplicates?

• Information Week respondents complained more about data manipulation & interoperability problems than about software reliability problems!

Page 7: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

ProblemProblem77

To be useful, representations of these To be useful, representations of these data must meet 3 requirements.data must meet 3 requirements.

ExtensibilityDifferent people use different data.

Let users represent the data that they care about.

ShareabilityDifferent people sometimes use the same kinds of data.

Help end users find & evaluate representations of data.

FlexibilityData appear in many formats, with exceptions to every rule.

Support multiple formats, and permit exceptions.

Page 8: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

Existing approachesExisting approaches88

Existing approaches do not meet the Existing approaches do not meet the requirements.requirements.

• Regexps / grammars / data detectors represent syntax, not semantics (e.g.: how to represent “FL” = “Florida”?)

• Research on units typically only apply to numeric data in certain applications (e.g.: spreadsheets).

• Knowledge systems (e.g.: ConceptNet) do not contain representations of data formats.

• OO and formal types are too difficult for many end users and typically disallow exceptions to type rules.

• Federated database systems deal with heterogeneous joins but require the attention of a professional DBA.

And each lacks built-in support for helping users decide whose code to trust.

Page 9: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

Proposed modelProposed model99

A “tope” defines the basic semantics of a A “tope” defines the basic semantics of a single user-level data abstraction.single user-level data abstraction.

A “tope” is a pair of functions defined by a user:

• isa: string [0,1]

returns a context-independent estimate of the likelihood that the string is an instance of this tope

• eq: string x string [0, 1]

returns an estimate of the likelihood that the strings are equivalent, conditional on being instances of this tope

Topes will be defined in files and compiled, just like types.

Page 10: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

Proposed modelProposed model1010

Reformatting functions would transform Reformatting functions would transform instances from one tope to another.instances from one tope to another.

Two topes are “isotopes” if instances of one can be reformatted into the other.

• fmt: string string

treats the input as an instance of one tope and returns an equivalent string that is in another format

Page 11: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

Proposed meta-modelProposed meta-model1111

Repositories would permit sharing of topes.Repositories would permit sharing of topes.

• Topes could be implemented in arbitrary languages.• The binaries would be stored in “repositories”.

• Each user might subscribe to multiple repositories:– personal repository of custom topes

– university repository of organization-specific topes

– general repository of generic topes

• Users would be able to define new topes, search for topes, add them to their repository, and prune their repository of outdated topes.

Page 12: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

Proposed meta-modelProposed meta-model1212

A meta-model would representA meta-model would representaspects of tope trustworthiness.aspects of tope trustworthiness.

• How do end users decide which topes to use?– Topes would be annotated with platform (e.g.: JDK1.5),

author names, and other meta information to facilitate finding and choosing topes.

• What if a tope consistently over-estimates “isa” scores?– Let the user give feedback, either positive or negative.– “Wrap” the tope and correct for its bias (renormalization).

• What other renormalization might be handy?– Words in the context might “cue” that certain tope instances

are nearby; “isa” functions for those topes may be more trustworthy in that context.

Page 13: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

Tope implementationTope implementation1313

Macro tools would download topes on Macro tools would download topes on demand from repositories.demand from repositories.

Back to the macro example…

1. The macro tool retrieves topes.

2. The tool tries to infer a tope for each value in the macro.• The user could override this assignment, of course.

3. The tool can now automatically reformat data if needed

Page 14: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

Tope implementationTope implementation1414

Most isa functions could be implemented Most isa functions could be implemented with an augmented context-free grammar.with an augmented context-free grammar.

• We logged data from information workers’ web browsers.

• It appears that most data can be recognized using probabilistic context-free grammars with constraints on the grammar terms.– E.g.: time HH : MM ap

HH, MM ## {MM >= 0 && MM <= 59 && … }

• I will need to…– Verify that an augmented grammar is expressive enough

– Identify what constraint primitives are necessary

Page 15: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

Tope implementationTope implementation1515

Most equivalence and reformatting Most equivalence and reformatting functions are built from very few primitives.functions are built from very few primitives.

• Equivalence functions combine– Lookup in hard-coded tables

– Arithmetic and conditionals

– Numeric comparisons

– “Identicalness” comparison

– Case-insensitive comparison

• Reformatting functions combine– Lookup in hard-coded tables

– Arithmetic and conditionals

– Permutation

Page 16: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

Tope implementationTope implementation1616

A prototype system will help usersA prototype system will help userscreate, compile, and share topes.create, compile, and share topes.

• I will need to provide a prototype system with…– A user-friendly editor for end users to define topes.

– A program that turns these definitions into binary modules.

– A repository server to store binaries and the meta-model.

• Remember: Sophisticated end users (or professionals) can fall back on an arbitrary language to create topes.

• The simple grammar language is for the common case.

Page 17: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

Applications of topesApplications of topes1717

Equipping tools with topes will help users Equipping tools with topes will help users create programs of higher quality.create programs of higher quality.

• During end user programming, tools would download useful topes from repositories.

– Web macros could perform reformatting automatically.• Improved composability of web macros

– Spreadsheets could be checked for malformed values.• Improved correctness of spreadsheets

– Web applications could validate & reformat data from web.• Improved security of applications

Page 18: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

Applications of topesApplications of topes1818

Providing a tope system to users will Providing a tope system to users will improve data interoperability.improve data interoperability.

• When receiving data, users could define a new tope (or use an existing tope) and apply it to validate data.– Users could reformat values to a uniform format.

– Users could find and remove duplicate values.

• Data could carry along tope definitions, particularly if the representation is “secure” (e.g.: context-free grammar)– a form of self-describing data.

Page 19: A Data Model to Help End User Programmers Manipulate and Validate Data Christopher Scaffidi Carnegie Mellon University ISRI SSSG Oct 2006

1919

Thank You…Thank You…

• To Mary Shaw, Brad Myers, Robin Abraham, Allen Blackwell, Margaret Burnett, Michael Coblenz, Allen Cypher, Sebastian Elbaum, Martin Erwig, Josh Gross, John Hosking, Andy Ko, Andhy Koesnandar, Henry Lieberman, Ericka Orrick, John Pane, Mary Beth Rosson, Jeff Stylos, Steve Tanimoto, and others for helpful discussions.

• To NSF for funding