my redneck brother's tire size, and other unrelated topes christopher scaffidi carnegie mellon...

30
My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

Upload: derrick-howard

Post on 26-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

My Redneck Brother's Tire Size, and Other Unrelated Topes

Christopher Scaffidi

Carnegie Mellon University

Page 2: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

22

Even when lives are at stake,Even when lives are at stake,people still make typos. people still make typos.

intro ● current practice ● real data ● topes ● closing vignette

Hurricane Katrina“Person Locator”

Web site

Page 3: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

33

Data errors reduce the usefulness of data.Data errors reduce the usefulness of data.

Even little typos impede data de-duplication.

Age is not useful for flying my helicopter to come rescue you.

Age belongs in the description/additional information field

so I can recognize you.

And a “city name” with 1 letter is no use at all.

intro ● current practice ● real data ● topes ● closing vignette

Page 4: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

44

The website creators omitted input The website creators omitted input validation.validation.

• Reasons: They thought…– it would be too hard to write the validation.

– catching obviously-wrong inputs would prevent collecting maybe-correct data.

This is the UI code for the web formwhere people could type in the data.

A RAD tool called CodeCharge Studiowas used to create the UI.

intro ● current practice ● real data ● topes ● closing vignette

Page 5: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

55

Outline and main pointsOutline and main points

1. Current practice– Currently, writing validation code is hard…

2. Real data– because how do you express, “this is questionable data?”

3. Topes– Topes can express that—and they’re easy to create, too.

4. Closing vignette: my brother’s truck tires– Email me some vignettes of your own.

intro ● current practice ● real data ● topes ● closing vignette

Page 6: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

66

Programmers have lots of tricks to simplify Programmers have lots of tricks to simplify writing validation code.writing validation code.

Split inputs into multiple easy-to-validate fields.Who cares if the user has to type tabs now,or if he can’t just copy-paste into one field?

Make users pick from drop-downs.Who cares if it’s faster for users to type

“NJ” or “1/2007”?(Disclaimer: drop-downs sometimes good!)

I implemented this codefor NJTransit.com.

intro ● current practice ● real data ● topes ● closing vignette

Page 7: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

77

Even with these tricks, writing validation is Even with these tricks, writing validation is still very time-consuming.still very time-consuming.

Overall, the site had over 1100 lines of JavaScript just for

validation….Plus equivalent

server-side Java code (can’t trust

some users)

Sample code below.

intro ● current practice ● real data ● topes ● closing vignette

if (!rfcCheckEmail(frm.primaryemail.value)) return messageHelper(frm.primaryemail, "Please enter a valid Primary Email address.");var atloc = frm.primaryemail.value.indexOf('@');if (atloc > 31 || atloc < frm.primaryemail.value.length-33) return messageHelper(frm.primaryemail, "Sorry. You may only enter 32 characters or less for your email name\r\n”+ ”and 32 characters or less for your email domain (including @).");

Page 8: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

88

That was worst case.That was worst case.Best case: reusable regexps.Best case: reusable regexps.

• Many IDEs allow the programmer to enter oneregular expression for validating each input field.– Usually, this drastically reduces the amount of code, since

most validation ain’t fancy.

– Unfortunately…

intro ● current practice ● real data ● topes ● closing vignette

Page 9: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

99

Regexps are a good bullet but not a silver Regexps are a good bullet but not a silver bullet—so lots of data goes unvalidated.bullet—so lots of data goes unvalidated.

• The world is full of programmers who can’t read regexps.– Do a search on Google some time for “regular

expressions” and read what people say in the forums.

– USA alone has over 55 million non-expert creators of web sites, databases, and spreadsheets (which have most of the same data problems that web sites do).

• Regexps only work for data where you can say, “Yes, this is definitely ok” or “No, this is definitely wrong”.– What would a regexp for a valid company name look like?

intro ● current practice ● real data ● topes ● closing vignette

Page 10: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

1010

So we did a preliminary review ofSo we did a preliminary review ofreal data needing validation.real data needing validation.

• Sources:– Comments from Information Week readers to a survey

– Observations of people as they created and used websites

– Many Hurricane Katrina sites

– Cursory browsing of the EUSES spreadsheet corpus

– Browsing around the web

– My own experience as a professional webapp developer

• We found 3 primary problems with regexps…

intro ● current practice ● real data ● topes ● closing vignette

Page 11: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

1111

1. Real data doesn’t always conform well1. Real data doesn’t always conform wellto the simple “binary” regexp model.to the simple “binary” regexp model.

• Data is sometimes questionable… yet valid.– Remember the suspiciously long NJTransit email address?

– In practice, person names and other proper nouns are never validated with regexps… too brittle.

– Life is full of corner cases and exceptions.

• If your code can identify questionable data, then it can double-check the data:– Ask an application end user to confirm the input

– Flag the input for checking by a system administrator

– Compare the value to a list of known allowable exceptions

– Call up a server and see if it can confirm the value

intro ● current practice ● real data ● topes ● closing vignette

Page 12: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

1212

2. Real data often can occur in multiple2. Real data often can occur in multipledifferent formats.different formats.

• Two different strings can be equivalent.– How many ways can you write a date?

– What if an end user types a date in the wrong format?

– “Jan-1-2007” and “1/1/2007” mean the same thing because of the category that they are in: date.

– Sometimes the interpretation is ambiguous. In real life, we use preferences and experience to guide interpretation.

• If your code can transform among formats (ie: not just recognize formats with regexps), then it can put data in an unambiguous format of your choice.– Display the result so users can fix interpretations if needed

intro ● current practice ● real data ● topes ● closing vignette

Page 13: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

1313

3. The meaning of data is often tied to3. The meaning of data is often tied toits “parts”, not directly to its characters.its “parts”, not directly to its characters.

• Real data often has parts, each with their own meaning.– What are the parts of a date, 1/1/2007?

– Valid data conforms to intra- and inter-part constraints.

– Writing regexps requires you to translate constraints into a character sequence… tough in many cases, practically or truly impossible in others.

– No wonder most people can’t read or write regexps.

• If your code could succinctly state the parts, as well as mandatory and optional constraints on the parts, wouldn’t the code be easier to write and maintain?

intro ● current practice ● real data ● topes ● closing vignette

Page 14: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

1414

Imagine a world…Imagine a world…

• Where your code could say to an oracle, “Is this input a company name?”, and the oracle would say yes, no, almost definitely, probably not, and other shades of gray.

• Where your code could accept an input in any reasonable format, since your code could ask the oracle to put the input into whatever format you actually want.

• Where you could teach the oracle about a new category of data by concisely stating the parts and constraints of that data.

intro ● current practice ● real data ● topes ● closing vignette

Page 15: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

1515

Tope = an abstraction for a data categoryTope = an abstraction for a data category

• Greek word for “place,” because each corresponds to a data category with a natural place in the problem domain

• Topes in practice:1. People implement new topes by using the basic tope editor

(or another language such as JavaScript)

2. People publish tope implementations on repositories.

3. People download tope implementations to local caches.

4. Tool plug-ins let people browse their local cache and associate topes with variables and input fields.

5. Plug-ins use tope implementations from local cache to recognize, transform, and equivalence-test data.

intro ● current practice ● real data ● topes ● closing vignette

Page 16: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

1616

A tope is a graph.A tope is a graph.Node = format, edge = transformationNode = format, edge = transformation

• A notional representation for a CMU room number tope.– Note that edges (transformations) can be chained

Formal building name& room number

Elliot Dunlap Smith Hall 225

Building abbreviation& room number

EDSH 225

Colloquial building name& room number

Smith 225

intro ● current practice ● real data ● topes ● closing vignette

Page 17: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

1717

A tope is a conceptual abstraction.A tope is a conceptual abstraction.A tope implementation is code you can run.A tope implementation is code you can run.

• Each tope implementation contains executable functions:– 1 isa:string[0,1] function per format, for

recognizing instances of the format

– 0 or more trf:stringstring function linking formats, for transforming values form one format to another

intro ● current practice ● real data ● topes ● closing vignette

Page 18: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

1818

Common kinds of topes:Common kinds of topes:enumerations and proper nouns enumerations and proper nouns

• Multi-format, non-binary enums, e.g.: US states– Fixed list of definitely valid names (e.g.: “Maryland”)

– Transformed to other formats via lookup tables (“MD”)

– Augmented with a list of unusual values that technically might be ok in some circumstances (“PR”)

• Open-set proper nouns, e.g.: Company names– You certainly can’t list all of these

– Collect a whitelist of definitely valid names (“Google”), with alternate formats (e.g. “Google Corporation”, “GOOG”)

– Augment with a pattern for recognizing promising inputs that are not yet on the whitelist

intro ● current practice ● real data ● topes ● closing vignette

Page 19: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

1919

Two more common kinds of topes:Two more common kinds of topes:numeric and hierarchicalnumeric and hierarchical

• Numeric, e.g.: area codes– Check that inputs are numeric and in a certain range– Values outside the range might be valid but questionable– Very rarely, numeric data are explicitly flagged with a unit

• Hierarchical, e.g.: address lines– Parts are described with other topes (e.g.: “100 Main St.”

uses a numeric, a proper noun, and an enum)– Each part has its own internal constraints; the hierarchical

tope may add inter-part constraints.– Simple isa functions can be implemented with regexps.– Transformations involve permutation of parts, changes to

separators, simple arithmetic, and lookup tables.

intro ● current practice ● real data ● topes ● closing vignette

Page 20: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

2020

We have a tool to help people succinctly We have a tool to help people succinctly express common kinds of topes.express common kinds of topes.

Features:

• Format inference• Format/part names• Soft constraints• “isa” generation• Testing features• Format reusability

(Similar UI style for implementing trfs)

intro ● current practice ● real data ● topes ● closing vignette

Page 21: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

2121

And we have plug-ins for using topesAnd we have plug-ins for using topesin web forms, databases, and spreadsheetsin web forms, databases, and spreadsheets

Visual Studio: drag-and dropcode generation

Microsoft Excel:buttons and menus

intro ● current practice ● real data ● topes ● closing vignette

Page 22: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

2222

We have conducted a variety of We have conducted a variety of evaluations.evaluations.

• Expressiveness:– We have implemented formats for dozens of kinds of data

(1) EUSES spreadsheet corpus(2) Hurricane Katrina, and Google Base website data(3) logs of admin assistants’ web browsing

– … and topes were very effective at identifying data errors.

• Usability:– Controlled experiment shows that our format editor

enables admin assistants and master’s students to validate data more quickly and accurately than with Lapis patterns or with regexps.

intro ● current practice ● real data ● topes ● closing vignette

Page 23: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

2323

For more details…For more details…

• Ask me for the papers on…– Surveys and other studies of programmers and users

– The topes model

– Our user interfaces

– Our evaluations

• Ask me for the tools:– Some modules are already open-sourced

– Modules have a clean API (if you just want a binary)

– The evaluations pointed out some places for improvement• (when this is done, the rest will be open-sourced, too)

intro ● current practice ● real data ● topes ● closing vignette

Page 24: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

2424

A closing vignetteA closing vignette

• This vignette illustrates many of the characteristics of data that I mentioned today.

• I would value similar (true story) vignettes from you– To help highlight what real data looks like

– To help communicate the concept of a tope

– To provide me with test cases for topes

intro ● current practice ● real data ● topes ● closing vignette

Page 25: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

2525

My brother (Ben) and I hit a rake while My brother (Ben) and I hit a rake while driving his truck around the backyard.driving his truck around the backyard.

We got a flat.

intro ● current practice ● real data ● topes ● closing vignette

Page 26: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

2626

Fortunately, Ben knows A LOT about Fortunately, Ben knows A LOT about trucks… and their tires.trucks… and their tires.

Observe the pencil marks on the tire, where my brother

drew while explaining what the

parts of the tire size meant.

(I tweaked the contrast on this image to make the lettering stand out.) intro ● current practice ● real data

● topes ● closing vignette

Page 27: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

2727

So Ben went online to order a tire.So Ben went online to order a tire.

Observe the red neck and boots.

My brother is very web savvy.

He is an electrician by day, but he

assembles computers and sets up web sites as a

side job.

intro ● current practice ● real data ● topes ● closing vignette

Page 28: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

2828

But even though Ben knows tires really well, But even though Ben knows tires really well, he couldn’t implement tire size validation.he couldn’t implement tire size validation.

• Each part has meaning (cross section, sidewall

aspect ratio, internal construction, etc).

• Though parts must be selected from a simple enumeration, there are inter-part constraints.

• “Questionable” sizes? Dunno. Maybe those that are reasonable but hard-

to-find?

intro ● current practice ● real data ● topes ● closing vignette

Page 29: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

2929

SummarySummary

• Real data is full of input errors.

• Real validation is currently hard to write.

• Topes enable accurate, convenient validation by capturing soft constraints and the multiple formats of real data.

• Please email me some vignettes of your own.

intro ● current practice ● real data ● topes ● closing vignette

Page 30: My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University

3030

Thank you…Thank you…

• … to lots of people (including Mary Shaw, Brad Myers, Jim Herbsleb, and Jonathan Aldrich) for encouraging feedback about topes and lots of suggestions.

• … to anybody who emails me a vignette.

• …to NSF and EUSES for funding (ITR-0325273 and CCF-0438929)