unsupervised inference of data formats in human-readable notation christopher scaffidi carnegie...

26
Unsupervised Inference Unsupervised Inference of of Data Formats in Data Formats in Human-Readable Notation Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

Post on 19-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

Unsupervised Inference of Unsupervised Inference of Data Formats inData Formats in

Human-Readable NotationHuman-Readable Notation

Christopher Scaffidi

Carnegie Mellon University

Page 2: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

22

Target audienceTarget audience

• In 2012, we project that there will be 90 million computer end users (“EUs”) in American workplaces.

• Of these, at least half will create spreadsheets, databases, and/or web applications. These are called end-user programmers (“EUPs”).

• For professional programmers, programs are a deliverable.

• For EUPs, programs are a means to an end.

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 3: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

33

Current practice:Current practice:Storing data as stringsStoring data as strings

• Typical tools of EUPs:– Excel “text” cells

– Access/MSSQL “varchar” fields

– FrontPage/Dreamweaver “textfield” inputs

• Validation involves…– Learning an exotic new notation (VBScript, regexps, etc)

– Writing cumbersome expressions in that notation

• Most EUPs do not know these notations and have no time, interest, or incentive to learn the notations.

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 4: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

44

Our Topes system to date…Our Topes system to date…

• Formats are presented in human-readable notationin our format editor– Format = sequence of parts with constraints on parts

– Constraints can be “often” true (rather than “always”)

• The format is automatically converted to a context-free grammar, with constraints attached to productions.

• At runtime, our parser checks values against formats, returning a confidence in the range [0,1] for each value.

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 5: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

55

Needed: Format inferenceNeeded: Format inference

• Problem: To date, the system offers limited support for helping users to get started.

• Although users do not need to learn specialized notation, there is still the cognitive work of…– examining data

– breaking it into parts

– representing parts in the format

• Solution: Infer a boilerplate format from examples

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 6: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

66

Talk OutlineTalk Outline

• Motivation / Problem

• Solution– Overview of Topei

– Inference algorithm

• Evaluation

• Conclusion

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 7: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

77

PrototypePrototypeTask flow diagramTask flow diagram

Algorithm infers a format from cell

values

User reviews and customizes

format

User creates a format from

scratch

User loads an existing format

from a file

Plug-in flags cells that don’t match format

User highlights spreadsheet

cells

[1][6]

or

or

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 8: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

88

Sample task: validating a spreadsheetSample task: validating a spreadsheetwith the prototype we have builtwith the prototype we have built

• The second column is “supposed” to contain first names, but some outlier values containing initials have snuck in.

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 9: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

99

Sample task: validating a spreadsheetSample task: validating a spreadsheetCustomizing an inferred formatCustomizing an inferred format

• Inferred format is presented in editor with sentence-like prompts to improve human-readability

• User can specify meaningful names for parts

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 10: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

1010

Sample task: validating a spreadsheetSample task: validating a spreadsheetCustomizing constraints in our prototypeCustomizing constraints in our prototype

• User can add/edit constraints

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 11: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

1111

Sample task: validating a spreadsheetSample task: validating a spreadsheetFlagging potential errorsFlagging potential errors

• A red flag (reviewer comment, actually) appears on cells that do not match the format; mouse over for message

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 12: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

1212

Our algorithm has 2 phasesOur algorithm has 2 phases

Input: An array of strings

Phase 1: Identify format parts

Phase 2: Identify constraints on each part of each format

Output: An array of formats– Sorted according to how many examples they match

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 13: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

1313

Phase 1: Identify format partsPhase 1: Identify format parts

• For each string, replace each character with its class, then collapse runs, generating a string “signature”– Supported character classes:

A uppercase letter

a lowercase letter

0 digit

• Example:[email protected] [email protected]

[email protected] [email protected]

[email protected] [email protected]

[email protected] [email protected]

[email protected] [email protected]

[email protected] [email protected]

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 14: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

1414

Phase 1: Identify format partsPhase 1: Identify format parts

• Pack strings with identical signatures (often leads to significant performance improvement)

• Example:[email protected] [email protected]

[email protected] [email protected]

[email protected] <<packed with 1st, above>>

[email protected] [email protected]

[email protected] [email protected]

[email protected] [email protected]

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 15: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

1515

Phase 1: Identify format partsPhase 1: Identify format parts

• Align signatures based on separators

• Example:[email protected] a @ a . a

[email protected] a0 @ a . a

[email protected] A @ A . a

[email protected] [email protected]

[email protected] [email protected]

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 16: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

1616

Phase 1: Identify format partsPhase 1: Identify format parts

• Abstract to least general composite character class, yielding the parts of each format.

• Example (3 formats below):[email protected] a @ a . a

[email protected] a0 @ a . a

[email protected] A @ A . a

a0A aA a

[email protected] a @ a - a . a

[email protected] a . a @ a . a . a

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 17: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

1717

Phase 2: Identify constraints on each partPhase 2: Identify constraints on each part

• Constrain each part’s contents to the character classes

• Require indicated separators before/after parts

• Infer an additional content constraint that is “often” true:– Must be in a set of 3 or fewer literals?

– Must be in a numeric range?

– Must start with or end with certain characters?

• A content constraint is inferred if it covers at least 95% of the examples supporting that format’s signature.

• Afterward, the user can review/customize format.

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 18: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

1818

Evaluation as an outlier finderEvaluation as an outlier finder

• Outlier finding:– Infer a format from example values

– Use the inferred format to check the examples Reveals “outliers” that might contain typos or other errors

• Comparison algorithm: LapisLapis example@DayOfMonth is Number equal to /[12][0-9]|3[01]|0?[1-9]/ ignoring nothing@ShortMonth is Number equal to /1[012]|0?[1-9]/ ignoring nothing@ShortYear is Number equal to /\d\d/ ignoring nothingDate is flatten @ShortMonth then @DayOfMonth then @ShortYear ignoring either Spaces or Punctuation

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 19: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

1919

Evaluation dataEvaluation dataDrawn from EUSES spreadsheet corpusDrawn from EUSES spreadsheet corpus

• 6288 US phone numbers in 37 columns– First cell in column contains “phone”

– And at least 20 cells have exactly 10 digits

– And at least 2/3 of cells have exactly 10 digits

• 1124 country names in 7 columns– First cell in column contains “country”

– And there are at least 20 cells

– And at least one cell contains “Portugal”

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 20: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

2020

Run each algorithm (Topei & Lapis)Run each algorithm (Topei & Lapis)and compare their output to hand-labelingand compare their output to hand-labeling

• For determining “true outliers” in calculating accuracy:– Outlier phone numbers have an area code that is

not in service, or if they contain errant separators such as spaces not shared by most cells in the column.

– Outlier country names contain abbreviations, misspellings or a different name than the one usually used by English-speakers, except for a specific list of allowed exceptions that are very commonly used (e.g.: Brasil, US, UK) [note: allowing these exceptions hurts Topei’s accuracy]

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 21: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

2121

Results: Topei’s precision/recall Results: Topei’s precision/recall exceed Lapis’sexceed Lapis’s

• Standard machine learning measures for outlier finding– Precision = # outliers found / # outliers claimed– Recall = # outliers found / # true outliers

Task Algorithm Precision (%) Recall (%)

Country Topei 56.5 94.6

Country Lapis 46.7 7.6

Phone Topei 97.7 99.8

Phone Lapis 44.0 2.4

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 22: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

2222

Limitations & future workLimitations & future work

• Topei still makes mistakes:– Doesn’t infer constraints aggressively enough

– Doesn’t recognize non-ASCII chars in character classes

– Doesn’t handle formats with repeating parts

• Need deeper integration with EUPs’ tools

• Computational complexity:– Is intended to be O(# examples), seems to be true

– More careful verification needed

• Usability has not yet been evaluated in user study

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 23: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

2323

Thank You…Thank You…

• …to you for your interest and attention

• …to INSTICC for the opportunity to present

• …to NSF and EUSES for funding (ITR-0325273 and CCF-0438929)

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 24: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

2424

Another example:Another example:Carnegie Mellon University phone #: 8-1234Carnegie Mellon University phone #: 8-1234

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 25: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

2525

IntegrationIntegration

• Formats can be inferred from…– Spreadsheet cells

– Database queries (e.g.: Access/MSSQL)

– Arbitrary collection of text strings (via C#)

• Formats can then be used without modification in other venues, as well.– E.g.: infer a format from spreadsheet cells, then use it to

create a trigger for a database table

motivation ● overview ● algorithm ● evaluation ● conclusion

Page 26: Unsupervised Inference of Data Formats in Human-Readable Notation Christopher Scaffidi Carnegie Mellon University

2626

Related WorkRelated Work

• Many algorithms train a recognizer to notice features – See (Mitchell, 1997) for a summary

– Such algorithms do not infer a human-editable format.

• Others generate formats in specialized notation. – (Miller, 2001) (Blackwell, 2001) (Lerman, 2000),

(Lieberman, 2001) (Nardi, 1998)

– Regular expressions and CFGs have limited readability.

• Several tools recognize or manipulate some of the same kinds of data as Topei.– (Hong, 2006) (Pandit, 1997) (Stylos 2004)

– Custom formats are unsupported (only hardcoded formats)

motivation ● overview ● algorithm ● evaluation ● conclusion