learning semantic string transformations from examples

Post on 24-Feb-2016

55 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Learning Semantic String Transformations from Examples. Rishabh Singh and Sumit Gulwani. FlashFill. Transformations. Syntactic Transformations Concatenation of regular expression based substring “VLDB2012”  “VLDB” Semantic Transformations More than just characters - PowerPoint PPT Presentation

TRANSCRIPT

Learning Semantic String Transformations from

ExamplesRishabh Singh and Sumit

Gulwani

FlashFill

Transformations• Syntactic Transformations – Concatenation of regular expression based

substring– “VLDB2012” “VLDB”

• Semantic Transformations–More than just characters– “1/5/2010” “May 1st 2010”

Semantic Transformations• Semantic information as relational

tables– 1 January, 2 February

• Learn table lookup queries– VLOOKUP macro 2nd most problematic

Outline• Lookup Transformations

• Lookup + Syntactic Transformations

• Case Studies

Table Lookup Transformati

ons

Demo

Learning Framework

Input Strings F Output

StringF1

1. Domain-specific Language L

Fn…

2. Algorithm to learn all Fs from (i,o)

Lookup Transformation Language

Emp RecordSSN EmpId Name

027-36-4557 1254 John Henry034-83-7683 2412 William

Johnson044-58-3429 1125 Steve Russell018-45-8949 4257 Ian Jordan023-34-3254 6418 Mary Dina

Input v1 Output044-58-3429 Steve Russell

Select(Name, EmpRecord, (SSN = v1))

Example - Lookup

ItemRecItemId ItemST-340 StrollerBI-567 BibDI-328 DiapersWI-989 WipesAS-469 Aspirator

PriceRecItemId PriceST-340 $145.6

7BI-567 $3.56DI-328 $21.45WI-989 $5.12AS-469 $2.56

Input v1 OutputStroller $145.67

Select(Price, PriceRec, (ItemId = Select(ItemId, ItemRec, Item = v1))

Example – Transitive Lookup

Learn Query

ItemRecItemId ItemST-340 StrollerBI-567 BibDI-328 DiapersWI-989 WipesAS-469 Aspirator

PriceRecItemId PriceST-340 $145.6

7BI-567 $3.56DI-328 $21.45WI-989 $5.12AS-469 $2.56

Input v1 OutputStroller $145.67

Select(Price, PriceRec, (ItemId = Select(ItemId, ItemRec, Item = v1))

Synthesis Algorithm : • Input: (input state , output string )

• Output: all conforming expressions

• Reachability algorithm from input strings

GenerateSt r𝑡

Strings reachable from input row044-58-3429

Emp RecordSSN EmpId Name

027-36-4557 1254 John Henry034-83-7683 2412 William

Johnson044-58-3429 1125 Steve Russell018-45-8949 4257 Ian Jordan

𝜂1 𝜂2 𝜂3Progs [𝜂 1 ]= {𝑣1 }

GenerateSt r𝑡

strings in table rows of visited nodes 044-58-3429 1125 Steve Russell

)B≡ {∧𝐶𝑖={𝑣𝑎𝑙−1 (𝑇 [𝐶𝑖 ,𝑟 ] ) }} 𝑗

GenerateSt r𝑡

……..Repeat until k steps or

fixpoint

GenerateSt r𝑡

……..Steve Russell

𝜂 Progs [𝜂 ]

GenerateSt r𝑡• Sound and k-complete

– t: number of reachable strings– p: number of candidate keys–m: maximum size of a candidate key

Data structure • Maintains tree structure– share common sub-expressions

• CNF of Boolean Conditionals– independent column predicates

Intersect t :D t1∧Dt 2

∧ ≡

Synthesize ProcedureSynthesize((i1,o1), …, (in,on))

P = GenerateStrt(i1,o1)for j = 2 to n:

P’ = GenerateStrt(ij,oj) P = Intersectt(P’, P)

return P

Semantic String

Transformations

Demo

Syntactic String Language [GulwaniPOPL11]

Combined Language

Syntactic manipulations over lookup outputs

Syntactic manipulations before indexing

Synthesis Algorithm:

– Reachability based on syntactic string matches•

– Boolean conditionals

GenerateSt r𝑢SSN: 044-58-3429

Emp RecordSSN EmpId Name

027-36-4557 1254 John Henry034-83-7683 2412 William

Johnson044-58-3429 1125 Steve Russell018-45-8949 4257 Ian Jordan

Mr. Steve Russell

GenerateSt r𝑢SSN: 044-58-3429

Emp RecordSSN EmpId Name

027-36-4557 1254 John Henry034-83-7683 2412 William

Johnson044-58-3429 1125 Steve Russell018-45-8949 4257 Ian Jordan

GenerateSt r ′𝑡

GenerateSt r𝑢SSN: 044-58-3429

Emp RecordSSN EmpId Name

027-36-4557 1254 John Henry034-83-7683 2412 William

Johnson044-58-3429 1125 Steve Russell018-45-8949 4257 Ian Jordan

GenerateSt r ′𝑡

GenerateSt r𝑢{ “SSN: 044-58-3429”, “044-58-3429”, “1125”, “Steve Russell” } Set of reachable

strings

GenerateSt r𝑢

GenerateSt r𝑠

{ “SSN: 044-58-3429”, “044-58-3429”, “1125”, “Steve Russell” }

Mr. Steve Russell

and in paper

Experiments• 50 benchmark problems– 12 , 38

• ~1020 consistent expressions– Size of data structure: ~2000

• Performance: 96% less than 1 second

• Ranking: at most 3 examples (95% 2 examples)

Related Work• Matching strings for table joins

– Record Matching [Elmagarmid et. al. 07, Koudas et. al. SIGMOD06]– Schema Matching [Dhamankar et. al. SIGMOD04, Warren & Tompa

VLDB06]

• Query Synthesis– from representative view [Das Sharma et.al. ICDT10, Tran et.al.

SIGMOD09]

• Text-editing by example– QuickCode[Gulwani POPL11]– SMARTedit[Lau et.al. ML03], Simulatenous Editing[Miller et.al.

USENIX01]

Thanks!

 End-Users

Algorithm DesignersSoftware

Developers

Large potential

Backup slides

Semantic String Transformations

Time (12 Hr) Time (24 Hr)0930 9:30 AM1520 3:20 PM164808301015201010121425

=TEXT(C,”00 00”)+0

Semantic String Transformations

Date Formatted Date06-03-2008 Jun 3rd, 200803-26-201008-01-200909-24-200705-14-201007-20-199810-24-200408-24-1972

Idea 1: Share sub-expressionsT3

C1 C2 C3

s3 s4 s5

T1

C1 C2 C3

s1 s2 s3

T2

C1 C2 C3

s2 s3 s4

Select(C3, T2, C1=e)

Select(C2, T3, C1=Select(C2,T2,C1=e)

e Select(C2, T1, C1=v1)𝑠2

Youtube VideosFrenchPolishUrduGermanSerbianRussian

http://bit.ly/flashfill

Idea 2: CNF conditionalsT

C1 C2 C3 … Cn Cn+1

s s s s t

v1 v2 … vm Out

s s s t

No. of Consistent Expressions

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 491

10000

100000000

1000000000000

1E+016

1E+020

1E+024

1E+028

1E+032

1E+036

Large number of consistent expressions

Benchmarks

Num

ber

of e

xpre

ssio

ns

Succinct Representation

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

500

1,000

1,500

2,000

Succinct Representation

Benchmarks

Size

of

Dat

a St

ruct

ure

Performance

1 6 11 16 21 26 31 36 41 460.002.004.006.008.00

10.0012.00

Running Time

Benchmarks

Runn

ing

Tim

e (in

sec

onds

)

Ranking

1 2 30

5

10

15

20

25

30

35

40

Ranking Measure

Number of I/O examples

Num

ber

of B

ench

mar

ks

Idea 2: CNF conditionals

{{𝜂1 ,𝜂 2 } ,𝜂2 ,Progs }Progs [𝜂 1 ]≡ {𝑣1 ,𝑣2 ,⋯ ,𝑣𝑚}

Progs [𝜂 2 ]={Select (C𝑛+1 ,𝑇 ,∧𝑖C i= {𝑠 ,𝜂1 })}

𝑚+1Θ ((𝑚+1 )𝑛 )

GenerateSt r𝑡 : string value𝜂

: set of lookup programs to generate

𝑣𝑎 𝑙−1 (𝑠 ):Node𝜂 ,𝑣𝑎𝑙 (𝜂 )=𝑠

Related Work• Record Matching

– Similarity functions for matching [Elmagarmid et. al. 07, Koudas et. al. SIGMOD06]

– Customizable similarity function [Arasu et. al. VLDB09]

• Learning Schema Matches– iMAP [Dhamankar et. al. SIGMOD04] concat. of

column strings using domain-specific knowledge

– [Warren & Tompa VLDB06] concatenation of column substrings, single table

Related Work• Query Synthesis [Das Sharma et.al. ICDT10, Tran et.al.

SIGMOD09]– Infer relation from large representative example

view– no joins or projections

• Text-editing using examples– QuickCode[Gulwani POPL11] string transformations– SMARTedit[Lau et.al. ML03], Simulatenous

Editing[Miller et.al. USENIX01] programming by demonstration

General Framework• A Domain-specific Transformation Language L

– Expressive and succinct

• Efficient Data structures for set of expressions– Version-space algebra

• GenerateStr – All sets of expressions from I-O example

• Intersect– Intersect two sets of expressions

Emp RecordSSN EmpId Name

027-36-4557 1254 John Henry034-83-7683 2412 William

Johnson044-58-3429 1125 Steve

Russell018-45-8949 4257 Ian Jordan023-34-3254 6418 Mary DinaInput v1 Output

044-58-3429 Steve Russell023-34-3254

Select(Name, EmpRecord, (SSN = v1))

Example - Lookup

ItemRecItemId ItemST-340 StrollerBI-567 BibDI-328 DiapersWI-989 WipesAS-469 Aspirator

PriceRecItemId PriceST-340 $145.6

7BI-567 $3.56DI-328 $21.45WI-989 $5.12AS-469 $2.56

Input v1 OutputStroller $145.67

BibAspirator

Wipes

Select(Price, PriceRec, (ItemId = Select(ItemId, ItemRec, Item = v1))

Example – Transitive Lookups

Data Structure

Data structure for expressions

Data structure

Data structure

Data structure

T1

C1 C2 C3

s1 s2 s3

T2

C1 C2 C3

s2 s3 s4

Ti

C1 C2 C3

si si+1 si+2

Example

…TmInput v1 Output

s1 sm

Ti-1

C1 C2 C3

si-1 si si+1

Ti-2

C1 C2 C3

si-2 si-1 si

Sub-expression Sharing

𝑠𝑖

Sub-expression Sharing

𝑠𝑖− 1 𝑠𝑖𝑠𝑖− 2

𝜂𝑖

𝜂𝑖− 1

𝜂𝑖− 2

Sub-expression Sharing

{{𝜂1 ,𝜂2 ,⋯ ,𝜂𝑚 } ,𝜂𝑚 , Progs }

Progs [𝜂 1 ]≡ {𝑣1 }Progs [𝜂2 ]={Select (C2 , T 1,C1= {s1 ,𝜂1 }) }

Sub-expression Sharing𝑁 (𝑖 )=𝑁 (𝑖−1 )+𝑁 (𝑖−2)

𝑁 (𝑖 )=Θ (2𝑖){{𝜂1 ,𝜂2 ,⋯ ,𝜂𝑚 } ,𝜂𝑚 , Progs }

Progs [𝜂 1 ]≡ {𝑣1 }Progs [𝜂2 ]={Select (C2 , T 1,C1= {s1 ,𝜂1 }) }

Intersect t :D t1∧Dt 2

Current State of the Art: Help forums

Observations• Semantic string transformations

• Input-output examples based interaction– New disambiguating inputs

• Add-in with the same interface

top related