efficient approximate string matching with synonyms and

34
Matemaattis-luonnontieteellinen tiedekunta Matemaattis-luonnontieteellinen tiedekunta Efficient Approximate String Matching with Synonyms and Taxonomies Pengfei Xu Custos: Professor Jiaheng Lu, University of Helsinki Opponent: Professor Jan Holub, Czech Technical University in Prague 1 PhD Defense

Upload: others

Post on 04-Apr-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekuntaMatemaattis-luonnontieteellinen tiedekunta

Efficient Approximate String Matching with Synonyms and Taxonomies

Pengfei XuCustos: Professor Jiaheng Lu, University of Helsinki

Opponent: Professor Jan Holub, Czech Technical University in Prague

1

PhD Defense

Page 2: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

§ Motivation, challenge, and research problems§ Main contributions of this thesis§ Impacts of this research

Lectio Praecursoria

2

Page 3: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Motivation

3

Typos…

… should be corrected

Page 4: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Motivation

4

Q: A ≈ B and C ≈ D?A: Yes.

Page 5: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

We want the data be clean and consistent.But data in reality is often far from that

Motivation

5

Page 6: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Research Problems

6

most similar records all similar recordsWhat we want:

Page 7: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Research Problems

7

most similar records all similar records

typographic diffs semantic diffs

synonym

DeutscheBahn

taxonomy

What we want:

bodygauardWhat we have

in data:

Page 8: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Research Problems

8

most similar records all similar records

typographic diffs semantic diffs

synonym

DeutscheBahn

taxonomy

What we want:

bodygauardWhat we have

in data:

Page 9: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Research Questions

9

most similar records all similar records

typographic diffs semantic diffs

synonym

DeutscheBahn

taxonomy

What we want:

bodygauardWhat we have

in data:

RQ1

RQ2

RQ3

RQ4

Page 10: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

RQ1: find most similar records given a query string w.r.t. synonyms

Main Contributions (RQ1)

10

Twin TriesLeast space cost

Slow lookup speed

Expansion TrieMost space cost

Fast lookup speed

Hybrid TrieModerate space cost

Moderate (on the faster side) lookup speed

Page 11: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Twin tries§ Records and synonym rules in separate trie§ Save space but slowdown the lookup

Expansion trie§ Attach synonyms directly to records§ Use more space but fast lookup

Hybrid trie§ Select some rules and attach them to records, left others

in a separate trie§ Branch and bound to solve a knapsack problem

§ Moderate space (under threshold) and lookup time

Main Contributions (RQ1)

11

Page 12: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Main Contributions (RQ1)

12

Space cost

Lookup speed

Page 13: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

RQ2: find most similar records between two sets w.r.t. taxonomies

Main Contributions (RQ2)

13

Sorted List LP-Sorted List

intermediate node

real node in record

Trie

Page 14: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Main Contributions (RQ2)

14

1. Two pointers on the first record

2. Advance the large pointer until not similar

3. Advance the small pointer

4. Repeat Steps 2-35. Stop when both pointers

reach the end

Page 15: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Main Contributions (RQ2)

15

Trie solution is able to skip visiting some records.

It makes uses of ml (min length) to tell the depth of the shallowest node among all descendants, skipping infesiable subtries.

intermediate node

real node in record

Trie

Page 16: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Main Contributions (RQ2)

16

Trie solution is able to skip visiting some records.

It makes uses of ml (min length) to tell the depth of the shallowest node among all descendants, skipping infesiable subtries.

intermediate node

real node in record

Prefix “1.5” in two tries lead to at most 2 / max(3, 4) = 0.5 similarity.

Trie

Page 17: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Main Contributions (RQ2)

17

Lookup speed

Page 18: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

RQ3: find all similar records between two sets w.r.t. taxonomies

Main Contributions (RQ3)

18

Page 19: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

§ Finding the maximal similarity is a bipartite matching problem (Hungarian algorithm)

§ Prefix filtering can be adapted to reduce the search space

Main Contributions (RQ3)

19

Page 20: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

We extended the popular prefix filtering to allow multiple overlaps§ Normal prefix filtering:

If 𝑆 and 𝑇 has similarity 𝑡, they must share at least 1 prefix in the first (1 − 𝑡) |𝑆| + 1 (or (1 − 𝑡) |𝑇| + 1 ) prefixes

§ Our enhanced prefix filtering:If 𝑆 and 𝑇 has similarity 𝑡, they must share at least 𝑛 prefix, each of which has a similarity at least ! " #$%&

' #$%&

Main Contributions (RQ3)

20

Page 21: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

We extended the popular prefix filtering to allow multiple overlaps§ Normal prefix filtering:

If 𝑆 and 𝑇 has similarity 𝑡, they must share at least 1 prefix in the first (1 − 𝑡) |𝑆| + 1 (or (1 − 𝑡) |𝑇| + 1 ) prefixes

§ Our enhanced prefix filtering:If 𝑆 and 𝑇 has similarity 𝑡, they must share at least 𝑛 prefix, each of which has a similarity at least ! " #$%&

' #$%&

Main Contributions (RQ3)

21

Page 22: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Iterative Bernoulli sampling:1. pick small samples, run the filtering and join algorithm2. scale the time up to the whole dataset3. repeat above steps and return the mean of all

estimated values

Main Contributions (RQ3)

22

Page 23: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

RQ4: find all similar records w.r.t. typos, synonyms, and taxonomies

Main Contributions (RQ4)

23

Page 24: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

What type of similarity should be used?à NP-hard problem: weighted MIS

Main Contributions (RQ4)

24

Page 25: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Model the segmentation problem as a weighted MIS on d-claw-free graph§ Approximable in polynomial time

Main Contributions (RQ4)

25

4+1-claw free graph4-claw

Page 26: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Model the segmentation problem as a weighted MIS on d-claw-free graph§ Approximable in polynomial time

Main Contributions (RQ4)

26

Page 27: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Main Contributions (RQ4)

27

The filtering framework1. Generate pebbles for all similarities

Synonym TaxonomyTypo

Pebbles

Prefix Signature

Candidates

Results

Prefix Selection

Prefix Filtering

Calculate similarity

Page 28: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Main Contributions (RQ4)

28

The filtering framework1. Generate pebbles for all similarities2. Select signature for prefix filtering

Heuristicruns in a quasi-linear time

DPslower but produce shorter signatures

Synonym TaxonomyTypo

Pebbles

Prefix Signature

Candidates

Results

Prefix Selection

Prefix Filtering

Calculate similarity

Page 29: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Main Contributions (RQ4)

29

The filtering framework1. Generate pebbles for all similarities2. Select signature for prefix filtering3. Find record pairs having enough

common prefixes§ Allow to find multiple overlaps, with a

sampling algorithm

Synonym TaxonomyTypo

Pebbles

Prefix Signature

Candidates

Results

Prefix Selection

Prefix Filtering

Calculate similarity

Page 30: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Main Contributions (RQ4)

30

The filtering framework1. Generate pebbles for all similarities2. Select signature for prefix filtering3. Find record pairs having enough

common prefixes4. Verify the real similarity for each

candidate

Synonym TaxonomyTypo

Pebbles

Prefix Signature

Candidates

Results

Prefix Selection

Prefix Filtering

Calculate similarity

Page 31: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Result shows a huge improvement over existing algorithms

Main Contributions (RQ4)

31

Page 32: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

§ Semantic knowledge, including synonyms and taxonomies, can be used for string matching tasks

§ Prefix filtering can be extended to increase the filtering power, accelerating verification process

§ Multiple types of similarites can be intergrated to discover more related records

Research Impacts

32

Page 33: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

§ Top-k string auto-completion with synonyms, DASFAA 2017 (answers RQ1)

§ Efficient taxonomic similarity joins with adaptive overlap constraint, CIKM 2018 (answers RQ2)

§ Efficient string similarity join with taxonomy knowledge, Under review, KIS (answers RQ3)

§ Towards a unified framework for string similarity joins, VLDB 2019 (answers RQ4)

Publications

33

Page 34: Efficient Approximate String Matching with Synonyms and

Matemaattis-luonnontieteellinen tiedekunta

Thesis

34

ISSN 1238-8645ISBN 978-951-51-6987-7 (PAPERBACK)

ISBN 978-951-51-6988-4 (PDF)UNIGRAFIA

HELSINKI 2021

UNIVERSITY OF HELSINKIFACULTY OF SCIENCE

EFFICIENT APPROXIMATE STRING MATCHING WITH SYNONYMS AND TAXONOMIESPENGFEI XU

PENGFEI XU | EFFICIENT APPROXIMATE STRING M

ATCHING WITH SYNO

NYMS AND TAXO

NOM

IES

DEPARTMENT OF COMPUTER SCIENCE PHD THESIS

A-2021-2

UNIVERSITY OF HELSINKIFACULTY OF SCIENCE

DEPARTMENT OF COMPUTER SCIENCESERIES OF PUBLICATIONS A, REPORT A-2021-2

Available online for comments:

http://urn.fi/URN:ISBN:978-951-51-6988-4