Matemaattis-luonnontieteellinen tiedekuntaMatemaattis-luonnontieteellinen tiedekunta
Efficient Approximate String Matching with Synonyms and Taxonomies
Pengfei XuCustos: Professor Jiaheng Lu, University of Helsinki
Opponent: Professor Jan Holub, Czech Technical University in Prague
1
PhD Defense
Matemaattis-luonnontieteellinen tiedekunta
§ Motivation, challenge, and research problems§ Main contributions of this thesis§ Impacts of this research
Lectio Praecursoria
2
Matemaattis-luonnontieteellinen tiedekunta
Motivation
3
Typos…
… should be corrected
Matemaattis-luonnontieteellinen tiedekunta
Motivation
4
Q: A ≈ B and C ≈ D?A: Yes.
Matemaattis-luonnontieteellinen tiedekunta
We want the data be clean and consistent.But data in reality is often far from that
Motivation
5
Matemaattis-luonnontieteellinen tiedekunta
Research Problems
6
most similar records all similar recordsWhat we want:
Matemaattis-luonnontieteellinen tiedekunta
Research Problems
7
most similar records all similar records
typographic diffs semantic diffs
synonym
DeutscheBahn
taxonomy
What we want:
bodygauardWhat we have
in data:
Matemaattis-luonnontieteellinen tiedekunta
Research Problems
8
most similar records all similar records
typographic diffs semantic diffs
synonym
DeutscheBahn
taxonomy
What we want:
bodygauardWhat we have
in data:
Matemaattis-luonnontieteellinen tiedekunta
Research Questions
9
most similar records all similar records
typographic diffs semantic diffs
synonym
DeutscheBahn
taxonomy
What we want:
bodygauardWhat we have
in data:
RQ1
RQ2
RQ3
RQ4
Matemaattis-luonnontieteellinen tiedekunta
RQ1: find most similar records given a query string w.r.t. synonyms
Main Contributions (RQ1)
10
Twin TriesLeast space cost
Slow lookup speed
Expansion TrieMost space cost
Fast lookup speed
Hybrid TrieModerate space cost
Moderate (on the faster side) lookup speed
Matemaattis-luonnontieteellinen tiedekunta
Twin tries§ Records and synonym rules in separate trie§ Save space but slowdown the lookup
Expansion trie§ Attach synonyms directly to records§ Use more space but fast lookup
Hybrid trie§ Select some rules and attach them to records, left others
in a separate trie§ Branch and bound to solve a knapsack problem
§ Moderate space (under threshold) and lookup time
Main Contributions (RQ1)
11
Matemaattis-luonnontieteellinen tiedekunta
Main Contributions (RQ1)
12
Space cost
Lookup speed
Matemaattis-luonnontieteellinen tiedekunta
RQ2: find most similar records between two sets w.r.t. taxonomies
Main Contributions (RQ2)
13
Sorted List LP-Sorted List
intermediate node
real node in record
Trie
Matemaattis-luonnontieteellinen tiedekunta
Main Contributions (RQ2)
14
1. Two pointers on the first record
2. Advance the large pointer until not similar
3. Advance the small pointer
4. Repeat Steps 2-35. Stop when both pointers
reach the end
Matemaattis-luonnontieteellinen tiedekunta
Main Contributions (RQ2)
15
Trie solution is able to skip visiting some records.
It makes uses of ml (min length) to tell the depth of the shallowest node among all descendants, skipping infesiable subtries.
intermediate node
real node in record
Trie
Matemaattis-luonnontieteellinen tiedekunta
Main Contributions (RQ2)
16
Trie solution is able to skip visiting some records.
It makes uses of ml (min length) to tell the depth of the shallowest node among all descendants, skipping infesiable subtries.
intermediate node
real node in record
Prefix “1.5” in two tries lead to at most 2 / max(3, 4) = 0.5 similarity.
Trie
Matemaattis-luonnontieteellinen tiedekunta
Main Contributions (RQ2)
17
Lookup speed
Matemaattis-luonnontieteellinen tiedekunta
RQ3: find all similar records between two sets w.r.t. taxonomies
Main Contributions (RQ3)
18
Matemaattis-luonnontieteellinen tiedekunta
§ Finding the maximal similarity is a bipartite matching problem (Hungarian algorithm)
§ Prefix filtering can be adapted to reduce the search space
Main Contributions (RQ3)
19
Matemaattis-luonnontieteellinen tiedekunta
We extended the popular prefix filtering to allow multiple overlaps§ Normal prefix filtering:
If 𝑆 and 𝑇 has similarity 𝑡, they must share at least 1 prefix in the first (1 − 𝑡) |𝑆| + 1 (or (1 − 𝑡) |𝑇| + 1 ) prefixes
§ Our enhanced prefix filtering:If 𝑆 and 𝑇 has similarity 𝑡, they must share at least 𝑛 prefix, each of which has a similarity at least ! " #$%&
' #$%&
Main Contributions (RQ3)
20
Matemaattis-luonnontieteellinen tiedekunta
We extended the popular prefix filtering to allow multiple overlaps§ Normal prefix filtering:
If 𝑆 and 𝑇 has similarity 𝑡, they must share at least 1 prefix in the first (1 − 𝑡) |𝑆| + 1 (or (1 − 𝑡) |𝑇| + 1 ) prefixes
§ Our enhanced prefix filtering:If 𝑆 and 𝑇 has similarity 𝑡, they must share at least 𝑛 prefix, each of which has a similarity at least ! " #$%&
' #$%&
Main Contributions (RQ3)
21
Matemaattis-luonnontieteellinen tiedekunta
Iterative Bernoulli sampling:1. pick small samples, run the filtering and join algorithm2. scale the time up to the whole dataset3. repeat above steps and return the mean of all
estimated values
Main Contributions (RQ3)
22
Matemaattis-luonnontieteellinen tiedekunta
RQ4: find all similar records w.r.t. typos, synonyms, and taxonomies
Main Contributions (RQ4)
23
Matemaattis-luonnontieteellinen tiedekunta
What type of similarity should be used?à NP-hard problem: weighted MIS
Main Contributions (RQ4)
24
Matemaattis-luonnontieteellinen tiedekunta
Model the segmentation problem as a weighted MIS on d-claw-free graph§ Approximable in polynomial time
Main Contributions (RQ4)
25
4+1-claw free graph4-claw
Matemaattis-luonnontieteellinen tiedekunta
Model the segmentation problem as a weighted MIS on d-claw-free graph§ Approximable in polynomial time
Main Contributions (RQ4)
26
Matemaattis-luonnontieteellinen tiedekunta
Main Contributions (RQ4)
27
The filtering framework1. Generate pebbles for all similarities
Synonym TaxonomyTypo
Pebbles
Prefix Signature
Candidates
Results
Prefix Selection
Prefix Filtering
Calculate similarity
Matemaattis-luonnontieteellinen tiedekunta
Main Contributions (RQ4)
28
The filtering framework1. Generate pebbles for all similarities2. Select signature for prefix filtering
Heuristicruns in a quasi-linear time
DPslower but produce shorter signatures
Synonym TaxonomyTypo
Pebbles
Prefix Signature
Candidates
Results
Prefix Selection
Prefix Filtering
Calculate similarity
Matemaattis-luonnontieteellinen tiedekunta
Main Contributions (RQ4)
29
The filtering framework1. Generate pebbles for all similarities2. Select signature for prefix filtering3. Find record pairs having enough
common prefixes§ Allow to find multiple overlaps, with a
sampling algorithm
Synonym TaxonomyTypo
Pebbles
Prefix Signature
Candidates
Results
Prefix Selection
Prefix Filtering
Calculate similarity
Matemaattis-luonnontieteellinen tiedekunta
Main Contributions (RQ4)
30
The filtering framework1. Generate pebbles for all similarities2. Select signature for prefix filtering3. Find record pairs having enough
common prefixes4. Verify the real similarity for each
candidate
Synonym TaxonomyTypo
Pebbles
Prefix Signature
Candidates
Results
Prefix Selection
Prefix Filtering
Calculate similarity
Matemaattis-luonnontieteellinen tiedekunta
Result shows a huge improvement over existing algorithms
Main Contributions (RQ4)
31
Matemaattis-luonnontieteellinen tiedekunta
§ Semantic knowledge, including synonyms and taxonomies, can be used for string matching tasks
§ Prefix filtering can be extended to increase the filtering power, accelerating verification process
§ Multiple types of similarites can be intergrated to discover more related records
Research Impacts
32
Matemaattis-luonnontieteellinen tiedekunta
§ Top-k string auto-completion with synonyms, DASFAA 2017 (answers RQ1)
§ Efficient taxonomic similarity joins with adaptive overlap constraint, CIKM 2018 (answers RQ2)
§ Efficient string similarity join with taxonomy knowledge, Under review, KIS (answers RQ3)
§ Towards a unified framework for string similarity joins, VLDB 2019 (answers RQ4)
Publications
33
Matemaattis-luonnontieteellinen tiedekunta
Thesis
34
ISSN 1238-8645ISBN 978-951-51-6987-7 (PAPERBACK)
ISBN 978-951-51-6988-4 (PDF)UNIGRAFIA
HELSINKI 2021
UNIVERSITY OF HELSINKIFACULTY OF SCIENCE
EFFICIENT APPROXIMATE STRING MATCHING WITH SYNONYMS AND TAXONOMIESPENGFEI XU
PENGFEI XU | EFFICIENT APPROXIMATE STRING M
ATCHING WITH SYNO
NYMS AND TAXO
NOM
IES
DEPARTMENT OF COMPUTER SCIENCE PHD THESIS
A-2021-2
UNIVERSITY OF HELSINKIFACULTY OF SCIENCE
DEPARTMENT OF COMPUTER SCIENCESERIES OF PUBLICATIONS A, REPORT A-2021-2
Available online for comments:
http://urn.fi/URN:ISBN:978-951-51-6988-4