automatingentitymatching modeldevelopmentjnwang/ppt/automl-em-tr-talk.pdf · 2021. 4. 22. ·...

38
Jiannan Wang Automating Entity Matching Model Development Simon Fraser University April 14, 2021, Thomson Reuters Pei Wang, Weiling Zheng, Jiannan Wang, Jian Pei. Automating Entity Matching Model Development. ICDE 2021, Chania, Greece.

Upload: others

Post on 02-Aug-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

Jiannan Wang

Automating Entity MatchingModel Development

Simon Fraser University

April 14, 2021, Thomson Reuters

Pei Wang, Weiling Zheng, Jiannan Wang, Jian Pei. Automating Entity Matching Model Development. ICDE 2021, Chania, Greece.

Page 2: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

SFU in the cloud J2

Page 3: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

SFU Data Science Research Grouphttp://data.cs.sfu.ca

3

• Invented many famous data mining algorithms (e.g., FP-Growth, DBScan)

• Research Strengh: Cloud Databases, DataPreparation, Data Pricing, Data Security andPrivacy, Recommender Systems

• Ranked 13th in databases and data mining inNorth America (source: csrankings.org)

Martin Ester(Joined in 2001)

Jian Pei(Joined in 2004)

Ke Wang(Joined in 2000)

Tianzheng Wang(Joined Fall 2018)

Jiannan Wang(Joined in 2016)

Page 4: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

Democratizing AI

4

• Computing

• Algorithms

• Data Data Prep is the bottleneck

Page 5: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

5

Data Lake CookedData

Data Preparation

What is Data Prep?

Page 6: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

6

Data Lake CookedData

Why is Data Prep hard?

• Data Discovery• Data Profiling• Data Extraction• Data Normalization• Data Enrichment• Data Transformation• Data Filtering• Data Provenance• Data Labeling• Error Detection• Schema Matching• Deduplication• Outlier Detection• Imputation• …

Page 7: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

7

Two Promising Directions

1. Using advanced ML technologies

• Automated Machine Learning (AutoML)

• Active Learning and Self-training

2. Building open-source software• Ease of Use• Fast• All-in-one

Today’sTalk

Next Wed’sTalk

http://dataprep.ai

Page 8: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

8

Talk Outline

1. Entity Matching (EM)

2. Automate Model Development

3. Automate Data Labeling

4. Future Direction

Page 9: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

Entity Matching (EM)

9

EM is central to data integration and cleaning

Page 10: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

Entity Matching (EM)

10

ID Product Name Price

r1 iPad Eight 128GB WiFi White $490

r2 iPad 8th generation 128GB WiFi White $469

r3 iPhone 10th generation White 256GB $545

r4 Apple iPhone 11th generation Black 256GB $375

r5 Apple iPhone 10 256GB White $520

, (r3, r5)(r1, r2)Matching Pairs:

Page 11: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

Entity Matching Techniques

11

1. Similarity-based• Similarity function (e.g., Jaccard)

• Threshold (e.g., 0.8)

2. Learning-based

Jaccard(r1, r2) = 0.9 ≥ 0.8 √Jaccard(r3, r4) = 0.4 < 0.8 ×

(r1, r2)(r1, r3)

(r3, r5)

(r4, r5)

(r3, r4)

Matching Non-matching

Page 12: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

12

Manual Model Development

Page 13: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

13

Our Goal

AutoML

Page 14: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

14

Abt-buy Dataset (70 features)

FeatureSelection

RandomForest

# of selected features

F1 Score

max_features parameter value

TrainingData

EM Model

Why AutoML?Reason 1: TuningMatters

F1 Score

Page 15: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

15

Why AutoML?Reason 2: Huge Tuning Space

Abt-buy Dataset (70 features)

FeatureSelection

RandomForest

TrainingData

EM Model

Space Size: 70 x 70 = 490

20+ components 20+ components 10+ models

HugeSpace

Page 16: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

16

Talk Outline

1. Entity Matching (EM)

2. Automate Model Development

3. Automate Data Labeling

4. Future Direction

Page 17: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

What is AutoML?

17

Vision• AutoML allows non-experts to make use of machine

learning models and techniques

Scope• Automate Data Preprocessing à Feature Engineering à Model

Selection/Hyperparameter Tuning for Supervised Learning

Training Data +target + metric

ModelValidation

Page 18: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

18

Will AutoML replace data scientists?

NO! AutoML lacks domain knowledge

Page 19: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

How does AutoML work?

19

Three Steps

Search Space𝑆 Search Strategy Performance

Estimation Strategy

Pipeline𝑝 ∈ 𝑆

Performanceestimate of 𝑝

1 2 3

Domain Knowledge

Page 20: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

How to adopt AutoML in EM?

20

Key Idea: Ingest domain knowledge through acareful search space design

Feature Generation:Magellan Features [1] vs. AutoML-EM Features

Model Selection:All Models vs. Random Forest

[1] Konda, Pradap, et al. "Magellan: Toward building entity matching management systems." Proceedings of the VLDB Endowment 9.12 (2016): 1197-1208.

Page 21: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

Feature Generation

21

Magellan Features [1]

[1] Konda, Pradap, et al. "Magellan: Toward building entity matching management systems." Proceedings of the VLDB Endowment 9.12 (2016): 1197-1208.

V.S.

AutoML-EM Features

Page 22: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

Magellan Features

22

5 features for Product Name

4 features for Price

ID Product Name Price

r1 iPad Eight 128GB WiFi White $490

ID Product Name Price

r2 iPad 8th generation 128GB WiFi White $469

Record Pair Feature Vector?9 Features

Page 23: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

AutoML-EM Features

23

16 features for Product Name

4 features for Price

ID Product Name Price

r1 iPad Eight 128GB WiFi White $490

ID Product Name Price

r2 iPad 8th generation 128GB WiFi White $469

Record Pair Feature Vector?20 Features

Page 24: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

Experiments – setup & datasets

24

• AutoML-EM§ Built on Auto-sklearn

• Methods for comparison§ Magellan [1]:state-of-the-art library for EM model development§ DeepMatcher [2]: state-of-the-art deep learning models for EM

• Datasets• Eight benchmark datasets

[1]. Konda, Pradap, et al. "Magellan: Toward building entity matching management systems." VLDB 2016.[2]. Mudgal, Sidharth, et al. "Deep learning for entity matching: A design space exploration." SIMGOD 2018.

Page 25: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

• Magellan Features vs. AutoML-EM Features

Feature Generation

25

AutoML-EM features outperform Magellan Features by up to 11.1 %

Page 26: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

Can AutoML-EM Beat Human?

26

AutoML-EM beats human by an average of 5.8 % in F1 Score

Human

• Human vs. AutoML-EM

Page 27: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

Deep Learning Based EM

27

Page 28: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

Can AutoML-EM beat deep learning?

28

Dataset DeepMatcher AutoML-EM △ F1 Score

BeerAdvo-RateBee 72.7 80.9 +8.2

DBLP-ACM 98.4 98.1 -0.3

DBLP-Scholar 94.7 94.6 -0.1

Fodors-Zaqats 100.0 100.0 + 0

Walmart-Amazon 66.9 79.9 +13

iTunes-Amazon 88.0 95.7 +7.7

AutoML-EM wins on structured data by up to 13%

Dataset DeepMatcher AutoML-EM △ F1 Score

Amazon-Google 69.3 63.8 -5.5

Abt-Buy 62.8 58.1 -4.7

Deep learning wins on textual data but NOT by a large margin

Page 29: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

Deep Learning vs. AutoML-EM

29

Deep Learning AutoML-EM

Interpretability

Time efficiency

Performance on structured data

Page 30: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

TakeawaysInnovation• The first work to apply AutoML to EM

Key Findings1. AutoML-EM beats human by a large margin

2. AutoML-EM outperforms deep learning on structured data

3. AutoML-EM is competitive to deep learning on textual data

30

Page 31: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

31

Talk Outline

1. Entity Matching (EM)

2. Automate Model Development

3. Automate Data Labeling

4. Future Direction

Page 32: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

32

Data Labeling

Page 33: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

Active Learning

33

Illustration Workflow

Data

LabeledData

Model

LabeledData

Page 34: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

Self-training

34

1. Train model onlabeled data

2. Use model to predictunlabeled data

3. Add predictedunlabeled with highconfidence totraining set

Labeled data

High Conf. Low Conf.High Conf.

New labeled data

Page 35: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

Active Learning + Self-training

35

Model inferredlabels

Data

Labeled Data

Model

Labeled Data

Page 36: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

TakeawaysInnovation• The first work to combine active learning and self-

training for EM

Key Findings1. Our combined solution beats active learning only solution

2. Our combined solution beats self-training only solution

36

Page 37: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

37

Talk Outline

1. Entity Matching (EM)

2. Automate Model Development

3. Automate Data Labeling

4. Future Direction

Page 38: AutomatingEntityMatching ModelDevelopmentjnwang/ppt/AutoML-EM-TR-Talk.pdf · 2021. 4. 22. · Walmart-Amazon 66.9 79.9 +13 iTunes-Amazon 88.0 95.7 +7.7 AutoML-EMwinsonstructureddatabyupto13%

The journey has just begun

38

Democratizing AI

Algorithms ComputingData Prep

Entity MatchingError DetectionSchema MatchingOutlier DetectionImputationData DiscoveryData ProfilingData ExtractionData NormalizationData EnrichmentData Transformation…

AutoML +(Domain Knowledge)