automating data integration with machine learning...automating data integration with machine...
TRANSCRIPT
![Page 1: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/1.jpg)
www.data61.csiro.au
Automating Data Integration with Machine Learning Bringing Your Data Together
Natalia Rümmele | Data Scientist September 2016
![Page 2: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/2.jpg)
Data Integration Problem
• Combine data from different sources
• Provide unified view of data
Data Integration | Natalia Rümmele 2 |
?
![Page 3: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/3.jpg)
Data Integration Problem
• Combine data from different sources
• Provide unified view of data • Multiple datasets with common entities and
content
• Siloed systems
• Sharing data across systems/schemas
• Handling legacy systems
• Handling different schemas
Data Integration | Natalia Rümmele 3 |
?
![Page 4: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/4.jpg)
Data Integration Problem
• Combine data from different sources
• Provide unified view of data • Multiple datasets with common entities and
content
• Siloed systems
• Sharing data across systems/schemas
• Handling legacy systems
• Handling different schemas
• Resource-intensive ETL…
Data Integration | Natalia Rümmele 4 |
?
![Page 5: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/5.jpg)
Data Integration Problem
• Combine data from different sources
• Provide unified view of data • Multiple datasets with common entities and
content
• Siloed systems
• Sharing data across systems/schemas
• Handling legacy systems
• Handling different schemas
• Resource-intensive ETL…
• Can machine learning help?
Data Integration | Natalia Rümmele 5 |
?
![Page 6: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/6.jpg)
Areas of Investigation
• Schema matching & mapping • Connecting datasets by establishing a global data
model
• Entity Resolution • Detecting entities across databases
• Privacy Preserving Analytics
• Learning the global data model without sharing data
• Data Quality
• Semantic and Syntactic scoring
Data Integration | Natalia Rümmele 6 |
![Page 7: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/7.jpg)
Relational Schema Matching
![Page 8: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/8.jpg)
Schema Matching
• Goal: Automatically connect datasets across relational schemas
Data Integration | Natalia Rümmele 8 |
UserName
Joe Blogs
Jill Blogs
…
Names
Blogs, Joe
Blogs, Jill
…
People
Blogs, J
Blogs, J
…
_USERS_
Blogs_Joe
Blogs_Jill
…
IDs
Joe Blogs
Jill Blogs
…
![Page 9: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/9.jpg)
Schema Matching
• Goal: Automatically connect datasets across relational schemas
• Problem: Syntax vs Semantics • Requires humans to distinguish between similar columns
• Machines get caught on syntax
Data Integration | Natalia Rümmele 9 |
UserName
Joe Blogs
Jill Blogs
…
Names
Blogs, Joe
Blogs, Jill
…
People
Blogs, J
Blogs, J
…
_USERS_
Blogs_Joe
Blogs_Jill
…
IDs
Joe Blogs
Jill Blogs
…
Name
Semantic type
![Page 10: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/10.jpg)
Data Integration: Schema Matcher
• Can we automatically label columns with semantic types?
• Given multiple datasets and a set of semantic types, can we detect all columns of the same semantic type where: • Column names may be different
• Common entries may not exist
• Formatting issues may exist
Data Integration | Natalia Rümmele 10 |
UserName
Joe Blogs
Jill Blogs
…
Names
Blogs, Joe
Blogs, Jill
…
People
Blogs, J
Blogs, J
…
_USERS_
Blogs_Joe
Blogs_Jill
…
IDs
Joe Blogs
Jill Blogs
…
Name Name Name Name Name
Schema Matcher
Name Address Phone
![Page 11: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/11.jpg)
Machine Learning: Feature Vector
• Multi-class classification problem • Represent column as a vector of features • Classify column as one of semantic types using a ML classifier • Training data needed!
Data Integration | Natalia Rümmele 11 |
People
Joe Blogs
Jill Blogs
Fred Flogs
Fiona Flogs
George Glogs
Gini Glogs
Henry Hogs
...
Column Header + Table Name • Nearest neighbours to class training labels • Edit distance metrics • WordNet distance
Content • Character Frequencies • Missing Values • Number repeated values • Entropy
![Page 12: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/12.jpg)
Machine Learning: Cost Matrix
• Class imbalance: unknown class over-represented (unlabeled columns)
• Class resampling strategies: oversampling/undersampling to mean, etc.
• Cost-sensitive learning by introducing asymmetric costs of misclassifications
Data Integration | Natalia Rümmele 12 |
Semantic Type
Predicted
Act
ual
Unknown Name Address Phone
Unknown 0 1 1 1
Name 40 0 1 1
Address 40 1 0 1
Phone 40 1 1 0
![Page 13: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/13.jpg)
Machine Learning: Bagging
• Number of columns usually much smaller than dataset
• Can use bagging to build many smaller samples by subsampling
• Also addresses class imbalance
Data Integration | Natalia Rümmele 13 |
People
Joe Blogs
Jill Blogs
Fred Flogs
Fiona Flogs
George Glogs
Gini Glogs
Henry Hogs
...
People
Joe Blogs
Jill Blogs
Fred Flogs
People
George Glogs
Jill Blogs
Gini Glogs
People
Henry Hogs
Joe Blogs
Henry Hogs
People
Joe Blogs
Jill Blogs
Fred Flogs
People
Joe Blogs
Gini Glogs
Fred Flogs
![Page 14: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/14.jpg)
Data Integration: Schema Matcher
• Schema Matcher is trained on examples and user feedback
• System improves as predictions are corrected
Data Integration | Natalia Rümmele 14 |
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Schema Matcher
Address Address
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Schema Matcher
Name Address
![Page 15: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/15.jpg)
Data Integration: Schema Matcher
• User supplies semantic types and datasets
• User labels some example sets
Data Integration | Natalia Rümmele 15 |
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Address Name
People
Joe Blogs
Jill Blogs
…
Phone Email
![Page 16: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/16.jpg)
Data Integration: Schema Matcher
• User supplies semantic types and datasets
• User labels some example sets
• Predictions are generated
Data Integration | Natalia Rümmele 16 |
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Address Name
People
Joe Blogs
Jill Blogs
…
Phone Email
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Name
![Page 17: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/17.jpg)
Data Integration: Schema Matcher
• User supplies types and datasets
• User labels some example sets
• Predictions are generated, user corrects
Data Integration | Natalia Rümmele 17 |
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Address Name
People
Joe Blogs
Jill Blogs
…
Phone Email
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Name
![Page 18: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/18.jpg)
Data Integration: Schema Matcher
• User supplies types and datasets
• User labels some example sets
• Predictions are generated, user corrects
• User adds more data
Data Integration | Natalia Rümmele 18 |
Address Name
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
People
Joe Blogs
Jill Blogs
…
Phone Email
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Name
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
People
Joe Blogs
Jill Blogs
…
![Page 19: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/19.jpg)
Data Integration: Schema Matcher
• User supplies types and datasets
• User labels some example sets
• Predictions are generated, user corrects
• User adds more data
• Repeat
Data Integration | Natalia Rümmele 19 |
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Address Name
People
Joe Blogs
Jill Blogs
…
Phone Email
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Name
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
People
Joe Blogs
Jill Blogs
…
![Page 20: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/20.jpg)
Applications
• Linking and connecting data
• Class-wide transforms
• Relabelling columns to unified naming convention
• Labelling no-header datasets
• Merging tables by semantic type
• A component for further semantic modelling
Data Integration | Natalia Rümmele 20 |
UserName
Joe Blogs
Jill Blogs
…
Location
Melbourne
Perth
…
Contact
99110002
45878723
…
_USERS_
Blogs_Joe
Blogs_Jill
…
Number
98712533
34598734
…
Name City Phone Name Phone
![Page 21: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/21.jpg)
Semantic Modelling
![Page 22: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/22.jpg)
Implicit Semantics
• The semantic meaning of a dataset is more than a column label
Data Integration | Natalia Rümmele 22 |
Name
Joe Blogs
Jill Blogs
…
BirthDate
21-05-1986
97-12-1990
…
City
Perth
Adelaide
…
State
WA
SA
…
Workplace
Data61
CSIRO
…
Person City
lives-in? born-in? works-in?
![Page 23: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/23.jpg)
Making Implicit Explicit
• The semantic meaning of a dataset is more than a column label
• There are usually relationships implied between the columns
Data Integration | Natalia Rümmele 23 |
UserName
Joe Blogs
Jill Blogs
…
Location
Melbourne
Perth
…
Contact
99110002
45878723
…
_USERS_
Blogs_Joe
Blogs_Jill
…
Number
98712533
34598734
…
Name Name Number Name Number
Person City Phone Person Phone
has-a
lives-in
has-a
![Page 24: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/24.jpg)
Semantic Model
• The semantic meaning of a dataset is more than a column label
• There are usually relationships implied between the columns
Data Integration | Natalia Rümmele 24 |
UserName
Joe Blogs
Jill Blogs
…
Location
Melbourne
Perth
…
Contact
99110002
45878723
…
Name Name Number
Person City Phone
has-a
lives-in
Semantic Model
![Page 25: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/25.jpg)
Semantic Modelling
• For this we need an ontology or graph schema
• Can come from: • Built up iteratively from definitions
• Pre-defined domain ontologies
• Downloaded ontologies from Semantic Web
Data Integration | Natalia Rümmele 25 |
*M.Taheriyan et al. “A graph-based approach to learn semantic descriptions of data sources ”, ISWC 2013.
*
![Page 26: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/26.jpg)
Ontology
Data Integration | Natalia Rümmele 26 |
Class Node: Abstract Concept
*
![Page 27: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/27.jpg)
Ontology
Data Integration | Natalia Rümmele 27 |
Class Node: Abstract Concept Data Node: Properties
*
![Page 28: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/28.jpg)
Ontology
Data Integration | Natalia Rümmele 28 |
Class Node: Abstract Concept Data Node: Properties Relationship: Relationship between Concepts
*
![Page 29: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/29.jpg)
RDB2RDF Schema Matching
• Given an ontology and a set of known semantic models, can we generate a semantic model for a new dataset
Data Integration | Natalia Rümmele 29 |
UserName
Joe Blogs
Jill Blogs
…
Contact
99110002
45878723
…
Name Number
Person Phone
UserName
Joe Blogs
Jill Blogs
…
Location
Melbourne
Perth
…
Contact
99110002
45878723
…
Name Name Number
Person City Phone
UserName
Joe Blogs
Jill Blogs
…
Location
Melbourne
Perth
…
Name Name
Person City
+
![Page 30: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/30.jpg)
Constructing Semantic Model
Data Integration | Natalia Rümmele 30 |
• Map all possible semantic types matched for columns onto the Ontology
• Need to find most likely semantic model – subgraph which covers matched semantic types
• Minimum Cost Steiner Tree Problem (approximate)
UserName
Joe Blogs
Jill Blogs
…
Organization
Data2Decisions
Data61
…
Person name
Org name
Person Organization
worksFor
![Page 31: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/31.jpg)
Advantages
Not just better understanding of your data…. • Better entity resolution • Easier integration to graph databases or merging between relational • Enables more sophisticated and accurate merging • More powerful search
Data Integration | Natalia Rümmele 31 |
UserName
Joe Blogs
Jill Blogs
…
Location
Melbourne
Perth
…
Contact
99110002
45878723
…
_USERS_
Blogs_Joe
Blogs_Jill
…
Number
98712533
34598734
…
Name City Phone Name Phone
Person City Phone Person Phone
has-a
lives-in
has-a
![Page 32: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/32.jpg)
Advanced Search
• The ontology allows transitive and subclass relationships
• Searches can associate new columns e.g. City -> State -> Country
• A search for something in a country can also proceed to the subclasses
Data Integration | Natalia Rümmele 32 |
![Page 33: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/33.jpg)
Graph Database
• The ontology can act as the intermediary between graph databases and relational databases
• Functions as a graph schema and global (unified) schema of integrated datasets
Data Integration | Natalia Rümmele 33 |
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
People
Joe Blogs
Jill Blogs
…
![Page 34: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/34.jpg)
Summary
![Page 35: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/35.jpg)
Data Integration
• Schema Matcher • Relational schema
• Find semantically similar columns across data sets
• Semantic Modelling • Graph schema
• Find semantically similar columns + relationships between them
Data Integration | Natalia Rümmele 35 |
Name Name Number
Person City Phone
![Page 36: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/36.jpg)
Open Questions
• Can we learn column transformations?
• Complex column matches • One-to-many matches
• Many-to-one matches
• Applications in entity resolution
• Applications in search
Data Integration | Natalia Rümmele 36 |
split?
concatenate?
![Page 37: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive](https://reader034.vdocuments.us/reader034/viewer/2022042220/5ec604b25638540e6d6ee4d4/html5/thumbnails/37.jpg)
www.csiro.au
Data Platforms Team Engineering and Design
Thank you ?
Alex Collins
Stephen Hardy
Yuriy Tyshetskiy
Natalia Rümmele
Maybe you?