idigbio data integration with self-referential schemas · talend open studio (tos) for data...

13
iDigBio Data Integration with Self-Referential Schemas Sarfaraz Soomro Andréa Matsunaga José Fortes Presented By Alex Thompson TDWG Conference

Upload: others

Post on 10-Aug-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: iDigBio Data Integration with Self-Referential Schemas · Talend Open Studio (TOS) for Data Integration An Open Source tool for Data Integration Drag & Drop visual interface for components

iDigBio Data Integration with Self-Referential SchemasSarfaraz Soomro

Andréa Matsunaga

José Fortes

Presented By

Alex Thompson

TDWG Conference

Page 2: iDigBio Data Integration with Self-Referential Schemas · Talend Open Studio (TOS) for Data Integration An Open Source tool for Data Integration Drag & Drop visual interface for components

Advanced Computing and Information Systems laboratory

Data Integration The branch of computer science that deals with

combining data from several disparate data sources into a mediated schema is called Data Integration

ExtractTransformLoad

Mediated Schema

http://icons.iconarchive.com/, http://www.cagrid.org, http://en.wikipedia.org

2

Page 3: iDigBio Data Integration with Self-Referential Schemas · Talend Open Studio (TOS) for Data Integration An Open Source tool for Data Integration Drag & Drop visual interface for components

Advanced Computing and Information Systems laboratory

Hierarchies

3

GeographyID Name ParentID Rank

1 Earth Planet

5 North America 1 Continent

198 United States 5 Country

1237 Florida 198 State

1250 Georgia 198 State

4371 Alachua County 1237 County

4380 Hall County 1250 County

4600 Gainesville 1237 City

4670 Gainesville 4380 CityCity

County

State

Country

Continent

Planet Earth

North America

United States

Florida

Alachua County

Gainesville

Georgia

Hall County

Gainesville

Ranked Self-Referential Representation

Planet Continent Country State County City

Earth

Earth North

America

Earth North

America

United

States

Earth North

America

United

States

Florida

Earth North

America

United

States

Georgia

Earth North

America

United

States

Florida Alachua

County

Earth North

America

United

States

Georgia Hall

County

Earth North

America

United

States

Florida Gainesville

Earth North

America

United

States

Georgia Hall

County

Gainesville

Flat Representation

• Trees• Only single element as a parent

Page 4: iDigBio Data Integration with Self-Referential Schemas · Talend Open Studio (TOS) for Data Integration An Open Source tool for Data Integration Drag & Drop visual interface for components

Advanced Computing and Information Systems laboratory

GeographyID Name ParentID Rank

1 Earth Planet

5 North America 1 Continent

198 United States 5 Country

1237 Florida 198 State

1250 Georgia 198 State

4371 Alachua County 1237 County

4380 Hall County 1250 County

4600 Gainesville 1237 City

4670 Gainesville 4380 City

Self-Referential to Flat Transformation

4

GeographyID Name ParentID Rank

1 Earth Planet

1 Continent

5 Country

198 State

198 State

1237 County

1250 County

1237 City

4380 City

GeographyID Planet Continent Country State County City

1 Earth

5 Earth North America

198 Earth North America United States

1237 Earth North America United States Florida

1250 Earth North America United States Georgia

4371 Earth North America United States Florida Alachua County

4380 Earth North America United States Georgia Hall County

4600 Earth North America United States Florida Gainesville

4670 Earth North America United States Georgia Hall County Gainesville

5 North America

198 United States

1250 Georgia

4380 Hall County

Alachua County4371

Gainesville4600

Gainesville4670

Florida1237

Page 5: iDigBio Data Integration with Self-Referential Schemas · Talend Open Studio (TOS) for Data Integration An Open Source tool for Data Integration Drag & Drop visual interface for components

Advanced Computing and Information Systems laboratory

GeographyID Planet Continent Country State County City

1 Earth

5 Earth North America

198 Earth North America United States

1237 Earth North America United States Florida

1250 Earth North America United States Georgia

4371 Earth North America United States Florida Alachua County

4380 Earth North America United States Georgia Hall County

4600 Earth North America United States Florida Gainesville

4670 Earth North America United States Georgia Hall County Gainesville

GeographyID Name ParentID Rank

1 Earth Planet

5 North America 1 Continent

198 United States 5 Country

1237 Florida 198 State

1250 Georgia 198 State

4371 Alachua County 1237 County

4380 Hall County 1250 County

4600 Gainesville 1237 City

4670 Gainesville 4380 City

GeographyID Planet Continent Country State County City

1 Earth

5 Earth North America

198 Earth North America United States

1237 Earth North America United States Florida

1250 Earth North America United States Georgia

4371 Earth North America United States Florida Alachua County

4380 Earth North America United States Georgia Hall County

4600 Earth North America United States Florida Gainesville

4670 Earth North America United States Georgia Hall County Gainesville

Flat to Self-Referential Transformation

5

Page 6: iDigBio Data Integration with Self-Referential Schemas · Talend Open Studio (TOS) for Data Integration An Open Source tool for Data Integration Drag & Drop visual interface for components

Advanced Computing and Information Systems laboratory

Motivation for Self-Referential Transformations Data sources in iDigBio are in varied schemas Ranked self-referential schemas are a recurring pattern in these schemas

Geography trees Taxonomy authority trees Geology trees

Required to either convert Flat schemas to ranked self-referential schemas Ranked self-referential schemas to flat schemas

These are specialized transformations and require special attention to detail A museum migrating from Symbiota to Specify database Exporting your Specify data for publishing to GBIF Ingesting data from a third party provider like GBIF

Specify provides free of charge data transformation services to the community! Due to the complicated nature of these transformations, it is time-

consuming to produce these transformations Wait time could potentially vary from 6 months to an year

Page 7: iDigBio Data Integration with Self-Referential Schemas · Talend Open Studio (TOS) for Data Integration An Open Source tool for Data Integration Drag & Drop visual interface for components

Advanced Computing and Information Systems laboratory

Achieving Data Transformation Talend Open Studio (TOS) for Data Integration

An Open Source tool for Data Integration Drag & Drop visual interface for components Data transformation jobs are designed as workflows using many

components chained together Built in components for

Data extraction from multiple formats Data processing Data export to multiple formats

Custom components and TOS built-in components can be re-used with many possibilities

Examples Custom Component’s output to a MySQL, MSSQL, Oracle or other supported databases Custom Component’s output to a XML File Custom Component’s output to an Excel File Custom Component’s output to a JSON File Custom Component’s output to a variety of other data processing components available in

TOS

7

Page 8: iDigBio Data Integration with Self-Referential Schemas · Talend Open Studio (TOS) for Data Integration An Open Source tool for Data Integration Drag & Drop visual interface for components

Advanced Computing and Information Systems laboratory

Talend Open Studio

8

• Drag & drop a component.• Input component

• Drag & drop an Output component• Processing component• Arrows indicate data flow

Page 9: iDigBio Data Integration with Self-Referential Schemas · Talend Open Studio (TOS) for Data Integration An Open Source tool for Data Integration Drag & Drop visual interface for components

Advanced Computing and Information Systems laboratory

Specify to Symbiota Example with Talend

9

Specify Software An open source tool for collections management.

Uses self-referential schema to store geography, taxonomy, geology, etc.

Symbiota Specimen based web-tools for bio-collaboration.

Uses flat schema to store occurrence, taxonomy and geography data.

Specify to Symbiota is a Hierarchical to Flat transformation We show next how our custom components developed for

Talend can be used along with other built-in components to achieve this transformation

Page 10: iDigBio Data Integration with Self-Referential Schemas · Talend Open Studio (TOS) for Data Integration An Open Source tool for Data Integration Drag & Drop visual interface for components

Advanced Computing and Information Systems laboratory 10

• Drag & Drop Components

• These tables are joined to collect specimen info

• iDigBio built custom components for Hierarchical to Flat Transformation in Talend

• One flattens the taxonomy tree

• Other one flattens the geography tree

• Run the job

Specify to Symbiota with Talend

Flattens Taxon Tree

Flattens Geography Tree

Page 11: iDigBio Data Integration with Self-Referential Schemas · Talend Open Studio (TOS) for Data Integration An Open Source tool for Data Integration Drag & Drop visual interface for components

Advanced Computing and Information Systems laboratory

Specify to Symbiota with Talend

11

Job finishes in 18.5 Sec on a Windows 7 x64 Enterprise

Intel Core2Duo @ 3GHz 2.99 GHz

4GB Memory

Talend is multi-threaded

Output Records: 16,821

Input Records: 185,907

Can accomplish similar transformation in other direction as well

Effort Reduced for Flat to Hierarchical TransformationTable Actual SQL Query

(# Chararcters)Input to Talend (#Characters)

PercentageReduction

Taxon 30,449 328 99%

Geography 2,078 128 94%

Page 12: iDigBio Data Integration with Self-Referential Schemas · Talend Open Studio (TOS) for Data Integration An Open Source tool for Data Integration Drag & Drop visual interface for components

Advanced Computing and Information Systems laboratory

Conclusion Custom components built by iDigBio for Talend help in

achieving

Hierarchical to Flat Schema Transformation

Flat to Hierarchical Schema Transformation

Custom components can be used easily with other built in components

Schema transformation approach is general and can be adapted for databases from other domains as well

Talend jobs can be re-used as templates

12