idigbio data integration with self-referential schemas · talend open studio (tos) for data...
TRANSCRIPT
iDigBio Data Integration with Self-Referential SchemasSarfaraz Soomro
Andréa Matsunaga
José Fortes
Presented By
Alex Thompson
TDWG Conference
Advanced Computing and Information Systems laboratory
Data Integration The branch of computer science that deals with
combining data from several disparate data sources into a mediated schema is called Data Integration
ExtractTransformLoad
Mediated Schema
http://icons.iconarchive.com/, http://www.cagrid.org, http://en.wikipedia.org
2
Advanced Computing and Information Systems laboratory
Hierarchies
3
GeographyID Name ParentID Rank
1 Earth Planet
5 North America 1 Continent
198 United States 5 Country
1237 Florida 198 State
1250 Georgia 198 State
4371 Alachua County 1237 County
4380 Hall County 1250 County
4600 Gainesville 1237 City
4670 Gainesville 4380 CityCity
County
State
Country
Continent
Planet Earth
North America
United States
Florida
Alachua County
Gainesville
Georgia
Hall County
Gainesville
Ranked Self-Referential Representation
Planet Continent Country State County City
Earth
Earth North
America
Earth North
America
United
States
Earth North
America
United
States
Florida
Earth North
America
United
States
Georgia
Earth North
America
United
States
Florida Alachua
County
Earth North
America
United
States
Georgia Hall
County
Earth North
America
United
States
Florida Gainesville
Earth North
America
United
States
Georgia Hall
County
Gainesville
Flat Representation
• Trees• Only single element as a parent
Advanced Computing and Information Systems laboratory
GeographyID Name ParentID Rank
1 Earth Planet
5 North America 1 Continent
198 United States 5 Country
1237 Florida 198 State
1250 Georgia 198 State
4371 Alachua County 1237 County
4380 Hall County 1250 County
4600 Gainesville 1237 City
4670 Gainesville 4380 City
Self-Referential to Flat Transformation
4
GeographyID Name ParentID Rank
1 Earth Planet
1 Continent
5 Country
198 State
198 State
1237 County
1250 County
1237 City
4380 City
GeographyID Planet Continent Country State County City
1 Earth
5 Earth North America
198 Earth North America United States
1237 Earth North America United States Florida
1250 Earth North America United States Georgia
4371 Earth North America United States Florida Alachua County
4380 Earth North America United States Georgia Hall County
4600 Earth North America United States Florida Gainesville
4670 Earth North America United States Georgia Hall County Gainesville
5 North America
198 United States
1250 Georgia
4380 Hall County
Alachua County4371
Gainesville4600
Gainesville4670
Florida1237
Advanced Computing and Information Systems laboratory
GeographyID Planet Continent Country State County City
1 Earth
5 Earth North America
198 Earth North America United States
1237 Earth North America United States Florida
1250 Earth North America United States Georgia
4371 Earth North America United States Florida Alachua County
4380 Earth North America United States Georgia Hall County
4600 Earth North America United States Florida Gainesville
4670 Earth North America United States Georgia Hall County Gainesville
GeographyID Name ParentID Rank
1 Earth Planet
5 North America 1 Continent
198 United States 5 Country
1237 Florida 198 State
1250 Georgia 198 State
4371 Alachua County 1237 County
4380 Hall County 1250 County
4600 Gainesville 1237 City
4670 Gainesville 4380 City
GeographyID Planet Continent Country State County City
1 Earth
5 Earth North America
198 Earth North America United States
1237 Earth North America United States Florida
1250 Earth North America United States Georgia
4371 Earth North America United States Florida Alachua County
4380 Earth North America United States Georgia Hall County
4600 Earth North America United States Florida Gainesville
4670 Earth North America United States Georgia Hall County Gainesville
Flat to Self-Referential Transformation
5
Advanced Computing and Information Systems laboratory
Motivation for Self-Referential Transformations Data sources in iDigBio are in varied schemas Ranked self-referential schemas are a recurring pattern in these schemas
Geography trees Taxonomy authority trees Geology trees
Required to either convert Flat schemas to ranked self-referential schemas Ranked self-referential schemas to flat schemas
These are specialized transformations and require special attention to detail A museum migrating from Symbiota to Specify database Exporting your Specify data for publishing to GBIF Ingesting data from a third party provider like GBIF
Specify provides free of charge data transformation services to the community! Due to the complicated nature of these transformations, it is time-
consuming to produce these transformations Wait time could potentially vary from 6 months to an year
Advanced Computing and Information Systems laboratory
Achieving Data Transformation Talend Open Studio (TOS) for Data Integration
An Open Source tool for Data Integration Drag & Drop visual interface for components Data transformation jobs are designed as workflows using many
components chained together Built in components for
Data extraction from multiple formats Data processing Data export to multiple formats
Custom components and TOS built-in components can be re-used with many possibilities
Examples Custom Component’s output to a MySQL, MSSQL, Oracle or other supported databases Custom Component’s output to a XML File Custom Component’s output to an Excel File Custom Component’s output to a JSON File Custom Component’s output to a variety of other data processing components available in
TOS
7
Advanced Computing and Information Systems laboratory
Talend Open Studio
8
• Drag & drop a component.• Input component
• Drag & drop an Output component• Processing component• Arrows indicate data flow
Advanced Computing and Information Systems laboratory
Specify to Symbiota Example with Talend
9
Specify Software An open source tool for collections management.
Uses self-referential schema to store geography, taxonomy, geology, etc.
Symbiota Specimen based web-tools for bio-collaboration.
Uses flat schema to store occurrence, taxonomy and geography data.
Specify to Symbiota is a Hierarchical to Flat transformation We show next how our custom components developed for
Talend can be used along with other built-in components to achieve this transformation
Advanced Computing and Information Systems laboratory 10
• Drag & Drop Components
• These tables are joined to collect specimen info
• iDigBio built custom components for Hierarchical to Flat Transformation in Talend
• One flattens the taxonomy tree
• Other one flattens the geography tree
• Run the job
Specify to Symbiota with Talend
Flattens Taxon Tree
Flattens Geography Tree
Advanced Computing and Information Systems laboratory
Specify to Symbiota with Talend
11
Job finishes in 18.5 Sec on a Windows 7 x64 Enterprise
Intel Core2Duo @ 3GHz 2.99 GHz
4GB Memory
Talend is multi-threaded
Output Records: 16,821
Input Records: 185,907
Can accomplish similar transformation in other direction as well
Effort Reduced for Flat to Hierarchical TransformationTable Actual SQL Query
(# Chararcters)Input to Talend (#Characters)
PercentageReduction
Taxon 30,449 328 99%
Geography 2,078 128 94%
Advanced Computing and Information Systems laboratory
Conclusion Custom components built by iDigBio for Talend help in
achieving
Hierarchical to Flat Schema Transformation
Flat to Hierarchical Schema Transformation
Custom components can be used easily with other built in components
Schema transformation approach is general and can be adapted for databases from other domains as well
Talend jobs can be re-used as templates
12
Advanced Computing and Information Systems laboratory
Questions?
13