talend open studio data integration
DESCRIPTION
Talend Open Studio ETL tool, Talend Profiler and Data Management. TotTRANSCRIPT
www.robertomarchetto.com
Data Integration
Data Integration involves combining data residing in differente sources and providing the
user with a unified view of the data
Data Management combines different disciplines to manage data as a valuable resource
www.robertomarchetto.com
Talend
● Talend is a company focused on Data Integration and Data Management solutions
● Talend is a „Cool Vendor“ for Gartner (2010)● Present in more than 12 locations around the
World● Fast growing company
www.robertomarchetto.com
Talend Open Studio
● Open Source, professional tool● Draw procedures linking components, each
component performs an operation● DB vendor-specific optimized components● Produces fully editable Java (or Perl) code● Deployment with small and fast compiled Java
or as Web Service● Eclipse based IDE, excellent flexibility● BI Platform indipendent, DB Vendor indipendent
www.robertomarchetto.com
Automatic code generation, diffent deployment
www.robertomarchetto.com
Extracion Transformation Loading
● ETL is a common process in Data Integration● Extract, reading data from different datasources
(database, flat files, spreadsheet files, web services, etc)
● Transfom, converting data in a form so that it can be placed in another container (database, web services, files, etc). Cleaning, computations and verifications are also performed
● Load, write the data in the target format
www.robertomarchetto.com
Tutorial, Destination data (Datawarehouse)
www.robertomarchetto.com
Tutorial, Metadata
● Talend requires a preliminary definition of the metadata
● Often a strong metadata definition means, as in programming languages, fast, robust and maintenable applications
● ..demo..
www.robertomarchetto.com
Tutorial, Talend jobs basics
● Place components on the designer● Link components to build a transformation● Main type of link: Rows flow● Schema metadata is propagated and must be
coherent● ..demo..
www.robertomarchetto.com
Extensibility, comunity plugins
● Many official components
● Components for every task released by the comunity
● Geospatial components, log analysis, Google analytics, data encryption, etc
www.robertomarchetto.com
And now.. reports, dashboards, OLAP, Geoanalysis, KPIs..
www.robertomarchetto.com
What about data quality?
● Customer A is present 5 times with different names
● Null values can vary statistical indexes like mean calculation
● Duplicated records● Blank values● Some records can contain errors (es -1 field
values)● Some records can be garbage
www.robertomarchetto.com
What abount data storage size?
● Some fields can be oversized for the data they contain
● Sometimes fields are related and can be calculated
● Some keys or values are never used● When data grow garbage grow● Data storage is not free (disks, electricity,
backups, DB licenses)
www.robertomarchetto.com
Data is „the black gold“ that can produce knowledge
● Data is a resource, you can extract knowledge● A lot of Data produces concise informations● Data storage is not free and a lot of data can
make system not fast● Data cleansing is a central process in statistical
analysis and Data Mining