etl extract transform load. introduction of etl etl is used to migrate data from one database to...
TRANSCRIPT
ETL
Extract Transform Load
Introduction of ETL• ETL is used to migrate data from one database to another, to
form data marts and data warehouses and also to convert database from one format or type to another
Process of ETL• Extract • Process of reading data from a database
• Transform• Process of converting the extracted data from its previous
form into the form it need to be• By using rules or lookup tables or by combing the data
with other data
• Load• Process of writing the data into the target database
Operations of Transform
• Selecting only certain columns to load• Translating coded values• Encoding free-form values• Sorting• Joining data from multiple sources• Aggregation• Splitting a column into multiple columns• Deriving a new calculated values• …
Pentaho Data Integration
• Pentaho data integration (PDI, also called Kettle) is for ETL processes.
• Download: http://sourceforge.net/projects/pentaho/files/Data Integration/
• Two parts of PDI– Transformation: transformation is the process of ETL– Job: job is used to run transformation
Transformation• Definition
Transformation Description
Value Values are part of a row and can contain any type of data: strings, floating point numbers, integers, dates or boolean values
Row A row consists of 0 or more values that are processed together as a single entry
Input Stream A stack of rows that enters a step
Hop A graphical representation of one or more data streams between two steps; a hop always represents the output stream for one step and the input stream for another
Note Descriptive text that can be added to a transformation
Step
HopNote
Main ComponentsAll components Input Output Transformation
Job
• DefinitionTransformation Description
Job Entry A part of job that performs a special task
Hop A graphical representation of one or more data streams between two steps; a hop always represents the output stream for one step and the input stream for another
Note Descriptive text that can be added to a job
Job EntryHopNote
Components of PDI
• Spoon– GUI tool to design the ETL process transformations.– Creating jobs which automate the database update process– Performing the typical data flow functions including: reading, validating,
refining, transforming, writing data
• Pan– Application to run data transformations designed in Spoon
• Kitchen– Application helps execute the jobs in a batch mode, usually using a schedule
• Carte– A web Server which allows remote monitoring of the running PDI ETL
processes through a web browser
Feature of PDI• Simple Visual Designer• Graphic ETL tool• Dynamic transformations• Integrated debugger for testing and tuning job execution
Feature of PDI
• Drag and Drop Integration• Rich library of pre-build components to access• Integration with Zero-Coding Required
• Powerful Administration and Management• Data Profiling and Data Quality• Identify data that fails to comply with business rules and
standards• Manager data quality with partners such as human
interface
Feature of PDI• Support for Any Big Data Source