Download - Pd i 2000 Lectures
-
8/17/2019 Pd i 2000 Lectures
1/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com.
July 2011
en a o a a n egra onforDatabase Developers
-
8/17/2019 Pd i 2000 Lectures
2/623
Welcome Agenda
● Audience and prerequisites
● Learning objectives
● Class process
● Course outline
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 2
-
8/17/2019 Pd i 2000 Lectures
3/623
Audience and Course Prerequisites
● Intended audience
— Course targeted for database administrators and databasedevelopers.
— Portions of the course assumes knowledge of SQL and relationaldatabase concepts.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 3
— There are no Pentaho Training prerequisites for this course.
-
8/17/2019 Pd i 2000 Lectures
4/623
Learning Objectives
● At the end of the course, you should understand:
— The basic architecture and features of Pentaho Data Integration.
— The concept and features of the advanced Pentaho Data Integration
Enterprise Edition.
— How PDI supports you in the Agile BI approach.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 4
-
8/17/2019 Pd i 2000 Lectures
5/623
Learning Objectives
● At the end of the course, you should be able to
— Load and write data from and to different data sources
— Join data from different sources
— Use PDI and ETL design patterns (like restartable solutions)
— Influence the performance aspects of databases and transformations
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 5
—
— Schedule jobs and transformations
— Use logging, monitoring and error handling features of PDI
— Load, transform, and create complex XML structures
— Use scripting (JavaScript, Formula, Java) in transformations
— Apply clustering and partitioning solutions for high volumes
-
8/17/2019 Pd i 2000 Lectures
6/623
Learning Objectives
● What are your objectives?
● What are you expectations?
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 6
-
8/17/2019 Pd i 2000 Lectures
7/623
Course Process● Daily schedule:
— 9:00 am – 5:00 pm
— 1 hour lunch break -noon
— 15 minute morning break -10:30— 15 minute afternoon break -3:30
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 7
● e course s a com na on o ec ure, emo an a s.
● Feel free to ask questions or to seek clarification!
● Online survey will be provided for feedback and suggestions.
-
8/17/2019 Pd i 2000 Lectures
8/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com.
Pentaho Data IntegrationOverview
-
8/17/2019 Pd i 2000 Lectures
9/623
Pentaho Data Integration (PDI) Introduction
● PDI is the product associated with the
KETTLE open source project:
— KETTLE is open source software that
makes up the core of PDI EnterpriseEdition.
— PDI Enterprise Edition is ‘whole
roduct’.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 9
• Professional technical support
• Maintenance releases
• EE-only features including
enterprise security integration,scheduling, and more
• Documentation
● Member of the Pentaho BI Suite
-
8/17/2019 Pd i 2000 Lectures
10/623
Use as a BI Platform Component
● PDI Jobs and Transformations can be
run in Pentaho BI Platform:
— PDI 3.X Jobs/Transforms execute inPentaho BI Platform 1.7.X andbeyond.
— PDI 4.X Jobs/Transforms execute in
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 10
en a o a orm . . anbeyond.
— For example: Fill a Pentaho Reportwith data from a Transformation.
— Details in a separate module.
-
8/17/2019 Pd i 2000 Lectures
11/623
Enterprise Edition (EE) Data Integration Server
Standalone without the BI Platform
PDI Enterprise Edition Architecture:
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 11
-
8/17/2019 Pd i 2000 Lectures
12/623
Enterprise Edition (EE) Data Integration Server
Primary features and functions:
● Execution: Executes ETL jobs and transformations using the Pentaho Data
Integration Engine.
● Security: Allows you to manage users and roles (default security) or integrate
security to your existing security provider such as LDAP or Active Directory.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 12
ETL jobs and transformations. This includes full revision history on content and
features such as sharing and locking for collaborative development
environments.
● Scheduling: Provides the services allowing you to schedule and monitorscheduled activities on the Data Integration Server from within the Spoon design
environment.
-
8/17/2019 Pd i 2000 Lectures
13/623
The Enterprise Console
● Provides a thin client for managing deployments of Pentaho Data
Integration Enterprise Edition including:
— Management of Enterprise Edition licenses.
— Monitoring and controlling activity on a remote Pentaho Data
Integration server.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 13
—
Transformations.
-
8/17/2019 Pd i 2000 Lectures
14/623
The KETTLE Project
● What is KETTLE?
— Recursive acronym much like GNU (GNU’s Not Unix).
— K ettle Extraction Transformation Transportation & Loading
Environment.● Created out of frustration with other solutions
— Custom built PL/SQL, C-SQL (embedded SQL), hacked VB solutions
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 14
— ommerc a pro uc s:• Oracle Warehouse Builder• Information Builders iWay• SQL Server DTS• Data Mirror
-
8/17/2019 Pd i 2000 Lectures
15/623
PDI (KETTLE) History
● Early Years: 2001 – 2005
— KETTLE project started by 2001 by Matt Casters
— Years focused on easy-to-use, maintainability and deployability
● KETTLE 2.3: December 2005
— First LGPL-licensed open source version
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 15
● Project acquired by Pentaho: April 2006
— Provided paid staff of developers
— Offered support and services to customers
-
8/17/2019 Pd i 2000 Lectures
16/623
PDI (KETTLE) History
● Pentaho Data Integration 2.4: February 2007
— Parallel processing support
— Multi-tab interface for developers editing multiple transformations
— Integration of transformation design and job execution userinterfaces
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 16
.
— Enhanced MySQL database support
— Transformation Explorer for organizing and accessing transformations
-
8/17/2019 Pd i 2000 Lectures
17/623
PDI (KETTLE) History
● Pentaho Data Integration 3.0: Nov 2007
— ETL Developer productivity— Rapid, community-fueled evolution
— Performance and scalability• Clean separation of data and
metadata Reduced Java object creation
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 17
• Clustering improvements:Support for multiple step copiesData re-partitioningDynamic cluster schemas
• Faster flat file reading:
Use of non-blocking I/O (NIO) to read large blocks at a timeParallel file readingSupport for lazy conversionSimplified algorithms
-
8/17/2019 Pd i 2000 Lectures
18/623
PDI (KETTLE) History
● Pentaho Data Integration 3.1: September 2008— Enterprise Console for remote monitoring and performance trend
analysis
— Ease of use improvements including a consolidated log, executionhistory, step performance graph results panel
— Numerous new steps, job entries and expanded data source support
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 18
● Pentaho Data Integration 3.2: May 2009— Dynamic Clustering - dynamically distribute execution to available
cluster/cloud nodes
— Named Parameters
— Usability improvements - annotate hops, more visual feedback— Over 20 new Transformation Steps/Job Entries with numerous
updates to existing steps
-
8/17/2019 Pd i 2000 Lectures
19/623
PDI (KETTLE) History
● Pentaho Data Integration 4.0: June 2010— Data Modeling and Visualization Perspectives (Pentaho Agile BI)
— EE Data Integration Server with scheduling, revision history andmore
— Usability improvements
— New steps/Job entries
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 19
● Pentaho Data Integration 4.1: November 2010— Hadoop Integration (Enterprise Edition only)
— One click disable/enable of steps downstream of a hop
— Metadata Injection (experimental first steps)
— Many new steps and job entries
-
8/17/2019 Pd i 2000 Lectures
20/623
PDI Version 4.2 (July 2011)
● Graphical performance and progress feedback for transformations● Metadata Injection
● Report bursting by the Pentaho Reporting Output step
● Automatic Documentation step● Talend Job Execution job entry
● Single Threader step for parallel performance tuning of large
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 20
transformations
● Allow a job to be started at a job entry of your choice
● The XML Input Stream (StAX) step to read huge XML files at optimal
performance
● The Get ID from Slave Server step allows multi-host or clustered
transformations to get globally unique integer IDs from a slave server… 4.2 continued on next page …
-
8/17/2019 Pd i 2000 Lectures
21/623
PDI Version 4.2
● Carte improvements1. reserve next value range from a slave sequence service
2. allow parallel (simultaneous) runs of clustered transformations
3. list (reserved and free) socket reservations service
4. new options in XML for configuring slave sequences
5. allow time-out of stale objects using environment variableKETTLE_CARTE_OBJECT_TIMEOUT_MINUTES
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 21
● Memory tuning of logging back-end with1. KETTLE_MAX_LOGGING_REGISTRY_SIZE
2. KETTLE_MAX_JOB_ENTRIES_LOGGED
3. KETTLE_MAX_JOB_TRACKER_SIZE allowing for flat memory usage fornever ending ETL in general and jobs specifically.
… 4.2 continued on next page …
-
8/17/2019 Pd i 2000 Lectures
22/623
PDI Version 4.2
● Repository Import/Export1. Export at the repository folder level
2. Export and Import with optional rule-based validations
3. Import command line utility allow for rule-based (optional)import of lists of transformations, jobs and repository exportfiles: http://wiki.pentaho.com/display/EAI/Import+User+Documentation
● ETL Metadata Injection
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 22
1. Retrieval of rows of data from a step to the “metadatainjection” step
2. Support for injection into the “Excel Input” step
3. Support for injection into the “Row normaliser” step
4. Support for injection into the “Row Denormaliser” step
● And many more new steps and job entries
-
8/17/2019 Pd i 2000 Lectures
23/623
Why Pentaho Data Integration?
● Ease of use:
— 100% metadata driven (define WHAT you want to do, not HOW to doit)
— No extra code generation means lower complexity— Simple setup, intuitive graphical designers and easy to maintain
● Flexibility:
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 23
— Never forces a certain path on the user— Pluggable architecture for extending functionality
-
8/17/2019 Pd i 2000 Lectures
24/623
Why Pentaho Data Integration?
● Modern Standards-based Architecture
— 100% Java with broad, cross platform support— Over 100 out-of-the-box mapping objects (steps and job entries)
— Enterprise class performance and scalability
● Lower total cost of ownership (TCO)
— No license fees
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 24
— Short implementation cycles— Reduced maintenance costs
-
8/17/2019 Pd i 2000 Lectures
25/623
Pentaho Data Integration Adoption
● Wide range of production deployments:
— Small and medium-sized companies
— Large enterprises
● Rapid product evolution
— Driven by Pentaho investment
— Includes significant community
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 25
contributions• ‘Contribution-friendly’architecture
• Natural fit for additional data
sources, targets andtransformations
-
8/17/2019 Pd i 2000 Lectures
26/623
Common Uses
● Data warehouse population:
— Built-in support for slowly changing
dimensions, junk dimensions and other data
warehouse concepts.
● Export of database(s) to text-file(s) or other
databases.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 26
● Import of data into databases, ranging fromtext-files to Excel© spreadsheets.
● Data migration between database applications.
-
8/17/2019 Pd i 2000 Lectures
27/623
Common Uses
● Exploration of data in existing database
(tables, views, synonyms, …).
● Information enrichment by looking up data
in various information stores (databases,
text-files, Excel© spreadsheets, …).
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 27
conditions in data transformations.
● Application integration.
-
8/17/2019 Pd i 2000 Lectures
28/623
Example
21
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 28
2 … or DB2, Teradata, Microsoft, Sybase, MySQL, PostgreSQL, Ingres, etc.
1 … or PeopleSoft, Siebel, JD Edwards, Axapta, Navision, SugarCRM, Compiere, and others
-
8/17/2019 Pd i 2000 Lectures
29/623
Agile BI
● Modeling and Visualization perspectives
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 29
-
8/17/2019 Pd i 2000 Lectures
30/623
Agile BI
● Modeling and Visualization perspectives.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 30
-
8/17/2019 Pd i 2000 Lectures
31/623
EE Data Integration Server
● Enterprise Repository and Content Management:— New repository based on JCR (Content Repository API for Java)— Improved Repository Browser— Enterprise security:
• Configurable Authentication including support for LDAP and MSAD• Task permissions to control what actions a user/role can performsuch as read/execute content, create content and administersecurity
—
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 31
previous revisions of a Job or Transformation.— Ability to lock Transformations/Jobs for editing.— ‘Recycling bin’ concept for working with deleted files.
-
8/17/2019 Pd i 2000 Lectures
32/623
EE Data Integration Server
● Scheduling
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 32
-
8/17/2019 Pd i 2000 Lectures
33/623
Since PDI 4.0 & 4.2: Usability
● Hover-over menus & simplify the
connection of steps
● Graphical indicators on hops representing
the flow of information between steps.
● New activity indicators on jobs and steps (since 4.2) help highlight current
activity and bottlenecks during execution.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 33
-
8/17/2019 Pd i 2000 Lectures
34/623
Since PDI 4.0: Improved Logging
● Internal object IDs (small API change)
● Logging channels (GUIDs)
● Step logging
● Sniffing (debugging, data lineage)
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 34
-
8/17/2019 Pd i 2000 Lectures
35/623
Pentaho Data Integration Components
● Spoon
— Graphical environment for modeling— Transformations are metadata models
describing the flow of data— Jobs are workflow-like models for
coordinating resources, execution anddependencies of ETL activities
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 35
● Pan
— Command line tool for executingtransformations modeled in Spoon
● Kitchen
— Command line tool for executing jobsmodeled in Spoon
Spoon Interface – Designing a Transformation
Job Example
-
8/17/2019 Pd i 2000 Lectures
36/623
Pentaho Data Integration Components
● Carte
— Lightweight web\HTTP server for remotelyexecuting Jobs and Transformations.
— Carte accepts XML containing the transformation toexecute and the execution configuration.
— Enables remote monitoring
—
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 36
— Remote servers running Carte are referred to asSlave Servers.
-
8/17/2019 Pd i 2000 Lectures
37/623
Enterprise Edition (EE) Data Integration Server
● Enterprise Edition alternative to Carte providing
— Execution and remote monitoring (can act as master/slave similar to Carte)
— Integrated scheduling
— Enterprise Security options— Enhanced content management including revision history and locking
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 37
-
8/17/2019 Pd i 2000 Lectures
38/623
Repository: the Metadata Store
● Kettle can store metadata in
— XML files
— RDBMS repository
— Enterprise repository
● Objects stored in repository
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 38
—
— Transformations
— Jobs
— Schemas
— User and profile definitions● Repository supports
collaborative development
-
8/17/2019 Pd i 2000 Lectures
39/623
How to start the user interface?
● Start Spoon.bat (Windows) or Spoon.sh (Linux, MacOS) in the Kettle
folder.
● The command launch-designer.bat /.sh is also possible in the archive
based installation.
● For our trainings we do not use the repository, all training data will be
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 39
s ore n e e sys em as - rans orma ons or - es o s n
XML format.
● Another option would be to start via Java Web Start:
— Latest version loaded via the Internet (usable for configuration
management).— JNLP-files (Java Network Launching Protocol) located in the
KETTLE/webstart folder.
-
8/17/2019 Pd i 2000 Lectures
40/623
Transformations
● Transformations are a network of logical tasks (Steps):
— Read a flat file
— Filter it
— Sort it— Load it into MySQL
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 40
-
8/17/2019 Pd i 2000 Lectures
41/623
Steps – Job Entries
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 41
-
8/17/2019 Pd i 2000 Lectures
42/623
Hops - within Transformations
● Are data pathways that connect steps together
● Allow schema metadata to be passed from step to step
● Determine the flow of data through the steps
● Example: The pathway for all data and the true and false path from a Filter
rows step.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 42
-
8/17/2019 Pd i 2000 Lectures
43/623
Hops – Data Movement: Copy or Distribute?
● Specify if data can either be copied or distributed between multiple hops
leaving a step (right click on a step and select ‘Data Movement’).
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 43
H O h T i T f i
-
8/17/2019 Pd i 2000 Lectures
44/623
Hops – Other Types in Transformations
● Info Steps: When data is retrieved (pulled) from another step.
● Error handling steps: When error handling is enabled.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 44
H ithi J b
-
8/17/2019 Pd i 2000 Lectures
45/623
Hops - within Jobs
● Are defining the execution sequence for job entries
● There are three types of hops within Jobs (right click on the hop):
— Unconditional
— Follow when result is true
— Follow when result is false
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 45
D t Fl Th di h i
-
8/17/2019 Pd i 2000 Lectures
46/623
Data Flow, Threading mechanism
● All Steps run are started and run in parallel:— Initialization sequence is not predictable
● PDI takes care of the correct data flow
— ‘Pulling and pushing’ data from step to step● PDI capable to process an unlimited number of rows:
— Steps vary on execution speed and memory consumption
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 46
— Set the threshold on number of rows will wait to be processed bynext step:
• If this number of waiting rows is reached, the source step waitsfor room.
• When there is room to process, more rows are put into the datastream.
Transformation Properties / Miscellaneous / Number of rows in
rowset
Th di h i
-
8/17/2019 Pd i 2000 Lectures
47/623
Threading mechanism
● There are additional options that influence the threading mechanism inTransformation Properties/Miscellaneous
1. Manage thread priorities: If there
is not much to do for a thread(step), put the thread to sleep for
some milliseconds. This reduces
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 47
2. (Serial) Single Threaded: This allows to build thread pools in combination
with sub-transformations. It will not work if any step is getting or putting rowsfrom/to more than one step (e.g. the Stream lookup step).
3. KETTLE properties: KETTLE_BATCHING_ROWSET,
KETTLE_ROWSET_GET/PUT_TIMEOUT
and is enabled by default.
Values Metadata and Data
-
8/17/2019 Pd i 2000 Lectures
48/623
Values, Metadata and Data
● Values are like columns in data rows:
— Composed of the metadata and the data.
● PDI version 3.0 separated metadata and data
— Metadata is only transported with the first data row.— All subsequent data rows reference to this metadata.
● PDI ma s database JDBC data t es to PDI data
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 48
— Implementation can (and often is) different from database todatabase.
Values Metadata and Data: Formatting etc
-
8/17/2019 Pd i 2000 Lectures
49/623
Values, Metadata and Data: Formatting etc.
● Metadata is used for formatting when:
— Data is presented e.g. in a preview.
— Data is written to the outside world, e.g. to a text- or XML-file.
● Metadata is NOT used for formatting when:— Data is just loaded from one table and written to another table.
● Metadata is used to create SQL-statements to know the field types, length etc.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 49
● Metadata is used to check the data for the right data type.
● Note: A change in metadata does not change the data, e.g.
— A modification of length does not change/truncate the data.
— A new formatting does not change the data.
● But, when you modify the type in the ‘Select Values’ step, the data isconverted to a new data type (e.g. from a String to a Number with the given
formatting).
Numeric Data Types
-
8/17/2019 Pd i 2000 Lectures
50/623
Numeric Data Types
● Number (double, floating point)
— Double precision can handle only 15 significant digits
— Is sometimes not exact
To avoid this, use BigNumber but will have performance drop● Integer
— Is optimal for storing and processing data from a performance
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 50
viewpoint.
● BigNumber
— Offers a extremely high precision level but needs more memory andCPU then the other numerical values.
● Note: Formatting Number Data Types is done by default with the
pattern #.#;-#.# that means with only one digit precision.
-
8/17/2019 Pd i 2000 Lectures
51/623
Other Data Types
-
8/17/2019 Pd i 2000 Lectures
52/623
Other Data Types
● Date
— Includes date and time— Note: Formatting Date Data Types is done by default with the
pattern yyyy/MM/dd HH:mm:ss.SSS
● Boolean
— True or false, representation in the database depends on databaseboolean su ort.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 52
— Note: Please see the following option to turn this on when yourdatabase supports this: Database connection/Advanced/Supportsboolean data type
● Binary
— Can hold any binary data like pictures, used mainly for BLOB data.● Serializable
— An object to transfer from/to specific steps (internally only).
Data Types and Silent Conversions
-
8/17/2019 Pd i 2000 Lectures
53/623
Data Types and Silent Conversions
● Until version 2.5.x, data types mostly converted silently.
— Can lead to some undiscovered type mismatches.
● PDI 3.0
— Data types are strictly checked
— No silent conversions take place (with some exceptions forcompatibility)
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 53
● Example for a type conflict:— Samples directory: Denormalizer - Simple example.ktr
Lazy Conversion
-
8/17/2019 Pd i 2000 Lectures
54/623
Lazy Conversion
● Lazy conversion is a delayed conversion of data types
— Provides a performance boost.
● Conversion takes place only where it is really needed
— Ideally at output steps
— Sometimes not at all—reading from text-file and writing back totext-file
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 54
— If output format is the same as the input format, no conversion
● Steps support lazy conversion
— Specifically: CSV File Input, Fixed File Input and Table Input
— Other steps support it transparently
Lazy Conversion
-
8/17/2019 Pd i 2000 Lectures
55/623
Lazy Conversion
● Binary form of data can cause issues:
— For example: sorting on binary form ignores character set sortingsequence.
● New feature in the ‘Select Values’ step
— Covered in more detail in a separate module
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 55
— Converts binary data to and from binary character data
Handling the User Interface
-
8/17/2019 Pd i 2000 Lectures
56/623
Handling the User Interface
● Main tree
— Lists all open Transformations and Jobs and their contents.
● Core objects and favorite Steps/Job entries
— Core is toolbox with all the available Steps/Job entries (plug-ins arein bold).
— Favorites are static ‘most’ used steps.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 56
● Notes— Can appear anywhere on graphical view.
— Right clicking on the canvas and selecting Add Note.
● Options and settings
— Options are valid for the entire PDI environement.
— Settings are valid for a particular transformation or job.
Handling the User Interface
-
8/17/2019 Pd i 2000 Lectures
57/623
Handling the User Interface
● Draw new hops
— Other methods:• Middle or scroll wheel button, click on the first ste and dra onto the
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 57
second.
• Use SHIFT+Click and drag from one step to another• Select 2 steps, right click on one of them and select ‘New hop’.• Drag ‘Hops’ onto the canvas.
● Inserting a step (or job entry) between others:
— Move the step over the arrow until the arrow becomes drawn in bold— Release the mouse-button
● Window sizes:
— May have to resize some dialogs to all parameters
Handling the User Interface
-
8/17/2019 Pd i 2000 Lectures
58/623
g
● Click right on the first column in any dialog table (grid) for a list of allthe options.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 58
Handling the User Interface
-
8/17/2019 Pd i 2000 Lectures
59/623
g
● Click right on a Step for a list of all the options in the context menu.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 59
Run and Preview
-
8/17/2019 Pd i 2000 Lectures
60/623
● Can execute the entire Transformation or just preview a particular step.— Preview also possible by selecting a step and pressing the F10 key.
● Need at least two steps connected with a hop to run or preview
● Closing a preview (vs. choosing the stop/get more rows options) willleave the Transformation in a paused execution state. If you attempt to
restart the Transformation, it will tell you it can not be started twice.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 60
● Note: Preview may be destructive.— Subsequent steps are also initiated (could cause truncation of target
table).— Rows passed to subsequent steps; not stopped at the previewed step.
— You may temporarily disable the hop after the step to preview.
Log View
-
8/17/2019 Pd i 2000 Lectures
61/623
● Shows statistics associated with execution of a Transformation.
● Used to understand performance and to check the results.
● Logging can be very granular – down to the row level if needed.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 61
-
8/17/2019 Pd i 2000 Lectures
62/623
Usage of ‘Safe Mode’ at design time
-
8/17/2019 Pd i 2000 Lectures
63/623
● Performing a union on data with different layouts generates warning
— The name of field number 1 is not the same as in the first rowreceived: you're mixing rows with different layout. Field[ThisIsAStringValue String(10)] does not have the same name as field[ThisIsANumberValue Number].
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 63
Analyzing Errors
-
8/17/2019 Pd i 2000 Lectures
64/623
2010/05/18 16:38:00 - Generate Rows.0 - ERROR (version 4.0.0-…) : Couldn't parse Integer
field [WrongType] with value [abc] -->org.pentaho.di.core.exception.KettleValueException:
● Log entries include
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 64
— Step (Generate Rows) generating the error— Detailed information with the PDI version (useful when submittingcases)
— Any stack traces (useful for bug tracking due to program errors)
— Error lines are in red since 4.0 (easier to find)● Use Show Error Lines to find errors easier
Debugging
-
8/17/2019 Pd i 2000 Lectures
65/623
● Introduced in PDI 3.0
● Provides condition break points
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 65
-
8/17/2019 Pd i 2000 Lectures
66/623
Special PDI Application Files
-
8/17/2019 Pd i 2000 Lectures
67/623
● User-specific files found in .kettle directory of user’s home directory
— kettle.properties: Default properties file for variables
— shared.xml: Default shared objects file
— db.cache: The database cache for meta data— repositories.xml: The local repositories file
— .spoonrc: User interface settings, last opened transformation/job …
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 67
— .languageChoice: User language (delete to revert language)
● Some temporary logs stored also in temp folder :
— Usually cleaned by Java Virtual Machine (VM).
— May need to be cleaned (deleted) due to defects in VM.— Directory of temporary logs determined by OS and VM
(for example: typically C:\Windows\Temp on Windows©).
Special PDI Application Files – KETTLE_HOME
-
8/17/2019 Pd i 2000 Lectures
68/623
● The HOME directory may change depending on the user who is logged on. As aresult, the configuration files that control the behavior of PDI Jobs and
Transformations are different from user to user. Setting the KETTLE_HOME
variable can be performed system wide by the operating system or just before
the start of PDI using a shell script or batch. For example: use the SET
command.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 68
● Point the KETTLE_HOME to the directory that contains the .kettle directory. The
.kettle gets appended by PDI. For example: when you have stored the common
files in C:\Pentaho\Kettle\common\.kettle you need to set the KETTLE_HOME
variable to C:\Pentaho\Kettle\common.
● When running PDI from the Pentaho BI Platform, please see the Knowledge Basefor setting the variable.
Pentaho Data Integration
-
8/17/2019 Pd i 2000 Lectures
69/623
The ‘Conceptual’ Model – any questions?
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 69
Pentaho Data Integration - Conceptual model
-
8/17/2019 Pd i 2000 Lectures
70/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 70
-
8/17/2019 Pd i 2000 Lectures
71/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com.
Connections, Inputs and Outputs
Database Connections
-
8/17/2019 Pd i 2000 Lectures
72/623
● Multiple database connections to different databases can be created.
● With a PDI repository:
— Defined connections readily available to transformations and jobs.
— Connection information for the repository itself is stored inrepositories.xml.
● Without a PDI repository:
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 72
— Connection definition contained in a single Transformation or Job.
— Can share connection definitions in subsequent Transformations andJobs.
Database Connections
-
8/17/2019 Pd i 2000 Lectures
73/623
● Available database connections appear in the Main Tree.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 73
● Choose Share in the context menu of any connection to share it.
— Shared connections appear in bold.
Database Connections
-
8/17/2019 Pd i 2000 Lectures
74/623
● General database connection options
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 74
Access Via JDBC
-
8/17/2019 Pd i 2000 Lectures
75/623
● Pentaho Data Integration ships with the most suitable JDBC drivers for
the listed databases.
● Additional drivers can by added to the lib/libext directory.
— Use Generic tab of connection dialog to use unlisted drivers.
— Permits connections to non-listed databases.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 75
● Existing drivers can be replaced in the lib/libext directory.
● Special database issues and experiences with different JDBC drivers can
be found in the Pentaho Data Integration Wiki:
— http://wiki.pentaho.com/display/EAI/Special+database+issues+and+
experiences
Other Access Methods
-
8/17/2019 Pd i 2000 Lectures
76/623
● ODBC connections are possible:
— ODBC connections must be defined in Windows.
— ODBC connections made via ODBC-JDBC-Bridge.
• Some limitations of the SQL syntax
• Generally slower than JDBC due to additional layer.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 76
● Use a JNDI connection to connect to a datasource defined in an
application server like JBoss or Webshere.
● Plugin specific access methods are supplied by a specific database
driver (like SAP R/3 or PALO connections).
Advanced Database Connections
-
8/17/2019 Pd i 2000 Lectures
77/623
● Pooling— Useful for performance tuning
— Can limit number of connections for database client licensing reasons
● Driver Specific Options (Options tab)
— Pass additional parameters to the drivers
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 77
— Allows driver specific tuning for performance
— Database vendor documentation is available by clicking help button
Advanced Database Connections
-
8/17/2019 Pd i 2000 Lectures
78/623
● Clustering— Enables clustering for the database connection and creation
connections to data partitions.
— Requires partition ID, name and port of host, and user name and
password● Identifiers (Advanced tab)
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 78
—
— Post Connection SQL
Quoting
-
8/17/2019 Pd i 2000 Lectures
79/623
● Quoting is used when reserved names or special characters are used.
— For example: Field names, ‘sum’, ‘V.A.T.’, ‘overall sales’.
● PDI has an internal list of reserved names for most of the supported
database types.
● PDI’s automatic quoting can be overridden.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 79
— Feedback on quoting is always welcome to improve quoting
algorithms.
Database Explorer
-
8/17/2019 Pd i 2000 Lectures
80/623
● In the toolbar:
● In the connection context menu:
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 80
DB Cache: The Metadata Cache
-
8/17/2019 Pd i 2000 Lectures
81/623
● Metadata for fields each connection and SQL statement is cached in
internal file ‘db.cache’.
● Metadata cache must be refreshed.
— Refreshed automatically when the table is changed in the PDI
context (from the SQL statement window).
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 81
● Manual refreshing of cache maybe necessary.
— Typically detected by missing fields or mismatches in ‘Show
input/output fields’ of a step.
Impact Analysis
-
8/17/2019 Pd i 2000 Lectures
82/623
● What impact does the Transformation have on the used databases?
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 82
SQL Editor
-
8/17/2019 Pd i 2000 Lectures
83/623
● Creates the needed DDL for the output steps related to a databasetable, often CREATE statements for tables or indices.
● SQL button in the toolbar creates all needed DDL for tables.
● No automatic mechanism to alter tables when the layout changed.
— For example: A field type from a source table is changed
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 83
— can e eas y an manua y c ange .
Steps Covered In This Section
-
8/17/2019 Pd i 2000 Lectures
84/623
● Input Steps Output Steps Special Dimension Steps
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 84
Text File Input
-
8/17/2019 Pd i 2000 Lectures
85/623
● Reads a large number of different text files, including CSV files generated by
spreadsheets.
● Text File Input Options:
— Filename specification options: Files may be added to ‘selected files’ list.
— Accept filenames from previous step: Filename can come from any source.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 85
— .
— Error handling: Allows user to define how to react to errors.— Filters: Allows user to specify the lines to be skipped in the text file.
— Fields: Allows user to define characteristics of the fields.
— Formats: Includes formatting of number and date fields.
CSV File Input
-
8/17/2019 Pd i 2000 Lectures
86/623
● Reads a CSV file format only.
● Due to the internal processing, this step is much faster.
● Options are a subset of Text File Input
— NIO buffer size: Set the buffer size used by the Java I/O classes(NIO, improved performance in the areas of buffer management).
—
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 86
.
Fixed File Input
-
8/17/2019 Pd i 2000 Lectures
87/623
● Reads a fixed file format only.
● Due to the internal processing, this step is much faster.
● Options are a subset of Text File Input
— NIO buffer size and lazy conversion options are identical to the CSVfile input options.
—
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 87
cluster with many workers processing on large file.
Table Input
-
8/17/2019 Pd i 2000 Lectures
88/623
● Reads information from a database, using a connection and SQL.
● Table Input Options
— Step Name: The name has to be unique in a single Transformation.
— Connection: The database connection used to read data from.
— S L: The statement used to read information from the database
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 88
connection; may be any query.
— Insert data from step: The input step name where parameters for
the SQL come from, if appropriate.
— Limit: Sets the number of lines that are read from the database.
Excel Input
R d i f i f E l fil
-
8/17/2019 Pd i 2000 Lectures
89/623
● Reads information from one or more Excel files.
● The options provided by PDI GUI for accepting excel inputs include:
— Step Name: Name of the step.
— File Tab: To define the filenames, with variable support.
— Sheet Tab: To define the sheet(s) to import.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 89
— Fields Tab: To specify the fields that need to be read from the
Excel files.
— Error handling Tab: Allows user to define how to react when error
is encountered.
— Content Tab: Includes the sub options of Header, No empty rows,Stop on empty rows, Field name, Sheet name field, Row number
field and Limit.
Access Input
-
8/17/2019 Pd i 2000 Lectures
90/623
● Reads information from one or more Access® files.
● No ODBC connection necessary.
— Allows Access® files to be read on non-Windows platforms.
● Access Input Options:
— Ste Name: Name of the ste .
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 90
— File Tab: To define the filenames, with variable support.— Content Tab: Specify the table name and the inclusion of the file
name, table name, row number and limit.
— Fields Tab: Specify the fields that need to be read from the Accessfiles.
XBase Input
R d d t f t t f DBF fil d i t ll d th XB
-
8/17/2019 Pd i 2000 Lectures
91/623
● Reads data from most types of DBF file derivates called the XBasefamily. (dBase III/IV, Foxpro, Clipper, ...)
● Options:
— Step name: Unique name (in transformation) of the step.— Filename: Name of XBase file with variable support.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 91
— Limit size: Only read this number of rows; zero means unlimited.
— Add rownr?: Adds a field to the output with the specified name that
contains the row number.
Apache Virtual File System (VFS) support
-
8/17/2019 Pd i 2000 Lectures
92/623
● Apache VFS allows references to files from virtual any location.
— Apache VFS support available in the Pentaho Platform and Pentaho
Analysis.
● All file names are treated as URIs.
— file:///somedir/somefile.txt
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 92
— zip:http://somehost/downloads/somefile.zip
— http://myusername@somehost/index.html— sftp://myusername:mypassword@somehost/pub/dl/somefile.tgz— webdav://somehost:8080/dist
● Further information: http://commons.apache.org/vfs/filesystems.html
Generate Rows
O b f d f l b ll
-
8/17/2019 Pd i 2000 Lectures
93/623
● Outputs a number of rows, default is empty but optionally containing a
number of static fields.
● Options
— Step name: Unique name (in Transformation) of the step.
— Limit: The number of rows user wants to out ut.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 93
—
Fields: Static fields user might want to include in the output row.
-
8/17/2019 Pd i 2000 Lectures
94/623
LDAP Input
R d d f LDAP
-
8/17/2019 Pd i 2000 Lectures
95/623
● Reads data from a LDAP server.
● Options
— Host: Hostname or IP address of the LDAP server.
— Port: The TCP port to use, typically 389.
— User Authentication: Enable to pass authentication credentials to
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 95
.
— Username/Password: For authenticating with the LDAP server.
— Search base: Location in the directory from which the LDAP searchbegins.
— Filter String: The filter string for filtering the results.
— Fields: Define the return fields and type.
De-serialize/Serialize to/from File
R d d it d t d PDI t d t t th f d t fil
-
8/17/2019 Pd i 2000 Lectures
96/623
● Read and write data and PDI metadata together from and to a file.
● Use Cases
— Transfer data from one Transformation to another where the memory
is not sufficient to hold the amount of data.— Save data and metadata to be processed at another time.— Transfer data to another user with no need to reanal ze the
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 96
metadata (instead of text files).
Text File Output
● Exports data to a variety of different text file formats including CSV
-
8/17/2019 Pd i 2000 Lectures
97/623
● Exports data to a variety of different text file formats, including CSV.
● Options
— Extension
— Append— Separator— Enclosure
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 97
— Header/Footer
— Zipped— Include step number/date/time in file name— Encoding— Right pad fields
— Split every or ‘n’ row(s)
Table Output
● Insert (only) information in a database table
-
8/17/2019 Pd i 2000 Lectures
98/623
● Insert (only ) information in a database table.
● Options
— Target table
— Commit size— Truncate table— I nore insert errors
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 98
— Partition data over tables
— Use batch update for inserts— Return auto-generated key— Name auto-generated key field— Is the name of the table defined in a field
Insert / Update
● Automates simple merge processing:
-
8/17/2019 Pd i 2000 Lectures
99/623
● p g p g— Look up a row using one or more lookup keys.— If a row is not found, insert the new row.— If found and targeted fields are the identical, do nothing.
— If found and targeted fields are not identical, update the row.● Options
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 99
—
— Connection— Target table— Commit size— Keys
— Update fields— Do not perform any updates (If used, operates like Table Output,but without any Insert errors caused by duplicate keys).
Update
● Same as the Insert / Update step except no insert is performed in thedatabase table
-
8/17/2019 Pd i 2000 Lectures
100/623
Same as the p step except no insert is performed in thedatabase table.— ONLY updates are performed.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 100
-
8/17/2019 Pd i 2000 Lectures
101/623
Excel Output
● Exports data to an Excel® file
-
8/17/2019 Pd i 2000 Lectures
102/623
p● Options
— Sheet name— Protect sheet with a password— Use a template (e.g. with a preformatted sheet)— Append or override the contents of the template
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 102
Access Output
● Exports data to an Access® file
-
8/17/2019 Pd i 2000 Lectures
103/623
p
— ODBC not required
— Can be used on non-Windows platforms
● Options
— Database filename (.mdb)
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 103
— — Target table— Create table— Commit size
XML Output
● Writes rows from any source to one or more XML files
-
8/17/2019 Pd i 2000 Lectures
104/623
● Options
— File name
— Extension— Include stepnr in file name— Include date in file name
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 104
— Include time in file name
— Split every N rows— Parent XML element— Row XML element— Fields
— Zipped— Encoding
Dimension Update / Lookup
● Implements slowly changing dimensions (Type I and Type II)
-
8/17/2019 Pd i 2000 Lectures
105/623
● Can be used for updating a dimension table and for looking up values in
a dimension. (Lookup, if not found, then update/insert)
● Each entry in the dimension table has the following fields— Technical key: The primary (surrogate) key of the dimension—
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 105
— Start of date range: Field containing validity starting date
— End of date range: Field containing the validity ending date— Keys: Business keys used in source systems; used for lookup
functionality.— Fields: Actual information of a dimension and can be set individually
to update all versions or to add a new version when a new valueappears.
-
8/17/2019 Pd i 2000 Lectures
106/623
-
8/17/2019 Pd i 2000 Lectures
107/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com.
Introduction to the Training Data
Introduction to the Training Data
● Represents fictitious company Steel WheelsBuys collectable model cars trains trucks etc
-
8/17/2019 Pd i 2000 Lectures
108/623
— Buys collectable model cars, trains, trucks, etc.from manufacturers
— Sells to distributors across the globe
● Data adapted from the sample data provided by
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 108
Eclipse’s BIRT project.
● pentaho_oltp database has many tables:
— Offices, Employees, Customers, Products,
Orders, Orderdetails, Payments
pentaho_oltp
Pentaho Training Data: Tables● Offices
— 7 offices worldwide (San Francisco, Boston, NYC, Paris, Tokyo,
-
8/17/2019 Pd i 2000 Lectures
109/623
( , , , , y ,
Sydney, London)
— Headquartered in San Francisco, CA
— Each office is assigned to a sales territory (APAC, NA, EMEA or
JAPAN)
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 109
● Employees
— 23 employees: 6 Execs and 17 Sales Reps
— Each assigned to one of the seven offices
— Sales Reps also assigned to a number of customers (distributors)— New Sales Reps (that are still in training) don’t have assigned
customers
Pentaho Training Data: Tables● Customers
— Steel Wheels has 122 customers worldwide
-
8/17/2019 Pd i 2000 Lectures
110/623
— Approximately 20 of those new customers without a sales rep any orders
— Each has a credit limit which determines maximum outstanding balance
● Products
— 110 unique models purchased from 13 vendors
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 110
— , , ,
Trucks and Buses, Planes, Ships, Trains.
— Additionally models are classified based on their scale (e.g. 1:18, 1:72 etc.)
— Cost paid and MSRP (suggested retail price)
● Payments
— Customers make payments on average 2-3 weeks after they place an order.— In some cases one payment covers more than 1 order.
Pentaho Training Data: Tables
● Orders
— 2560 orders, which span the period from 1/1/2000 to 12/31/2007
-
8/17/2019 Pd i 2000 Lectures
111/623
, p p
— Each in a given state: In Process, Shipped, Cancelled, Disputed,Resolved, or On Hold.
● OrderDetails
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 111
— r er ne ems re ec nego a e pr ce an quan y per pro uc
— Training data has 23,640 OrderDetails
pentaho_oltp
Database Schema OLTP (Source Database)Database schema is in $PENTAHO_TRAINING/data/models
-
8/17/2019 Pd i 2000 Lectures
112/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 112
Database Schema OLAP (Target Database)Database schema is in $PENTAHO_TRAINING/data/models
-
8/17/2019 Pd i 2000 Lectures
113/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 113
-
8/17/2019 Pd i 2000 Lectures
114/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com.
Data Warehouse Steps
Target Database Design (your Data Warehouse)
● Prior to defining mappings from sources, the target database must bedesigned.
-
8/17/2019 Pd i 2000 Lectures
115/623
g
● Staging tables and/or file format designs are often identical to source
format to simplify extract processing.● Target star schemas must be designed to meet analytical processing
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 115
nee s as we as eas ty o oa rom ent e source ata.
Source to Target Mapping
● Identifies how each target table/column will be populated from the
sources.
-
8/17/2019 Pd i 2000 Lectures
116/623
● Includes details of the following:
— Source table/column or how the value is otherwise derived
— Data types/lengths and any format Transformation
— Special cleansing or Transformation logic
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 116
— Exception handling
● This document will aid in creation of the actual programmingspecifications for the ETL developers or in creation of instructions
(technical meta data) for the ETL tool.
Dimensional Design in your Data WarehouseBest Practice Analytical Database Design (Kimball, et al.)
EmployeeProduct
Geography
-
8/17/2019 Pd i 2000 Lectures
117/623
Product
Sales Facts
Customer Time
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 117
Fact Table Dimension Table
● A fact table contains items that you wantto measure. For example:
— Revenue
— Amount sold
— Average price
● Metrics are the values you are trying to
report.
● Dimensions are the ways that you want tolook at the data. For example:
— By customer
— By date
— By product
● Dimensions provide context in reports
(grouping, labels, filters, etc.).
Dimensional models are often called Star-Schemas.
Why Fact- and Dimension tables?
● The idea behind this separation is that you have a large fact table with
much smaller dimension tables joined to it.
-
8/17/2019 Pd i 2000 Lectures
118/623
● Fact tables are huge compared with the dimensions. They are usually
huge compared with anything you have had in an OLTP database (e.g.
when you store historical data or add data from external sources).
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 118
with a speedy access.
● Fact and dimension tables should be joined by integers.
Slowly Changing Dimensions● In a dimension table the value of an attribute may change over time.
● You may need to reference the old and new value.
-
8/17/2019 Pd i 2000 Lectures
119/623
● Examples:
— You have a product dimension with the price as an attribute. The
price changes over time. You may want to calculate the profit forcertain periods by the product price for a certain time. Then you
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 119
.
— You have a customer dimension with the location (e.g. defined by
the zip code) as an attribute. Customers can relocate and getanother zip code. The geographic organized sales team madebudgets for their customers and want to compare this with theactual. What happens when a customer relocates into another sales
region?● You can solve this easy by storing your dimensions with the concept of
Slowly Changing Dimensions.
Slowly Changing Dimensions
● ‘Type 1’ Dimension:
— New information overwrites the old information.— Old information is not saved, it is lost.
-
8/17/2019 Pd i 2000 Lectures
120/623
O ,— Can only be used in applications where maintaining a chronicle of data is not
essential; used for update only.
● ‘Type 2’ Dimension:— New information is appended to the old information.—
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 120
, .— Can be used in applications where maintaining a chronicle of data is
required so that changes in a data warehouse can be tracked.● ‘Type 3’ Dimension:
— New information is saved alongside the old information.— Old information is partially saved.
— Additional columns are created to show the time from which the newinformation has taken effect.— Enables view of facts in both current state and “what-if” past states of
dimensional values
Dimension Update / Lookup
● Implements slowly changing dimensions: Type 1 and Type 2
● Can be used for updating a dimension table and for looking up values in a
di i L k if f d h d /i
-
8/17/2019 Pd i 2000 Lectures
121/623
dimension. Lookup, if not found, then update/insert.
● In dimension implementation each entry in the dimension table has the
following fields:
— Technical key: This is the primary (surrogate) key of the dimension.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 121
— Version field: Shows the version of the dimension entry (a revision number).
— Start of date range: This is the fieldname containing validity starting date.
— End of date range: This is the fieldname containing the validity ending date.
— Keys: These are business keys used in source systems such as customer no,product id, etc. These are used for lookup functionality.
— Fields: These fields contain the actual information of a dimension and can be
set individually to update all versions or to add a new version when a newvalue appears.
Example for Slowly Changing Dimensions
● We create a new dimension for the product from the pentaho_oltp
database into the pentaho_olap database.
-
8/17/2019 Pd i 2000 Lectures
122/623
OLAPOLTP
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 122
Example for Slowly Changing Dimensions
● The dimension step added the fields:
— productid: The technical key
version: The version of the dimension row as a reference
-
8/17/2019 Pd i 2000 Lectures
123/623
— version: The version of the dimension row as a reference
— date_from, date_to: The valid date range for the actual dimension
row.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 123
-
8/17/2019 Pd i 2000 Lectures
124/623
Example for Slowly Changing Dimensions
● The data within our dim_product table looks like this:
-
8/17/2019 Pd i 2000 Lectures
125/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 125
● Look at the first line: This is the default row returned when you lookup
a dimension and the key is not found. This row (for not found entries) is
created automatically by PDI with null values.
Example for Slowly Changing Dimensions
● Now we have a price change for two products:
-
8/17/2019 Pd i 2000 Lectures
126/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 126
Example for Slowly Changing Dimensions
● The price change came from a price file where the other information
like productname failed, so they are missing in the new versions:
-
8/17/2019 Pd i 2000 Lectures
127/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 127
● We can look it up before the dimension load.
Example for Slowly Changing Dimensions
● You have to uncheck the ‘Update the dimension?’ box.
-
8/17/2019 Pd i 2000 Lectures
128/623
● And enter the fields you want to retrieve (mind the change of the
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 128
columns, especially ‘Type of return field’ instead of ‘Type of dimension
update’):
Example for Slowly Changing Dimensions
● The handling of the Stream Datefield.
-
8/17/2019 Pd i 2000 Lectures
129/623
● When the date field is empty, the actual date and time for lookup and
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 129
new inserts is used.
● When you have a date field with a ‘valid from’ date, you can use thishere.
Example for Slowly Changing Dimensions
● Example for a different date used.
-
8/17/2019 Pd i 2000 Lectures
130/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 130
Example for Slowly Changing Dimensions
● If you do a lookup of a dimension and want to include the effective
version and the date ranges of your found row, add them to fields:
-
8/17/2019 Pd i 2000 Lectures
131/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 131
-
8/17/2019 Pd i 2000 Lectures
132/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com.
Lookups
Lookups
● The Lookup feature of PDI accesses a data source to find values per adefined matching criteria, i.e. key.
● The following steps have lookup functionality in PDI:
-
8/17/2019 Pd i 2000 Lectures
133/623
— Commonly Used
• Database Lookup• Stream Lookup
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 133
• Merge Join
— Others• Database Join
• Call database procedure
• Dimension Update/Lookup
• Combination Update/Lookup• HTTP Lookup
Database Lookup
● Lookup attributes from a single table based on a key-matching criteria
● Options for performing database lookup include:
— Lookup table: The name of the table where the lookup is done
-
8/17/2019 Pd i 2000 Lectures
134/623
Lookup table: The name of the table where the lookup is done.
— Enable cache: This option caches database lookups for the duration ofthe Transformation.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 134
• Ena ing t is option can increase per ormance.
• Danger: If other processes are changing values in the table do notset this option.
— Load all data from table: Preload the complete data in memory at the
initialization phase. This can replace a Stream Lookup step incombination with a Table Input step and is faster.
Database Lookup (cont)
SELECT
-
8/17/2019 Pd i 2000 Lectures
135/623
ATTRIB1 as ‘FullName’
FROM
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 135
_
WHERE
ID =
-
8/17/2019 Pd i 2000 Lectures
136/623
Stream Lookup (Example)
● The following Transformation add information coming from a text-file todata coming from a database table:
-
8/17/2019 Pd i 2000 Lectures
137/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. •US and Worldwide: +1 (866) 660-7555 | Slide 137
● ‘B’ is the source step. It is where the in-memory lookup stream resides.
Merge Join
● Takes TWO sorted streams and performs a traditional JOIN on EQUALITY
— INNER = Only output a row when the key is in both streams
— LEFT OUTER = Output a row even if there is no matching key in 2nd
Step
-
8/17/2019 Pd i 2000 Lectures
138/623
Step
— RIGHT OUTER = Output a row even if there is no matching key in 1
st
Step
— FULL OUTER = Output a row regardless of matching
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 138
● Options provided by PDI GUI for merge join include:
— First Step: Step to refer to as the 1st
— Second Step: Step to refer to as the 2nd
— Keys for 1st: The key fields from the 1st Stream
— Keys for 2nd: The key fields from the 1st Stream
— Join Type: The key fields from the 1st Stream
Merge Join (cont)
-
8/17/2019 Pd i 2000 Lectures
139/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 139
FULLOUTER
Database Join
● Options provided by Kettle GUI for database join procedure include:
— SQL: The SQL query to launch towards the database.
— Number of rows to return: 0 means all, any other number limits the
-
8/17/2019 Pd i 2000 Lectures
140/623
number of rows.
— Outer join?: When checked, will always return a single record for eachinput stream record, even if the query did not return a result.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 140
● The parameters to use in the query.
—Parameters noted as ? in the query—Order of fields in parameter list must match the order of the ? in the
query.
Call Database Procedure
● Executes a database procedure (or function) and gets the result(s) back.
● Options for call database procedure include:
— Proc-name: Name of the procedure or function to call.
-
8/17/2019 Pd i 2000 Lectures
141/623
— Enable auto-commit: This can be used to perform updates in the
database using a specified procedure. The user can either have thechanges done using auto-commit or by disabling this. If auto-commit is
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 141
disabled, a single commit is performed after the last row is received
by this step.
— Result name:
• When calling a database function this field needed.
• When calling a database procedure this field must not be
entered.
Call Database Procedure (cont)
● Result type: Type of result of function call. Not used in case of a
procedure.
● Parameters: List of parameters that the procedure or function needs:
-
8/17/2019 Pd i 2000 Lectures
142/623
p p
● Field name: Name of the field.● Direction: Can be either IN (input only), OUT (output only), INOUT (value
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 142
s c ange on e a a ase .
● Type: Used for output parameters so that Kettle knows what comes back.
Call Database Procedure (cont)
● Other buttons available in this step are:
—“Find it...” button: Searches the specified database connection for
available procedures and functions.
-
8/17/2019 Pd i 2000 Lectures
143/623
—“Get fields” button: Fills in all the fields in the input streams to make
the process easier by deleting the lines that are not needed and re-
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 143
.
Dimension Lookups
● Uses the same dimension step used for updating● Same fields, same setup
● Stream Datefield – stream_date between EFFECT and EXPIRE
-
8/17/2019 Pd i 2000 Lectures
144/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 144
HTTP Lookup
● Covered in Web Service module
-
8/17/2019 Pd i 2000 Lectures
145/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 145
-
8/17/2019 Pd i 2000 Lectures
146/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com.
Field Transformations Part 1
Field Transformations
● Field Transformations are steps that operate at the field level within a
stream record.● The step types covered in this section include:
— Select ValuesCalculator
-
8/17/2019 Pd i 2000 Lectures
147/623
— Calculator
— Add Constants— Null If
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 147
Select Values
● This step type is used to:
— Select/remove fields from the process stream.— Rename fields— Specify/change the length and/or precision of fields.
● 3 Tabs are provided:
-
8/17/2019 Pd i 2000 Lectures
148/623
● 3 Tabs are provided:
— Select and Alter: Specify the exact order and name in which the fieldshave to be placed in the output rows.
—
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 148
rows.
— Meta-data: Change the name, type, length and precision (the meta-data) of one or more fields.
Select Values (cont)
● Options provided for this step include:
— Step name: Name of the step. This name has to be unique in a singleTransformation.
— Attributes that can be changed for a given field:• Fieldname: The fieldname to select or change
-
8/17/2019 Pd i 2000 Lectures
149/623
• Rename to: To be left blank if rename not required.• Length: Number has to be entered to specify the length (-1: no length
specified).
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 149
-precision.
Calculator
● Provides a list of functions that can be executed on field values.
● An important advantage Calculator has over custom JavaScript scripts is
that the execution speed of Calculator is many times that of a script.
B id th t (Fi ld A Fi ld B d Fi ld C) th l d
-
8/17/2019 Pd i 2000 Lectures
150/623
● Besides the arguments (Field A, Field B and Field C) the user also needs
to specify the return type of the function.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 150
values were calculated. This is useful for removing temporary values.
Calculator (cont)
● The list of functions supported by
the calculator includes commonlyused mathematical and datefunctions.
-
8/17/2019 Pd i 2000 Lectures
151/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 151
Add Constants
● Adds constants to a stream.
● The use is very simple:
— Specify the name
— Enter value in the form of a string
-
8/17/2019 Pd i 2000 Lectures
152/623
Enter value in the form of a string
— Specify the formats to convert the value into the chosen data type.
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 152
Null If
● If the string representation of a field is equal to a specified value, then
the output value is set the null (empty).
-
8/17/2019 Pd i 2000 Lectures
153/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 153
-
8/17/2019 Pd i 2000 Lectures
154/623
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com.
Set Transformations
Set Transformations
● Set Transformations are steps that operate on the entire set of data
within a stream.
● The operations operate across all rows and not strictly within a row
● The steps covered in the section include:
-
8/17/2019 Pd i 2000 Lectures
155/623
● The steps covered in the section include:
— Filter Rows
—
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 155
— Join Rows
— Merge Rows— Unique Rows
— Aggregate Rows
— Group By
Filter Rows● Filter rows based upon conditions and comparisons with full boolean logic
supported.
● Output can be diverted into 2 streams: Records which pass (true) the condition
and records which fail (false).
-
8/17/2019 Pd i 2000 Lectures
156/623
● Often used to:— Identify exceptions that must be written to a bad file
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 156
— Branc trans ormation ogic i sing e source as two interpretations
● The options provided for this step include:
— Send ‘true’ data to step: Which step receives those rows which pass the
condition.
— Send ‘false’ data to step: Which step receives those rows which fail the
condition.
Sort Rows
● Sort rows based upon specified fields, including sub sorts, in ascending
or descending order.
● The options provided for this step include:
— A list of fields and whether they should be sorted ascending or not.
-
8/17/2019 Pd i 2000 Lectures
157/623
A list of fields and whether they should be sorted ascending or not.
— Sort directory: This is the directory in which the temporary files are
stored when needed. The default is the standard tem orar director
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 157
for system.
— Sort size: The more rows you can store in memory, the faster thesort. Eliminating need for temp files reduced costly disk I/O.
— The TMP-file prefix: Choose a recognizable prefix to identify the
files when they show up in the temp directory.
Join Rows
● Produces combinations of all rows on the input streams.
The options provided by PDI on this feature include:
— Step name: Name of the step; name has to be unique.
— Main step to read from: Specifies the step to read the most data
-
8/17/2019 Pd i 2000 Lectures
158/623
Ma step to ead o : Spec es t e step to ead t e ost data
from. This step is not cached or spooled to disk, the others are.
— The condition: User can enter a com lex condition to limit the
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 158
number of output rows. If empty, the result is a cartesian product.
— Temp directory: Specify the name of the directory where the systemstores temporary files.
— Temporary file prefix: This is the prefix of the temporary files that
will be generated.
— Max. cache size: The number of rows to cache before the systems
reads data from temporary files.
Merge Join
● The ‘Merge Join’ step performs a classic merge join between data sets
with data coming from two different input steps. Join options include
INNER, LEFT OUTER, RIGHT OUTER, and FULL OUTER.
The options provided by PDI on this feature include:
-
8/17/2019 Pd i 2000 Lectures
159/623
p p y
— Step name: Name of the step; name has to be unique.
—
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 159
.
— Second Step: Specify the second input step to the merge join.
— Join Type: INNER, LEFT OUTER, RIGHT OUTER, or FULL OUTER
— Keys for 1st step: Specify the key fields on which the incoming data
is sorted.
— Keys for 2nd step: Specify the key fields on which the incoming datais sorted.
Sorted Merge
● The ‘Sorted Merge’ step merges rows coming from multiple input steps
providing these rows are sorted themselves on the given key fields.
The options provided by PDI on this feature include:
— Fields: Specify the fieldname and sort direction
-
8/17/2019 Pd i 2000 Lectures
160/623
p y
(ascending/descending).
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 160
Merge Rows
● Compares and merges two streams of data
— Reference Stream— Compare Stream
● Mostly used to identify deltas in source data when no timestamp is
-
8/17/2019 Pd i 2000 Lectures
161/623
available— Reference Stream = The previously loaded data
• © 2011, Pentaho. All Rights Reserved. www.pentaho.com. • US and Worldwide: +1 (866) 660-7555 | Slide 161
—
● Usage note: Ensure streams are sorted by comparison key fields
● The output row is marked as follows:
— “identical”: The key was found in both streams and the values tocompare were identical.
— “changed”: The key was found in both streams but one or morevalues is different.
— “new”: The key was not found in the reference stream.— “deleted”: The key was not found in the compare stream.
Unique Rows● Removes duplicates from the input stream.
● Usage Note: Only consecutive records will be compared for duplicates, thus thestream must be sorted by comparison fields.
● The options provided for this step include:
-
8/17/2019 Pd i 2000 Lectures
162/623
�