ppdm data loading best practice

Post on 15-Dec-2014

2.241 Views

Category:

Technology

6 Downloads

Preview:

Click to see full reader

DESCRIPTION

A review of data loading best practice, based on PPDM data from the Oil & Gas industry.

TRANSCRIPT

www.etlsolutions.com

Data LoadingBest PracticePPDM Association 2012 Data Symposium

Agenda

Data loading challenges

Best practices in tools and

methodology

We’ll be taking a look at loading data into PPDM, but much of this applies to generic data loading too

Agenda

Data loading challenges

Best practices in tools and

methodology

We’ve been listening to the Data Manager’s perspective

o PPDM Conference, Houston

o PNEC, Houstono Data Managers’

challenges:– Education– Certification– Preserving

knowledge– Process

Data management is difficult and important

Different data movement scenarios

Data

Migration

LoadingIntegration

But all require mapping rules for best practice

The business view of data migration can be an issue

• Often started at the end of a programme• Seen as a business issue (moving the filing

cabinet), not technical• However, the documents in the filing cabinet

need to be read, understood, translated to the new system; obsolete files need to be discarded

Different data migration methodologies are available

PDM (Practical Data Migration)

• Johny Morris• Training course• PDM certification• Abstract• V2 due soon

Providers• Most companies

providing data migration services/products have a methodology

• Ours is PDM-like, but more concrete and less abstract

Agenda

Data loading challenges

Best practices in tools and

methodology

Methodology

Core migration

As an example, our methodology

Landscape analysis

Data discovery

Data modelling

Data assurance

Data review

Data cleansing

Requirements analysis

Project scoping

Configuration

Migration design

Testing design

Migration development

Testing development

Execution

Review

Legacy decommissioning

Firstly, review the legacy landscape

Legacy

SAP

ApplicationReport

Archive

Access DBExcel

VBA

Satellites

Eradicate failure points

Beware the virtual waterfall process

Signoff

MigrateAgile DevelopmentRequirements

Agenda

Data loading challenges

Best practices in tools and

methodology

Rules

• In data migration, integration or loading, one area of commonality is the link between source and target

• This requires design, definition, testing, implementation and documentation

• The aim is automated loading of external data into a common store

• This requires best practice

Rules are required

PPDM

Best practice: A single version of truth

PPDM 3.8

• So for each of these data loaders we want a single version of truth

• Whatever artifacts are required, we want to remove duplication, because duplication means errors, inconsistency and additional work

• We want to remove boiler plate components that are only indirectly related to the business rules by which data is loaded

• Let’s look at what goes into a data loader and where the duplication and unnecessary work comes from...

The PPDM physical model

• PPDM comes to us as a physical projection, rather than a logical model – maps directly to a relational database

• Access therefore via SQL, PL/SQL; low level detail is important i.e. how relationships are implemented (e.g. well header to borehole)

• Considerations to access: primary keys, foreign keys, data types – conversions, maximum lengths. Load order required by FKs – PPDM Load of the rings, relationships – cardinality etc

• SQL: only know at runtime, so turnaround can be slow

• All of this metadata is available in machine readable format, so we should use it

External data sources

• Looking at the external files, we need a variety of skills: text manipulation, XML processing, Excel, database

• The data model is unlikely to be as rich as PPDM, but there is some definition of the content e.g. Excel workbooks have a tabular layout with column titles, worksheets are named

• It can be hard to find people with the relevant skills - you sometimes see ad hoc, non-standard implementations because the developer used whatever skills he/she had: perl, python, xslt, sql

• So the next clue is that we should use the model information: what elements, attributes and relationships are defined, rather than details of how we access it

• Abstract out the data access layer; don’t mix data access with the business rules required to move them into PPDM

PPDM 3.8

Challenges with domain expert mapping rules• A common step for defining how a data source is to be loaded is for a domain expert to

write it up in Excel• Not concerned with data access, but some details will creep in, e.g. specifying an xpath• When lookups, merging/splitting values, string manipulation, conditional logic appear,

the description can become ambiguous• Also note the duplication: the model metadata is being written in the spreadsheet; if the

model changes, the spreadsheet needs to be manually updated

Challenges with developer mapping rules• The example here probably wouldn’t pass a code inspection, but it does illustrate the

type of issues that can arise• Firstly, duplication: this is reiterating the Excel rules – they need to match up, but while

a domain expert might follow the simple example previously, low level code can be tricky to discuss

• Secondly, metadata is again duplicated: the names of the tables and columns appear in the SQL statements, the max length of the name column is checked

• Thirdly, boiler plate code: select/update/insert conditional logic• Fourthly, data access code appears in the rules• Finally, the code becomes hard to maintain as the developer moves on to other roles

Documentation of mapping rules

Word document for sign-off

Data Management record How data was loaded Stored in your MDM

data storeCan be queried

PPDM mapping tables

• Here is where you do require some duplication

• Tests are stories:• Define what the system

should do• If it does, the system is good

enough if the tests are complete

• If we use a single version of truth to generate tests, the tests will duplicate errors, not find them

Test artifacts

Agenda

Data loading challenges

Best practices in tools and

methodology

Tools

Use tools

• Use available metadata• Abstract out data access layer• Higher level DSL for the mapping

rules:• Increase team communication

– developer/business• Reduce boiler plate code

• One definition:• Replace Excel and code• Generate documentation

An example of a graphical tool: Altova MapForce

• Tools such as Talend, Mule DataMapper and Altova MapForce take a predominantly graphical approach

• The metadata loaded on the left and right (source/target) with connecting lines

• In addition to the logic gates for more complex processing, code snippets can be added to implement most business logic

• Issues:• Is it really very easy to read? The

example here is a simple mapping; imagine PPDM well log curve, reference data tables etc

• It isn’t easy to see what really happens: a+b versus an “adder” – e.g. follow the equal() to Customers – what does that actually do?

• But: can generate documentation and executable from that single definitive mapping definition

• Typing errors etc are mostly eliminated

ETL Solutions’ Transformation Manager

• An alternative is to use a textual DSL: again the metadata has been loaded

• No data access code• Metadata is used extensively: for example

warnings, primary key for identification; relationships

• Typing errors are checked at designtime, and model or element changes affecting the code are quickly detected e.g. PPDM 3.8 to 3.9

• Rels used to link transforms: a more logical view with no need to understand underlying constraints; complexity of the model doesn’t matter, as the project becomes structured naturally

• FK constraints used to determine load order• Metadata pulled in directly from the source

e.g. PPDM, making use of all the hard work put in by the PPDM Association

Generated documentation

One of the many questions a data manager has about the data he/she manages:

Keeping the PPDM data manager happy

PPDM 3.8

Data lineage: How did this data get here?

PPDM provides tables to record data lineage

Transformation Manager can generate documentation for the PPDM metadata module

Agenda

Data loading challenges

Best practices in tools and

methodology

Project management

Key points

• Be aware• Look at data migration

methodologies• Select appropriate

components

• Look for and remove large risky steps

• Start early• Ensure correct

resources will be available

• No nasty budget surprises

• Use tools• Build a happy virtual team

• Did you know about these tables?

• Who uses them? • How do you use them? • What features would be truly

useful in a data loader tool?

Questions

www.etlsolutions.com

Raising data management

standards

www.etlsolutions.com

Contact us for more information:

Karl Glenn, Business Development Directorkg@etlsolutions.com+44 (0) 1912 894040

Read more on our website:

http://www.etlsolutions.com/what-we-do/oil-and-gas/

Images from Free Digital Photos freedigitalphotos.net

top related