introduction to metadata management by bob panic
TRANSCRIPT
Data about Data – The complexities of Metadata
…a Whitepaper by Bob Panic, Solutions Architect – Business Applications, IT
Infrastructure, Corporate Data Systems, Cloud Solutions, Security
Architecture…
Web: www.bobpanic.com e: [email protected] m: +61(0) 424 102 603
Metadata Context - introduction by Bob Panic
The babble around metadata is now growing like a howling wind in a small, locked room and has
in particular been a topic of discussion in the media due to the fact that governments are
introducing data retention laws and “metadata legislation”.
Metadata is a topic of conversation without anyone specifically understanding as too what
metadata is and what does it (or does not) cover. This whitepaper is quite complex and covers
quite a considerable spectrum of elements of metadata. I have tried to compile a white paper
that is broad and potentially easier to understand for the end reader.
At the end of the day discussions around data and metadata are technical in nature and you
cannot escape this in the context of this whitepaper. I always find when I present this topic to
groups in a workshop format, the message is a little better understood as the human element
comes into play and active feedback is encouraged.
Digital DNA – the data of metadata
I find that when discussing technical matters, that a human interpretation is best. The best way for
me to explain metadata, rightly or wrongly, which can be easily understood, is in the context of
DNA.
DNA is a highly complex topic, yet most people have a good understanding of what DNA does
and how it relates to life and us as individuals, humans, animals, carbon based lifeforms if you will.
Metadata can be thought of as “Digital DNA” it is the best way that I can relate to it in a context
that could be commonly understood. I find even the summary of “Data about Data” as a
complex passage for the average person to understand and I hope that Digital DNA might be a
better context.
And really to think about the “collection of metadata”, as most governments refer to the
collection and storage of varied data sources, I think Digital DNA is an apt description for
Metadata.
What is Data?
“…any product of a digital technology can be referred to as data…”
The above is my personal statement, I think some will find fault with it, I am sure most can summerise
the concept better, but in the most simplest way that I can, I think the above statement is pretty
spot on.
Metadata – or Digital DNA – is the detailed “machine code” embedded within that digital
product, data, and can contain information on the who, what, where, why and how that
particular data was “born” and its particular travels through “life”.
…Metadata is as detailed, and as rich, as the digital system that produced the data was
designed to do…
You may need to read the above statement a few times to understand what Metadata can “do”
(or more accurately contain) and what it can “not do”.
Examples Please! …The Evolution of the digital Photo…
1. Imagine it is early 2000’s, not that long ago, but an eon in the life of a digital product. You
have just unwrapped your new digital camera, the super awesome (and totally imagined
product) the 1 megapixel ZoomTastic 2000, the very first digital camera of its kind! You look at
your $2000 dollar investment and you know that this is the future, you may never need another
photography camera ever again! Film is buried forever, the stock price of silver drops to an all-
time low… You place your massive 1mb memory card into the camera, it’s the new “film’
don’t you know, and you go to the most beautiful rose in your garden. You take what will be
the first of many, many digital photos.
This digital photo is a product of your digital camera = DATA!
So now you take your digital photo back home to your computer and load it into a special bit
of software, let’s call it “Photo House”, this software allows you to view the most beautiful photo
of a rose ever taken in a matter of seconds. No time waiting on processing or development of
film.
This is the start of the instant photography revolution… and you are proud as punch…
But as you look at the photo on your computer screen, displayed by the help of your special
“Photo House” editing software, you see bits of additional information you have never seen
before. It’s not displayed in the actual photo but “hidden”. It contains, the time, date, the f
stop, the ISO (sensitivity of the “film”), the speed of the shutter, all little numbers and details
that most people would not understand, but you can understand because you are a
professional and this critical data enhances your understanding of the digital photograph
taken by this fancy new digital camera. This is = METADATA!
2. …and its now 2015. You are still a great professional photographer, you love to photograph
everything. Your tool of choice is the fantastically imagined “20 Megapixel Pineapple
YouSmartPhone 2015 ™” fresh of the factory floor from East Bolivia.
So what has changed in the 15 years since you had your first digital camera?
a. This has been the 15th such digital camera…
b. The picture is 20x bigger in size (bigger DATA file)
c. And the information stored within the file is just stunning: GPS co-ordinates, time, date,
names and address, color modeling information, various topographical and location
based information, altitude, even the heartbeat of the photographer taken in the
moment the shutter was pressed all stored – more METADATA
Privacy vs. Security
The term metadata, like the now over used term: “The Cloud”, is now becoming a generally
overused verb, a vernacular to describe something complex without anyone really been able
understand what it means. This is a trap when technical terms become common topics of
discussions in café’s and general conversation. Fear and mistrust sets in when complex
technologies, poorly understood by general society, and bantered about by politicians who wrap
superlatives like “national security and legislation” in the same sentence as metadata.
Big Data or Big Brother?
Society fears a government that intrudes on our private lives, and when governments legislate the
capture and storage of our private information we wonder, what will they find?
The Government has my data…so what?
So what “secrets” would a government know about “me” from my metadata you might ask?
The basic answer, quite a bit, in fact a considerable map of your life would emerge and could be
plotted out from your “metadata”. But before you cry out in anger and disgust, remember that
anyone can know a lot about you just from your personal social media profile. So why the fuss?
The fear is that armed with a highly detailed spectrum of information about an individual,
governments would deal with the social sea that you inhabit - the social and private data you
create (or that is created about you) - in a manner to prevent freedoms and democratic rights.
Possible? Yes. Probable? Unlikely.
Historically, data theft, identity theft or misrepresentation is the domain of the criminal sector, a
dark world were your stolen credit card details are used for nefarious means. The most likely
scenario however is that a prepubescent tween, hidden in the basement of his parents’ home, in
some remote and dull outer suburban wasteland, bored and with little a social life, has hacked a
poorly secure gaming or smart TV network and stolen parts of your private life, happily charging
soft porn Hustler subscriptions to your Myer One card without your knowledge or approval.
And this is where the average person gets scared. If the Russian mafia (or most likely a 13 year old)
can hack into my private details and run rampant, what would or could, the government do
As I write this, the debate on metadata collection is still heating up in Australia and the proposed
legislation has already got 30 odd amendments to it. One such amendment is that a Journalist’s
data is exempt from metadata retention laws.
Sounds great and noble, but how do we distinguish “my” metadata from a Journalists metadata?
In fact who can claim to be a journalist in 2015?
I am a blogger, am I not considered a creator of copy and journalistic prose?
What about hate speech bloggers? They too could be considered journalists. Oh, sorry we are
talking about “professional/paid” journalists! Ok, that discounts the funds hate speech bloggers
get from the Ku Klux Klan to support hate speech web sites and online newspapers…and by the
way…fundamentalist groups don’t publish articles do they? Nor blog?
Let’s get real. At a given point in time, in the collection and storage of metadata, you cannot
determine neither the origin nor the producer of the original data source. Have a long think about
that last sentence.
So what that means is that significant smarts, either human or computational algorithms, would
need to be applied before data context is determined: who created it, who is the intended
recipient, what was the intended purpose…etc.
Before metadata can be interpreted it needs to be gathered on mass, stored on mass and to the
eyes of the average beholder, it will look like a mess!
But even before all this, all this collection and storage of “Big Data”, something fundamental
needs to be clearly understood about all this metadata:
A mentioned earlier, metadata, is as good, and as detailed, as the “system” that has been
designed to create the data in the first place. A bit hard to understand I know, so back to my
digital camera example:
In 2000 the digital camera, as a direct limitation of the technology at that specific given time in
history, and from the specific manufacturer and even as a result of budgetary restrictions of the
specific model of digital camera, was designed to provide a limited amount of information as part
of its core function. If you were to grab a digital photo file from the year 2000 and you were to
look at its metadata, you might not be able to find specific information on location, time, or who
took the digital photo. The technology of the time, just could not create that much detail.
In 2015, adversely, with the demands of consumers and violent competition between brands and
the vast leaps of technology, digital cameras have morphed into smart phones and these devices
create a myriad of metadata for each and every digital photo taken. More digital photos are
taken by iPhones than any other digital photographic device or system, and the metadata
contained within each photo is mind boggling.
So what about in 2020? This is an interesting flight of fancy for future gazing, we see glimpses of
this 2020 future today in both social media sites like Facebook and our Smartphones. One such
technology/feature: Facial recognition!
Try the following experiment. Grab a smart phone that is not more than 12 months old and take a
photo of a person standing in front of a busy street. The smartphone will instantly detect the face,
automatically focus on the face (separating it from other objects in the background/foreground)
and the result: you take a nice sharp photo of your loved one with your snazzy smart phone. Very
smart indeed.
I am hoping you are now seeing where I am going with this line of thinking. So you now look to
post your lovely photo of your loved one on your Facebook or social media account. Can you
see what happens? Facebook (or any other social media site for that matter) will automatically
detect that the uploaded photo that was taken has a face/person and it ask you if you want to
create a tag, and name the loved one or individual in your photo. And as the dutiful slave to
technology you are, you freely oblige by identifying the person by name.
This is an example of how rich metadata gets created. So far no evil government intervention is
required. The manufacturers create digital products that create digital outputs and metadata is
generated, as the human interface to this created data, we then add context and a value that
goes beyond a simple, private photo taken for our personal enjoyment.
So by 2020 we would have created so much additional information about every single
photograph ever taken, that our smartphones will automatically identify and tag specific
individuals, no need for any human intervention. This could then be referred to as Super Rich Data.
By 2025 companies and governments could collect so much super rich “big data” that we could
potentially take a photo in the busy intersection of Times Square in New York, and each and every
face within that photo could be recognised and tagged, automatically. That is not big brother,
or evil governments spying on our every move (a nod to “Person of Interest” TV show here), this is
society generated rich metadata - just what the government needs to fight fundamentalist
insurgencies, hate speech and terrorism…
But Wait there is more…!
Metadata is just not the domain of digital photography, there is the internet, online advertising,
Facebook and Social Media Metadata, internet browsing history, social tagging, email trails and
communications, mobile phone tracking and GPS enabled devices, Internet of Things, data
storage, cloud technologies, service oriented software, public data sharing and the endlessly
growing list of software and digital hardware creating digital DNA, a sea of metadata
breadcrumbs scattered throughout the virtual world that lead straight back to your front door.
But these
Now on to the technical complexities of metadata. The below has been compiled and edited to
provide a detailed glimpse into the complex nature of data and machine metadata.
Metadata Management
Metadata Management Life Cycle
Metadata management Life Cycle defines the various phases associated with the end-to-end
metadata management process starting from planning through maintenance till retirement of
metadata
Governance and Planning
Governance and Planning involves initial planning, defining the objectives for metadata
management process, identification of owners and associated roles and responsibilities for each
of the stakeholders.
The ability to ingest and explore any data – including structured, semi-structured and unstructured
data is critical to getting the most out of corporate data and data warehouses. Given this usage,
it is challenging to enforce a strict control and governance regime on the data being ingested
into the Data warehouse environments and hence Governance of Metadata is of relatively lesser
significance in this context.
Metadata Content
Metadata content defines the types of metadata that need to be captured as part of the
metadata management process.
Metadata Capture Strategy
Metadata capture strategy defines the process and/or tools that need to be used for capturing
the required metadata. Strategy for metadata capture can include multiple tools/approaches
based on the type of data and feasibility constraints. The strategy outlines the guidelines for using
an appropriate tool or mechanism for identified use cases.
Type of Metadata Definition / Description
Business Metadata
Business Metadata defines the data in the Warehouse in user friendly
terms. Business Metadata captures ‘what’ data is stored in the
Warehouse, ‘where’ the data is sourced from, ‘how’ the data is used
and its relationship to other data in the Warehouse.
Technical Metadata
Technical Metadata defines the data, objects and processes in the
Warehouse from a technical point of view. Technical Metadata
captures system metadata – such as tables, data elements, indices,
partitions in a relational database, files stored in the cluster, security
classification for the data elements etc.
Operational Metadata
Operational Metadata (or sometimes also referred to as the Process
Metadata) is the data about the processes in the Warehouse.
Operational Metadata captures process schedules, frequency of
batch processes, status summary and usage statistics for various
processes etc.
Business Rules &
Transformation Rules
Business Rules and Transformation Rules related metadata capture the
rules applied on data elements during the data acquisition, data
ingestion or data extraction and loading processes in the Data
Warehouse.
In some cases, this metadata can also be used to dynamically process
and load the source data feeds into the Data Warehouse.
System Statistics
System Statistics related metadata captures data related to system
resource utilisation for proactive monitoring and maintenance within
a Data Warehouse environment.
Metadata for
Downstream Process
Metadata for downstream processes captures the ‘Technical
Metadata’ including mapping of data elements from the Warehouse
to downstream processes or applications such as BI tools, analytical
models or any other downstream applications.
Metadata Model and Integration
Metadata Modelling defines the data modelling strategy for the metadata repository. Metadata
Integration defines the approach for integration of various types of metadata including
integration from various metadata repositories, if applicable.
Metadata Visibility
Metadata Visibility defines the processes associated with enabling access to the metadata
elements, types of analyses and use-cases for usage of metadata by end-users.
Metadata Standards and Quality
Metadata Standards and Quality have been of relatively lesser significance in the past compared
to the other phases in the context of Data Warehouse planning and commissioning. Metadata is
created once and is occasionally used by a limited set of users. Hence typically Organisations do
not invest in tracking or enhancing the quality of metadata captured – either through an
automated process or through a manual process. However as the gathering and use of metadata
grows and is governed by state and federal laws, standards will need to be strengthened and full
auditability will need to be proved to ensure the “sanctity” of core metadata repositories
Maintenance and Retirement
Maintenance and Retirements define the following aspects associated with metadata
management processes.
Purging and archival or obsolete metadata (Operational Metadata for example)
Restructuring and enhancements to the Metadata Model
Processes and Governance for ensuring accuracy and timeliness of the metadata
captured with on-going changes and project releases
Metadata Content - Detail
This section details the list of recommended metadata data elements that need to be captured
for various types of Metadata as part of the Metadata Management strategy for the environment.
Business Metadata
Following are the recommended Business Metadata data elements that need to be captured for
the Business metadata. The Conceptual Model, Logical model information are also stored in the
Business metadata for the ease for usage and to understand the impact analysis for any business
changes
Metadata Data Elements Level
Source Feed Business Name Source Feed
Source Feed Business Description Source Feed
Source Feed Usage Source Feed
Source Feed Group Name Source Feed
External Data Source Indicator Source Feed
Source Host Code Name Source Feed
Source Feed Business Owner / Contact Source Feed
Source Feed Technical Contact Source Feed
Source Column Business Name Source Column
Source Column Business Description Source Column
Target File Business Name Target File
Target File Business Description Target File
Target File Usage Target File
Subject Area Target File
Data Security Classification Target File
Target Column Business Name Target Column
Target Column Business Description Target Column
Target Column Synonym(s) Target Column
Technical Metadata
Following are the recommended Technical Metadata data element that needs to be captured
for the ODS, Data warehouse, Data Marts, Source Systems. This should captured for all source,
target and extracts provided
Level Metadata Data Elements
Source Feed Source Feed Name
Source Feed Source Database Name
Source Feed Source Table Technical Name
Source Feed Source Data File Name
Source Feed Source Feed Group Name
Source Feed Source Host Type
Source Feed Source System Code Name
Source Feed Source Feed Format Type
Source Feed Source File Layout Definition (XSD / JSON etc.)
Source Feed Source Trigger File Name
Source Feed Source Trigger File Type and Format
Source Feed Source Encryption Method
Source Feed Source Feed Profile Path
Source Feed Source Feed Delivery Frequency
Source Feed Exception Days for the Source Feed
Source Feed Expected Delivery Time of the Source Feed
Source Feed Expected Number of Records
Source Feed Number of Columns (Source Feed)
Source Column Source Column Technical Name
Source Column Source Column Data Format
Source Column Source Column Data Type
Source Column Source Column Data Length
Source Column Required / Optional (NULL) Indicator
Target File Target File Name
Target File Target File Format Type
Target File Target File Layout Definition (XSD / JSON etc.)
Target File HDFS Location (Directory Path)
Target File Target Data Security (ARD Role)
Data Source Ingestion Method / Extraction Method
Target File Archive Location
Target File Target Encryption Method
Target Object Target Resource Size
Target File / Table Update Frequency
Target File / Table Update Type
Target Column Target Column Technical Name
Target Column Target Column Data Format
Target Column Target Column Data Type
Target Column Target Column Data Length
Target Column Expression / Transformation (Source – Target)
Column Column Delimiter Used
Column System of Record / System of Reference
Operational Metadata
Following are the data elements recommended to be captured as part of the Operational
Metadata. The Operational Metadata captured does not vary based on the source system of the
type of the source data.
Operational Metadata data elements can be classified into 2 broad categories – Data
Movement and Data Usage, for each of the source data types.
Following are the recommended Operational Metadata data elements that needs to be
captured
Metadata Data Elements Structured Unstructured
Data Movement Metadata
Source Feed Delivery Time SLA
Source Feed Delivery Time (Actual)
Source Feed Exception Indicator
Source Feed Exception Details
Number of Records Received
Expected Number of Columns
Actual Number of Columns Received
Data Load Rule Name
Data Load Rule Threshold Type
Data Load Rule Failure Value
Data Load Rule Last Failure Date and Time
Business Date
Last Data Load Date and Time
Data As of Date
Job Name
Job Description
Job Location
Job Type (Batch / Real-Time etc.)
Job Execution Frequency
Job Execution Start Time
Job Execution End Time
Job Status
Job Completion Time SLA
Job Execution Exception Indicator
Job Execution Exception Type
Job Execution Exception Details
Number of Success Records
Number of Exception Records
Number of Rejected Records
Data Usage Metadata
Access Count
Last Access Date and Time
Last Access User / Process
Number of Queries / Extractions
Last Extraction Date and Time
Output Protocol (FTP, Tumbleweed etc.)
Business Rules and Transformation Rules
Following are the recommended Business Rules and Transformation Rules related Metadata data
elements that needs to be captured
Metadata Data Elements File Level Column Level
Rule Name
Rule Type
Rule Level Name
Rule Threshold Type
Alert Threshold Value
Abort Threshold Value
Rule Default Value
Trigger Field Name
Rule Filter Condition
Rule Parameter Name
Rule Parameter Value
System Statistics
Following are the recommended System Statistics that needs to be captured. The metadata data
elements listed are high level statistics which can comprise of one or more detailed statistics. The
detailed list of system statistics that can be captured depends on the Operating System,
monitoring tools used etc. The table below provides examples of detailed statistics for each
category
Metadata Data Elements Examples
CPU Utilisation CPU Utilisation of System Processes, CPU Utilisation of
Applications / Users, CPU Idle Time etc.
Memory Utilisation Total Physical Memory, Memory used for Swap, Memory Used for
Caching etc.
Storage Utilisation Total Space Available, Utilised Space
I/O Utilisation Number of Transfers per Second, Data Reads (kB/s), Data Writes
(kB/s), I/O Wait Time, Reads per Second, Writes per Second etc.
Metadata Capture Strategy
In the context of Data Warehouse, Metadata is captured only in the production environment
The approach or strategy for capturing the Metadata for the Warehouse can be broadly classified
into 4 categories as follows
Metadata capture for structured data
Metadata capture for semi-structured / unstructured data sources
Metadata capture for downstream processes from Warehouse
The following table summarises the metadata capture strategy by type of Metadata
Metadata Type Options
Business Metadata Sourced from Commercial BI Metadata
Repository
Manual Capture
Technical Metadata Sourced from Commercial BI Metadata
Repository
Auto-Capture (from system tables / repositories)
Manual Capture
Operational Metadata Published to Metadata Repository
Auto-Capture (from Application Repositories)
Business Rules & Transformation
Rules
Custom Manual Capture (through the portal)
System Statistics Auto-Capture
Metadata for Downstream
Processes
Manual Capture
Business Metadata
Business metadata provides the data definition for each of the data elements processed and
loaded into the Warehouse. The metadata management process should provide a mechanism
for manual capture of Business Metadata during the design phase.
Following are the general guidelines for capturing the Business Metadata
For structured data sourced
o If the Business Metadata is available within the Source Metadata Repository, the
required data elements should be sourced and loaded into the Data Warehouse
Metadata Repository
o If the Business Metadata is not available within the Source Metadata Repository,
the data owner responsible for the movement of the data from Source to Data
Warehouse should provide the business metadata. The metadata can be
captured manually using a customised template used for Metadata Management
process.
Data Stewards or Analysts responsible for capturing (creating) the business
metadata should be able to upload the metadata through a self-serviced
portal. This would enable authentication and authorisation for the users
capturing or creating the metadata.
Alternatively, Data Stewards or Analysts can be provided with a UI on the
portal for creating the business metadata that cannot be sourced
programmatically.
For any other source data feeds and target objects (in all cases), business metadata
should be captured using the manual capture process. When the data is captured
through the manual process
o Metadata certified , validated and released
The table below captures the details of metadata capture by layer for Business Metadata
Layer When Metadata Capture Strategy Responsible Party
Data Access Layer Design Phase Manual Capture Business Analysts
Data Storage Layer Design Phase Manual Capture Business Analysts
Technical Metadata
Technical metadata captures the details of how, what and where the data elements are stored
within the Data Warehouse environments. Given the multitude of options for modelling and storing
the various types of data in a Data Warehouse, the Technical Metadata captured varies based
on the type of data being sourced or ingested into the Data environment.
The table below captures the details of metadata capture by layer for Technical Metadata
Layer When Metadata Capture
Strategy Responsible Party
Data Access Layer Design Phase Auto-Capture Data Stewards
Design Phase Manual Capture Data Stewards
Data Landing Layer Design Phase Auto-Capture Data Stewards
Data Integration
Layer Design Phase Manual Capture Data Stewards
Data Storage Layer
Design /
Development Phase Auto-Capture Data Stewards
Design Phase Manual Capture Data Stewards
Operational Metadata
Operational Metadata captures data from the auditing and logging for data acquisition, data
transformation and loading processes, BI usage data, details around data integration job and
report execution times etc.
The approach and guidelines for capturing the Operational Metadata depends on the type of
operational data being captured and can be broadly classified into following categories
Operational Metadata for Data Movement
Operational Metadata for BI and Analytics
The Metadata Management process implemented should capture the Operational Metadata for
data movement during the actual job execution. The metadata should be captured
programmatically without any manual intervention. Operational Metadata for Data Usage
however can be extracted on a period basis and can be scheduled.
Metadata Repository
An Operational Metadata repository should be created for the Data Warehouse
It is recommended to implement a metadata repository at least for Operational
Metadata irrespective of the Data Modelling strategy adopted
If an integrated Metadata Repository is implemented, the Operational Metadata can be
part of the repository (subject area approach)
Guidelines
Following are the general guidelines for capturing Operational Metadata for Data Movement
A common approach is used for capturing Operational Metadata for structured, semi-
structured and unstructured data
Metadata capture should be event driven and required data elements should be
published into the metadata repository as soon as the data movement process / cycle
completes
Data Ingestion, Data Extraction and the Data Load processes should have a mechanism
to publish the required data elements into the Operational Metadata repository
o The data elements may either be published using pre and post processing scripts for
the batch processes
o Alternatively, a control script can be continuously monitor the batch process and
publish the required data elements into the operational metadata repository
Following are the general guidelines for capturing Operational Metadata for BI and Analytics
Operational Metadata for BI and analytics will be primarily sourced from the application
repositories
Metadata capture can be batch oriented, with ability to support intra-day batches
The table below captures the details of metadata capture by layer for Operational Metadata
Layer When Metadata Capture
Strategy Responsible Party
Data Integration
Layer Data Movement Auto-Capture
Data Storage Layer Post Go-Live, on
regular basis Auto-Capture
Business Rules & Transformation Rules
Business Rules and Transformation Rules applied for the data sourced into the Data environment
is always captured through a custom manual process. This section provides the general guidelines
for capturing the Business Rules and / or Transformation rules based on the type of Data
Structured Data
Business Rules and Transformation Rules should be captured as separate rules
Applicable Business Rules and Transformation Rules should be captured at both Source
Table level as well as Source Column Level
Linkage between the Business Rules and Transformation Rules should be established
through the source object
Multiple rules may be associated with a given Source Table or Source Column
Rules may either be captured and stored in the metadata repository (database) or
maintained as Excel files associated with the source object
Semi-Structured / Unstructured Data
Business Rules and Transformation Rules should be captured as separate rules
Rules should be captured at source feed level
Multiple rules may be associated with a given source feed
It is recommended to capture the rules using Excel files associated with the source objects
o Business rules can be optional at field level
o Transformation rules applicable to field level may be captured in the Excel files
Business Rules and Transformation Rules related metadata is dependent on the Technical
Metadata for the source data feeds or source data elements. In order to ensure data quality and
accuracy of the metadata, it is recommended to capture the business rules and transformation
rules metadata through a UI on the portal with following checks and balances
Source data feeds and data elements should be pre-populated from the Technical
Metadata available in the metadata repository
End-users should not be able to edit or modify the source data elements
UI can have basic validations to ensure mandatory metadata elements are captured
UI should also have a provision to allow users to upload a file with the rules either at source
data feed level or at source data element level
Users should be able to edit – update or delete any rules entered through the UI
The table below captures the details of metadata capture by layer for Business Rules and
Transformation Rules related Metadata
Layer When Metadata Capture Strategy Responsible Party
Data Integration
Layer Design Phase
Manual Capture (Custom
Process)
System Statistics
System Statistics for the Warehouse environment should be captured using automated capture
from the system logs or through the use of system monitoring tools and utilities.
Following are the general guidelines for capturing System Statistics
System statistics should always be captured using an automated process
Key utilisation statistics such as CPU or memory utilisation should be tracked continuously
Utilisation statistics for other resources such as storage may be captured on a periodic
basis
The table below captures the details of metadata capture by layer for System Statistics
Layer When Metadata Capture
Strategy Responsible Party
Data Landing Layer Post Go-Live, on
regular basis Auto-capture System Administrators
Data Integration
Layer
Post Go-Live, on
regular basis Auto-capture System Administrators
Data Storage Layer Post Go-Live, on
regular basis Auto-capture System Administrators
Metadata for Downstream Processes
Metadata for the downstream processes comprises of business metadata for the target objects,
technical metadata for the target objects including the lineage from warehouse/ Hadoop to the
downstream data repositories (data marts/ Hive / HBase etc.), BI tools or analytical models. This
metadata is required to enable complete lineage analysis from the source systems to the target
applications.
Following are the general guidelines for capturing the metadata for downstream processes
Business Analysts or the data stewards responsible for moving the data from the Data
Warehouse to the downstream applications should be primarily responsible for capturing
the Business Metadata elements
Technical SMEs / technical point-of-contact for the downstream applications should be
primarily responsible for capturing the Technical Metadata including the lineage
metadata
Any business rules and transformation rules applied should be captured at both Entity and
Attribute level
Any business rules and transformation rules applied should be captured at both Entity and
Attribute level
The table below captures the details of metadata capture by layer for System Statistics
Layer When Metadata Capture
Strategy Responsible Party
Data Storage Layer Design Phase Manual Capture
Business Analysts
Data Analysts
Data Stewards
Metadata Modeling and Integration
Metadata modelling defines the approach or data modelling strategy for the metadata
repository. This section describes various options for metadata modelling and provides a
comparative analysis between each of the options.
Metadata Refresh
Metadata Refresh defines the process and frequency for capturing and updating the metadata
on an on-going basis. The processes and frequency of Metadata refresh varies based on the type
of the Metadata and the environment for which Metadata is being captured and refreshed.
The table below provides a consolidated view of the Metadata refresh strategy for each of the
environments
Type of Metadata Description
Business Metadata
Metadata is “created”
Initial Metadata captured during Design Phase
Metadata needs to be updated continuously whenever there
is a change to source data feed or target structures, enforced
as part of the code release process
Technical Metadata
Metadata is “created”
Metadata that needs to be captured manually is created
during the Design Phase
Metadata captured using automated process is initially
created during the development phase and certified before
code release
Metadata needs to be updated continuously whenever there
is a change to source data feed or target structures, enforced
as part of the code release process
Operational Metadata
Data Movement related Operational Metadata is captured
using event driven approach, but on ad-hoc basis
Data Usage related Operational Metadata can be captured
on a need basis (Optional)
Business Rules and
Transformation Rules
Rules related Metadata should be “created”
Initial metadata should be created post the Technical
Metadata is sourced into the repository
Metadata should be updated on a continuous basis, as and
when there is a need for change using the custom manual
approach defined
System Statistics
Captured using automated process on a need basis
Need to captured and maintained on a regular basis only if
required (for usage based charge-back mechanism for
example)
Metadata for Downstream
Processes / Applications
For any downstream applications designed, metadata should
be “created” in environment
Metadata should be captured during the Design phase
Metadata Visibility
Visibility or access to the Metadata captured for the Data Warehouse should be enabled only
through a standard intranet portal. The portal should provide the following functionalities
Provide a layer of abstraction for the metadata capture, integration and storage aspects
Ability to authenticate users accessing the portal
o It is assumed that there is no need for user authorisation (data security)
Ability to search on the metadata captured, using any of the use-cases identified
o Provide a layer of abstraction between the User Interface and the underlying data
elements on which the search operation is performed. For example – a basic
search on UI for table name could perform a search on table technical name,
table business name, table business description and the source data file name.
o Provide ability to perform advanced search using a combination of search criteria.
For example – search for a given table name within a subject area for a given
Market.
o Pagination of the search results for better readability
o Ability to sort the search results on predefined criteria including search relevance
(this use case may need further discussion and elaboration)
o Should provide ability to export the search results to Excel for offline analysis
Ability to establish data lineage for data entities and elements within the Data Warehouse
o Should support bi-directional lineage analysis
o Completeness and quality of data lineage information will be dependent on the
accuracy and completeness of the metadata captured – either through
automated process or through the manual capture process
Ability to generate and view standard operational reports
Following are the general guidelines with respect to the Metadata Visibility
End users (data analysts for example) for metadata should never be provided direct
access to the metadata repository – database tables or the Excel files within Data
Warehouse
Only system administrators and technical SMEs for the Data Warehouse may have direct
access to the metadata repository including the physical storage
Access to metadata environments should be enabled through separate user interfaces
– separate portals, sub-sites etc.
User Groups and Associated Usage
This section captures the details of the target user groups who would need access to the portal
and their associated usage of the portal, in each of the environments
Metadata Analysis & Usage
The Metadata Repository portal supports the following types of analysis and usage of the
metadata captured.
Lineage Analysis
Lineage analysis is one of the key requirements for the proposed Metadata Management solution.
The metadata captured should support the following types of lineage analysis
For structured data source extracted from Source, the metadata in Data Metadata
repository should support bi-directional lineage analysis from the tables in Source/
Warehouse to the Data Warehouse or any downstream applications from Data
warehouse
o The metadata should support lineage analysis at table and column level
o For each of the tables / Files from Source, the System of Record information for the
original source feed may be made available as additional information. However,
the lineage from the original source data feed to the Source Files/ tables will be
out of scope for lineage analysis
o The completeness of lineage metadata will be dependent on the process
implemented for capturing the metadata for downstream processes / applications
For semi-structured or unstructured data sources, the metadata captured should support
lineage analysis as follows
o Bi-directional lineage analysis at object level (web files, video files etc.)
o For data sources like IVR where each transaction can potentially contain an audio
file, lineage analysis should capture the linkage of audio files to the transaction and
the source feed
o For structure metadata captured as part of unstructured data sources, the
metadata should support lineage analysis at column (data element) level
Data Usage Analysis
Data usage analysis primarily provides ability to track what data within the Warehouse is being
used, frequency of usage and the access log of end-users accessing the data. Data usage
analysis helps in identifying the frequency of data elements being accessed, improve the data
modelling and restructure the data to provide easier and quicker access to end-users.
Data Analysis usage requires the Data Usage related operational metadata to be captured as
part of the metadata management process. Some of these operational metadata for structured
data can be captured through automated processes either from the system logs or system tables.
However, for semi-structured or unstructured data capturing operational metadata may require
some level of tracking at the operating system level and is subject to feasibility, specific use case
requirement and the decision to implement tracking user activity at such detailed level.
BI Usage Analysis
Operational Metadata required for supporting BI usage analysis will be primarily sourced from the
application metadata repositories. BI usage analysis helps to understand the user behavior on BI
tools and applications and this identifying potential opportunities for redesign and / or
optimisation.
Following are some examples of analyses typically performed on BI Usage
Number of users executing reports on a daily / weekly basis
Average number of reports executed on a daily / weekly basis
Number of times a report is run in the last x days
Audit Analysis
Audit analysis requires Operational Metadata to be captured for the data integration and load
processes. Audit analysis primarily helps to understand the effectiveness of the data movement
and data loading processes and helps to identify potential opportunities for redesign and / or
optimisation.
Examples or audit analyses reports are as follows:
Average execution times for batch processes, by subject areas
Long running jobs at the potential risk of missing data loading SLAs (for proactive tuning)
Jobs exceeding the average execution times on a daily / weekly basis
Average number of errors or exceptions on a periodic basis
Frequently occurring errors or exceptions by Source Feed or Subject Area
Metadata Maintenance and Retirement
Metadata Maintenance and Retirement process will be closely related and dependent on the
Governance and Planning for Metadata. For the `Warehouse, Metadata Maintenance and
Retirement strategy needs to be cater to the differences in target audience, data movement
strategy and the data retention strategy for each of these environments.
Following are the general guidelines for Metadata Maintenance and Retirement:
Metadata will be captured only for the ‘Shared’ Area
No metadata will be captured or maintained for user specific directories (‘Private’ Area)
Metadata capture and updates for any metadata captured using manual or custom
process need to be enforced as part of the code release checklist and should be up-to-
date at given time
Technical metadata captured using automated process also should be maintained
completely and accurately for all objects
Following metadata captured using an automated process may be refreshed on a need
basis
o Operational Metadata
o System Statistics
When data is purged, all metadata associated with that data / data objects should also
be purged from the metadata repository