data explorer

22
Ihr Logo Data Explorer - A data profiling tool

Upload: kalpesh1908

Post on 04-Dec-2014

86 views

Category:

Documents


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data explorer

Ihr Logo

Data Explorer- A data profiling tool

Page 2: Data explorer

Your Logo

Agenda Introduction

Existing System

Limitations of Existing System

Proposed Solution

Project Scope

Block Diagram

Implementation

Technology

Hardware and Software Requirements

Features and Benefits

Future Enhancement

Data Explorer – A Data Profiling Tool

Page 3: Data explorer

Your Logo

Introduction (1/2)

Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data.

Data profiling is an analysis of the candidate data sources for a data warehouse to clarify the structure, content, relationships and derivation rules of the data.  Profiling helps to understand anomalies and to assess data quality, but also to discover, register, and assess enterprise metadata. 

The purpose of data profiling is both to validate metadata when it is available and to discover metadata when it is not. 

The result of the analysis is used both strategically, to determine suitability of the candidate source systems and give the basis for an early go/no-go decision, and tactically, to identify problems for later solution design, and to level sponsors’ expectations.

Data Profiling

Data Explorer – A Data Profiling Tool

Page 4: Data explorer

Your Logo

Introduction (2/2)

Find out whether existing data can easily be used for other purposes

Improve the ability to search the data by tagging it with keywords, descriptions, or assigning it to a category

Give metrics on data quality, including whether the data conforms to particular standards or patterns

Assess the risk involved in integrating data for new applications, including the challenges of joins

Assess whether metadata accurately describes the actual values in the source database

Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns.

Pourpose of Data Profiling

Data Explorer – A Data Profiling Tool

Page 5: Data explorer

Your Logo

Existing System

Initially the data Profiling activities used to be done by writing complicated SQL queries

This would be comfortable for analyst or user who knows to write SQL queries

Many of us do not know the proper syntax and format for writing SQL queries

To overcome this, Data Profiling tools were introduced

Data Profiling Tools, to a some extent overcome the limitations for writing complex queries

All types of profiling activities were not supported by the tools

User has to understand and learn how to use the tool

Data Explorer – A Data Profiling Tool

Page 6: Data explorer

Your Logo

Limitations of Existing System

Development time is more.

Need to understand the functionality for developing the queries.

Results needs to be exported to excel or notepad for anlysis

Traditional Approach

Complex User Interface

Limited Functionality.

Setup and Installation.

License Cost.

Minimum Server Requirements

SQL Queries Existing Tools

Data Explorer – A Data Profiling Tool

Page 7: Data explorer

Your Logo

Proposed Solution

Developing an Application performing all the types of profiling

Easy to use interface

Minimum system requirements

Feature to export the profiling results data to excel

Additional feature to indicate the Data Quality i.e. Data Quality Indicator

Supporting multiple Databases like Oracle 10g, Oracle 11g, MS SQL Server 2005, MS SQL Server 2008, My SQL etc

Integrating Data Quality to correct erroneous, inconsistent and inaccurate data

Data Explorer – A Data Profiling Tool

Page 8: Data explorer

Your Logo

Project Scope Keeping the Time Line and other factors in mind, the project will currently support only

MS SQL Server

Also the project will have following types of Profiling:

Column Profiling

Empty Column Analysis

Null Rule Analysis

Constant Analysis

Frequency Analysis

Uniqueness Analysis

Primary/Composite Key Analysis

Integrating Data Quality

Data Explorer – A Data Profiling Tool

Page 9: Data explorer

Your Logo

Architecture Diagram

Data Explorer – A Data Profiling Tool

Analysis TeamAnalysis Team ManagementManagementBusiness UsersBusiness Users

Data ExplorerData ExplorerData

ProfilingCentral

Metadata Repository

Capture Issues

and Notes

MS SQL ServerMS SQL Server Other DatabasesOther Databases

Reporting

Page 10: Data explorer

Your Logo

Implementation

The project will be implemented module wise.

Project will be having different modules. Each module will be developed individually and Unit Tested

After completion of all the modules and unit testing, all the modules will be integrated and System Integration Testing will be performed

There will be separate modules for Databases retrieval from server, Tables retrieval after selecting a database, Columns retrieval after selecting a Table

There will be separate module for each type of profiling discussed.

Data Explorer – A Data Profiling Tool

Page 11: Data explorer

Your Logo

Implementation - Profiling Details

Column Profiling

This will help in discovering total no of records, null percentage, unique percentage, minimum and maximum value in the column, documented data type etc.

Constant Analysis

This will help in discovering those columns which has less than 4 and greater than 0 distinct values.

Null Rule Analysis

This will help in finding all the columns in a table which has 100% NULL values

Data Explorer – A Data Profiling Tool

Page 12: Data explorer

Your Logo

Implementation - Profiling Details

Unique Analysis

This will help in finding all the columns in table which has 100% uniqueness.

Primary Key / Composite Key Analysis

It will help us to find out the possible primary or composite key columns which can be have unique combination.

Frequency Analysis

This will help in finding the no. of distinct values in the columns and the no. of time the value is repeated in a column.

Data Explorer – A Data Profiling Tool

Page 13: Data explorer

Your Logo

Implementation – Data Quality

Data Unification

Before Data Unification Profiling results

Data Explorer – A Data Profiling Tool

Column Column Value Count

Gender Male 50

M 10

His 5

male 60

Total 125

Column Column Value Count

Country USA 10

U.S.A 60

United States of America 2

US 20

Total 92

Page 14: Data explorer

Your Logo

Implementation – Data Quality

Data Unification

After Data Unification Profiling results

Data Explorer – A Data Profiling Tool

Column Column Value Count

Gender Male 125

Column Column Value Count

Country U.S.A 92

Page 15: Data explorer

Your Logo

Implementation – Data Quality

NULL Removal

Before Null Removal profiling results

Data Explorer – A Data Profiling Tool

Column Null %

Country 30

Column Column Value Count

Country India 50

U.S.A 20

NULL 30

Total 100

Page 16: Data explorer

Your Logo

Implementation – Data Quality

NULL Removal

After Null Removal Profiling results

Data Explorer – A Data Profiling Tool

Column Null %

Country 0

Column Column Value Count

Country India 50

U.S.A 20

N.A. 30

Total 100

NULL value defaulted to N.A. (Not Available)

Page 17: Data explorer

Your Logo

Technology

Data Explorer will be developed on .NET platform using C# as a coding language.

.NET is Microsoft platform for developing advanced and Robust applications

.NET supports a wide range of library classes which eases the development efforts and hence more time can be utilized in other activities

.NET is called Language Independent Platform as it support 4 native languages and 21 non-native languages.

Native Languages are a Microsoft created languages i.e. C#. VB.Net. J#, VC++

Non-Native or Non Microsoft Languages supported are Pearl, Ruby on Rails etc

Data Explorer – A Data Profiling Tool

Page 18: Data explorer

Your Logo

Hardware and Software Requirements

Data Explorer – A Data Profiling Tool

Data Explorer

HARDWARE SOFTWARE

• Pentium Core 2 Duo processor or above

• 2 GB RAM • 20 GB HDD • Printer • Router for Internet

Connection

• Windows 2000/ Windows XP/ Windows Vista/ Windows 7

• Microsoft .NET Framework 3.5

• Microsoft Visual Studio 2008

Page 19: Data explorer

Your Logo

Features Supports multiple databases like MS SQL Server, Oracle

Different types of profiling like

Column Profiling

Constant Analysis

Unique Analysis

Null Rule Analysis

Frequency Analysis

Empty Column Analysis

Primary / Composite Key Analysis

Quickly Analyze and validate data issues

Data Quality improvement

Data Explorer – A Data Profiling Tool

Page 20: Data explorer

Your Logo

Benefits

Quick discovery of data issues

No more writing of queries to profile data

Time efficient

Shorten the implementation cycle of major projects

Improve understanding of data for the users

Discovering business knowledge

Improves data accuracy in corporate databases

Data Explorer – A Data Profiling Tool

Page 21: Data explorer

Your Logo

Future Enhancement

Data Explorer can be further extended to support unstructured or semi-structured data like flat files, .csv files

It can also be extended to support other relation data bases like MS Access, MySQL, Sybase etc Time efficient

It can also be enhanced by including Data Quality reports on top of Data Quality Results

There can be mechanism to store the profiling results so that it can be used or referred later at any point of time

Data Explorer – A Data Profiling Tool

Page 22: Data explorer

Ihr Logo

Thank You

Data Explorer – A Data Profiling Tool