sql 2012 dqs
DESCRIPTION
deck from DevTeach VTRANSCRIPT
Data Quality Services
@LynnLangit
Breakthrough Insights = Better BIP
ower
Vie
w
New
sem
antic
serv
er m
odel
for S
SA
S
Dat
a Q
ualit
y S
ervi
ces
Mas
ter D
ata
Ser
vice
s
Col
umn-
stor
e
Inde
x (1
0x-
100x
fast
er)
Sem
antic
S
earc
h &
Fi
leta
ble
What is Data Quality Services?
A set of tools and services that allow domain experts to improve Data
Quality• Produces result set with suggested improvements• Does NOT change source data
Why Use DQS?
• Manually define, match , cleanseSME input
• Programmatically “”, then manually approve
• Can ‘learn’
Machine Cleansing
• Can incorporate 3rd party data• Can integrate with other data processes
(SSIS)Integration
When to use DQS (scenarios)Issue Detail
Completeness Is all information present?
Conformity Is all data in the correct format?
Consistency Do values represent the same meaning?
Accuracy Do data objects represent their real-world values?
Validity Do data values fall within acceptable ranges?
Duplication Are there multiple copies of the same data?
DQS Architecture
Installing DQS
SQL Server 2012
BI Edition
Enterprise edition
Not installed by default
Client / Server / SSIS task
Grant 1 of 3 DQS roles on the DQS_Main db
Make your data accessible for SQL operations
Enable TCP/IP for remote DQS
Post Install
Must run ‘DQS Server Installer’ post SQL Install
Do MDS integration
DQS CU1
DQS Components on SQL Server 2012
Data Quality Services client interface
How to Use DQS?
List of Basic Steps• Create/Refine/Use a Knowledge Base• Perform a Data Quality Evaluation• Generate output (results)
• List of Components• DQS Server• DQS Client(s)
How to Use DQS? Start with the KB
Knowledge Bases
• Can use included KB• Can refine included KB• Can create KB from
source data• Can manually create
KB
Parts of DQS KB – Domain Management
Adding Domain Values
• Correct• Error• Invalid
More on Domain Values
• Link as synonyms• Set as leading value
Regular or Composite Domains
More about KB Domain Management
• Domain Properties – Description, Language…• Reference Data – relate to 3rd party data• Domain Rules – RegEx/length, etc…rule-based• Domain Values – shows substitute values• Term-Based Relations – common word corrections
Parts of DQS KB – Knowledge Discovery
Parts of DQS – Knowledge Discovery – 1/2
Step two – Running Discovery
Parts of DQS KB – Knowledge Discovery – 2/2
Step three – Correcting Values
DQS KB – Creating a Matching Policy – 1/3
• Select data to be matched for each domain
DQS KB – Creating a Matching Policy – 2/3
• Create matching rules per domains• Similar
• set similarity score, when matching score < 60• For numbers, set threshold (% or int)• For dates, set threshold (DD, MM or YY)
• Exact – identical values (score of 100)• Configure Weight, must sum to 100
• Can configure Prerequisites
DQS KB – Creating a Matching Policy -3/3• Test matching rules per domains
• Click ‘Start’• Review ‘Matching Results’ tabs to compare one or more results
Matching – See Results
Matching is usually performed AFTER cleansing and is focused on identifying (and removing) duplicates
More Matching Output
Using the DQS KB to do Data Cleaning
• Create or Open a Data Quality Project• Map the DQS KB to the new data• Perform Cleansing• Manage / View Results• Export corrected results
DQS Project -- Cleansing
DQS Cleaning in Process…
DQS Cleaning complete
DQS Cleaning – Manage Results
DQS Output file Information
Export file column names (with option to include "Data and Cleansing Info“)
XXX_Source - original source column value XXX_Output - clean column value XXX_Reason - reason column value was either valid or invalid XXX_Confidence - column confidence percentage returned by the DQS
server algorithms XXX_Status - column processing status (i.e. Correct, New, Invalid, etc.)
DQS Administration - General
DQS Administration – Reference Data
DQS Administration - Logging
DQS Integration
List of Integration Points• API? – not at this time• SSIS task• MDS (Master Data Management)
DQS Cleansing Task in SSIS
DQS Cleansing Task in SSIS - mapping
For each input column define columns for • Source – contains input values• Output – contains correct or
corrected or invalid output values• Status – contains auto suggest,
correct, invalid or new
Running Package Status
DQS SSIS Task
DQS SSIS Task Complex Example
What is Master Data Management?
Defining MDS• Central repository for data• Rule-based• Can work with DQS
Types of Data Quality Projects
• Exact matches (WHERE = WHERE <> WHERE IN)• LIKE (%string matching)
T-SQL scripts (boolean match)
• CONTAINSFull-text matching
(semantic word match)
• SEMANTICSIMIALARITIESTABLESemantic Search (semantic phrase match)
• List belowSSIS tasks - (transactional,
multi-valued matching)
• Knowledge Base - rules/matches• Data Quality project - clean / correct dataDQS (KB matching)
• Versioned Entities, Attributes and RulesMDS (One view of truth)
New since RC0
• Use knowledge import from projects back to your knowledge base (KB) with Cleanse2KB
• Use the Office speller as part of the DQS client• Use Composite Domain rules
to correct values to detect rules violations
• Import values from Excel import values together with their synonyms
• Use unstructured composite domain values? KB parsing is a new feature that takes advantage of your knowledge for a
more accurate parsing
• Modify server log settings through the client UI
Performance Information
Resources
DQS Team Blog - here
DQS video – here
DQS on TechNet - here
More samples – here
DQS videos (playlist) - here
www.Develop.com
Next Steps
• Install DQS
• Create a KB
• Try out Data Cleansing
Related Session(s)
• SQL BI SQL 366 - Understanding Analysis Services in SQL Server 2012 SQL 422 – Integrating Spreadsheets with Enterprise Data SQL 245 - Why Data Warehousing Projects Fail
www.TeachingKidsProgramming.org
Do a Recipe Teach a Kid (Ages 10 ++)Microsoft SmallBasic Free Courseware (recipes)
Keep up with Data
Follow me @LynnLangit
RSS my blog www.LynnLangit.com
Hire me• To help build your BI/Big Data
solution• To teach your team next gen BI
with SQL Server 2012