datprof test data management (data privacy & data subsetting) - english
TRANSCRIPT
Test Data ManagementHarald Kikkers, Maarten Urbach & Bert Nienhuis
DATPROF
Data IntegrationTest Data Management
• Dutch Software supplier
• Founded in 1998
• Partners: ITCG, Sogeti, …
…and many more!
MANYORGANISATIONSUSE MULTIPLE COPIES OFPRODUCTION DATABASES
PURPOSES:• TESTING
• DEVELOPMENT
• OUTSOURCING
• MARKETING
• TRAINING
Agile Development • Building the right product• Room for change• Every 2-4 weeks working increments of the software• Progress in development
Test Automation
How to test all these iterations?
And… what data to use?
?
Team 1 Team 2 Team 3
6 TB 500 GBProduction
10 GB
6 TB 500 GBTest
10 GB 6 TB 500 GBDevelopment
10 GB
Total19,53 TB
Team 1 Team 2 Team 3
6 TB 500 GBProduction
10 GB
6 TB 500 GBTest
10 GB 6 TB 500 GBDevelopment
10 GB
Total19,53 TB
Team 1 Team 2 Team 3
Test
Team 1 Team 2 Team 3
Development
Team 1 Team 2 Team 3
6 TB 500 GBProduction
10 GB
6 TB 500 GB
Development
10 GB
6 TB 500 GBTest
10 GB
6 TB 500 GB
Development
10 GB
6 TB 500 GBTest
10 GB
6 TB 500 GB
Development
10 GB
6 TB 500 GBTest
10 GB
Total45,57 TB
Team 1 Team 2 Team 3
6 TB 500 GBProduction
10 GB
600 GB 50 GB
Development
1 GB
600 GB 50 GBTest
1 GB
600 GB 50 GB
Development
1 GB
600 GB 50 GBTest
1 GB
600 GB 50 GB
Development
1 GB
600 GB 50 GBTest
1 GB
Total10.4 TB
10 % Subset 10 % Subset 10 % Subset
Development
Test
Development
Test
Development
Test
How to protectsensitive customer data?
Test Test Test
Development Development Development
Minimize data usage
Save on hardware & infra
Reduce throughput times
Efficient data management
Protect customer information
Comply with regislation
Prevent brand damage
Maintain competitive advantages
Subsetting AnonymizingAdvantages of subsetting data Advantages of scrambling & masking data
DBA Tools ETL Suites100$ tools IBM, Informatica, Oracle
DBA Tools ETL Suites?
DBA Tools ETL Suites
- User Experience
- Default templates
- Easy to maintain
- Smart functionality
- Chain support
DBA Tools ETL Suites
Production Test/DevelopmentSource Database Target Database
Data model classification
Subset – Process dataExample: Customers, Orders, Contracts, Invoices, Transactions
Full – Master dataExample: Application data, configuration, master tables
Embty – Logging, non relevant historyExample: Logging tables, temp tabellen
Determine data to be subsetted
Chain of systemsMethod for deriving consistent subsets from multiple systems
Production Test/Development
Start FilterAll customers from The
Netherlands
Start FilterAll orders from customers in
the previous subset.
ImportMeta data Classification Deployment
Anonymization of sensitive data
- Bank account balance
- Dept
- Medication
- Illness
- Religion
- Political preference
- Salary
- Phone history
- Et cetera…
- Name
- Date of birth
- Bank account number
- Social security number
- Adress
- Insurance number
- Cellphone number
- Et cetera..
Personal data
Identifying Characteristics
“Any information relating to an identified or identifiable natural person ("data subject")Source: Data Protection Directive - Directive 95/46/EC
Techniques
ShuffleShuffle values within same column
ConditionalManipulate specified rows+
First name Last name Type
John
Max
Joe
Clark
Smith
Williams
DATPROF
Customer
Customer
Customer
Company
321First name Last name Type Comment E-Mail
John
Max
Joe
Smith
Williams
Clark
BlankDelete values from columns
ScrambleReplace existing characters
“Brother of J. Clark”
“Has dept”
Customer
Customer
Customer
CompanyDATPROF
Nr. First name Last name Type Co.. E-mail Date of Birth
John
Max
Joe
Smith
Williams
Clark
DATPROF
123
Customer
Customer
Customer
Company
321
789
456
First dayChange dates to first day within same month and year
01-02-1954
01-11-1984
01-03-1974
Postal code
Date of Birth 1st day of month 1st day of year
87% 3.7% 0.04%Source: research anonimity by Prof. Dr. Latanya Sweeney (Harvard University)
x.xxxxx@xxxx...
Xxxxx_xxx@xx...
x_xx@XxxXxxx...
Nr. First name Last name Type .. E-mail Date of birth
123
321
789
01-02-1954
01-11-1984
01-03-1974
Look-upReplace values with values from a lookup table
James
Adrian
Thomas
John
Max
Joe
First names
Chris
Thomas
James
Ruben
Adrian
Michael
David
Reference data
Smith
Williams
Clark
DATPROF
Customer
Customer
Customer
Company
x.xxxxx@xxxx...
Xxxxx_xxx@xx...
x_xx@XxxXxxx...
Nr. First name Last name Type Comment E-mail Date of birth
Thomas
James
Adrian
Smith
Williams
Clark
DATPROF
123
Customer
Customer
Customer
Company
321
789
456
01-02-1954
01-11-1984
01-03-1974
ExpressionUse custom made functions
Scrambled [email protected]
Scrambled
Scrambled
ImportMeta data
Define masking rules 3. Deployment