tesngetl - exploring...

28
Tes$ng ETL Don’t dig a tunnel with a toothpick Iain McCowa; h;p://exploringuncertainty.com [email protected] @imccowa; imccowa; © Can Stock Photo Inc. / frenta

Upload: dokiet

Post on 04-Apr-2018

221 views

Category:

Documents


3 download

TRANSCRIPT

Tes$ng  ETL  Don’t  dig  a  tunnel  with  a  toothpick  

Iain  McCowa;  h;p://exploringuncertainty.com  

[email protected]  @imccowa;  imccowa;  

 

 

©  Can  Stock  Photo  Inc.  /  frenta  

•  What  is  ETL?  •  Where  is  it  used?  •  How  does  it  work?  

Introduc$on:  ETL  

•  No  GUI  •  Tester  skills  •  Paucity  of  off-­‐the-­‐shelf  tools  

Introduc$on:  Challenges  

•  Func$onal  tes$ng  of  ETL,  primarily  for  accuracy  

•  Logic  tes$ng  of  ETL,  i.e.  verifica$on,  rather  than  valida$on  

Introduc$on:  Scope  of  Presenta$on  

Context:  Types  of  ETL  Logic  

Logic  Type   Descrip;on   Extract   Transform   Load  Selec$on   Selec$on  of  source  data   ✓   ✓  Mapping   Simple  transfer  of  data  from  

source  field  to  des$na$on  field  ✓  

Transforma$on   Manipula$on  /  normaliza$on  of  data  from  source  to  des$na$on  

✓  

Load   Load  strategy  (e.g.  append/merge)  Referen$al  Integrity  

✓  

Excep$on   Management  of  excep$ons  (invalid  data,  invalid  lookups,  orphan  records  etc.)  

✓   ✓  

•  Project  constraints:  $me,  cost,  quality  •  Size  •  Complexity  •  Types  of  load  (e.g.  full,  delta  etc.)  •  Data  volume  •  Degree  of  specifica$on  •  Tester  skills  

Context:  Other  Factors  

Test  Strategy:  Data  

Approach   Indica;ons   Contraindica;ons  Produc$on   Need  to  test  with  volume  

Need  noise    

Data  privacy  issues  

Synthe$c   Need  to  cover  logic    

Li;le  visibility  into  the  logic  

Condi$oned   Need  volume,  noise  AND  coverage    

Data  privacy  issues  Li;le  visibility  into  the  logic  

Test  Strategy:  Oracles  

Approach   Indica;ons   Contraindica;ons  Source  Data   Minimal  selec$on  &  

transforma$on  Any  complexity  

Previous  Implementa$on  

Regression  tes$ng  Previous  implementa$on  available  

Significant  change  from  previous  implementa$on  

Baseline  Data   Regression  tes$ng  Baseline  data  available  

Significant  change  from  previous  implementa$on  

Parallel  Implementa$on  

No  appropriate  Oracle  is  available,  e.g.  new  implementa$on  

Significant  $me  and  cost  constraints  Testers  lack  development  skills  

Test  Strategy:  Verifica$on  

Approach   Indica;ons   Contraindica;ons  Profiling   Logic  is  of  minimal  complexity  

 Data  integrity  is  cri$cal    

Visual  Diff   Data  integrity  is  cri$cal    

Large  data  volume  Mul$ple  types  of  load  Time  or  cost  is  cri$cal    

Automated  Comparator  

Data  integrity  is  cri$cal  Tes$ng  with  large  data  volumes  or  mul$ple  types  of  load    

Testers  lack  development  skills    

•  Previous  release  overran:  $me  cri$cal  •  Regulatory  project:  quality  (data  integrity)  cri$cal  

•  Significant  transforma$on  logic  (>130  tables,  >950  fields,  >2200  condi$ons),  some  highly  complex,  but  reasonably  specified  

•  Mul$ple  types  of  load  with  different  logic  (full,  delta)  

•  Preference  for  tes$ng  with  produc$on  data    

Case  Study:  Context  

Case  Study:  Strategy  

Data   Oracle   Verifica;on  Produc$on   Source  Data   Profiling  

 Synthe$c   Previous  Implementa$on   Visual  Diff  

 Condi$oned   Baseline  Data   Automated  Comparator  

 Parallel  Implementa$on  

Case  Study:  Strategy  

Data   Oracle   Verifica;on  Produc$on   Source  Data   Profiling  

 Synthe$c   Previous  Implementa$on   Visual  Diff  

 Condi$oned   Baseline  Data   Automated  Comparator  

 Parallel  Implementa$on  

✓  

Condi$oned  data  is  the  compromise  

Client  preference  for  produc$on  data  Reliability  concerns  suggested  a  need  for  “noisy”  data  

Complexity  of  transforma$ons  drove  need  for  coverage  

Case  Study:  Strategy  

Data   Oracle   Verifica;on  Produc$on   Source  Data   Profiling  

 Synthe$c   Previous  Implementa$on   Visual  Diff  

 Condi$oned   Baseline  Data   Automated  Comparator  

 Parallel  Implementa$on  

✓  

✓  

With  no  other  suitable  Oracle,  a  Parallel  Implementa$on  was  selected  despite  $me  and  cost  constraints  

Significant  changes  from  previous  release  ruled  these  out  

Complex  Transforma$ons  ruled  this  out  

Case  Study:  Strategy  

Data   Oracle   Verifica;on  Produc$on   Source  Data   Profiling  

 Synthe$c   Previous  Implementa$on   Visual  Diff  

 Condi$oned   Baseline  Data   Automated  Comparator  

 Parallel  Implementa$on  

✓  

✓  

✓  

Selected  as  the  more  efficient  op$on  

Importance  of  data  integrity  ruled  this  out  

Given  $me  and  cost  constraints,  data  volumes  and  repe$$ons  with  different  types  of  load  ruled  this  out  

Case  Study:  Approach  

Source   Target  Extract  Process   Transforma$on  Process  

Staged  Data  

Extracted  Data  

Parallel  Implementa$on  

Expected  Results  

Reconcilia$on  

Load  Process  

Parallel  Implementa$on  

Expected  Results  

Reconcilia$on  

Case  Study:  Approach  

Source   Target  Extract  Process   Transforma$on  Process  

Staged  Data  

Extracted  Data  

Parallel  Implementa$on  

Expected  Results  

Reconcilia$on  

Load  Process  

Parallel  Implementa$on  

Expected  Results  

Reconcilia$on  

Due  to  data  privacy  concerns,  all  test  data  was  scrubbed  A  client  (internal)  team  tested  the  extract  process  

Source  data  used  to  build  expected  result  set  for  transforma$on  process  Both  source  and  target  (pre-­‐load)  used  to  build  expected  result  set  for  load  process  

Case  Study:  Approach  

Source   Target  Extract  Process   Transforma$on  Process  

Staged  Data  

Extracted  Data  

Parallel  Implementa$on  

Expected  Results  

Reconcilia$on  

Load  Process  

Parallel  Implementa$on  

Expected  Results  

Reconcilia$on  

Note:  Tested  by  another  team  

         

Reconcilia$on  is  a  comparison  of  the  expected  result  data  set  generated  by  the  parallel  implementa$on  and  the    actual  result  data  set  generated  by  the  ETL  tool    Records  in  the  rela$ve  complements  (i.e.  mismatches)  indicate  poten$al  problems:            

Expected  Data  Set  

Actual  Data  Set  

Case  Study:  Process  

Mine  Data  

Condi$on  Data  

Determine  Expected  Results  

Reconcile  Results  

Isolate  and  Report  Bugs  

Execute  ETL  

•  Some  things  should  be  automated  

•  Other  things  should  not  

•  A  custom  tool  was  created  using  Oracle  (SQL,  PL/SQL)  

•  This  serves  as  both  a  parallel  implementa$on  and  comparator  

•  A  generic  framework  was  used,  enabling:  – Easier  addi$on  of  new  transforma$ons  – Control  over  scope  of  any  given  execu$on  

Case  Study:  Tool    

Parallel  Implementa$on  

Stored  Procedures  Stored  

Procedures  

Case  Study:  Tool  

Driver  Control  Data  

Stored  Procedures  

Tracking  Data  

Source   Target  Expected  Results  

Comparator  

Mismatch  Data  

Stored  Procedures  Stored  

Procedures  Stored  

Procedures  

Parallel  Implementa$on  

Stored  Procedures  Stored  

Procedures  

Case  Study:  Tool  

Driver  Control  Data  

Stored  Procedures  

Tracking  Data  

Source   Target  Expected  Results  

Comparator  

Mismatch  Data  

Stored  Procedures  Stored  

Procedures  Stored  

Procedures  

Controls  scope  of  execu$on:  •  ER  genera$on,  reconcilia$on  or  both  •  Tables  /  fields  /  condi$ons  

Responsible  for  expected  result  genera$on  

Expected  result  genera$on  not  dependent  on  specific  set  of  input  data  

Decision  coverage  data  is  used  to  guide  data  condi$oning  

Parallel  Implementa$on  

Stored  Procedures  Stored  

Procedures  

Case  Study:  Tool  

Driver  Control  Data  

Stored  Procedures  

Tracking  Data  

Source   Target  Expected  Results  

Comparator  

Mismatch  Data  

Stored  Procedures  Stored  

Procedures  Stored  

Procedures  

Responsible  for  reconciling  expected  and  actual  results  

The  “secret  sauce”:  tracking  data  used  to  pinpoint  condi$ons  for  which  mismatches  are  detected    

Final  mismatch  report  iden$fies  which  condi$ons  as  been  executed,  and  which  appear  to  have  problems  

Reconcilia$on  performed  in  two  phases:  •  Key  matching  to  iden$fy  key  

level  mismatches  •  Field  matching  of  the  

intersec$on  to  iden$fy  field  level  mismatches  

•  On  a  shared  “run-­‐of-­‐the-­‐mill”  development  box,  the  tool  was  able  to  reconcile  >10  Million  cells  (Rows  x  Columns)  per  minute,  enabling  execu$on  and  reconcilia$on  of  a  complete  load  in  a  ma;er  of  hours  

•  This  use  of  “large  scale  mechanized  checking”  enabled  a  degree  of  coverage  not  possible  on  the  previous  release,  where  a  more  manually  intensive  approach  was  applied  

Case  Study:  Tool    

•  This  is  an  example  of  a  heavy  weight  Oracle  and  Comparator,  designed  for  reuse  and  high  performance  with  large  data  volumes  – Other  projects  may  require  neither  

•  Similar  tools  can  be  constructed  very  quickly  using  off-­‐the-­‐shelf  database  sokware  – For  an  earlier  project,  MS  Access  was  used  to  construct  a  parallel  implementa$on  in  a  ma;er  of  days,  and  a  comparator  in  under  an  hour  

Case  Study:  A  Note  on  Tools  

•  Data  literate,  mix  of  manual  testers  and  tradi$onal  automators  

•  Prior  SQL  but  no  prior  PL/SQL  experience  •  Ready,  willing  and  able  to  learn  whatever  needed  to  be  learned  in  order  to  test  

•  Ready  to  challenge  each  other,  and  to  be  challenged  

Case  Study:  Team  

•  The  tool  was  “just  a  tool”:  human  factors  were  more  important,  e.g.:  – Availability  of  informa$on  – Responsibili$es  and  handoffs  – Communica$on  of  changes  

Case  Study:  Lessons  Learned  

•  The  tool,  and  the  tes$ng,  were  heuris$c  in  nature:  –  Tests  (whether  manual  or  automated)  can  have  bugs  in  them:  tests  that  do  not  reveal  problems  don’t  “pass”,  they  simply  do  not  reveal  problems  

–  Our  checks  were  only  as  good  as  the  requirements  and  design,  and  were  blind  to  requirement  and  design  bugs  

– We  were  forced  to  use  composite  keys  where  true  keys  could  not  be  replicated,  increasing  our  blindness  to  certain  logic  bugs  

–  In  some  cases,  condi$ons  were  not  mutually  exclusive,  making  isola$on  of  any  corresponding  mismatch  less  precise  (i.e.  at  a  field  rather  than  condi$on  level)  

Case  Study:  Lessons  Learned  

Ques$ons?