data integration” ought to be automated! · • data integration is one of the most challenging...
TRANSCRIPT
![Page 1: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/1.jpg)
© CDISC 2017 1
Presented by Sandy Lei and Simson AlexJohnson & Johnson
2017 CDISC International Interchange
“Data Integration” Ought to be Automated!
![Page 2: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/2.jpg)
© CDISC 2017
Data Integration• Data integration is one of the most challenging
support activities during compound development cycle. Data harmonization: clean, map (per study)
(may be omitted for highly standardized data)
Data stacking/pooling Derivation
• Treatment (combine dose groups)• MedDRA Up-version• Analysis visits (consolidate)
• Example: ISS/ISE is an ad-hoc/static based data pooling required for most submissions.
2
![Page 3: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/3.jpg)
© CDISC 2017
Data Integration Issues• Same Information but data is provided differently:
Dataset: ADES/ADEXSUM (Exposure Summary) Variable: WEIGHTBL/BLWTKG (Baseline Weight (kg)) Value/Code: Serum Bilirubin (umol/L)/Bilirubin (umol/L) Resolved by mapping Prevented by common practices, standardization
• Same data represented or defined differently: Same variable with various label, length, type Same variable/value/code with various
definition/computational method Resolved by re-define or re-derive the variable Prevented by common practices, standardization
• Analysis challenge: variable/record selection Resolved by analysis results metadata
3
![Page 4: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/4.jpg)
© CDISC 2017
Data IntegrationPossible truncation due to length
4
data adsl01;set lib1.adsl lib2.adsl lib3.adsl;
run;WARNING: Multiple lengths were specified for the variable AGEGR1 by input data set(s). This may cause truncation of data.WARNING: Multiple lengths were specified for the variable RACEGR1 by input data set(s). This may cause truncation of data.WARNING: Multiple lengths were specified for the variable STRATA by input data set(s). This may cause truncation of data.NOTE: There were 504 observations read from the data set LIB1.ADSL.NOTE: There were 910 observations read from the data set LIB2.ADSL.NOTE: There were 614 observations read from the data set LIB3.ADSL.NOTE: The data set WORK.ADSL01 has 2028 observations and 83 variables.
Variable value may or may not get truncated depending on the variable length defined in lib1.
![Page 5: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/5.jpg)
© CDISC 2017
Data IntegrationRedefine Length
/*** ADSL dataset ***/data adsl01;
length AGEGR1 $20 RACEGR1 $20 STRATA $30;set lib1.adsl lib2.adsl lib3.adsl;
run;
5
Simple fix by redefine variable lengths in a “length” statement. However, it requires to find all variable lengths from each
study’s metadata.
![Page 6: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/6.jpg)
© CDISC 2017
Data IntegrationError caused by variable types
6
33 data adae01;34 set lib1.adae lib2.adae lib3.adae;ERROR: Variable AELLTCD has been defined as both character and numeric.35 run;NOTE: The SAS System stopped processing this step because of errors.WARNING: The data set WORK.ADAE01 may be incomplete. When this step was stopped there were 0 observations and 132 variables.
31 data adae01;32 set lib1.adae (rename=(AELLTCD=AELLTCD_CHAR))33 lib2.adae (rename=(AELLTCD=AELLTCD_CHAR))34 lib3.adae (in=in3 rename=(AELLTCD=AELLTCD_NUM));35 if in3 then AELLTCD_CHAR=put(AELLTCD_NUM, best12.);36 AELLTCD=input(AELLTCD_CHAR,best12.);37 run;NOTE: There were 106 observations read from the data set LIB1.ADAE.NOTE: There were 1502 observations read from the data set LIB2.ADAE.NOTE: There were 491 observations read from the data set LIB3.ADAE.NOTE: The data set WORK.ADAE01 has 2099 observations and 133 variables.
![Page 7: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/7.jpg)
© CDISC 2017
Data Integration with Enhanced SAS Macro
• Data integration macro: Input: study folder paths Output: folder path for integrated datasets rules:
• Length: define maximum length among each variable across studies
• Label: most common, specific, standard• Type: convert numeric into character
when both defined for the same variableAutomate data integration by defining your own
business rules.
7
![Page 8: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/8.jpg)
© CDISC 2017
Data Integration with Enhanced SAS Macro
8
%dbpooln1(compdir=&filesrv\prd\compound\abc,studydir=s003 s004 s005,indtsdir=adam adam adam,where=where memname in
(‘ADSL’, ‘ADDS’,‘ADEXS’ ‘ADCM’ ‘ADTTE’,’ADEFF’),
outdir=&compdir.\ise001\adam);
Aid tools to review in spreadsheets or pdf: Across-study metadata Subset (several subjects each study)
No more proc contents/dictionary, distinct values!
![Page 9: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/9.jpg)
© CDISC 2017
Examples of Data Integration from BIG METADATA/DATA:
• Placebo treated subject data for compound abc.• Placebo responder rate for indication xyz.• List of studies contained “Hemoglobin A1c”,
“Systolic Blood Pressure”, “FEV1”, etc.• Distribution of race, sex, age group, and
comorbidity by subject enrollment year• Analysis data for time to first major bleeding
event from 22 phase 2/3 studies.• Hy’s Law cases by compound.• …
9
![Page 10: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/10.jpg)
© CDISC 2017
Automated Data Integration a web-based platform via user interfaces
• J&J Patient Data Warehouse (PDW) One centralized database (metadata, data) Browse, search and retrieve data (on-demand) via a
web-based platform 2013~present 445 studies (1189 SAS data folders) from 17
compounds : load as is (any SAS datasets) Application user interfaces
• Trial Catalog (trial related documents)• Study data selection• Metadata Browser• Data Integration (on-demand)• Subject Sub-setting
10
![Page 11: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/11.jpg)
© CDISC 2017
PDW Database Tables
11
Dataset Variable Value Parameter Value
Metadata*
TA, indication, phase, data type, data standard, data location (path), etc.
SAS Data Metadata
Patient Data from Clinical Studies*
Study Information Metadata
*PDW metadata are designed to facilitate across-study metadata views for user to search, browse, select and export data of interest.
Patient Data ValuesFrom SAS tables
11
![Page 12: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/12.jpg)
© CDISC 2017 12
Example Clinical Trial Data : ADVS (Vital Signs Analysis Data)
USUBJID AVISITN AVISIT PARAM AVAL AVALC BASE BASEC CHG PCHGSTUDY-01-300001 1BASELINE WEIGHT (kg) 92.5STUDY-01-300001 1BASELINE HEIGHT (cm) 170.2STUDY-01-300001 1BASELINE BODY MASS INDEX (kg/m2) 31.9STUDY-01-300001 1BASELINE BMI CLASSIFICATION 3OBESE >=30STUDY-01-300001 1BASELINE TEMPERATURE (c) 36.2STUDY-01-300001 1BASELINE STANDING PULSE RATE (bpm) 80STUDY-01-300001 1BASELINE SUPINE PULSE RATE (bpm) 76STUDY-01-300001 1BASELINE STANDING SBP (mmHg) 118STUDY-01-300001 1BASELINE SUPINE SBP (mmHg) 120STUDY-01-300001 1BASELINE STANDING DBP (mmHg) 80STUDY-01-300001 1BASELINE SUPINE DBP (mmHg) 80STUDY-01-300001 1BASELINE PULSE (STANDING-SUPINE) (bpm) 4STUDY-01-300001 1BASELINE SBP (STANDING-SUPINE) (mmHg) -2STUDY-01-300001 1BASELINE DBP (STANDING-SUPINE) (mmHg) 0STUDY-01-300001 1BASELINE PULSE(STD-SUP)>15 AND SBP(STD-SUP)<-20 2NOSTUDY-01-300001 1BASELINE PULSE(STD-SUP)>15 AND DBP(STD-SUP)<-10 2NOSTUDY-01-300001 2WEEK 4 WEIGHT (kg) 84.8 92.5 -7.7 -8.324STUDY-01-300001 2WEEK 4 BODY MASS INDEX (kg/m2) 29.3 31.9 -2.6 -8.15STUDY-01-300001 2WEEK 4 BMI CLASSIFICATION 2OVERWEIGHT 25-<30 3OBESE >=30 -1 -33.333STUDY-01-300001 2WEEK 4 TEMPERATURE (c) 36.8 36.2 0.6 1.657STUDY-01-300001 2WEEK 4 STANDING PULSE RATE (bpm) 80 80 0 0STUDY-01-300001 2WEEK 4 SUPINE PULSE RATE (bpm) 76 76 0 0STUDY-01-300001 2WEEK 4 STANDING SBP (mmHg) 132 118 14 11.864STUDY-01-300001 2WEEK 4 SUPINE SBP (mmHg) 128 120 8 6.667STUDY-01-300001 2WEEK 4 STANDING DBP (mmHg) 78 80 -2 -2.5STUDY-01-300001 2WEEK 4 SUPINE DBP (mmHg) 78 80 -2 -2.5STUDY-01-300001 2WEEK 4 PULSE (STANDING-SUPINE) (bpm) 4 4 0 0STUDY-01-300001 2WEEK 4 SBP (STANDING-SUPINE) (mmHg) 4 -2 6 -300STUDY-01-300001 2WEEK 4 DBP (STANDING-SUPINE) (mmHg) 0 0 0STUDY-01-300001 2WEEK 4 PULSE(STD-SUP)>15 AND SBP(STD-SUP)<-20 2NO 2NO 0 0STUDY-01-300001 2WEEK 4 PULSE(STD-SUP)>15 AND DBP(STD-SUP)<-10 2NO 2NO 0 0
![Page 13: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/13.jpg)
© CDISC 2017 13
PDW Dataset Metadata
MEMNAME MEMLABEL FILESIZE NOBS CRDATE MODATE
ADVS Vital Signs Analysis Data 105775104 232322 4/14/2017 4/14/2017
![Page 14: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/14.jpg)
© CDISC 2017 14
PDW Variable Metadata MEMNAME NAME LABEL TYPE FORMAT VARNUM LENGTHADVS ADT Actual Date of Vital Signs num DATE9. 2 8ADVS ADY Actual Day of Vital Signs num 3 8ADVS APHASE Analysis Phase char 16 20ADVS APHASEC Analysis Phase Code num 17 8ADVS ARFENDT Analysis Reference End Date num DATE9. 18 8ADVS ARFSTDT Analysis Reference Start Date num DATE9. 19 8ADVS ASTNRHI Vital Sign Significant Range High num 62 8ADVS ASTNRIND Vital Sign Significant N/L/H Indicator char 63 1ADVS ASTNRLO Vital Sign Significant Range Low num 61 8ADVS AVAL Numeric Value num 14 8ADVS AVALC Character Value char 15 50ADVS AVISIT Analysis Visit char 23 20ADVS AVISITN Analysis Visit Number num 5 8ADVS BASE Value at Baseline num 57 8ADVS BASEC Character Value at Baseline char 58 50ADVS CHG Change from Baseline num 59 8ADVS PARAM Parameter char 7 40ADVS PARAMCD Parameter Code char 6 8ADVS PCHG Percent Change from Baseline num 60 8ADVS STUDYID Study Id char 24 18ADVS SUBJID Subject Number char 25 8ADVS TRTP Treatment Group char 50 40ADVS TRTPN Treatment Group Code num BEST12. 49 8ADVS USUBJID Unique Subject Id char 1 31ADVS VISIT Visit char 29 28ADVS VISITNUM Visit Number num BEST12. 4 8ADVS VSSEQ Vital Signs Sequence Number num BEST12. 30 8
![Page 15: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/15.jpg)
© CDISC 2017 15
PDW Value Level Metadata DOMAIN VARIABLE CODE DECODE TYPEADVS ADT (17FEB2004,24MAY2005) numADVS ADY (-14,76) numADVS AVAL (-60,236.5) numADVS AVALC YES charADVS AVALC NO charADVS AVALC NORMAL <25 charADVS AVALC OVERWEIGHT 25-<30 charADVS AVALC OBESE >=30 charADVS PARAM WEIGHT (kg) charADVS PARAM HEIGHT (cm) charADVS PARAM BODY MASS INDEX (kg/m2) charADVS PARAM BMI CLASSIFICATION charADVS PARAM TEMPERATURE (c) charADVS PARAM STANDING PULSE RATE (bpm) charADVS PARAM SUPINE PULSE RATE (bpm) charADVS PARAM STANDING SBP (mmHg) charADVS PARAM SUPINE SBP (mmHg) charADVS PARAM STANDING DBP (mmHg) charADVS PARAM SUPINE DBP (mmHg) charADVS PARAM PULSE (STANDING-SUPINE) (bpm) charADVS PARAM SBP (STANDING-SUPINE) (mmHg) charADVS PARAM DBP (STANDING-SUPINE) (mmHg) charADVS PARAM PULSE(STD-SUP)>15 AND SBP(STD-SUP)<-20 charADVS PARAM PULSE(STD-SUP)>15 AND DBP(STD-SUP)<-10 char
![Page 16: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/16.jpg)
© CDISC 2017 16
PDW Parameter-Value Level Metadata
DOMAIN PARAMVAR PARAMCODE VALUEVAR VALUECODE DVALUEVAR DVALUECODEADVS PARAM BMI CLASSIFICATION AVAL 1 AVALC NORMAL <25
ADVS PARAM BMI CLASSIFICATION AVAL 2 AVALCOVERWEIGHT 25-<30
ADVS PARAM BMI CLASSIFICATION AVAL 3 AVALC OBESE >=30ADVS PARAM BODY MASS INDEX (kg/m2) AVAL (14.7,72.2) AVALCADVS PARAM DBP (STANDING-SUPINE) (mmHg) AVAL (-58,73) AVALCADVS PARAM HEIGHT (cm) AVAL (127,205) AVALCADVS PARAM PULSE (STANDING-SUPINE) (bpm) AVAL (-49,80) AVALC
ADVS PARAM PULSE(STD-SUP)>15 AND DBP(STD-SUP)<-10 AVAL 1 AVALC YES
ADVS PARAM PULSE(STD-SUP)>15 AND DBP(STD-SUP)<-10 AVAL 2 AVALC NO
ADVS PARAM PULSE(STD-SUP)>15 AND SBP(STD-SUP)<-20 AVAL 1 AVALC YES
ADVS PARAM PULSE(STD-SUP)>15 AND SBP(STD-SUP)<-20 AVAL 2 AVALC NOADVS PARAM SBP (STANDING-SUPINE) (mmHg) AVAL (-60,59) AVALCADVS PARAM STANDING DBP (mmHg) AVAL (20,161) AVALCADVS PARAM STANDING PULSE RATE (bpm) AVAL (38,176) AVALCADVS PARAM STANDING SBP (mmHg) AVAL (64,210) AVALCADVS PARAM SUPINE DBP (mmHg) AVAL (35,160) AVALCADVS PARAM SUPINE PULSE RATE (bpm) AVAL (40,137) AVALCADVS PARAM SUPINE SBP (mmHg) AVAL (67,210) AVALCADVS PARAM TEMPERATURE (c) AVAL (35,38.6) AVALCADVS PARAM WEIGHT (kg) AVAL (33,236.5) AVALC
![Page 17: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/17.jpg)
© CDISC 2017
PDW Data Integration Work Flow
Select Studies• TA/Indication/Compound• Phase 2/3• Reporting Effort, Type of Data
Select Datasets
• Demographic• Lab• Vital Sign• Concomitant Medication• Efficacy• …
Select Variables Add Conditions
Submit/Run
• All Variables• ARM= PLACEBO• Retrieve data
Get Integrated Data
• SAS• CSV• XPT
17
*PDW metadata are designed to facilitate across-study metadata views for user to search, browse, select and export data of interest.
![Page 18: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/18.jpg)
© CDISC 2017
PDW Data IntegrationVariable selection and condition set-up from variable values
18
![Page 19: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/19.jpg)
© CDISC 2017
PDW Metadata/Data Volume
Dataset>38K
Variable>1.1M
Value>30M
Parameter Value
Metadata*
445 studies, 1189 SAS folders
SAS Data Metadata
Patient Data from Clinical Studies*
Study Information Metadata
>30B rows(1B SAS records)
19
![Page 20: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/20.jpg)
© CDISC 2017
Challenges of BIG Data/Metadata• Volume• Framework/visualization design• Dynamic metadata: in need of dynamic data
query for multiple variables distinct values• Application run-time performance• Data integrity - alignment of data standard
• Datamart creation• Security and data governance• Results validation/verification• Cross-function collaboration• Demand on a strong and innovative partner• Resistance to change• Ideal: Metadata-Driven
20
![Page 21: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,](https://reader033.vdocuments.us/reader033/viewer/2022041922/5e6c6b01a6408e0fad52349d/html5/thumbnails/21.jpg)
© CDISC 2017 21
THANK YOU!