data integration” ought to be automated! · • data integration is one of the most challenging...

21
© CDISC 2017 1 Presented by Sandy Lei and Simson Alex Johnson & Johnson 2017 CDISC International Interchange Data Integration” Ought to be Automated!

Upload: others

Post on 12-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017 1

Presented by Sandy Lei and Simson AlexJohnson & Johnson

2017 CDISC International Interchange

“Data Integration” Ought to be Automated!

Page 2: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017

Data Integration• Data integration is one of the most challenging

support activities during compound development cycle. Data harmonization: clean, map (per study)

(may be omitted for highly standardized data)

Data stacking/pooling Derivation

• Treatment (combine dose groups)• MedDRA Up-version• Analysis visits (consolidate)

• Example: ISS/ISE is an ad-hoc/static based data pooling required for most submissions.

2

Page 3: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017

Data Integration Issues• Same Information but data is provided differently:

Dataset: ADES/ADEXSUM (Exposure Summary) Variable: WEIGHTBL/BLWTKG (Baseline Weight (kg)) Value/Code: Serum Bilirubin (umol/L)/Bilirubin (umol/L) Resolved by mapping Prevented by common practices, standardization

• Same data represented or defined differently: Same variable with various label, length, type Same variable/value/code with various

definition/computational method Resolved by re-define or re-derive the variable Prevented by common practices, standardization

• Analysis challenge: variable/record selection Resolved by analysis results metadata

3

Page 4: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017

Data IntegrationPossible truncation due to length

4

data adsl01;set lib1.adsl lib2.adsl lib3.adsl;

run;WARNING: Multiple lengths were specified for the variable AGEGR1 by input data set(s). This may cause truncation of data.WARNING: Multiple lengths were specified for the variable RACEGR1 by input data set(s). This may cause truncation of data.WARNING: Multiple lengths were specified for the variable STRATA by input data set(s). This may cause truncation of data.NOTE: There were 504 observations read from the data set LIB1.ADSL.NOTE: There were 910 observations read from the data set LIB2.ADSL.NOTE: There were 614 observations read from the data set LIB3.ADSL.NOTE: The data set WORK.ADSL01 has 2028 observations and 83 variables.

Variable value may or may not get truncated depending on the variable length defined in lib1.

Page 5: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017

Data IntegrationRedefine Length

/*** ADSL dataset ***/data adsl01;

length AGEGR1 $20 RACEGR1 $20 STRATA $30;set lib1.adsl lib2.adsl lib3.adsl;

run;

5

Simple fix by redefine variable lengths in a “length” statement. However, it requires to find all variable lengths from each

study’s metadata.

Page 6: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017

Data IntegrationError caused by variable types

6

33 data adae01;34 set lib1.adae lib2.adae lib3.adae;ERROR: Variable AELLTCD has been defined as both character and numeric.35 run;NOTE: The SAS System stopped processing this step because of errors.WARNING: The data set WORK.ADAE01 may be incomplete. When this step was stopped there were 0 observations and 132 variables.

31 data adae01;32 set lib1.adae (rename=(AELLTCD=AELLTCD_CHAR))33 lib2.adae (rename=(AELLTCD=AELLTCD_CHAR))34 lib3.adae (in=in3 rename=(AELLTCD=AELLTCD_NUM));35 if in3 then AELLTCD_CHAR=put(AELLTCD_NUM, best12.);36 AELLTCD=input(AELLTCD_CHAR,best12.);37 run;NOTE: There were 106 observations read from the data set LIB1.ADAE.NOTE: There were 1502 observations read from the data set LIB2.ADAE.NOTE: There were 491 observations read from the data set LIB3.ADAE.NOTE: The data set WORK.ADAE01 has 2099 observations and 133 variables.

Page 7: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017

Data Integration with Enhanced SAS Macro

• Data integration macro: Input: study folder paths Output: folder path for integrated datasets rules:

• Length: define maximum length among each variable across studies

• Label: most common, specific, standard• Type: convert numeric into character

when both defined for the same variableAutomate data integration by defining your own

business rules.

7

Page 8: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017

Data Integration with Enhanced SAS Macro

8

%dbpooln1(compdir=&filesrv\prd\compound\abc,studydir=s003 s004 s005,indtsdir=adam adam adam,where=where memname in

(‘ADSL’, ‘ADDS’,‘ADEXS’ ‘ADCM’ ‘ADTTE’,’ADEFF’),

outdir=&compdir.\ise001\adam);

Aid tools to review in spreadsheets or pdf: Across-study metadata Subset (several subjects each study)

No more proc contents/dictionary, distinct values!

Page 9: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017

Examples of Data Integration from BIG METADATA/DATA:

• Placebo treated subject data for compound abc.• Placebo responder rate for indication xyz.• List of studies contained “Hemoglobin A1c”,

“Systolic Blood Pressure”, “FEV1”, etc.• Distribution of race, sex, age group, and

comorbidity by subject enrollment year• Analysis data for time to first major bleeding

event from 22 phase 2/3 studies.• Hy’s Law cases by compound.• …

9

Page 10: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017

Automated Data Integration a web-based platform via user interfaces

• J&J Patient Data Warehouse (PDW) One centralized database (metadata, data) Browse, search and retrieve data (on-demand) via a

web-based platform 2013~present 445 studies (1189 SAS data folders) from 17

compounds : load as is (any SAS datasets) Application user interfaces

• Trial Catalog (trial related documents)• Study data selection• Metadata Browser• Data Integration (on-demand)• Subject Sub-setting

10

Page 11: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017

PDW Database Tables

11

Dataset Variable Value Parameter Value

Metadata*

TA, indication, phase, data type, data standard, data location (path), etc.

SAS Data Metadata

Patient Data from Clinical Studies*

Study Information Metadata

*PDW metadata are designed to facilitate across-study metadata views for user to search, browse, select and export data of interest.

Patient Data ValuesFrom SAS tables

11

Page 12: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017 12

Example Clinical Trial Data : ADVS (Vital Signs Analysis Data)

USUBJID AVISITN AVISIT PARAM AVAL AVALC BASE BASEC CHG PCHGSTUDY-01-300001 1BASELINE WEIGHT (kg) 92.5STUDY-01-300001 1BASELINE HEIGHT (cm) 170.2STUDY-01-300001 1BASELINE BODY MASS INDEX (kg/m2) 31.9STUDY-01-300001 1BASELINE BMI CLASSIFICATION 3OBESE >=30STUDY-01-300001 1BASELINE TEMPERATURE (c) 36.2STUDY-01-300001 1BASELINE STANDING PULSE RATE (bpm) 80STUDY-01-300001 1BASELINE SUPINE PULSE RATE (bpm) 76STUDY-01-300001 1BASELINE STANDING SBP (mmHg) 118STUDY-01-300001 1BASELINE SUPINE SBP (mmHg) 120STUDY-01-300001 1BASELINE STANDING DBP (mmHg) 80STUDY-01-300001 1BASELINE SUPINE DBP (mmHg) 80STUDY-01-300001 1BASELINE PULSE (STANDING-SUPINE) (bpm) 4STUDY-01-300001 1BASELINE SBP (STANDING-SUPINE) (mmHg) -2STUDY-01-300001 1BASELINE DBP (STANDING-SUPINE) (mmHg) 0STUDY-01-300001 1BASELINE PULSE(STD-SUP)>15 AND SBP(STD-SUP)<-20 2NOSTUDY-01-300001 1BASELINE PULSE(STD-SUP)>15 AND DBP(STD-SUP)<-10 2NOSTUDY-01-300001 2WEEK 4 WEIGHT (kg) 84.8 92.5 -7.7 -8.324STUDY-01-300001 2WEEK 4 BODY MASS INDEX (kg/m2) 29.3 31.9 -2.6 -8.15STUDY-01-300001 2WEEK 4 BMI CLASSIFICATION 2OVERWEIGHT 25-<30 3OBESE >=30 -1 -33.333STUDY-01-300001 2WEEK 4 TEMPERATURE (c) 36.8 36.2 0.6 1.657STUDY-01-300001 2WEEK 4 STANDING PULSE RATE (bpm) 80 80 0 0STUDY-01-300001 2WEEK 4 SUPINE PULSE RATE (bpm) 76 76 0 0STUDY-01-300001 2WEEK 4 STANDING SBP (mmHg) 132 118 14 11.864STUDY-01-300001 2WEEK 4 SUPINE SBP (mmHg) 128 120 8 6.667STUDY-01-300001 2WEEK 4 STANDING DBP (mmHg) 78 80 -2 -2.5STUDY-01-300001 2WEEK 4 SUPINE DBP (mmHg) 78 80 -2 -2.5STUDY-01-300001 2WEEK 4 PULSE (STANDING-SUPINE) (bpm) 4 4 0 0STUDY-01-300001 2WEEK 4 SBP (STANDING-SUPINE) (mmHg) 4 -2 6 -300STUDY-01-300001 2WEEK 4 DBP (STANDING-SUPINE) (mmHg) 0 0 0STUDY-01-300001 2WEEK 4 PULSE(STD-SUP)>15 AND SBP(STD-SUP)<-20 2NO 2NO 0 0STUDY-01-300001 2WEEK 4 PULSE(STD-SUP)>15 AND DBP(STD-SUP)<-10 2NO 2NO 0 0

Page 13: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017 13

PDW Dataset Metadata

MEMNAME MEMLABEL FILESIZE NOBS CRDATE MODATE

ADVS Vital Signs Analysis Data 105775104 232322 4/14/2017 4/14/2017

Page 14: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017 14

PDW Variable Metadata MEMNAME NAME LABEL TYPE FORMAT VARNUM LENGTHADVS ADT Actual Date of Vital Signs num DATE9. 2 8ADVS ADY Actual Day of Vital Signs num 3 8ADVS APHASE Analysis Phase char 16 20ADVS APHASEC Analysis Phase Code num 17 8ADVS ARFENDT Analysis Reference End Date num DATE9. 18 8ADVS ARFSTDT Analysis Reference Start Date num DATE9. 19 8ADVS ASTNRHI Vital Sign Significant Range High num 62 8ADVS ASTNRIND Vital Sign Significant N/L/H Indicator char 63 1ADVS ASTNRLO Vital Sign Significant Range Low num 61 8ADVS AVAL Numeric Value num 14 8ADVS AVALC Character Value char 15 50ADVS AVISIT Analysis Visit char 23 20ADVS AVISITN Analysis Visit Number num 5 8ADVS BASE Value at Baseline num 57 8ADVS BASEC Character Value at Baseline char 58 50ADVS CHG Change from Baseline num 59 8ADVS PARAM Parameter char 7 40ADVS PARAMCD Parameter Code char 6 8ADVS PCHG Percent Change from Baseline num 60 8ADVS STUDYID Study Id char 24 18ADVS SUBJID Subject Number char 25 8ADVS TRTP Treatment Group char 50 40ADVS TRTPN Treatment Group Code num BEST12. 49 8ADVS USUBJID Unique Subject Id char 1 31ADVS VISIT Visit char 29 28ADVS VISITNUM Visit Number num BEST12. 4 8ADVS VSSEQ Vital Signs Sequence Number num BEST12. 30 8

Page 15: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017 15

PDW Value Level Metadata DOMAIN VARIABLE CODE DECODE TYPEADVS ADT (17FEB2004,24MAY2005) numADVS ADY (-14,76) numADVS AVAL (-60,236.5) numADVS AVALC YES charADVS AVALC NO charADVS AVALC NORMAL <25 charADVS AVALC OVERWEIGHT 25-<30 charADVS AVALC OBESE >=30 charADVS PARAM WEIGHT (kg) charADVS PARAM HEIGHT (cm) charADVS PARAM BODY MASS INDEX (kg/m2) charADVS PARAM BMI CLASSIFICATION charADVS PARAM TEMPERATURE (c) charADVS PARAM STANDING PULSE RATE (bpm) charADVS PARAM SUPINE PULSE RATE (bpm) charADVS PARAM STANDING SBP (mmHg) charADVS PARAM SUPINE SBP (mmHg) charADVS PARAM STANDING DBP (mmHg) charADVS PARAM SUPINE DBP (mmHg) charADVS PARAM PULSE (STANDING-SUPINE) (bpm) charADVS PARAM SBP (STANDING-SUPINE) (mmHg) charADVS PARAM DBP (STANDING-SUPINE) (mmHg) charADVS PARAM PULSE(STD-SUP)>15 AND SBP(STD-SUP)<-20 charADVS PARAM PULSE(STD-SUP)>15 AND DBP(STD-SUP)<-10 char

Page 16: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017 16

PDW Parameter-Value Level Metadata

DOMAIN PARAMVAR PARAMCODE VALUEVAR VALUECODE DVALUEVAR DVALUECODEADVS PARAM BMI CLASSIFICATION AVAL 1 AVALC NORMAL <25

ADVS PARAM BMI CLASSIFICATION AVAL 2 AVALCOVERWEIGHT 25-<30

ADVS PARAM BMI CLASSIFICATION AVAL 3 AVALC OBESE >=30ADVS PARAM BODY MASS INDEX (kg/m2) AVAL (14.7,72.2) AVALCADVS PARAM DBP (STANDING-SUPINE) (mmHg) AVAL (-58,73) AVALCADVS PARAM HEIGHT (cm) AVAL (127,205) AVALCADVS PARAM PULSE (STANDING-SUPINE) (bpm) AVAL (-49,80) AVALC

ADVS PARAM PULSE(STD-SUP)>15 AND DBP(STD-SUP)<-10 AVAL 1 AVALC YES

ADVS PARAM PULSE(STD-SUP)>15 AND DBP(STD-SUP)<-10 AVAL 2 AVALC NO

ADVS PARAM PULSE(STD-SUP)>15 AND SBP(STD-SUP)<-20 AVAL 1 AVALC YES

ADVS PARAM PULSE(STD-SUP)>15 AND SBP(STD-SUP)<-20 AVAL 2 AVALC NOADVS PARAM SBP (STANDING-SUPINE) (mmHg) AVAL (-60,59) AVALCADVS PARAM STANDING DBP (mmHg) AVAL (20,161) AVALCADVS PARAM STANDING PULSE RATE (bpm) AVAL (38,176) AVALCADVS PARAM STANDING SBP (mmHg) AVAL (64,210) AVALCADVS PARAM SUPINE DBP (mmHg) AVAL (35,160) AVALCADVS PARAM SUPINE PULSE RATE (bpm) AVAL (40,137) AVALCADVS PARAM SUPINE SBP (mmHg) AVAL (67,210) AVALCADVS PARAM TEMPERATURE (c) AVAL (35,38.6) AVALCADVS PARAM WEIGHT (kg) AVAL (33,236.5) AVALC

Page 17: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017

PDW Data Integration Work Flow

Select Studies• TA/Indication/Compound• Phase 2/3• Reporting Effort, Type of Data

Select Datasets

• Demographic• Lab• Vital Sign• Concomitant Medication• Efficacy• …

Select Variables Add Conditions

Submit/Run

• All Variables• ARM= PLACEBO• Retrieve data

Get Integrated Data

• SAS• CSV• XPT

17

*PDW metadata are designed to facilitate across-study metadata views for user to search, browse, select and export data of interest.

Page 18: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017

PDW Data IntegrationVariable selection and condition set-up from variable values

18

Page 19: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017

PDW Metadata/Data Volume

Dataset>38K

Variable>1.1M

Value>30M

Parameter Value

Metadata*

445 studies, 1189 SAS folders

SAS Data Metadata

Patient Data from Clinical Studies*

Study Information Metadata

>30B rows(1B SAS records)

19

Page 20: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017

Challenges of BIG Data/Metadata• Volume• Framework/visualization design• Dynamic metadata: in need of dynamic data

query for multiple variables distinct values• Application run-time performance• Data integrity - alignment of data standard

• Datamart creation• Security and data governance• Results validation/verification• Cross-function collaboration• Demand on a strong and innovative partner• Resistance to change• Ideal: Metadata-Driven

20

Page 21: Data Integration” Ought to be Automated! · • Data integration is one of the most challenging support activities during compound development cycle. Data harmonization: clean,

© CDISC 2017 21

THANK YOU!