Data integration - tasks and issues from a programming perspective

There are multiple difficulties seen in data integration (ISS/ISE), especially with the quality of data input into the integration. Quality should/must be everyone’s responsibility at all levels. After all, a study is only as good as its data and the practices used to interpret that data. As programmers, we have a unique role in this process as we often review the study data at a detailed level over the course of a study, review and approve data input specifications for completeness, and work with the study team to consolidate and the clean data in preparation for analysis. With some proper processes and planning, we can work to resolve data issues and questions early lest they result in rework later in the programming process. 

A core component of data integration is review of the data across all studies to ensure it combines in ways that are similar and meaningful. One study may have different visit windowing than another and need to be reduced into one set of visits. There may be specialty subject populations that require derivation of new visits or new tests, such as for Hy’s Law analysis. Medical coding dictionaries may not be the same across studies, and coding may need to be upgraded to the same version. Test codes and parameters may not be the same or may not match the SAP. This is where PROC FREQ in SAS is incredibly helpful to review categories, visits, etc., across all studies to ensure there are no outliers or issues that need to be resolved. If specific issues can be pinpointed, they can be brought up to the study team while data cleaning is ongoing. Issues found after database lock are tricky at best. Here is an example of using proc freq to check lab categories across studies: 

proc freq data=adlb;

tables parcat1*parcat2*studyid /list missing;

run;

 Since the final analysis data must meet CDISC standards for SDTM and ADaM, a review of the combined data via the Pinnacle 21 application is also advised. There are instances where data categories or test names appear one way in EDC but do not meet expected values in CDISC controlled terminology. Items that do not meet controlled terms are best corrected in SDTM and ADaM specifications and updated via programming. The expectation is that final analysis will match CDISC standards or include reasons as to why it differs.

 Data integration may also include data from external sources such as pharmacodynamics data. As programmers, we are often tasked with review of external data specifications to ensure all data needed is captured. Additionally, as part of programming there are several checks we can do as data is incorporated into the study. It is usually a good practice to have a test data file sent and processed to ensure the data is as expected. It is also good to incorporate checks of the data as it is processed as part of SDTM programming. Some things to check would be if all subjects are in subject level data (DM domain), all parameters needed for analysis are present, and visit windowing and timepoints align with the protocol. Even with EDC data, it is best to check all subjects are present in subject level domains. This can easily be done through data step merge using in statements:

proc sort data=prod.adeg out=adeg;

by studyid usubjid subjid;

run;

 

proc sort data=prod.adsl out=adsl;

by studyid usubjid subjid;

run;

 

data check;

merge adeg(in=a) adsl(in=b);

by studyid usubjid subjid;

ineg=a;

insub=b;

if a and ^b;

run;

 As part of the data cleaning process, programmers might even be tasked with the creation of files to support the team during data cleaning, such as creation of Excel or CSV files. This might be requested as part of a cohort review or in preparation for database lock. While not part of a normal statistical function, the files can assist in a critical team function. SAS ODS, LIBNAME statements, and PROC EXPORT can be used to create these files as necessary. The team may need to consider making the code repeatable if data reviews will be needed often. Study blinding is a critical step in this process that must be considered. As part of data review, certain columns in the supporting Excel/CSV file may need to be removed from review to retain study blinding.

Data integration can be a daunting task, especially when the quality of the parent study is questionable. However, through the steps mentioned above and other programmatic checks, we can easily identify any inconsistencies in the parent studies’ data, make informed decisions, and have the highest quality of individual study data before it is included in the integration process. Putting a little extra effort before the integration ensures the PHASTAR quality is met during and after the integration process.