data management for research michael a. kohn, md, mpp january 4, 2005

64
Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Upload: marybeth-shepherd

Post on 03-Jan-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Data Management for Research

Michael A. Kohn, MD, MPP

January 4, 2005

Page 2: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Assumptions about Students

• Actively involved in a clinical research study• Some experience with entering and maintaining

data in single-table spreadsheet or statistical software

• Some of you are here mainly to learn how to query an existing database

3 groups: ATCR/MCR, EPI non-ATCR/MCR, and CRC (GCRC or PCRC).

Page 3: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Housekeeping• Better of the 2 course websites: http://www.gcrc.ucsf.edu/PCRC/DBMSClass/DatabaseClass.htm

• Check Lab Sheet to confirm/sign up for a lab session. (Sign up for the Tuesday 8:15 and 9:15 labs is restricted to ATCR/MCR students.)

• Labs will be in S165A.• Bring a diskette, zip disk, or USB “memory stick,”

and your syllabus to labs. (No printing in labs.)• Syllabus and “Learn MS Access 2000” CD

Page 4: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Lab Instructors

Kari Mazurek (Course Administrator)Mike JarrettAndrew HighMandana Khalili(You will find that interacting with the lab

instructors, during labs, outside of labs, and via email, to be the most valuable part of this course.)

Page 5: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Course Objectives

Learn how to develop a multi-table, relational database for a research study. We will be using Microsoft Access, but we are familiar with other database software.

Learn how to query a database for monitoring and analyzing data in a research study.

Example: Infant Jaundice Study

Page 6: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Requirements

• Turn in all 4 assignments on time

• Fill out course evaluation.

Page 7: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

AssignmentsLab 1: Tables and Relationships 01/11 or 01/12Send Access file LastnameLab1.mdb to [email protected] by 1/17 at 5 pm.

We will work through these assignments in the labs, so you don’t need to have Access2000 at home.

Lab 2: Queries, Reports, Importing Data 01/18 or 01/19Save Access file as LastnameLab2.mdb Send to [email protected] by 1/24 at 5 pm.

Page 8: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Assignments (cont’d)

Optional (Required for ATCR.MCR): Write a sentence or two for the “Methods” section on inter-rater reliability. (Use Bland and Altman, BMJ 1996; 313:744)

Lab 3: Querying, Exporting, and Analyzing Data

Option A (Required for ATCR/MCR):

Determine if neonatal jaundice was associated with the 5-year neuropsychiatric scores and create a table, figure, or paragraph appropriate for the “Results” section of a manuscript summarizing the association.

Send assignment to [email protected] by 1/31 at 5 pm .

Page 9: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Assignments (cont’d)Lab 3: Querying, Exporting, and Analyzing Data 01/25 or 01/26

Option B (not for ATCR/MCR students):

Answer a research question of your own by querying an existing database. Display your results in a paragraph, table, or figure appropriate for presentation to others in your field.

Send assignment to [email protected] by 1/31 at 5 pm.

Page 10: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Assignments (cont’d)Class session 5 (not a lab): Planning and Budgeting for Data Management

Option A:

Write a one-page data management section for your research study protocol and create a budget for data management. (Please include an Access file for your database, if you have one. Also include a one-sentence summary of your study.)

Send assignment to [email protected] by 2/14 at 5 pm

Page 11: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Assignments (cont’d)Class session 5 (not a lab): Planning and Budgeting for Data Management

Option B:

Write a one-page description with a relationships diagram for the database with which you currently work.

Send assignment to [email protected] by 2/14 at 5 pm.

Page 12: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Data Management for Clinical Research

We know how to

• define the study population, the independent variables and the outcome variables;•measure these variables and anticipate problems with measurement;•analyze the results.

Page 13: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

The DBMS (Database Management System) is for

• entering and storing the measurements, • entering and storing the other information necessary to administer the study (subject contact information, exam schedules, reimbursement records, etc.),

• monitoring the study, and

• either analyzing the results or formatting the results for analysis.

Data Management for Clinical Research

Page 14: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Four Types of Research Database

1. Combination of paper files, Excel spreadsheets, and direct keyboard entry into the statistical analysis package.

2. Desktop multi-table relational database.

3. Client-Server multi-table relational database.

4. Internet database server.

Page 15: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

•Ease of data entry

•Automatic data validation

•Automatic error checking

•Alternative is a stack of paper forms

Advantages of a computerized database

Page 16: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Advantages of a Multi-Table Relational Database

•Eliminates redundancy•Ensures data integrity

Note: Unless you plan on doing your analysis long-hand, you always need a computer database of some sort (a Stata dataset or an Excel spreadsheet may be adequate); you don’t always need a multi-table relational DBMS (like Microsoft Access).

Page 17: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Collection of spreadsheet-like, two-dimensional tables.

Rows in Tables = Records

Columns in Tables = Attributes

Tables are related one-to-many, many-to-many, and one-to-one.

Multi-Table Relational Database

Page 18: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Jaundice and Infant Feeding Study

Cohort study to determine the 5-year neuropsychiatric sequelae of infants with neonatal jaundice or feeding disorders.

Page 19: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Methods:

Design-Cohort study.Setting-Single, urban medical centerSubjects-Infants with neonatal jaundice and randomly selected non-jaundiced infantsPredictor Variable-Presence or absence of jaundiceOutcome Variable- Neuropsychiatric score (ranging from 55 to 145) at age 5Analysis- ?

Infant Jaundice Study(Our fictional version of JIFee)

Page 20: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Infant Jaundice Study Data

1. Approximately 400 children2. 5 examiners (doctors)3. Approximately 700 neuropsychiatric examinations,

measuring weight, height, and “NPScore” (IQ)4. Some children to be examined more than once5. No examiner to see the same child twice6. If child died before age 5, store age and circumstances of

death

Page 21: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Infant Jaundice Study Table of Subjects = “Baby”

Row = Individual Infant

Columns = ID#, Name, DOB, Sex, Jaundice.

If one set of measurements per infant, put measurements in subject table.

This is a single-table database.

Table of Study Subjects

Page 22: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005
Page 23: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Demonstration: Creating a Data Table

Label columns and enter rows of data in datasheet view

Page 24: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Demonstration: Data Dictionary

Table design view:•field (=column) names, •data types, •definitions, •validation rules

(More on data types, free-text vs. coded responses, later)

Page 25: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005
Page 26: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Demonstration

Disallowed values

Duplicate primary keys

This automatic error checking and data validation IS why you need to enter your data into a computer; it is NOT why you need a relational DBMS. Many single-table products (Filemaker Pro, SAS FSP, even Excel) can do error checking and data validation.

Page 27: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Acceptable table showing one set of exam results per participant.(BabyExamForFigure3)

Page 28: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Demonstration: Same Table in Excel, Stata

• Excel

• Stata

• Etc

Rows = Records = Entities

Columns = Fields = AttributesAccess and Stata have a special row at the top for column headings (=field names); Excel just uses the first row.

Page 29: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Table of Study Subjects

Row = Individual Infant

Columns = ID#, Name, DOB, Sex, Jaundice

If some infants have more than one exam, what do you do?

Table of Study Subjects

Page 30: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Undesirable table showing multiple exam results per study participant.(BabyExamForFigure4)

Page 31: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Demo

• Find highest IQ Score

• Find all exams done in April

Page 32: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Common Error

• If you find yourself creating multiple columns for the same measurement, e.g., Date1, Score1, Date2, Score2, Date3, Score3, …

• Or if your table is more than about 30 columns wide,

– It is time to restructure your table.

Page 33: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Undesirable table with participant-specific data duplicated for each exam. (Note problem with Helen’s DOB.)(ExamBabyForFigure5)

Page 34: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Demo

• Find highest IQ Score

• Find all exams in a particular month

• What happened to Alejandro, Ryan, Zachary, and Jackson?

Page 35: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

If some infants have multiple exams,

“normalize” the records into two tables, one for subjects and one for examinations.

Normalization

Page 36: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Data normalized into two tables: one (“Baby”) with rows comprising subject-specific information; the other (“Exam”) with rows comprising exam-specific information. Note that Helen can only have one birth date. Subjects with no exams, e.g. Alejandro, still appear in the database. “SubjectID” functions as the primary key in the “Baby” table and as the foreign key in the “Exam” table.

Page 37: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Figure 7. Relationships diagram showing the one-to-many relationship between the table of subjects (“Baby”) and the table of measurements (“Exam”).

Page 38: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Demonstration

Inability to create integrity violations with normalized tables.

This IS why you need a multi-table relational DBMS.

Page 39: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Analogy to Double Data Entry• Having different examiners see the same 5-year-

old to establish the interrater reliability of the IQ score is analogous to doing double data entry.

• The same table structures and relationships would exist if the objective were to check data entry off of paper forms.

• When entering data directly into on-screen forms (with their automatic range checks and validation routines), double data entry may not be necessary.

Page 40: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Table of Examiners

Neuropsychiatric outcomes are assessed by 5 different examiners (doctors)

May want to assess whether examiner characteristics (sex, specialty, age) affect neuropsychiatric scores

Doctor examines many children; each child may have more than one exam; but a child is never examined by the same doctor twice.

Page 41: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Table of examiners with multiple examiner-specific fields.

Page 42: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Figure 9. Undesirable table in which examiner-specific data is repeated with each examination. (Note that Dr. Novello is a female pediatrician for two examinations and a male internist for an exam in between.)

Page 43: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Figure 10. Normalization into two tables, one for exam-specific information and one for examiner-specific information. (Note that Dr. Novello cannot change specialty or gender between examinations.) “DocID” functions as a second foreign key in the “Exam” table. (The other foreign key is “SubjectID”.)

Page 44: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Figure 11. Relationships diagram showing the relationships between the table of subjects (Baby), the table of measurements (Exam) and the table of examiners (Doctor). The “Exam” table functions as a linkage or join table between “Baby” and “Doctor” creating a “many-to-many” relationship between study subjects and examiners.

Page 45: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

One-to-One Relationship: Infants and Deaths.

Page 46: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Figure 12. Some fields are subject specific but valued for only a few subjects. Maintaining columns for these fields in the table of subjects leads to empty fields and wasted space.

Page 47: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Figure 13. Creating a separate table with a one-to-one relationship eliminates the empty fields and wasted space.

Page 48: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Figure 14. The relationships diagram now includes a table (“Death”) with a one-to-one relationship with the table of subjects (“Baby”). A subject can only have one record in the one-to-one-related table, but the vast majority of subjects will not have any “Death” record.

Page 49: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Undesirability of Storing Calculated Values

Store raw data, not calculated fields, e.g., store dates and times; calculate intervals.

 

Storing a patient’s birth date allows calculation of his or her exact age on the date of a particular measurement.

Page 50: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Figure 15. Calculated fields such as “AgeInMonths” are undesirable. What if the birth date for SubjectID 2322 (Helen) is corrected in the “Baby” table?

Page 51: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Select Queries

Select queries (aka “Views”) organize, sort, filter, and display data.

Queries use Standard Query Language (SQL), but you don’t have to learn it, because of graphical query design tools.

A query can join data from two or more tables, display only selected fields, and filter for records that meet certain criteria.

Page 52: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Demonstration

Age in months and BMI at exam of subjects who were examined in January and February of 2010.

Page 53: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Select Queries Produce “Table-Like” Results

Note that the result of a select query that joins two tables, displays only certain fields, selects rows based on special criteria, and calculates age and BMI still looks like a table in datasheet view.

But, remember that it is merely a “view” of data from the underlying tables.

Page 54: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

“Action Queries” Change Data

1) Update Query -- changes the values of specific fields in existing records*

2) Append Query -- adds new records (rows) to a table

3) Delete Query -- deletes records from a table

*Not covered in lab this year

Page 55: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Standard Data Entry Conventions

Several conventions for data entry and display have developed over time. Although most users of screen forms are not aware of these conventions, they have come to expect them subconsciously. For example, a series of mutually exclusive, collectively exhaustive choices is usually displayed as an “option group” consisting of several different “radio buttons”, whereas choices which are not mutually exclusive are displayed as check boxes.

N.B. An “option group” of mutually exclusive choices is a single column or field. A group of N check boxes represents N yes/no fields.

Page 56: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Use check boxes when options are not mutually exclusive. (5 fields)

Use radio buttons when options are mutually exclusive. (1 field)

Computer chart abstraction form showing two common data entry conventions.

Page 57: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Demonstration

Option group for examiner’s medical specialty

Page 58: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Guidelines for Data Management in Clinical Research

Establish the database tables, their rows and columns, and their relationships correctly at the outset.

  A poorly organized database makes data maintenance and retrieval nearly impossible. Make sure the data are normalized. The data structures should never require duplicate data entry or redundant storage.

? MS Genetics Example

Page 59: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Guidelines for Data Management in Clinical Research

Establish and follow naming conventions for columns and tables.  Short field names without spaces or underscores are convenient for programming, querying, and other manipulations. Instead of spaces or underscores, use “IntraCaps” (upper case letters within the variable name) to distinguish words, e.g. “StudyID”, “FName”, “FdDisord”, or “ExamDate”. Table names should be singular, e.g. “Baby” instead of “Babies”, “Exam” instead of “Exams”.

Page 60: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Guidelines for Data Management in Clinical Research

Obtain baseline demographic and clinical information about members of the study population from existing computer databases.

 

Avoid re-entering data which are already available (in digital formats) from other sources. In the JIFee Study, the patient demographic data and contact information are obtained from the hospital database. Computer systems can almost always produce text-delimited or fixed-column-width character files that the database management system can import.

Page 61: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

On-screen vs. paper formsMinimize the extent to which study measurements are recorded on paper

forms. Enter data directly into the computer database or move data from paper

forms into the computer database as close to the data collection time as possible.

When you define a variable in a computer database, you specify both its format and its domain or range of allowed values. Using these format and domain specifications, computer data entry forms give immediate feedback about improper formats and values that are out of range. The best time to receive this feedback is when the study subject is still on site.

You can always print out a paper copy of the screen form or a report of the exam/interview results once the data are collected.

Examples: ATM Machine’s printed transaction record, Gas Station’s printed receipt

Page 62: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Guidelines for Data Management in Clinical Research

Back up the database regularly and check the adequacy of the back up procedure by periodically restoring a file from the back up medium.

Page 63: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Desktop DBMS

Microsoft Access

Claris Filemaker Pro

Paradox

Microsoft Visual FoxPro

Dataease

The processing of records is done by the desktop. The server simply stores files (file server).

Page 64: Data Management for Research Michael A. Kohn, MD, MPP January 4, 2005

Client-Server DBMS

Microsoft SQL Server

Oracle

Informix

Sybase

The processing of records is done by the server. The desktop manages the screen, but passes queries on to the server. (Just to confuse things, MS Access can be a client for SQL Server, and other enterprise systems. The ultimate in “thin” clients is a browser (Internet Explorer). In this case, the server is an intranet or internet database server.)