anonymity through data cubes

Post on 23-Feb-2016

41 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Anonymity through Data cubes. Athos Antoniades. Introduction. Why Share Data? What are the current legal and ethical limitations? How have scientists shared medical data so far? Key Problems Perturbation Cell Suppression. The Problem. Why share data: Replication Testing - PowerPoint PPT Presentation

TRANSCRIPT

Linked2Safety Project (FP7-ICT-2011-7 – 5.3)A NEXT-GENERATION, SECURE LINKED DATA MEDICAL INFORMATION SPACE FOR

SEMANTICALLY-INTERCONNECTING ELECTRONIC HEALTH RECORDSAND CLINICAL TRIALS SYSTEMS

ADVANCING PATIENTS SAFETY IN CLINICAL RESEARCH

12th International Conference on Bioinformatics and Bioengineering, Larnaka

Anonymity through Data cubes

Athos Antoniades

FP7, ICT-2011 – 5.3 Page 2

Introduction

Why Share Data? What are the current legal and ethical

limitations? How have scientists shared medical data so far? Key Problems Perturbation Cell Suppression

FP7, ICT-2011 – 5.3 Page 3

The Problem

Why share data:Replication TestingStatistical PowerMultiple Testing Problem

Legal and Ethical IssuesAnonymization vs PseudoanonimizationLimitations derived from consent form signed by subjectsOther, regional, study, or subject specific issues.

FP7, ICT-2011 – 5.3 Page 4

How have scientists shared medical data Contingency Table and Data Cube

example

  aa aA AA

Case U00 U01 U02

Control U10 U11 U12

FP7, ICT-2011 – 5.3 Page 5

16 year old widow Problem

A paper that analyzes data from a specific study reports:

Marital Status

AgeAge Married Widowed Single0-16 0 1 50

18-24 10 5 5025-34 40 7 4035~ 60 15 20

FP7, ICT-2011 – 5.3 Page 6

16 year old widow Problem

A paper that analyzes data from a specific study reports:

Marital Status

AgeAge Married Widowed Single0-16 0 1 50

18-24 10 5 5025-34 40 7 4035~ 60 15 20

FP7, ICT-2011 – 5.3 Page 7

16 year old widow Problem

A paper that analyzes data from a specific study reports:

Marital Status

AgeAge Married Widowed Single0-16 0 1 50

18-24 10 5 5025-34 40 7 4035~ 60 15 20

FP7, ICT-2011 – 5.3 Page 8

Categorization Differences

Paper 1 that analyzes data from a specific

study reports:Marital Status

Age

Age MarriedWidowe

d Single0-16 NA NA 50

18-24 10 7 5025-34 40 7 4035~ 60 15 20

Marital Status

Age

Age MarriedWidowe

d Single0-16 NA NA 50

18-25 10 8 5026-35 45 7 4036~ 55 14 20

Paper 2 that analyzes data from the same

study reports:

FP7, ICT-2011 – 5.3 Page 9

Perturbation and Cell Suppression

Original Data

Marital Status

Age

Age MarriedWidowe

d Single0-16 0 1 50

18-24 10 7 5025-34 40 7 4035~ 60 15 20

Marital Status

Age

Age MarriedWidowe

d Single0-16 NA NA 51

18-24 9 8 4925-34 40 7 4135~ 61 14 21

Perturbation (+-1) andCell Suppression (<5)

FP7, ICT-2011 – 5.3 Page 10

Evaluation

• Most common parameters testedPerturbation:[0], [-1,1], [-3,3], [-5,5], [-10,10]Cell Supression: <0, <=1, <=3,<=5,<=10

• Standard main effect test using Chi Square

• Pearson’s Correlation Coefficient used to evaluate deviation of each parameter combination to original results.

• A-priory defined threshold for Pearson’s correlation coefficient <=0.95.

FP7, ICT-2011 – 5.3 Page 11

Evaluating Parameters with a matrix of graphs

FP7, ICT-2011 – 5.3 Page 12

Linked2Safety’s Data Analysis Space

Objectives: Design and develop the data mining techniques and the scalable

infrastructure for the identification of phenotypic and genetic associations related to adverse events.

Develop new and implement existing state of the art analytical approaches for genetic data.

Define and implement the knowledge extraction and filtering mechanisms and the knowledge base

Integrate the knowledge base into a lightweight decision support system (Adverse events early detection mechanism)

FP7, ICT-2011 – 5.3 Page 13

Data Analysis Steps

FP7, ICT-2011 – 5.3 Page 14

Quality Control Subspace

Provides the tools for identifying and removing erroneous data or data that do not conform to the quality standards that a user might define.

Tools: Hardy-Weinberg Equilibrium Test Allele Frequency Test Missing Data Test

FP7, ICT-2011 – 5.3 Page 15

Feature Selection Subspace

Provides the tools for removing redundant or irrelevant features from a dataset.

Tools: Rough Set Feature Selection Information Gain Feature Selection Chi Squared Feature Selection

FP7, ICT-2011 – 5.3 Page 16

Data Analysis Steps

FP7, ICT-2011 – 5.3 Page 17

Single Hypothesis Testing Subspace

Provides the tools for performing single hypothesis testing on a dataset and test for associations.

Tools: Pearson’s Chi Square Test Fisher’s Exact Test Odds Ratio Binomial Logistic Regression Linkage Disequilibrium Genetic Region Based Association Testing

FP7, ICT-2011 – 5.3 Page 18

Data Mining Subspace

Provides the tools for performing data mining analyses on a dataset and extract association rules.

Tools: Association Rules (apriori) Decision Trees with Percentage Split (C4.5) Decision Trees with Cross Validation (C4.5) Random Forest with Percentage Split Random Forest with Cross Validation

FP7, ICT-2011 – 5.3 Page 19

Data Analysis Space Interactions

FP7, ICT-2011 – 5.3 Page 20

Data Analysis Steps

FP7, ICT-2011 – 5.3 Page 21

Knowledge Extraction and Filtering Mechanism

Knowledge Extraction Mechanism This mechanism is responsible for storing

statistically significant associations and important association rules in the Linked2Safety knowledge database

Has two steps: Logging system Storing important knowledge

Filtering mechanism This mechanism allows users to insert or delete

associations and association rules

FP7, ICT-2011 – 5.3 Page 22

Adverse Event Early Detection Mechanism

Uses the knowledge in the L2S knowledge base Runs in the background to identify new

associations and association rules Reruns analyses when updated datasets are

available Creates alerts for patients profiles associated

with adverse events

FP7, ICT-2011 – 5.3 Page 23

Linked2Safety’s Data Analysis Platform

FP7, ICT-2011 – 5.3 Page 24

Linked2Safety’s Data Analysis Platform Workflow Screenshot

FP7, ICT-2011 – 5.3 Page 25

Patterns Discovery Common Variable Selection

Overlapping non genetic data of at least 2 data providers: Variables

Age Weight gainGender HeadachesBMI Gastrointestinal symptomsSmoking Ever Ophthalmological problemsDyslipidemia Type of ophthalmological condition Diabetes High blood pressureDiabetes type I Heart conditions existDiabetes type II Type of heart conditionAnemia HypertensionDepressive personality disorder Myocardial infarctionMajor depressive disorder StrokeSchizotypal personality disorder Coronary heart disease

FP7, ICT-2011 – 5.3 Page 26

Conclusion and future work on utilizing data cubes

We were able to identify for a given dataset the maximum noise that can be added to the data without significantly affecting the outcomes.

Results presented are only relevant to MASTOS, all other datasets need to repeat the analytical approach described to determine the maximum noise that can be added to the results.

Further investigation is necessary to identify the minimum parameter settings to satisfy legal and ethical requirements.

FP7, ICT-2011 – 5.3 Page 27

Who to Contact

Athos AntoniadesUniversity of Cyprus

email: athos@cs.ucy.ac.cy

top related