a data-driven approach for improved effective classification in predictive toxicology iccc 2006,...

19
A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

Upload: godfrey-white

Post on 19-Jan-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

A Data-Driven Approach for Improved Effective

Classification in Predictive Toxicology

ICCC 2006, TallinnDr. Daniel NEAGU, Dr. Gongde GUO

Page 2: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

Bradford, UK

Bradford,West Yorkshire

National Museum of Film and Television

School of Informatics, University of Bradford

Page 3: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

Overview

Short Introduction to Predictive Toxicology Data and Models

The Current Context on Interspecies Data Extrapolation

Our Motivation and Approach Algorithm for Data-driven Hybrid

Classification Model development Case studies Results and Conclusions

Page 4: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

Predictive Data Mining

The processes of data classification/ regression having the goal to obtain predictive models for a specific target, based on predictive relationships among large number of input variables.

Classification defines characteristics of data and identifies a data item as member of one of several predefined categorical classes.

Regression uses the existing numerical data values and maps them to a real valued prediction (target) variable.

Page 5: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

Predictive Toxicology

Predictive Toxicology: a multi-disciplinary science requires close collaboration among

toxicologists, chemists, biologists, statisticians and AI/ML researchers.

The goal of toxicity prediction is to describe the relationship between chemical properties, biological and toxicological processes: relates features of a chemical structure to a

property, effect or biological activity associated with the chemical

Page 6: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

Data in Predictive Toxicology

Page 7: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

ML applications for Predictive Toxicology

The EC proposal for the REACH regulation indicates that the information requirements under REACH can be (partially) fulfilled by using scientifically valid (Q)SAR models.

To guide the validation of computer-based methods, five OECD principles for the validation of (Quantitative) Structure-Activity Relationships were adopted: a defined endpoint an unambiguous algorithm a defined domain of applicability appropriate measures of goodness-of-fit, robustness

and predictivity a mechanistic interpretation, if possible

Page 8: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

The Context for our Approach

Data from In Vivo experiments: increased laboratory standards financial and social costs questionable outputs given different initial

conditions for tests and also the definition of the output between various experiments

In Vitro generated data reduces the costs of in vivo experiments dependent on artificial conditions focused on particular output measurements,

without an integrated biological dependency and reaction

In Silico data depends on the computing and modelling

resources far less expensive than previous two one might define an inverse proportional

relationship between data quality and data quantity

In Vivo Data

In Silico (Algorithms)

In Vitro Data

Page 9: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

Our Approach

Data availability: different chemical compounds are chosen and tested on different species for different purposes, and some of them are tested on more than one species by various experimental reasons Sparse data sets Copyrighted Not homogeneous (endpoint, laboratory conditions, standards, measurement

units) Distributed in time and sources

Further supporting experimental data for training classifiers are frequently limited and expensive.

Some endpoints show good correlations (i.e. Aquatic toxicity measured for various fish species, daphnia etc.)

Consequently, extrapolation methods can be used in regulatory toxicology to overcome these drawbacks

The goal is to predict toxic effects of different chemical compounds to particular species by considering both, toxicity values/classes of chemical compounds which have been tested on these species and on other species with correlated toxicity values/classes.

Page 10: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

Multi-Classifier Systems

Different classifiers potentially offer complementary or at least additional information about patterns to be classified

Various approaches to classifier combinations: majority voting entropy-based combination Dempster-Shafer theory-based combination Bayesian classifier combination similarity-based classifier combination fuzzy inference gating networks statistical models

Page 11: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

Step 1: for each dataset, build a model on all instances with a predefined class label, and then use this model to predict any unclassified instances.

Step 2: for every two datasets count the number of instances both have predefined class label, and the numbers of exact match, match with distance=“1” and match with distance=“2” among them.

Step 3: find potential pairs from different endpoints with highly correlation of their toxicity classes, i.e. the match rate of distance ≤ “1” is greater or equal to 90%.

Based on previous investigations, under assumption that for two datasets exists highly correlation between their classes of the same chemical compounds, a hybrid integration scheme is proposed:

Step 4: for each dataset, we build a model based on the training set and then use it to classify new instances. In the case the distance between the predicted class and the class of the same chemical compound in its most correlated dataset with different endpoint is 2 we give the class label of the latter to the new instance.

We propose a Data-driven Multi-Classifier Model for correlated PT Data Sets

Page 12: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

The Architecture of the Data-driven Multiple Classifier System for PT Interspecies Extrapolation

Model

Training(Endpoint1)

Testing

Descriptors Class Descriptors Class

ModelTraining(Endpoint2)

Testing

Descriptors Class

The class of an instance t: Cj

The predicted class of an instance t: Ci

d(Ci, Cj) ≤ δ

C=Ci

Otherwise

C=f(Ci, Cj)

Page 13: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

Datasets

DEMETRA*1. LC50 96h Rainbow Trout

acute toxicity (ppm) 282 compounds

2. EC50 48h Water Flea acute toxicity (ppm)

264 compounds3. LD50 14d Oral Bobwhite

Quail (mg/ kg) 116 compounds

4. LC50 8d Dietary Bobwhite Quail (ppm)

123 compounds5. LD50 48h Contact Honey

Bee (μg/ bee) 105 compounds

*http://www.demetra-tox.net

Page 14: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

Descriptors

Multiple descriptor types Various software packages to calculate 2D and

3D attributes*

*http://www.demetra-tox.net

Page 15: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

Model Development

Algorithms chosen for their representability and diversity, easy, simple and fast access Bayes Networks (BN) Instance-Based Learning algorithm (IBL) Decision Tree learning algorithm (DT) Repeated Incremental Pruning to Produce Error

Reduction (RIPPER) Multi-Layer Perceptrons (MLPs) Support Vector Machine (SVM)

Page 16: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

Experiments 1. For each dataset the most relevant descriptors were selected

by considering the individual predictive ability of each descriptor along with the degree of redundancy between them: Subsets of descriptors that are highly correlated with the class

while having low intercorrelation were preferred. 2. A model based on all available training instances with

predefined classes was built for each dataset and then used to predict unclassified instances.

3. Comparison of the differences of toxicity classes of the same chemical compounds for two different endpoints. The difference between toxicity classes is measured by a distance

function: for class labels in descent order in terms of toxicity (C={c1, c2,.., cm}, Toxicity(c1)≥Toxicity(c2) ≥ … ≥ Toxicity(cm))

Distance(ci, cj)=

The pairs (Trout, Daphnia), (Bee, Dietary_Quail), (Dietary_Quail, Oral_Quail) are significantly correlated

otherwiseji

ijifij

,

,

Page 17: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

Results

C1 stands for high toxic class; C2 stands for medium toxic class; C3 stands for non toxic class; PTN is the Percentage of Toxic chemical compounds being classified as Non-toxic chemical compounds

Page 18: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

Conclusions

no matter the performance of each original classification method is good or bad, its counterpart that integrates available correlative information has obtained better performance.

experimental results of the proposed hybrid classification system tested on five toxicity datasets obtain better performance than that of each single classifier-based model.

hybrid integration systems (IBL-HIS) reduced the percentage of toxic chemicals being classified as non-toxic chemicals

Page 19: A Data-Driven Approach for Improved Effective Classification in Predictive Toxicology ICCC 2006, Tallinn Dr. Daniel NEAGU, Dr. Gongde GUO

Acknowledgements

This work is part-funded by: EPSRC GR/T02508/01: Predictive Toxicology Knowledge Representation and

Processing Tool based on a Hybrid Intelligent Systems Approach http://pythia.inf.brad.ac.uk/

EU FP5 Quality of Life DEMETRA QLRT-2001-00691: Development of Environmental Modules for Evaluation of Toxicity of pesticide Residues in Agriculture http://www.demetra-tox.net

Special thanks also to: Dr. Q. Chaudhry (CSL York) Dr. Mark Cronin (LJMU)

and PhD students: Ms. Ladan Malazizi, BSc, PhD student

Research Theme: Development of Artificial Intelligence-based in-silico toxicity models for use in pesticide risk assessment

Mr. Paul Trundle, BSc, PhD student Research Theme: Hybrid Intelligent Systems applied to predict Pesticide Toxicity

Ms. Areej Shhab, BEng, MPhil Research Theme: Applications of Machine Learning in Knowledge Discovery and Data Mining

Mr. M. Craciun (University of Galati), BSc, MSc