e-lico an e-laboratory for interdisciplinary collaborative research in data mining and data...

20
e-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th , 2010 Delivering data mining to the Life Science Community Simon Jupp School of Computer Science University of Manchester, United Kingdom

Upload: charleen-owen

Post on 17-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

e-LICOAn e-Laboratory for Interdisciplinary Collaborative

research in data mining and data intensive sciences

October 12th, 2010

Delivering data mining to the Life Science Community

Simon JuppSchool of Computer Science

University of Manchester, United Kingdom

Page 2: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

e-LICO project overview

Infrastructure to support collaborative, data mining enabled experimental research

Knowledge-driven planning of DM workflows– Improve planning by meta-mining

Support research in data-intensive, knowledge-rich domains– Systems biology use case

Page 3: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

European Project

European Project, 9 partners. (Month 20/36)– Specialists from Data Mining, Semantic Web, Grid

computing and Systems Biology

• University of Manchester, UK• University of Geneva, Switzerland• Inserm, France• Josef Stefan Institute, Slovenia• NHRF, Greece• Poznan University, Poland• Rapid-I GmbH, Germany• Ruder Boskovic Institute, Coratia• University of Zurich, Switzerland

An EU-FP7 Collaborative Project (2009-2012) Theme ICT-4.4: Intelligent Content and Semantics

Page 4: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

Problems…

Capturing the workflow

– Explanation

– Error detection / Repair

– Reproducibility

– Provenance

Steep learning curve

– Many operators to choose from

– Best combination of operators

– Hard for non Data Miners

Page 5: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

Problems… and solutions (e-LICO planned workflows)

Develop “Intelligent Discovery Assistant”

(IDA) for Data Analysis

– Automatically generate workflows by planning

– Assist the user in solving DM task

– Structure workflows in workflow templates

– Self improvement through Meta-Mining

Ontology based data model

– Adds semantics

– OWL/RDF based

– Data Mining Experiment Resository

Capturing the workflow

– Explanation

– Error Detection / Repair

– Reproducibility

– Provenance

Steep learning curve

– Many operators to choose from

– Best combination of operators

– Hard for non Data Miners

Page 6: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

The e-LICO workflow

Input Data

Ontology based

AI planner

Workflow executionengine

Publish and share

Output: Data, provenance and

models

Meta-mining

1 3 4

2

Page 7: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

Ontology based AI planner

Input Data

Ontology based

AI planner

Workflow executionengine

Publish and share

Output: Data, provenance and

models

Meta-mining

1 3 4

2

Page 8: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

Hierarchical Task Network (HTN) planning

Set of Tasks to achieve possible Data Mining Goals

Tasks have an I/O specification and set of associated Methods to

achieve that task

Methods composed of simpler Task/Methods

Some methods are Operators with Conditions and Effects

Example: My task is ‘Data Mining With Evaluation’, my Goal is to get a

workflow that does this Evaluation via Cross-Validation

Workflow planning

Page 9: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

The Data Mining Worfkflow Ontology (DMWF)Class Description Examples

IO Object Input and output used by operators Data, Model, Report

MetaData Characteristics of the IOObjects Attribute, AttributeType, DataColumn, DataFormat

Operator DM operators DataTableProcessing, ModelProcessing, Modeling, MethodEvaluation

Goal A DM goal that the user could solve DescriptiveModelling, PatternDiscovery, PredictiveModelling, RetrievalByContent

Task A task is used to achieve a goal CleanMV, CategorialToScalar, DiscretizeAll, PredictTarget

Methods A method is used to solve a task CategorialToScalarRecursive, CleanMVRecursive, DiscretizeAllRecursive, DoPrediction

Page 10: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

AI Planner

Brute force planning

Probabilistic Planning

What will likely produce better results?

Case-based Planning

– How did we solved that previously?

DMOP (Workflow optimization ontology)

– Algorithm and Model selection given a particular task

– Meta-mining by abstraction and generalisation

Workflow Planning

Page 11: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

Meta-Mining

Initially, the AI planner recommends applicable DM workflows, not

necessarily good ones

Self-improves with experience through meta-mining

The meta-miner

– Applies DM techniques to meta-data from past DM experiments

– Extracts workflow patterns that are signatures of high predictive

performance

The planner uses these workflow patterns to design and recommend

promising workflows

Page 12: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

Workflow Execution

04/21/23e-LICO Kick-Off, Geneva 12

Input Data

Ontology based

AI planner

Workflow executionengine

Publish and share

Output: Data, provenance and

models

Meta-mining

1 3 4

2

Page 13: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

Workflow Execution

All operators in ontology (+200) are exposed as SOAP or REST based Web

Service

Plans converted to Workflow execution language (SCUFL 2)

Provenance capture

– Execution times, intermediate model returned to planner

Taverna

Page 14: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

Worflow Publishing and Sharing

04/21/23e-LICO Kick-Off, Geneva 14

Input Data

Ontology based

AI planner

Workflow executionengine

Publish and share

Output: Data, provenance and

models

Meta-mining

1 3 4

2

Page 15: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

Workflow Publishing and Sharing

Workflows and data can be shared via myExperiment

Build a community of data miners

Set of re-usable workflows, data and workflow templates (packs)

Page 16: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

Use case – Obstructive nephropathy

Demonstrated with System Biology Use Case– Biomarker discovery and pathway modelling in the study of

chronic kidney disease

– KUP challenge initiated (August 2010)

Expression data

KUP KB(RDF store)

Text-mining / Image mining

New modelsAnd hypothesis

Further wet labexperiments

Page 17: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

Research Questions

How and when does a planner based “Intelligent Discovery Assistant” help

the end user?

Can we improve planning and suggest better workflows through meta-

mining?

Can we plan complex workflows with Scientific Goals that answer biological

questions?

– KUP goal is to construct diagnostic models that accurately connect the biological

views to the severity of this pathology

Page 18: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

Where are we nowAvailability

http://wwww.e-lico.eu

1st year demo –

http://www.youtube.com/watch?v=JtmqZfzyEKs

eProPlan plugin for Protégé 4.0 Ontologies available

Taverna

http://www.taverna.org.uk

RapidMiner

http://rapid-i.com

Page 19: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

Summary

e-LICO: virtual laboratory for interdisciplinary collaborative research in

data-mining

Ontology based AI planning of KDD workflows

Generic E-Science platform for DM

Application layer for Systems Biology

Page 20: E-LICO An e-Laboratory for Interdisciplinary Collaborative research in data mining and data intensive sciences October 12 th, 2010 Delivering data mining

Acknowledgments

Robert Stevens (Manchester) Alan Williams (Manchester) Rishi Ramgolam (Manchester) Jorg-Uwe Kietz (Zurich) Melanie Hilario (Geneva) E-LICO consortium