Session 28: Session 28: Distributed Data Mining Distributed Data Mining
Research using Grids and Web Research using Grids and Web ServicesServices
Author/Presenter: Peter BrezanyAuthor/Presenter: Peter Brezany
University of Vienna, AustriaUniversity of Vienna, Austria
11 July
Motivation
Balatonfüred,Hungary - 6th-18th July 2008 2
Business
Medicine
Scientificexperiments
Simulations
Earth observations
Data and data exploration cloud
Data and data exploration cloud
Outline
Motivation
Selected projects ← Data mining model Towards high productivity analytics Parallel and distributed data mining
and OLAP in GridMiner/ADMIRE projects
Future developments
Selected Projects
Balatonfüred,Hungary - 6th-18th July 2008 4
A Long-Term Biodiversity, Ecosystem and Awareness Research Network – ALTER-Net
Balatonfüred,Hungary - 6th-18th July 2008 5
Waste
Air
Soil
Water
Emmision
Bio-diversity
Forests
DistributedData
DistributedData Mining
Flow Analysis
Geo-Statistic
Reporting
PopularPresen-tation
PredictionModels
DistributedApplications
…
Statistic
Common Ontology
Author: Kathi Schleidt
Balatonfüred,Hungary - 6th-18th July 2008 6
China-Austria Data Grid (CADGrid)
Main Idea: Medical Meridian Measurement Grid (M3G) for On-Line Diagnosis
Diabetic domain is the first domain highly profiting of the project results
Motivation
Meridian-Theory is an important part of Traditional Chinese Medicine (TCM)
Clinical practices of TCM (esp. acupuncture) have been guided by meridian theory for thousands of years
More than 4000 years of experience Knowledge that we should not only use
but also support by modern high-tech measurement and IT technologies
3-Dec-07 CADGrid 7
Balatonfüred,Hungary - 6th-18th July 2008 8
Meridian-Theory Basics (1)
According to TCM our human body has 14 acupuncture meridians
Secret to our biological and medical knowledge
Each meridian has its main points called source points
Balatonfüred,Hungary - 6th-18th July 2008 9
Meridian-Theory Basics (2)
Using data mining techniques, correlations between these points can be identified e.g. start-end point correlationsymmetric point correlation
If there was a pain on one place along the meridian, a good effect can be achieved by treating another place on the same line
Meridian-theory Basics (3)
Meridians can transport physical, medical, biological material
and information The characteristics (weaker or stronger
output, time delay, …) gained by the analysis of electro-signals sensed from meridians have a strong relationship with the human body organs (heart, lung, brain,…)
10
Meridian Measurement Methods
1 Active
2 Passive
11
Active Measurement
12
Data-file
Down-flow-key-point
Up-flow-key-point
Human-body-meridianSend
MeasureRecord
Record
Up-flow point: lower electrical potentialDown-flow point: higher electrical potentialFingers and toes: zero potential
Passive Measurement
13
Data-file
MeasureRecord
to the ground of the instrument
Balatonfüred,Hungary - 6th-18th July 2008 14
Application 1
Non-invasive Glucose Measurement (NIGM)
Meridian Measurement Instrument
Balatonfüred,Hungary - 6th-18th July 2008 15
The First Prototype
Balatonfüred,Hungary - 6th-18th July 2008 16
NIGM Workflow
Balatonfüred,Hungary - 6th-18th July 2008 17
M3G Services for DiabeticsNIGM-Service – Model Setup
Balatonfüred,Hungary - 6th-18th July 2008 18
M3G Services for DiabeticsNIGM-Service – Use Model
Balatonfüred,Hungary - 6th-18th July 2008 19
M3G Services for DiabeticsNIGM-Service – Maintain Model
Balatonfüred,Hungary - 6th-18th July 2008 20
CADGrid Framework
Intelligence Base
21
Future Work
Extension to other domains Brain Informatics
domain
22
Balatonfüred,Hungary - 6th-18th July 2008 23
BusinessUnderstanding
DataUnderstanding
DataPreparation
Modelling
Evaluation
DeploymentData
CRISP-DM
Towards High Productivity Analytics
Balatonfüred,Hungary - 6th-18th July 2008 24
A Project Sponsored by
Motivation:
High Productivity Analytics
Our definition:
„A high productive analytics system is one that delivers a high level of performance, guarantees a high level of accuracy of analytics models and other results extracted from analyzed data sets while scoring equally on other aspects, like usability, robustness, system management, and ease of programming.“
Balatonfüred,Hungary - 6th-18th July 2008 25
High Productivity Analytics Research Agenda
High performance services developed by high productivity languages and tools
Efficient workflow management (building and execution)
Advanced GUI
Illustration on the GridMiner system
Balatonfüred,Hungary - 6th-18th July 2008 26
GridMiner Data Mining Model
Balatonfüred,Hungary - 6th-18th July 2008 27
Data
Business understanding
Dataunderstanding
DataPreparation
Modeling
Evaluation
Deployment
CRISP-DM, SPSS
ServiceProvider
ServiceProvider
ServiceProvider
Data provider
Gri
dM
iner User
Virtual Organization
GridMiner Conceptual Architecture
Balatonfüred,Hungary - 6th-18th July 2008 28
DataWarehouse
Knowledge
Cleaning andIntegration
Selection andTransformation
Data Mining
Evaluation andPresentation
OLAP
Online Analytical Mining
OLAP Queries
Data and functional resources can be geogra-phically distributed – focus on virtualizationand large-scale data mining.
Motivation for large-scale data mining
Balatonfüred,Hungary - 6th-18th July 2008 29
accu
rac
y
sampled data size
100%
available data size
(qo,mo)
(qo,mo)
qi - data quality
mi - modeling method
(q0,m0)
Assumed
(qo,mo)
(qi,mi)
Service Parallelism Levels
Balatonfüred,Hungary - 6th-18th July 2008 30
Inter-Service &Intra-ServiceParallelism
Hybrid Programming Model
SPMD – Single Program Multiple Data (used for programming multiprocessor architectures)
+ SSMD – Single Service Multiple Data
(introduced by us for programming service-oriented architectures)
Balatonfüred,Hungary - 6th-18th July 2008 31
1. Construction of Decision Trees - SPRINT – Scalable PaRallelizable INduction of decision Tree
Balatonfüred,Hungary - 6th-18th July 2008 32
categoric
al
continuous
class
Splitting Attributes
The splitting attribute at a node is determined by the Gini index.
Out-of-Core Algorithm
Phase 1 - Preparation
Balatonfüred,Hungary - 6th-18th July 2008 33
Phase 2 - Execution
Balatonfüred,Hungary - 6th-18th July 2008 34
2. Construction of Neural Networks
Balatonfüred,Hungary - 6th-18th July 2008 35
Error Back-Propagation
Inputlayer
Outputlayer
Hidden layer
+-
Desiredoutput
Σ
Sum Limiter-sigmoid funct.
Weightedinputs
Outputs
Node
Parallel Algorithm
Challenges Training real NN is extremely computationally
intensive. Many NN practical applications (e.g., speech
and face recognition) involve the large number of adjustable parameters and training patterns to achieve the needed accuracy.
Solution Parallel training algorithms Development of services running in high
performance hardware and software environments
Balatonfüred,Hungary - 6th-18th July 2008 36
Programming Environment: Titanium
The goals: performance, safety, and expressiveness.
A language that gives its users access to modern program structuring through the use of object-oriented technology, that enables its users to write explicitly parallel code.
Based on a parallel SPMD model of computation with a global address space.
Titanium uses Java as its base, not a strict extension of Java.
Compiler: Titanium C + communicationBalatonfüred,Hungary - 6th-18th July 2008 37
Overview of Distributed Solution
Balatonfüred,Hungary - 6th-18th July 2008 38
Master
Sub-master 0
Sub-master 1
Slave0
Slave1
Slave2
Slave0
Slave1
Training Datafor
Sub-master 1
Data DistributionScheme 1
Data DistributionScheme 2
Training Datafor
Sub-master 0
The Parallel Implementation
Balatonfüred,Hungary - 6th-18th July 2008 39
VGE Client
VGE Server
VGE – Vienna Grid Environment
The Distributed & Parallel Implementation
Balatonfüred,Hungary - 6th-18th July 2008 40
VGE Client
VGE Server
3. On-Line Analytical Processing (OLAP)
Balatonfüred,Hungary - 6th-18th July 2008 41
a three-dimensional data cube
Distributed OLAP – Aggregation of Compute and Storage Resources
Balatonfüred,Hungary - 6th-18th July 2008 42
Tuple Stream
OLAP Service
Balatonfüred,Hungary - 6th-18th July 2008 43
Virtual Cube
Sub Cube
Sub Cube
Slave 1
Slave 3
Master
Data
Sub Cube
Slave 2
Indexes
Index Service
query
answerXML
Workflow Composition Approaches
Balatonfüred,Hungary - 6th-18th July 2008 44
AnalyticalServices
AnalyticalServices
AnalyticalServices
AnalyticalServices
WorkflowEngine
WorkflowEngine
AnalyticalServices
AnalyticalServices
WorkflowDescriptionWorkflow
Description
Manual Composition
WorkflowEditor
WorkflowEditor
AnalyticalServices
AnalyticalServices
AnalyticalServices
AnalyticalServices
WorkflowComposerWorkflowComposer
Passive Approach
WorkflowEngine
WorkflowEngine
AnalyticalServices
AnalyticalServices
WorkflowDescriptionWorkflow
Description
KBKB
Automated Composition
ReasonerReasoner
ResourcesMonitoringResourcesMonitoring
Active Approach
WorkflowComposerWorkflowComposer
WorkflowEngine
WorkflowEngine
KBKB
AnalyticalServices
AnalyticalServices
AnalyticalServices
AnalyticalServices
AnalyticalServices
AnalyticalServices
ReasonerReasoner
ResourcesMonitoringResourcesMonitoring
GridMiner Workflow Composition Editor
Computational Grid
Data Grid
Data Minig Grid
Semantic Grid – 1st Generation
Current Grids
Next-GenerationGrid
Evolution of the Web
KnowledgeTechnologies
Evolution of HPCNMobileServices
Towards Next-Generation Grids