data mining - steve tanner, uah

31
Data Mining Data Mining Research and Applications Research and Applications Workshop on Cyberinfrastructure For Environmental Research and Education October 31, 2002 Steve Tanner Information Technology and Systems Center University of Alabama in Huntsville [email protected] 256.824.5143 www.itsc.uah.edu

Upload: tommy96

Post on 20-Nov-2014

743 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data Mining - Steve Tanner, UAH

Data MiningData MiningResearch and ApplicationsResearch and Applications

Workshop on CyberinfrastructureFor Environmental Research and

EducationOctober 31, 2002

Steve TannerInformation Technology and Systems Center

University of Alabama in [email protected]

256.824.5143www.itsc.uah.edu

Page 2: Data Mining - Steve Tanner, UAH

What is the most effective approach to developing an integrated framework and plan for an interdisciplinary environmental cyberinfrastructure?

What organizational structure is needed to provide long-term support for data storage, access, model development, and services for a global clientele of researchers, educators, policy makers, and citizens?

How will effective interagency and public-private partnerships be formed to provide financial support for such an extensive and costly system?

How can communication and coordination among computer scientists and environmental researchers and educators be enhanced to develop this innovative, powerful, and accessible infrastructure?

Key Questions:Key Questions:

Page 3: Data Mining - Steve Tanner, UAH

Data MiningData Mining

Data Mining is an interdisciplinary field drawing from areas such as statistics, machine learning, pattern recognition and others

Automated discovery of patterns, anomalies, etc. from vast observational and model data sets

Derived knowledge for decision making, predictions and disaster response ADaM – Algorithm Development and Mining System

datamining.itsc.uah.edu

Page 4: Data Mining - Steve Tanner, UAH

Clustering Techniques– K Means– Isodata– Maximum

Pattern Recognition– Bayes Classifier– Minimum Distribution Classifier

Image Analysis– Boundary Detection– Cooccurrence Matrix– Dilation and Erosion– Histogram Operations– Polygon Circumscript– Spatial Filtering– Texture Operations

Genetic Algorithms Neural Networks Etc.

Techniques used for Data Techniques used for Data MiningMining

Data Mining systems usually involve a toolbox of many different techniques and a means for combining them

Page 5: Data Mining - Steve Tanner, UAH

Google – Complex algorithm sequence to decide order

Amazon.Com – Additional purchase suggestions

Credit Card Fraud– Event notification of odd usage

Typical Everyday Typical Everyday Encounters with Data Encounters with Data MiningMining

Most current Data Mining applications are text based. Text provides an easily readable source of heterogeneous data. Mining of scientific data sets is more complex.

Page 6: Data Mining - Steve Tanner, UAH

User Perspective and Data User Perspective and Data Perspective of the Data Perspective of the Data Mining ProcessMining Process

DataDataStoresStores

InformationInformation

AnalysisAnalysis

KnowledgeKnowledge

DecisionDecision

DatasetDataset

VolumeVolumeValueValue

Calibration Calibration & Navigation& Navigation

PreprocessingPreprocessing

TransformationTransformation

DatasetDatasetSpecific Specific AlgorithmsAlgorithms

DomainDomainSpecific Specific AlgorithmsAlgorithms

User PerspectiveUser Perspective Data PerspectiveData Perspective

DataData

Page 7: Data Mining - Steve Tanner, UAH

Scientific Scientific AnalysisAnalysis

Scientific Scientific AnalysisAnalysis

Harnesses human analysis Harnesses human analysis capabilitiescapabilities

– Highly creativeHighly creative

Based on theory and Based on theory and hypothesis formulationhypothesis formulation

– Physical basis is normally Physical basis is normally used for algorithmsused for algorithms

Drawing insights about the Drawing insights about the underlying phenomena underlying phenomena

Rapidly widening gap between Rapidly widening gap between data collection capabilities and data collection capabilities and the ability to analyze datathe ability to analyze data

Potential of vast amounts of Potential of vast amounts of data to be unuseddata to be unused

Provides automation of the Provides automation of the analysis process analysis process

Can be used for dimensionality Can be used for dimensionality reduction when manual reduction when manual examination of data is impossibleexamination of data is impossible

Can have limitationsCan have limitations

– May not utilize domain May not utilize domain knowledgeknowledge

– May be difficult to prove May be difficult to prove validity of the results validity of the results

There may not be a physical There may not be a physical basisbasis

Should be viewed as Should be viewed as complimentary tool and not a complimentary tool and not a replacement for scientific replacement for scientific analysisanalysis

Data Data MiningMiningData Data MiningMining

Page 8: Data Mining - Steve Tanner, UAH

Similarity between Data Mining Similarity between Data Mining and Scientific Analysis Processand Scientific Analysis Process

Page 9: Data Mining - Steve Tanner, UAH

Mining Framework (ADaM)– Complete System (Client and Engine)– Mining Engine (User provides its own client)– Application Specific Mining Systems– Operations Tool Kit– Stand Alone Mining Algorithms– Data Fusion

Distributed/Federated Mining– Distributed services– Distributed data– Chaining using Interchange Technologies

On-board Mining (EVE)– Real time and distributed mining– Processing environment constraints

Mining EnvironmentsMining Environments

Page 10: Data Mining - Steve Tanner, UAH

Using the Mining Framework: Using the Mining Framework: Focusing on the information in Focusing on the information in datadata

Using the Mining Framework: Using the Mining Framework: Focusing on the information in Focusing on the information in datadata

Page 11: Data Mining - Steve Tanner, UAH

TranslatedData

PreprocessedData

PreprocessedData

Patterns/ModelsPatterns/Models

ResultsResults

OutputGIF ImagesHDF Raster ImagesHDF Scientific Data SetsHDF-ESOPolygons (ASCII, DXF)SSM/I MSFC Brightness TempTIFF ImagesGeoTIFFOthers...

Preprocessing AnalysisClustering K Means Isodata MaximumPattern Recognition Bayes Classifier Min. Dist. ClassifierImage Analysis Boundary Detection Cooccurrence Matrix Dilation and Erosion Histogram Operations Polygon Circumscript Spatial Filtering Texture OperationsGenetic AlgorithmsNeural NetworksOthers…

Selection and Sampling Subsetting Subsampling Select by Value Coincidence SearchGrid Manipulation Grid Creation Bin Aggregate Bin Select Grid Aggregate Grid Select Find HolesImage Processing Cropping Inversion ThresholdingOthers...

Processing

InputPIP-2SSM/I PathfinderSSM/I TDRSSM/I NESDIS Lvl 1BSSM/I MSFC Brightness TempUS RainLandsatASCII GrassVectors (ASCII Text)HDFHDF-EOSGIFIntergraph RasterOthers...

The ADaM Processing ModelThe ADaM Processing Model

Raw DataRaw Data

Page 12: Data Mining - Steve Tanner, UAH

Iterative Nature of the Iterative Nature of the Data Mining ProcessData Mining Process

DATA

PREPROCESSING CLEANING

AndINTEGRATION

MINING SELECTIONAnd

TRANSFORMATION

DISCOVERY

KNOWLEDGEEVALUATION

AndPRESENTATION

Page 13: Data Mining - Steve Tanner, UAH

Distributed/Federated Mining: Distributed/Federated Mining: Meshing data and algorithms to Meshing data and algorithms to generate knowledgegenerate knowledge

Distributed/Federated Mining: Distributed/Federated Mining: Meshing data and algorithms to Meshing data and algorithms to generate knowledgegenerate knowledge

Page 14: Data Mining - Steve Tanner, UAH

ADaM : Mining Environment for ADaM : Mining Environment for Scientific DataScientific Data

• The system provides knowledge discovery, feature detection and content-based searching for data values, as well as for metadata.

•contains over 120 different operations •Operations vary from specialized science data-set specific algorithms to various digital image processing techniques, processing modules for automatic pattern recognition, machine perception, neural networks, genetic algorithms and others

Page 15: Data Mining - Steve Tanner, UAH

Classification Based on Classification Based on Texture Features and Edge Texture Features and Edge DensityDensity

Cumulus cloud fields have a very characteristic texture signature in the GOES visible imagery

Science Rationale: Man-made changes to land use cause changes in weather patterns, especially cumulus clouds

Comparison based on

– Accuracy of detection

– Amount of time required to classify

Page 16: Data Mining - Steve Tanner, UAH

Parallel Version of Cloud ExtractionParallel Version of Cloud Extraction

Laplacian FilterSobel Horizontal

FilterSobel Vertical

Filter

Energy Computation

Energy Computation

Energy Computation

Energy Computation

Classifier

GOES Image

Cloud Image

GOES images can be used to recognize cumulus cloud fields

Cumulus clouds are small and do not show up well in 4km resolution IR channels

Detection of cumulus cloud fields in GOES can be accomplished by using texture features or edge detectors

Three edge detection filters are used together to detect cumulus clouds which lends itself to implementation on a parallel cluster

GOES Image Cumulus CloudMask

Page 17: Data Mining - Steve Tanner, UAH

Automated Data Analysis for Automated Data Analysis for Boundary Detection and Boundary Detection and QuantificationQuantification

Analysis of polar cap auroras in large volumes of spacecraft UV images

Science Rationale: Indicators to predict geomagnetic storm

– Damage satellites– Disrupt radio

connection Developing different

mining algorithms to detect and quantify polar cap boundary

Polar Cap Boundary

Page 18: Data Mining - Steve Tanner, UAH

Detecting SignaturesDetecting Signatures Science Rationale:

Mesocyclone signatures in Radar data are indicators of Tornadic activity

Developing an algorithm based on wind velocity shear signatures

– Improve accuracy and reduce false alarm rates

Page 19: Data Mining - Steve Tanner, UAH

Genetic Subtyping Genetic Subtyping Using Hierarchical Using Hierarchical ClusteringClustering

Biologists are interested in comparing DNA sequences to see how closely related they are to one another

Phylogenetic trees are constructed by performing hierarchical clustering on DNA sequences using genetic distance as a distance measure

Such trees show which organisms are most likely share common ancestors, and may provide information about how various subtypes of organisms evolved

This information is useful when studying disease causing organisms such as viruses and bacteria, because genetically similar types should behave in similar ways

Page 20: Data Mining - Steve Tanner, UAH

Mining on Data Ingest: Tropical Cyclone Detection

Advanced Microwave Sounding Unit (AMSU-A) Data

Calibration/Limb Correction/Converted to Tb

Mining Environment

Data Archive

Result

Results are placed on the web, made available to National Hurricane Center & Joint Typhoon Warning Center,

and stored for further analysis

Mining Plan:• Water cover mask to eliminate land• Laplacian filter to compute temperature

gradients• Science Algorithm to estimate wind

speed• Contiguous regions with wind speeds

above a desired threshold identified• Additional test to eliminate false positives• Maximum wind speed and location

produced

Hurricane Floyd

Further Analysis

KnowledgeBase

pm-esip.msfc.nasa.gov/

Page 21: Data Mining - Steve Tanner, UAH

AMSU ProductGeneration

TMI AMSU-A SSM/I SSM/T2

OrderStaging

PM-ESIPCatalog

AMSU-A Ingest

ADaM-basedProcessing

Distributed Data Stores

Output

ProcessSubset//Grid/Format

In-put

ADaM Servers

TMI Ingest andProduct Generation

Data Ingest & Processing

Custom Processing

Web Interfaces & Applications

AMSU-A Images

Temperature Trends

STT Application

Visualization & Exploration

FTP

Cyclone Winds

Data Ordering

Multiple Mining Environments:Multiple Mining Environments:Passive Microwave ESIP Information Passive Microwave ESIP Information

SystemSystem

Page 22: Data Mining - Steve Tanner, UAH

• Science data comes in: Different formats, types and structures Different states of processing (raw,

calibrated, derived, modeled or interpreted)

Enormous volumes

• Heterogeneity leads to data usability problems

• One approach: Standard data formats Difficult to implement and enforce Can’t anticipate all needs

Some data can’t be modeled or is lost in translation

• The cost of converting legacy data

• A better approach: Interchange Technologies

• Earth Science Markup Language

Interoperability: Accessing Interoperability: Accessing Heterogeneous DataHeterogeneous DataThe Problem

DATA FORMAT 1

DATA FORMAT 1

DATA FORMAT 2

DATA FORMAT 2

DATA FORMAT 3

DATA FORMAT 3

READER 1 READER 2

FORMATCONVERTER

ESML LIBRARY

APPLICATION

DATA FORMAT 1

DATA FORMAT 1

DATA FORMAT 2

DATA FORMAT 2

DATA FORMAT 3

DATA FORMAT 3

The Solution

APPLICATION

ESMLFILEESMLFILE

ESMLFILEESMLFILE

ESMLFILEESMLFILE

Page 23: Data Mining - Steve Tanner, UAH

Data

Chained Image Chained Image Processing ServicesProcessing Services

Data Files

ESML

WMS(Java/Windows)

Draw Image(PERL/C – Linux)

Data Files

KnowledgeBase

Service Chaining is used to integrate modules – or services – developed on distributed platforms and different languages for a single processing solution.

GeoCrop(Perl/Linux)

Resample(Perl/C – Linux)

Format(Perl/Linux)

Data StreamsCha

ined

Ser

vice

s

ESML Lib

Reader(Java/C+

Windows)

Page 24: Data Mining - Steve Tanner, UAH

Data Integration using Web Data Integration using Web Mapping ServicesMapping Services

Globe AMSU-A KnowledgeBase

ITSC

Coastlines

Countries

MCS Events

Cyclone EventsAMSU-A Channel 01

AMSU-A data overlaid with MCS and Cyclone events for September 2000, merged with world boundaries from Globe.

Page 25: Data Mining - Steve Tanner, UAH

Analysis: Correlate MCSs and cyclones with atmospheric temperatures for September 2000.

Fused Displays Fused Displays from Multiple from Multiple ServersServers

Page 26: Data Mining - Steve Tanner, UAH

Model and Observation Data

FEATUREI

FEATUREII

FEATUREIII

FEATURE SET I

EVENT A

FEATUREX

FEATUREY

EVENT B

CO

NC

EP

TU

AL

L

EV

EL

CONCEPT MINING

DA

TA

FIL

EL

EV

EL

DECISION SUPPORT

MULTI-LEVEL MINING

Concept Hierarchy for Data Mining and Fusion

Page 27: Data Mining - Steve Tanner, UAH

On-Board Real-Time On-Board Real-Time Processing Processing Sensor Control/TargetingSensor Control/Targeting

• Anomaly detection

• Data Mining• Autonomous

Decision Making

• Immediate response

• Direct satellite to Earth delivery of results

EVE – Environment for On-board Processing

www.itsc.uah.edu/eve

Page 28: Data Mining - Steve Tanner, UAH

04/08/23 28

A Reconfigurable Web of Interacting Sensors

Ground NetworkGround Network

Ground Network

Military

Weather

Satellite Constellations

Communications

Page 29: Data Mining - Steve Tanner, UAH

Example Plan: Threshold Example Plan: Threshold events in AMSU-A Streaming events in AMSU-A Streaming Data Data

EVE

Page 30: Data Mining - Steve Tanner, UAH

Data Integration and Mining: From Global Information to Local Knowledge

Precision Agriculture

Emergency Response

Weather Prediction

Urban Environments

Page 31: Data Mining - Steve Tanner, UAH

What is the most effective approach to developing an integrated framework and plan for an interdisciplinary environmental cyberinfrastructure?

What organizational structure is needed to provide long-term support for data storage, access, model development, and services for a global clientele of researchers, educators, policy makers, and citizens?

How will effective interagency and public-private partnerships be formed to provide financial support for such an extensive and costly system?

How can communication and coordination among computer scientists and environmental researchers and educators be enhanced to develop this innovative, powerful, and accessible infrastructure?

Key Questions:Key Questions: