chandrika kamath and imola k. fodor center for applied scientific computing lawrence livermore...

14
Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002 Dimension Reduction and Sampling First SDM ISIC All-Hands Meeting UCRL. This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract W-7405-Eng-48.

Upload: emil-holland

Post on 12-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002 Dimension

Chandrika Kamath and Imola K. Fodor

Center for Applied Scientific ComputingLawrence Livermore National Laboratory

Gatlinburg, TNMarch 26-27, 2002

Dimension Reduction and Sampling First SDM ISIC All-Hands Meeting

UCRL. This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract W-7405-Eng-48.

Page 2: Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002 Dimension

Dimension Reduction and Sampling at LLNL-2CASC

The SDM ISIC aims to minimize the effort researchers spend in managing their data

LLNL is participating in several of the tasks, including—data mining to improve the management of data

Problem: data from simulations and experiments is high dimensional (i.e. many features)

Querying the features can help in understanding the data— but, searching in a high-dimensional space is difficult

May want to cluster similar objects for efficient access—but, clustering is expensive in high dimensions

We plan to address the problem of high dimensionality using techniques for dimension reduction and sampling originally developed in data mining.

Page 3: Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002 Dimension

Dimension Reduction and Sampling at LLNL-3CASC

Our work on dimension reduction will help both data management and mining

Reducing the dimensions will improve—searching (task 3.1, LBNL)—clustering (task 2.1, ORNL)

Dimension reduction is expensive if many data items—use a sample of the data items —techniques for sampling in presence of rare events

We will focus on climate and high-energy-physics data—complements work at ORNL (climate), LBNL (HEP)—but, techniques applicable to other data as well

We only report the .8 FTE work funded under SciDAC; however, our data mining research is more extensive. See www.llnl.gov/casc/sapphire

Page 4: Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002 Dimension

Dimension Reduction and Sampling at LLNL-4CASC

There are two different ways in which we can view dimension reduction

Reduce the number of features representing a data item

Reduce the number of basis vectors used to describe the data: if some of the are small, they can be ignored

Features Features

Data items np

''2

'121 pn ffffff

ij

j

N

jiji rBasisVectoDataItem

1

Page 5: Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002 Dimension

Dimension Reduction and Sampling at LLNL-5CASC

Our work on climate data focuses on reducing the number of basis vectors

Domain expert Dr. Benjamin Santer (LLNL climate) Climate scientists are interested in understanding the

change in the earth’s surface temperature Simulated and observed data are mixtures of volcano, El

Niño, and other effects Our goal is to separate the signals corresponding to

different effects—traditional approaches such as principal component

analysis (PCA) have not worked —separation difficult as El Chichón and Pinatubo

volcano eruptions coincided with El Niño events—our approach is to use independent component

analysis (ICA)

Dimension reduction supporting scientific discovery

Page 6: Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002 Dimension

Dimension Reduction and Sampling at LLNL-6CASC

The raw data is as monthly temperatures on a 144x73 spatial grid on 17 vertical levels

ICA

Volcano

El Niño

Other effects

January 1979 raw temperatures (Kelvin) on the 144x73 latitude by longitude gridat 1000hPa pressure level. Data from NCEP.

Page 7: Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002 Dimension

Dimension Reduction and Sampling at LLNL-7CASC

Initially, we applied ICA to global monthly mean anomaly temperatures

Time series of global monthly mean anomalies, Jan 1979 - Dec 2000

17 vertical levels

level1: 1000hPa, lowest altitude

level17: 10hPa, highest altitude

Page 8: Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002 Dimension

Dimension Reduction and Sampling at LLNL-8CASC

Next, we ran experiments with simulated data to understand the behavior of ICA

(i) Two original sources (ii) Two mixed signals from the original

ICA estimates correctly the shapes of the two independent components (ICs).

With additional processing, we can also estimate the relative contributions of the two ICs in the two mixed signals.

(iii) Sources (ICs) recovered from (ii)

ICA

mix

Page 9: Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002 Dimension

Dimension Reduction and Sampling at LLNL-9CASC

Original decomposition of the two mixed signals (-): sine (--) and volcano (-.)

(i) Signal 1

(ii) Signal 2

Page 10: Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002 Dimension

Dimension Reduction and Sampling at LLNL-10CASC

(i) Signal 1

(ii) Signal 2

ICA decomposition of the two mixed signals (-): sine (--) and volcano (-.)

Page 11: Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002 Dimension

Dimension Reduction and Sampling at LLNL-11CASC

ICA can also separate “noise” used as an extra component in the mixing

3 originalsources

3 mixed signals

3 estimatedICs

mix

ICA

Page 12: Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002 Dimension

Dimension Reduction and Sampling at LLNL-12CASC

Original decomposition of 3 mixed signals (-): El Niño (--), volcano (-.), and noise (..)

Cooling in global series at the arrow is in fact a combination of an ENSO warming and a volcano cooling. Without the volcano eruption, the El Nino warming would dominate, resulting in warmer global

temperatures.

(i) Signal 1

(ii) Signal 2

(iii) Signal 3

Page 13: Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002 Dimension

Dimension Reduction and Sampling at LLNL-13CASC

ICA decomposition of 3 mixed signals (-): El Niño (--), volcano (-.), and noise (..)

Although not perfect in terms of the exact amplitudes, ICA clearly separates the cooling effect of the volcano from the warming effect of El Nino.

(i) Signal 1

(ii) Signal 2

(iii) Signal 3

Page 14: Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore National Laboratory Gatlinburg, TN March 26-27, 2002 Dimension

Dimension Reduction and Sampling at LLNL-14CASC

Our future plans include work with HEP data and collaborators at ORNL and LBNL

Complete the work on the climate problem—our results with artificial data are encouraging—identify appropriate ICA model for climate data

Make the ICA software accessible to SciDAC scientists Try ICA and other dimension reduction techniques in

the context of the STAR high-energy-physics data—reduce number of features—investigate sampling to reduce computation—collaborate with LBNL (data, searching)

Investigate incremental PCA—monitor climate simulations using indices based on

the principal components—collaborate with ORNL (data, clustering)