hyper spectral data classification

CLASSIFICATION AND FEATURESELECTION USING REMOTESENSING DATA

MAHESH PALNATIONAL INSTITUTE OF TECHNOLOGY

KURUKSHETRA, INDIA

Remote Sensing dataPanchromatic-one bandMultispectral – Many bands (This system usesensors that detect radiation in a small numberof broad wavelength bands.Hyperspectral – Large numbers of contiguousbandsHyperspectral sensor collects many, verynarrow, contiguous spectral bandsthroughout the visible, near-infrared, mid-infrared, and thermal infrared portions ofthe electromagnetic spectrum.

Panchromatic-one bandMultispectral – Many bands (This system usesensors that detect radiation in a small numberof broad wavelength bands.Hyperspectral – Large numbers of contiguousbandsHyperspectral sensor collects many, verynarrow, contiguous spectral bandsthroughout the visible, near-infrared, mid-infrared, and thermal infrared portions ofthe electromagnetic spectrum.

Band number Spectral range(microns)

Groundresolution (m)

1 0.450 - 0.515 302 0.525 - 0.605 303 0.630 - 0.690 30

Landsat 7 ETM+ data(Multispectral)

3 0.630 - 0.690 304 0.750 - 0.900 305 1.550 - 1.750 306 10.40 - 12.50 607 2.090 - 2.350 30

Panchromatic 0.520 - 0.900 15

Between 0.45 -2.35 µm - A total of six bands

Images of the La Mancha (Spain) area by ETM+ sensor (30mresolution)

Spectrometer Bands (79) Wavelength range(micrometer)

VIS/NIR 32 0.50 - 1.05SWIR I 8 1.50 - 1.80

The DAIS (Digital Airborne ImagingSpectrometer) Hyperspectral Sensor

SWIR I 8 1.50 - 1.80SWIR II 32 1.90 - 2.50

MIR 1 3.00 - 5.00TIR 6 8.70 - 12.50

Between 0.502-2.395 µm - A total of 72 bandsContinuous bands at 10-45 nm bandwidth

Images of the La Mancha (Spain) area using DAIS hyperspectral image(5m resolution)

Hyperspectral Imaging, ImagingSpectrometry, Imaging Spectroscopy

•Spectroscopy is the study of electromagnetic radiation.

•Imaging spectroscopy has been used in the laboratoryby physicists and chemists for over 100 years.

•Imaging spectroscopy has many names in the remotesensing community, including imaging spectrometry orhyperspectral imaging.

• It acquires image in large number , narrow, contiguousspectral bands. Thus enabling extraction of reflectancespectra at a pixel scale, that can directly be comparedwith similar spectra from field.

Hyperspectral Imaging, ImagingSpectrometry, Imaging Spectroscopy

•Spectroscopy is the study of electromagnetic radiation.

•Imaging spectroscopy has been used in the laboratoryby physicists and chemists for over 100 years.

•Imaging spectroscopy has many names in the remotesensing community, including imaging spectrometry orhyperspectral imaging.

• It acquires image in large number , narrow, contiguousspectral bands. Thus enabling extraction of reflectancespectra at a pixel scale, that can directly be comparedwith similar spectra from field.

Importance of a HyperspectralSensor

• Provide spectral reflectance data in hundreds of bandsrather than only a few bands with multispectral data

– Allows for far more specific analysis of land cover– The emissivity levels of each band can be combined to form a

spectral reflectance curve

• These sensor provide information

– Visible region- vegetation, chlorophyll, sediments– Near Infrared - atmospheric properties, cloud cover,

vegetation land cover transformation– Thermal Infrared – Sea surface temperature, forest fires,

volcanoes, cloud height, total ozone

• Provide spectral reflectance data in hundreds of bandsrather than only a few bands with multispectral data

– Allows for far more specific analysis of land cover– The emissivity levels of each band can be combined to form a

spectral reflectance curve

• These sensor provide information

– Visible region- vegetation, chlorophyll, sediments– Near Infrared - atmospheric properties, cloud cover,

vegetation land cover transformation– Thermal Infrared – Sea surface temperature, forest fires,

volcanoes, cloud height, total ozone

CLASSIFICATION

Land cover classification has been a major researcharea involving the use of remote sensing images.

Image classification process involves in assigningpixels to classes in terms of the characteristics ofthe objects or materials.

A major input in GIS based studies

Several approaches are used for land coverclassification.

Land cover classification has been a major researcharea involving the use of remote sensing images.

Image classification process involves in assigningpixels to classes in terms of the characteristics ofthe objects or materials.

A major input in GIS based studies

Several approaches are used for land coverclassification.

CLASSIFICATION ALGORITHMS

Predictive accuracy Computational cost

o time to construct the modelo time to use the model

Robustnesso handling noise and missing values

Interpretability:o understanding the insight provided by the model

Predictive accuracy Computational cost

o time to construct the modelo time to use the model

Robustnesso handling noise and missing values

Interpretability:o understanding the insight provided by the model

Hyperspectral data classification

1. Provide greater detail on the spectral variation oftargets than conventional multispectral systems.

2. The availability of large amounts of data representsa challenge to classification analyses.

3. Each spectral waveband used in the classificationprocess should add an independent set ofinformation.

4. However, features are highly correlated, suggestinga degree of redundancy in the available informationwhich can have a negative impact on classificationaccuracy.

5. Require large pool of training data, which is quite costly tocollect.

Hyperspectral data classification

1. Provide greater detail on the spectral variation oftargets than conventional multispectral systems.

2. The availability of large amounts of data representsa challenge to classification analyses.

3. Each spectral waveband used in the classificationprocess should add an independent set ofinformation.

4. However, features are highly correlated, suggestinga degree of redundancy in the available informationwhich can have a negative impact on classificationaccuracy.

5. Require large pool of training data, which is quite costly tocollect.

Various approaches for the appropriateclassification of high dimensional data

1. Adoption of a classifier that is relatively insensitive to the Hughes

effect (Vapnik, 1995).

2. Using a methods to effectively increase training set size i.e. semi-

supervised classification (Chi and Bruzzone, 2005), active

learning, and use of unlabelled data (Shahshahani and D. A.

Landgrebe, 1994)

3. Use of some form of dimensionality reduction procedure prior to

the classification analysis.

Various approaches for the appropriateclassification of high dimensional data

1. Adoption of a classifier that is relatively insensitive to the Hughes

effect (Vapnik, 1995).

2. Using a methods to effectively increase training set size i.e. semi-

supervised classification (Chi and Bruzzone, 2005), active

learning, and use of unlabelled data (Shahshahani and D. A.

Landgrebe, 1994)

3. Use of some form of dimensionality reduction procedure prior to

the classification analysis.

Training samples

Learning algorithm Also called asHypothesis

Model/ function Output values

Testing samples

Hypothesis can be considered as a machine that provides the prediction fortest data

SUPPORT VECTOR MACHINES (SVM)

Basic Theory: in 1965 Margin based classifier: in 1992 Support vector network: In 1995

Since 1998, support vector network called asSupport Vector Machines (SVM) - used as analternative to neural network.

First application in remote sensingGualtieri and Cromp, (1998) for hyperspectral

image classification

Basic Theory: in 1965 Margin based classifier: in 1992 Support vector network: In 1995

Since 1998, support vector network called asSupport Vector Machines (SVM) - used as analternative to neural network.

First application in remote sensingGualtieri and Cromp, (1998) for hyperspectral

image classification

SVM: structural risk minimisation (SRM)statistical learning theory proposed in1960’s by Vapnik and co-workers.

SRM: Minimise the probability ofmisclassifying an unknown data drawnrandomly

Neural network: Empirical riskminimisation

Minimise the misclassification error ontraining data

SVM: structural risk minimisation (SRM)statistical learning theory proposed in1960’s by Vapnik and co-workers.

SRM: Minimise the probability ofmisclassifying an unknown data drawnrandomly

Neural network: Empirical riskminimisation

Minimise the misclassification error ontraining data

SVM

Map data from the original input featurespace to a very high dimensional featurespace.Data becomes linearly separable but

problem becomes computationally difficult.Kernel function allows SVM to work in

feature space, without knowing mappingand dimensionality of feature space.

SVM

Map data from the original input featurespace to a very high dimensional featurespace.Data becomes linearly separable but

problem becomes computationally difficult.Kernel function allows SVM to work in

feature space, without knowing mappingand dimensionality of feature space.

Margin theory suggest no effect ofdimensionality of input space uses fewer number of training data (called

support vectors)QP solution, so no local minimaNot many user-defined parameters

Advantages

Margin theory suggest no effect ofdimensionality of input space uses fewer number of training data (called

support vectors)QP solution, so no local minimaNot many user-defined parameters

But with real data:

55

60

65

70

75

80

85

90

95

5 10 15 20 25 30 35 40 45 50 55 60 65

Cla

ssifi

catio

n ac

cura

cy (%

)

Number of features

8 pixels 15 pixels

25 pixels 50 pixels

75 pixels 100 pixels

55

60

65

70

75

80

85

90

95

5 10 15 20 25 30 35 40 45 50 55 60 65

Cla

ssifi

catio

n ac

cura

cy (%

)

Number of features

8 pixels 15 pixels

25 pixels 50 pixels

75 pixels 100 pixels

Mahesh Pal and Giles M. Foody, 2010, Feature selection for classification of hyperspectral databy SVM. IEEE Transactions on Geoscience and Remote Sensing, Vol. 48, No. 5, 2297-2306.

Disadvantages Designed for two class problem Different methods to create multi-class

classifier. Choice of kernel function and kernel

specific parameters The kernel function should satisfy the

Mercer’s theorem Choice of Regularisation Parameter C Output is not naturally probabilistic

Designed for two class problem Different methods to create multi-class

classifier. Choice of kernel function and kernel

specific parameters The kernel function should satisfy the

Mercer’s theorem Choice of Regularisation Parameter C Output is not naturally probabilistic

Relevance vector Machines (RVM)

Based on a probabilistic Bayesianformulation of a linear model(Tipping, 2001).Produce a sparse solution than that of

SVM (i.e. less number of relevancevectors)Ability to use non-Mercer kernelsProbabilistic outputNo need of parameter C

Based on a probabilistic Bayesianformulation of a linear model(Tipping, 2001).Produce a sparse solution than that of

SVM (i.e. less number of relevancevectors)Ability to use non-Mercer kernelsProbabilistic outputNo need of parameter C

Major difference from SVM

• Selected points are anti-boundary (awayfrom decision boundary)

• Support vectors represent the leastprototypical examples (closer to boundary,difficult to classify)

• Relevance vectors are the mostprototypical (more representative of class)

• Selected points are anti-boundary (awayfrom decision boundary)

• Support vectors represent the leastprototypical examples (closer to boundary,difficult to classify)

• Relevance vectors are the mostprototypical (more representative of class)

Location of the useful training cases withSVM & RVM

MAHESH PAL AND G.M FOODY, 2012, Evaluation of SVM, RVM and SMLR for accurate image classificationwith limited ground data, IEEE journal of selected topics in applied earth observations and remote sensing, 5( 5),1344-1355.

MAJOR DIFFERENCE FROM SVMSelected points are anti-

boundary (away fromBoundary)

support vectorsrepresent the leastprototypical examples(closer toboundary, difficult toclassify), relevantvectors are the mostprototypical (morerepresentative of class)

Selected points are anti-boundary (away fromBoundary)

support vectorsrepresent the leastprototypical examples(closer toboundary, difficult toclassify), relevantvectors are the mostprototypical (morerepresentative of class)

Disadvantages

Requires large computation cost incomparison to SVM.

Designed for 2-class problem- similar toSVM.

Choice of kernel

May have a problem of local minima

Requires large computation cost incomparison to SVM.

Designed for 2-class problem- similar toSVM.

Choice of kernel

May have a problem of local minima

Sparse Multinomial LogisticRegression(SMLR)

SMLR algorithm learns a multi-classclassifier based on the multinomial logisticregression. Uses a Laplacian prior on the weights of

the linear combination of functions toenforce sparsity. SMLR performs a feature selection and

classification simultaneously. Somewhat closer to RVM

SMLR algorithm learns a multi-classclassifier based on the multinomial logisticregression. Uses a Laplacian prior on the weights of

the linear combination of functions toenforce sparsity. SMLR performs a feature selection and

classification simultaneously. Somewhat closer to RVM

80

90

100

110

Ban

d 5

Location of the useful training cases withSMLR

40

50

60

70

70 80 90 100

Ban

d 5

Band 1

WheatSugar beetOilseed rape

LOCATING USEFUL TRAININGSAMPLES

The Mahalanobis distance between asample and a class centroid is used.

Small distance indicates that the samplelies close to the class centroid and so istypical of the class while a large distanceindicates that the sample is atypical.

Can help to reduce the field work forground truth collection, thus reducingproject cost

The Mahalanobis distance between asample and a class centroid is used.

Small distance indicates that the samplelies close to the class centroid and so istypical of the class while a large distanceindicates that the sample is atypical.

Can help to reduce the field work forground truth collection, thus reducingproject cost

PRESENT WORK

Working with COST Action (EuropeanCooperation in Science and Technology)TD1202: “Mapping and the citizen sensor” as NonEU member

1. Classification with imperfect/noisy data2. How SVM / RVM and SMLR works with noisy

data3. Will be working on other classifiers- RF, ELM

Working with COST Action (EuropeanCooperation in Science and Technology)TD1202: “Mapping and the citizen sensor” as NonEU member

1. Classification with imperfect/noisy data2. How SVM / RVM and SMLR works with noisy

data3. Will be working on other classifiers- RF, ELM

Two type of data Noise Attribute noise and Class noise

We are dealing with class noise, which canhappen due to subjectivity, data-entry error, orinadequacy of the information used to labeleach class.

Possible solutions to deal with class noiseincludes data cleaning, detection andelimination of mislabelled training cases

Two type of data Noise Attribute noise and Class noise

We are dealing with class noise, which canhappen due to subjectivity, data-entry error, orinadequacy of the information used to labeleach class.

Possible solutions to deal with class noiseincludes data cleaning, detection andelimination of mislabelled training cases

Error indata - 0 5% 10% 15% 20% 25% 30% 35% 40%

RVM 88%(51)

88.22%(45)

87.11%(40)

87.78%(46)

87.33%(41)

87.56%(37)

86.44%(39)

85.56%(32)

84.00%(35)

SMLR 88.67%(83)

88.89%(91)

88.67%(85)

87.78%(82)

88.00%(89)

87.33%(80)

87.77%(78)

86.89%(86)

86.67%(72)

SVM 89.11%(203)

88.00%(259)

90.0%(310)

89.77%(339)

89.11%(369)

86.67%(409)

84.0%(432)

84.22%(447)

83.11%(490)

EXTREME LEARNING MACHINES (ELM)

A neural network classifier

Use one hidden layer only

No parameter except number of hidden nodes

Kernel function can be used in place ofhidden layer by modifying the optimizationproblem.

A neural network classifier

Use one hidden layer only

No parameter except number of hidden nodes

Kernel function can be used in place ofhidden layer by modifying the optimizationproblem.

Global solution (no local optima like NN) Performance comparable to SVM and

better than back-propagation neuralnetwork

Multiclass Very fast

Global solution (no local optima like NN) Performance comparable to SVM and

better than back-propagation neuralnetwork

Multiclass Very fast

Dataset SVM (%) KELM (%)ETM+ 88.37 90.33

ATM 92.50 94.06

DAIS 91.97 92.16

Classification Accuracy

Computational costDataset SVM (sec) KELM (sec)

ETM+ 76.74 5.78

DAIS 40.78 1.02

ATM 1.30 0.17

Computational cost

Mahesh Pal, A.E. Maxwell and T. A. Warner, Kernel based Extreme Learning Machine for Remote Sensing ImageClassification,2014, Remote Sensing letters.

PRESENT WORK

Working on sparse extreme learningmachine (produce sparse solution similar tosupport vector machine)Ensemble of extreme learning machineAlso trying to understand the working ofdeep neural network

Working on sparse extreme learningmachine (produce sparse solution similar tosupport vector machine)Ensemble of extreme learning machineAlso trying to understand the working ofdeep neural network

FEATURE REDUCTIONFEATURE REDUCTION

Two broad categories are: feature selection andfeature extraction.

Feature reduction may speed-up theclassification process by reducing data set size.

May increase the predictive accuracy.

May increase the ability to understand theclassification rules.

Feature selection select a subset of the originalfeatures those maintains the useful informationto separate the classes by removing redundantfeatures.

Two broad categories are: feature selection andfeature extraction.

Feature reduction may speed-up theclassification process by reducing data set size.

May increase the predictive accuracy.

May increase the ability to understand theclassification rules.

Feature selection select a subset of the originalfeatures those maintains the useful informationto separate the classes by removing redundantfeatures.

FEATURE EXTRACTIONNumber of techniques for feature extraction includingPrincipal Components, maximum noise fractiontransformation, non-orthogonal techniques such asprojection pursuit, Independent component analysis areproposed.

MNF requires estimates of the signal and noisecovariance matrices

Different features provided by MNF are ranked as persignal-to-noise ratio (First MNF have smallest value of S-N ratio).

Results with DAIS data suggests that MNF may not beused effectively for dimensionality reduction.

Number of techniques for feature extraction includingPrincipal Components, maximum noise fractiontransformation, non-orthogonal techniques such asprojection pursuit, Independent component analysis areproposed.

MNF requires estimates of the signal and noisecovariance matrices

Different features provided by MNF are ranked as persignal-to-noise ratio (First MNF have smallest value of S-N ratio).

Results with DAIS data suggests that MNF may not beused effectively for dimensionality reduction.

Feature selectionThree approaches of feature selection are:

Filters: uses a search algorithm to search through the space of

possible features and evaluate each feature by using a filter such as

correlation and mutual information

Wrappers: uses a search algorithm to search through the space of

possible features and evaluate each subset by using a classification

algorithm.

Embedded: some classification processes such as random forest/

Multinomial logisitic regression produce a ranked list of features

during classification.

Three approaches of feature selection are:

Filters: uses a search algorithm to search through the space of

possible features and evaluate each feature by using a filter such as

correlation and mutual information

Wrappers: uses a search algorithm to search through the space of

possible features and evaluate each subset by using a classification

algorithm.

Embedded: some classification processes such as random forest/

Multinomial logisitic regression produce a ranked list of features

during classification.

Filters

Large number of filter based approach are available in literature.Some used with hyperspectral data are:

1. Correlation-based feature selection

2. Minimum-Redundancy-Maximum-Relevance (mRMR)

3. Entropy

4. Fuzzy entropy

5. Signal-to-noise ratio

6. RELIEF

Large number of filter based approach are available in literature.Some used with hyperspectral data are:

1. Correlation-based feature selection

2. Minimum-Redundancy-Maximum-Relevance (mRMR)

3. Entropy

4. Fuzzy entropy

5. Signal-to-noise ratio

6. RELIEF

WRAPPER APPROACH

SVM-RFE approach utilise SVM as base classifier.The SVM-RFE utilise the objective function

as a feature ranking criterion to produce a list offeatures ordered by their discriminatory ability.The feature, with the smallest ranking score is

eliminated. SVM-RFE uses a backward feature elimination scheme

to recursively remove insignificant features from subsetsof features in order to derive a list of all features in rankorder of value. A major drawback of wrapper methods is their high

computational requirements

2w21

SVM-RFE approach utilise SVM as base classifier.The SVM-RFE utilise the objective function

as a feature ranking criterion to produce a list offeatures ordered by their discriminatory ability.The feature, with the smallest ranking score is

eliminated. SVM-RFE uses a backward feature elimination scheme

to recursively remove insignificant features from subsetsof features in order to derive a list of all features in rankorder of value. A major drawback of wrapper methods is their high

computational requirements

EMBEDDED APPROACH

During classification process some algorithm produceranked list of all features.

For example: two approaches based on Random forestand Multinomial logistic regression classifier can beused.

In contrast to the filter and wrapper approaches, thesearch for an optimal feature subset by embeddedapproach is built into the classification algorithmitself.

Classification and the feature selection processcannot be separated.

During classification process some algorithm produceranked list of all features.

For example: two approaches based on Random forestand Multinomial logistic regression classifier can beused.

In contrast to the filter and wrapper approaches, thesearch for an optimal feature subset by embeddedapproach is built into the classification algorithmitself.

Classification and the feature selection processcannot be separated.

Data Set1. DAIS 7915 sensor by German Space Agency flown on 29 June

2000.

2. The sensor acquire information in 79-bands at a spatial resolution of

5m in the wavelength range of 0.502–12.278 µm.

3. 7 features located in the mid- and thermal infrared region and 7

features from spectral region of 0.502 – 2.395 µm due to striping

noise were removed.

4. An area of 512 pixels by 512 pixels and 65 features covering the test

site was used.

1. DAIS 7915 sensor by German Space Agency flown on 29 June

2000.

2. The sensor acquire information in 79-bands at a spatial resolution of

5m in the wavelength range of 0.502–12.278 µm.

3. 7 features located in the mid- and thermal infrared region and 7

features from spectral region of 0.502 – 2.395 µm due to striping

noise were removed.

4. An area of 512 pixels by 512 pixels and 65 features covering the test

site was used.

1. Random sampling was used to collect training and test using a

ground reference image.

2. Eight land cover classes i.e. wheat, water, salt lake, hydrophytic

vegetation, vineyards, bare soil, pasture and built-up land.

3. A total of 800 training pixels and a total of 3800 test pixels was

used.

Training and test data

1. Random sampling was used to collect training and test using a

ground reference image.

2. Eight land cover classes i.e. wheat, water, salt lake, hydrophytic

vegetation, vineyards, bare soil, pasture and built-up land.

3. A total of 800 training pixels and a total of 3800 test pixels was

used.

Feature selectionAlgorithm

Number of usedfeatures Accuracy

None 65 91.76Fuzzy entropy 14 91.68

Entropy 17 91.61Signal to noise ratio 20 91.68

20Relief 20 88.61SVM-RFE 13 91.89

mRMR 37 91.84CFS 17 91.84

Random forest 21 92.08Multinomial logistic

regression 15 92.76

PRESENT WORK

How noise affects the feature selection Ensemble of feature selection method Stability of feature selection algorithms

for hyperspectral data

How noise affects the feature selection Ensemble of feature selection method Stability of feature selection algorithms

for hyperspectral data

hyper spectral data classification

Documents

contiguousspectral bands

mresolutionspectrometer

bandscontinuous bands

etm sensor

spectral reflectance

remotesensing community

mancha spain area

cloud cover