hyper spectral data classification
DESCRIPTION
discuss some methods to classify HSI dataTRANSCRIPT
CLASSIFICATION AND FEATURESELECTION USING REMOTESENSING DATA
MAHESH PALNATIONAL INSTITUTE OF TECHNOLOGY
KURUKSHETRA, INDIA
Remote Sensing dataPanchromatic-one bandMultispectral – Many bands (This system usesensors that detect radiation in a small numberof broad wavelength bands.Hyperspectral – Large numbers of contiguousbandsHyperspectral sensor collects many, verynarrow, contiguous spectral bandsthroughout the visible, near-infrared, mid-infrared, and thermal infrared portions ofthe electromagnetic spectrum.
Panchromatic-one bandMultispectral – Many bands (This system usesensors that detect radiation in a small numberof broad wavelength bands.Hyperspectral – Large numbers of contiguousbandsHyperspectral sensor collects many, verynarrow, contiguous spectral bandsthroughout the visible, near-infrared, mid-infrared, and thermal infrared portions ofthe electromagnetic spectrum.
Band number Spectral range(microns)
Groundresolution (m)
1 0.450 - 0.515 302 0.525 - 0.605 303 0.630 - 0.690 30
Landsat 7 ETM+ data(Multispectral)
3 0.630 - 0.690 304 0.750 - 0.900 305 1.550 - 1.750 306 10.40 - 12.50 607 2.090 - 2.350 30
Panchromatic 0.520 - 0.900 15
Between 0.45 -2.35 µm - A total of six bands
Images of the La Mancha (Spain) area by ETM+ sensor (30mresolution)
Spectrometer Bands (79) Wavelength range(micrometer)
VIS/NIR 32 0.50 - 1.05SWIR I 8 1.50 - 1.80
The DAIS (Digital Airborne ImagingSpectrometer) Hyperspectral Sensor
SWIR I 8 1.50 - 1.80SWIR II 32 1.90 - 2.50
MIR 1 3.00 - 5.00TIR 6 8.70 - 12.50
Between 0.502-2.395 µm - A total of 72 bandsContinuous bands at 10-45 nm bandwidth
Images of the La Mancha (Spain) area using DAIS hyperspectral image(5m resolution)
Hyperspectral Imaging, ImagingSpectrometry, Imaging Spectroscopy
•Spectroscopy is the study of electromagnetic radiation.
•Imaging spectroscopy has been used in the laboratoryby physicists and chemists for over 100 years.
•Imaging spectroscopy has many names in the remotesensing community, including imaging spectrometry orhyperspectral imaging.
• It acquires image in large number , narrow, contiguousspectral bands. Thus enabling extraction of reflectancespectra at a pixel scale, that can directly be comparedwith similar spectra from field.
Hyperspectral Imaging, ImagingSpectrometry, Imaging Spectroscopy
•Spectroscopy is the study of electromagnetic radiation.
•Imaging spectroscopy has been used in the laboratoryby physicists and chemists for over 100 years.
•Imaging spectroscopy has many names in the remotesensing community, including imaging spectrometry orhyperspectral imaging.
• It acquires image in large number , narrow, contiguousspectral bands. Thus enabling extraction of reflectancespectra at a pixel scale, that can directly be comparedwith similar spectra from field.
Importance of a HyperspectralSensor
• Provide spectral reflectance data in hundreds of bandsrather than only a few bands with multispectral data
– Allows for far more specific analysis of land cover– The emissivity levels of each band can be combined to form a
spectral reflectance curve
• These sensor provide information
– Visible region- vegetation, chlorophyll, sediments– Near Infrared - atmospheric properties, cloud cover,
vegetation land cover transformation– Thermal Infrared – Sea surface temperature, forest fires,
volcanoes, cloud height, total ozone
• Provide spectral reflectance data in hundreds of bandsrather than only a few bands with multispectral data
– Allows for far more specific analysis of land cover– The emissivity levels of each band can be combined to form a
spectral reflectance curve
• These sensor provide information
– Visible region- vegetation, chlorophyll, sediments– Near Infrared - atmospheric properties, cloud cover,
vegetation land cover transformation– Thermal Infrared – Sea surface temperature, forest fires,
volcanoes, cloud height, total ozone
CLASSIFICATION
Land cover classification has been a major researcharea involving the use of remote sensing images.
Image classification process involves in assigningpixels to classes in terms of the characteristics ofthe objects or materials.
A major input in GIS based studies
Several approaches are used for land coverclassification.
Land cover classification has been a major researcharea involving the use of remote sensing images.
Image classification process involves in assigningpixels to classes in terms of the characteristics ofthe objects or materials.
A major input in GIS based studies
Several approaches are used for land coverclassification.
CLASSIFICATION ALGORITHMS
Predictive accuracy Computational cost
o time to construct the modelo time to use the model
Robustnesso handling noise and missing values
Interpretability:o understanding the insight provided by the model
Predictive accuracy Computational cost
o time to construct the modelo time to use the model
Robustnesso handling noise and missing values
Interpretability:o understanding the insight provided by the model
Hyperspectral data classification
1. Provide greater detail on the spectral variation oftargets than conventional multispectral systems.
2. The availability of large amounts of data representsa challenge to classification analyses.
3. Each spectral waveband used in the classificationprocess should add an independent set ofinformation.
4. However, features are highly correlated, suggestinga degree of redundancy in the available informationwhich can have a negative impact on classificationaccuracy.
5. Require large pool of training data, which is quite costly tocollect.
Hyperspectral data classification
1. Provide greater detail on the spectral variation oftargets than conventional multispectral systems.
2. The availability of large amounts of data representsa challenge to classification analyses.
3. Each spectral waveband used in the classificationprocess should add an independent set ofinformation.
4. However, features are highly correlated, suggestinga degree of redundancy in the available informationwhich can have a negative impact on classificationaccuracy.
5. Require large pool of training data, which is quite costly tocollect.
Various approaches for the appropriateclassification of high dimensional data
1. Adoption of a classifier that is relatively insensitive to the Hughes
effect (Vapnik, 1995).
2. Using a methods to effectively increase training set size i.e. semi-
supervised classification (Chi and Bruzzone, 2005), active
learning, and use of unlabelled data (Shahshahani and D. A.
Landgrebe, 1994)
3. Use of some form of dimensionality reduction procedure prior to
the classification analysis.
Various approaches for the appropriateclassification of high dimensional data
1. Adoption of a classifier that is relatively insensitive to the Hughes
effect (Vapnik, 1995).
2. Using a methods to effectively increase training set size i.e. semi-
supervised classification (Chi and Bruzzone, 2005), active
learning, and use of unlabelled data (Shahshahani and D. A.
Landgrebe, 1994)
3. Use of some form of dimensionality reduction procedure prior to
the classification analysis.
Training samples
Learning algorithm Also called asHypothesis
Model/ function Output values
Testing samples
Hypothesis can be considered as a machine that provides the prediction fortest data
SUPPORT VECTOR MACHINES (SVM)
Basic Theory: in 1965 Margin based classifier: in 1992 Support vector network: In 1995
Since 1998, support vector network called asSupport Vector Machines (SVM) - used as analternative to neural network.
First application in remote sensingGualtieri and Cromp, (1998) for hyperspectral
image classification
Basic Theory: in 1965 Margin based classifier: in 1992 Support vector network: In 1995
Since 1998, support vector network called asSupport Vector Machines (SVM) - used as analternative to neural network.
First application in remote sensingGualtieri and Cromp, (1998) for hyperspectral
image classification
SVM: structural risk minimisation (SRM)statistical learning theory proposed in1960’s by Vapnik and co-workers.
SRM: Minimise the probability ofmisclassifying an unknown data drawnrandomly
Neural network: Empirical riskminimisation
Minimise the misclassification error ontraining data
SVM: structural risk minimisation (SRM)statistical learning theory proposed in1960’s by Vapnik and co-workers.
SRM: Minimise the probability ofmisclassifying an unknown data drawnrandomly
Neural network: Empirical riskminimisation
Minimise the misclassification error ontraining data
SVM
Map data from the original input featurespace to a very high dimensional featurespace.Data becomes linearly separable but
problem becomes computationally difficult.Kernel function allows SVM to work in
feature space, without knowing mappingand dimensionality of feature space.
SVM
Map data from the original input featurespace to a very high dimensional featurespace.Data becomes linearly separable but
problem becomes computationally difficult.Kernel function allows SVM to work in
feature space, without knowing mappingand dimensionality of feature space.
Margin theory suggest no effect ofdimensionality of input space uses fewer number of training data (called
support vectors)QP solution, so no local minimaNot many user-defined parameters
Advantages
Margin theory suggest no effect ofdimensionality of input space uses fewer number of training data (called
support vectors)QP solution, so no local minimaNot many user-defined parameters
But with real data:
55
60
65
70
75
80
85
90
95
5 10 15 20 25 30 35 40 45 50 55 60 65
Cla
ssifi
catio
n ac
cura
cy (%
)
Number of features
8 pixels 15 pixels
25 pixels 50 pixels
75 pixels 100 pixels
55
60
65
70
75
80
85
90
95
5 10 15 20 25 30 35 40 45 50 55 60 65
Cla
ssifi
catio
n ac
cura
cy (%
)
Number of features
8 pixels 15 pixels
25 pixels 50 pixels
75 pixels 100 pixels
Mahesh Pal and Giles M. Foody, 2010, Feature selection for classification of hyperspectral databy SVM. IEEE Transactions on Geoscience and Remote Sensing, Vol. 48, No. 5, 2297-2306.
Disadvantages Designed for two class problem Different methods to create multi-class
classifier. Choice of kernel function and kernel
specific parameters The kernel function should satisfy the
Mercer’s theorem Choice of Regularisation Parameter C Output is not naturally probabilistic
Designed for two class problem Different methods to create multi-class
classifier. Choice of kernel function and kernel
specific parameters The kernel function should satisfy the
Mercer’s theorem Choice of Regularisation Parameter C Output is not naturally probabilistic
Relevance vector Machines (RVM)
Based on a probabilistic Bayesianformulation of a linear model(Tipping, 2001).Produce a sparse solution than that of
SVM (i.e. less number of relevancevectors)Ability to use non-Mercer kernelsProbabilistic outputNo need of parameter C
Based on a probabilistic Bayesianformulation of a linear model(Tipping, 2001).Produce a sparse solution than that of
SVM (i.e. less number of relevancevectors)Ability to use non-Mercer kernelsProbabilistic outputNo need of parameter C
Major difference from SVM
• Selected points are anti-boundary (awayfrom decision boundary)
• Support vectors represent the leastprototypical examples (closer to boundary,difficult to classify)
• Relevance vectors are the mostprototypical (more representative of class)
• Selected points are anti-boundary (awayfrom decision boundary)
• Support vectors represent the leastprototypical examples (closer to boundary,difficult to classify)
• Relevance vectors are the mostprototypical (more representative of class)
Location of the useful training cases withSVM & RVM
MAHESH PAL AND G.M FOODY, 2012, Evaluation of SVM, RVM and SMLR for accurate image classificationwith limited ground data, IEEE journal of selected topics in applied earth observations and remote sensing, 5( 5),1344-1355.
MAJOR DIFFERENCE FROM SVMSelected points are anti-
boundary (away fromBoundary)
support vectorsrepresent the leastprototypical examples(closer toboundary, difficult toclassify), relevantvectors are the mostprototypical (morerepresentative of class)
Selected points are anti-boundary (away fromBoundary)
support vectorsrepresent the leastprototypical examples(closer toboundary, difficult toclassify), relevantvectors are the mostprototypical (morerepresentative of class)
Disadvantages
Requires large computation cost incomparison to SVM.
Designed for 2-class problem- similar toSVM.
Choice of kernel
May have a problem of local minima
Requires large computation cost incomparison to SVM.
Designed for 2-class problem- similar toSVM.
Choice of kernel
May have a problem of local minima
Sparse Multinomial LogisticRegression(SMLR)
SMLR algorithm learns a multi-classclassifier based on the multinomial logisticregression. Uses a Laplacian prior on the weights of
the linear combination of functions toenforce sparsity. SMLR performs a feature selection and
classification simultaneously. Somewhat closer to RVM
SMLR algorithm learns a multi-classclassifier based on the multinomial logisticregression. Uses a Laplacian prior on the weights of
the linear combination of functions toenforce sparsity. SMLR performs a feature selection and
classification simultaneously. Somewhat closer to RVM
80
90
100
110
Ban
d 5
Location of the useful training cases withSMLR
40
50
60
70
70 80 90 100
Ban
d 5
Band 1
WheatSugar beetOilseed rape
LOCATING USEFUL TRAININGSAMPLES
The Mahalanobis distance between asample and a class centroid is used.
Small distance indicates that the samplelies close to the class centroid and so istypical of the class while a large distanceindicates that the sample is atypical.
Can help to reduce the field work forground truth collection, thus reducingproject cost
The Mahalanobis distance between asample and a class centroid is used.
Small distance indicates that the samplelies close to the class centroid and so istypical of the class while a large distanceindicates that the sample is atypical.
Can help to reduce the field work forground truth collection, thus reducingproject cost
PRESENT WORK
Working with COST Action (EuropeanCooperation in Science and Technology)TD1202: “Mapping and the citizen sensor” as NonEU member
1. Classification with imperfect/noisy data2. How SVM / RVM and SMLR works with noisy
data3. Will be working on other classifiers- RF, ELM
Working with COST Action (EuropeanCooperation in Science and Technology)TD1202: “Mapping and the citizen sensor” as NonEU member
1. Classification with imperfect/noisy data2. How SVM / RVM and SMLR works with noisy
data3. Will be working on other classifiers- RF, ELM
Two type of data Noise Attribute noise and Class noise
We are dealing with class noise, which canhappen due to subjectivity, data-entry error, orinadequacy of the information used to labeleach class.
Possible solutions to deal with class noiseincludes data cleaning, detection andelimination of mislabelled training cases
Two type of data Noise Attribute noise and Class noise
We are dealing with class noise, which canhappen due to subjectivity, data-entry error, orinadequacy of the information used to labeleach class.
Possible solutions to deal with class noiseincludes data cleaning, detection andelimination of mislabelled training cases
Error indata - 0 5% 10% 15% 20% 25% 30% 35% 40%
RVM 88%(51)
88.22%(45)
87.11%(40)
87.78%(46)
87.33%(41)
87.56%(37)
86.44%(39)
85.56%(32)
84.00%(35)
SMLR 88.67%(83)
88.89%(91)
88.67%(85)
87.78%(82)
88.00%(89)
87.33%(80)
87.77%(78)
86.89%(86)
86.67%(72)
SVM 89.11%(203)
88.00%(259)
90.0%(310)
89.77%(339)
89.11%(369)
86.67%(409)
84.0%(432)
84.22%(447)
83.11%(490)
EXTREME LEARNING MACHINES (ELM)
A neural network classifier
Use one hidden layer only
No parameter except number of hidden nodes
Kernel function can be used in place ofhidden layer by modifying the optimizationproblem.
A neural network classifier
Use one hidden layer only
No parameter except number of hidden nodes
Kernel function can be used in place ofhidden layer by modifying the optimizationproblem.
Global solution (no local optima like NN) Performance comparable to SVM and
better than back-propagation neuralnetwork
Multiclass Very fast
Global solution (no local optima like NN) Performance comparable to SVM and
better than back-propagation neuralnetwork
Multiclass Very fast
Dataset SVM (%) KELM (%)ETM+ 88.37 90.33
ATM 92.50 94.06
DAIS 91.97 92.16
Classification Accuracy
Computational costDataset SVM (sec) KELM (sec)
ETM+ 76.74 5.78
DAIS 40.78 1.02
ATM 1.30 0.17
Computational cost
Mahesh Pal, A.E. Maxwell and T. A. Warner, Kernel based Extreme Learning Machine for Remote Sensing ImageClassification,2014, Remote Sensing letters.
PRESENT WORK
Working on sparse extreme learningmachine (produce sparse solution similar tosupport vector machine)Ensemble of extreme learning machineAlso trying to understand the working ofdeep neural network
Working on sparse extreme learningmachine (produce sparse solution similar tosupport vector machine)Ensemble of extreme learning machineAlso trying to understand the working ofdeep neural network
FEATURE REDUCTIONFEATURE REDUCTION
Two broad categories are: feature selection andfeature extraction.
Feature reduction may speed-up theclassification process by reducing data set size.
May increase the predictive accuracy.
May increase the ability to understand theclassification rules.
Feature selection select a subset of the originalfeatures those maintains the useful informationto separate the classes by removing redundantfeatures.
Two broad categories are: feature selection andfeature extraction.
Feature reduction may speed-up theclassification process by reducing data set size.
May increase the predictive accuracy.
May increase the ability to understand theclassification rules.
Feature selection select a subset of the originalfeatures those maintains the useful informationto separate the classes by removing redundantfeatures.
FEATURE EXTRACTIONNumber of techniques for feature extraction includingPrincipal Components, maximum noise fractiontransformation, non-orthogonal techniques such asprojection pursuit, Independent component analysis areproposed.
MNF requires estimates of the signal and noisecovariance matrices
Different features provided by MNF are ranked as persignal-to-noise ratio (First MNF have smallest value of S-N ratio).
Results with DAIS data suggests that MNF may not beused effectively for dimensionality reduction.
Number of techniques for feature extraction includingPrincipal Components, maximum noise fractiontransformation, non-orthogonal techniques such asprojection pursuit, Independent component analysis areproposed.
MNF requires estimates of the signal and noisecovariance matrices
Different features provided by MNF are ranked as persignal-to-noise ratio (First MNF have smallest value of S-N ratio).
Results with DAIS data suggests that MNF may not beused effectively for dimensionality reduction.
Feature selectionThree approaches of feature selection are:
Filters: uses a search algorithm to search through the space of
possible features and evaluate each feature by using a filter such as
correlation and mutual information
Wrappers: uses a search algorithm to search through the space of
possible features and evaluate each subset by using a classification
algorithm.
Embedded: some classification processes such as random forest/
Multinomial logisitic regression produce a ranked list of features
during classification.
Three approaches of feature selection are:
Filters: uses a search algorithm to search through the space of
possible features and evaluate each feature by using a filter such as
correlation and mutual information
Wrappers: uses a search algorithm to search through the space of
possible features and evaluate each subset by using a classification
algorithm.
Embedded: some classification processes such as random forest/
Multinomial logisitic regression produce a ranked list of features
during classification.
Filters
Large number of filter based approach are available in literature.Some used with hyperspectral data are:
1. Correlation-based feature selection
2. Minimum-Redundancy-Maximum-Relevance (mRMR)
3. Entropy
4. Fuzzy entropy
5. Signal-to-noise ratio
6. RELIEF
Large number of filter based approach are available in literature.Some used with hyperspectral data are:
1. Correlation-based feature selection
2. Minimum-Redundancy-Maximum-Relevance (mRMR)
3. Entropy
4. Fuzzy entropy
5. Signal-to-noise ratio
6. RELIEF
WRAPPER APPROACH
SVM-RFE approach utilise SVM as base classifier.The SVM-RFE utilise the objective function
as a feature ranking criterion to produce a list offeatures ordered by their discriminatory ability.The feature, with the smallest ranking score is
eliminated. SVM-RFE uses a backward feature elimination scheme
to recursively remove insignificant features from subsetsof features in order to derive a list of all features in rankorder of value. A major drawback of wrapper methods is their high
computational requirements
2w21
SVM-RFE approach utilise SVM as base classifier.The SVM-RFE utilise the objective function
as a feature ranking criterion to produce a list offeatures ordered by their discriminatory ability.The feature, with the smallest ranking score is
eliminated. SVM-RFE uses a backward feature elimination scheme
to recursively remove insignificant features from subsetsof features in order to derive a list of all features in rankorder of value. A major drawback of wrapper methods is their high
computational requirements
EMBEDDED APPROACH
During classification process some algorithm produceranked list of all features.
For example: two approaches based on Random forestand Multinomial logistic regression classifier can beused.
In contrast to the filter and wrapper approaches, thesearch for an optimal feature subset by embeddedapproach is built into the classification algorithmitself.
Classification and the feature selection processcannot be separated.
During classification process some algorithm produceranked list of all features.
For example: two approaches based on Random forestand Multinomial logistic regression classifier can beused.
In contrast to the filter and wrapper approaches, thesearch for an optimal feature subset by embeddedapproach is built into the classification algorithmitself.
Classification and the feature selection processcannot be separated.
Data Set1. DAIS 7915 sensor by German Space Agency flown on 29 June
2000.
2. The sensor acquire information in 79-bands at a spatial resolution of
5m in the wavelength range of 0.502–12.278 µm.
3. 7 features located in the mid- and thermal infrared region and 7
features from spectral region of 0.502 – 2.395 µm due to striping
noise were removed.
4. An area of 512 pixels by 512 pixels and 65 features covering the test
site was used.
1. DAIS 7915 sensor by German Space Agency flown on 29 June
2000.
2. The sensor acquire information in 79-bands at a spatial resolution of
5m in the wavelength range of 0.502–12.278 µm.
3. 7 features located in the mid- and thermal infrared region and 7
features from spectral region of 0.502 – 2.395 µm due to striping
noise were removed.
4. An area of 512 pixels by 512 pixels and 65 features covering the test
site was used.
1. Random sampling was used to collect training and test using a
ground reference image.
2. Eight land cover classes i.e. wheat, water, salt lake, hydrophytic
vegetation, vineyards, bare soil, pasture and built-up land.
3. A total of 800 training pixels and a total of 3800 test pixels was
used.
Training and test data
1. Random sampling was used to collect training and test using a
ground reference image.
2. Eight land cover classes i.e. wheat, water, salt lake, hydrophytic
vegetation, vineyards, bare soil, pasture and built-up land.
3. A total of 800 training pixels and a total of 3800 test pixels was
used.
Feature selectionAlgorithm
Number of usedfeatures Accuracy
None 65 91.76Fuzzy entropy 14 91.68
Entropy 17 91.61Signal to noise ratio 20 91.68
20Relief 20 88.61SVM-RFE 13 91.89
mRMR 37 91.84CFS 17 91.84
Random forest 21 92.08Multinomial logistic
regression 15 92.76
PRESENT WORK
How noise affects the feature selection Ensemble of feature selection method Stability of feature selection algorithms
for hyperspectral data
How noise affects the feature selection Ensemble of feature selection method Stability of feature selection algorithms
for hyperspectral data