transforms and other prestidigitations—or new twists in imputation
Post on 15-Jan-2016
35 Views
Preview:
DESCRIPTION
TRANSCRIPT
Transforms and other Transforms and other prestidigitations—or new prestidigitations—or new
twists in imputation. twists in imputation.
Albert R. StageAlbert R. Stage
Imputation:Imputation:
• To use what we know about “everywhere” that may be useful, but not very interesting- the X’s,
• To fill in detail that is prohibitive to obtain, except on a sample- the Y’s,
• By finding surrogates based on similarity of the X’s.
TopicsTopics
• Measures of similarity (a few in particular)
• Alternative MSN distance function leading to some improved estimates
• Transformations that improve resolution– On the X-side (known everywhere)– On the Y-side (known for sample only)
Distance measures for interval and Distance measures for interval and ratio scale variables (Podani 2000)ratio scale variables (Podani 2000)
• Euclidean/Mahalanobis • Chord • Angular• Geodesic• Manhattan• Canberra• Clark• Bray-Curtis• Marczewski-Steinhaus• 1-Kulczynski
• Pinkham-Pearson• Gleason• Ellenberg• Pandeya• Chi-square • 1-Correlation• 1-similarity ratio• Kendall difference• Faith intermediate • Uppsala coefficient
Distance measures for binary Distance measures for binary variables Podani (2000)variables Podani (2000)
Symmetric for 0/1• Simple matching• Euclidean• Rogers-Tanimoto• Sokal-Sneath• Anderberg I• Anderberg II• Correlation• Yule I• Yule II• Hamann
Asymmetric for 0/1• Baroni-Urbani-Buser I• Baroni-Urbani-Buser II• Russell-Rao• Faith I• Faith II
• Ignore 0• Jaccard• Sorenson• Chord• Kulczynski• Sokal-Sneath II• Mountford
Distance function in matrix notationDistance function in matrix notation
D2iu = mini [ (Xi-Xu) W (Xi-Xu)’ ]
– Where, for• Euclidean distance: W = I (Identity matrix)• Mahalanobis distance: W = Inverse
covariance matrix)• MSN (1995): W = ’ with:
= matrix of coefficients of canonical variatesdiagonal matrix of canonical correlations
Why Why Weight with Canonical Analysis?Weight with Canonical Analysis?
• Not degraded by non-informative X’s
Effect of adding 2 random X's(Number of original X's = 21)
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
RM
SE
re
lati
ve
to
Ma
ha
lan
ob
is
Mahalanobis
MSN
Why Why Weight with Canonical Analysis?Weight with Canonical Analysis?
• Not affected by non-informative Y’s if number of canonical pairs is determined by test of significance on rank.
Effect of adding 2 random Y's(Number of original Y's =15)
-0.2
-0.1
0
0.1
0.2
0.3
RM
SE
re
lati
ve
to
Ma
ha
lan
ob
is
MSN0
MSN0 with 2 randomY's
MSN1
MSN1 with 2 random Y's
Comparison of MSN Distance Comparison of MSN Distance FunctionsFunctions
• Moeur and Stage 1995– Assumes Y’s are “true”
– Searches for closest linear combination of Y’s
– Set of near neighbors sensitive to lower order canonical correlatrions
• Stage 2003 – Assumes Y’s include
measurement error– Searches for closest
linear combination of predicted Y’s
– Set of near neighbors less sensitive to random elements “swept” into lower order canonical corr.
New regression alternative:New regression alternative:
d ij 2 = (Xi - Xj) [ (I- 2 )]-1 ’ (Xi - Xj )’
is the diagonal matrix of canonical
correlations for k =
W 1 2/
/
/
1 11 0 0 0
0 0 0
0 0 1 0
0
0 0 0 0 0
k k
ii
k
ii
s
1 1
/ P R O P V A R
Effect of change:Effect of change:
• No change if only first canonical pair is used.
• Regression alternative gives more relative weight to higher correlated pairs.
• Effects on Root-Mean-Square Error of imputation are mixed: e.g. the following three data-sets---
Statistics for three data setsStatistics for three data sets
Utah Tally Lake User’s Guide
Canonical pairs (s) 9 8 7
Number of Y’s 15 8 17
Number of X’s (p) 12 20 7
Number of obs. (n) 1076 847 197
n/(p*s+s) 13.3 5.04 3.52
Canonical pair
Utah Tally LakeUser’s
Guide
2Rel.Wgt.
New/old2
Rel.Wgt.
New/old2
Rel.Wgt.
New/old
1 0.465 1.00 0.626 1.00 0.691 1.00
2 0.159 0.64 0.348 0.57 0.454 0.57
3 0.125 0.61 0.327 0.56 0.247 0.41
4 0.042 0.56 0.227 0.49 0.219 0.40
Total 0.863 1.861 1.823
Change in Relative Weights Depends on 2
-0.05
-0.04
-0.03
-0.02
-0.01
0
0.01
0.02
Pro
po
rtio
nal C
han
ge in
M
sq
r (N
ew
/Old
)
Utah FIA Data
Prop. Var = 0.99 Prop. Var = 0.90
-0.06
-0.05
-0.04
-0.03
-0.02
-0.01
0
0.01
0.02
0.03
0.04
Pro
po
rtio
na
l C
ha
ng
e
in M
sq
r (N
ew
/Old
)Tally Lake, Montana
Prop. Var = 0.99 Prop. Var = 0.90
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Pro
po
rtio
nal
Ch
an
ge i
n
Msq
r (N
ew
/Old
)
MSN User's Guide Example
Prop. Var = 0.99 Prop. Var = 0.90
Transforming X-variablesTransforming X-variables
• To predict discrete classes of modal species composition (MSC) with Euclidean or Mahalanobis distance.
• To predict continuous variables of species composition
Variable 1
Variable 2
Ref. A
Ref. B
Euclidean vs. Cosine (Spectral angle )Euclidean vs. Cosine (Spectral angle )
Euclidean
Spectral angle
Target Obs.
Euclidean distance function with cosine transformation
co s(a )
x x
x x
ij
ik jkk 1
p
ik2
jk2
k 1
p
k 1
p
Z X / X ' Xi i i iLet:
d 2 (Z Z )' I (Z Z ) 2(1 cos(a))ij i j i j
d 2 (Z Z )' I (Z Z ) 2(1 cos(a))ij i j i j
EEffect of using cosine transformation of ffect of using cosine transformation of TM data on classification accuracy*TM data on classification accuracy*
Attribute Untransformed Cosine trans.
Plant Assoc. Grp. (Oregon) **
(Mahalanobis)0.340 0.363
Modal Spp. Comp.(Oregon)** (Mahalanobis)
0.276 0.335
Modal Spp. Comp. (Minn.)***
(Euclidean)0.320 .328
* Kappa statistics **TM data ***TM+ Enhanced data
Transforming the Y-variablesTransforming the Y-variables
• Variance considerations—want homogeneity• And a logical functional form for Y = f(X)
– Transformations of species composition• Logarithm of species basal area• Percent basal area by species• Cosine spectral angle• Logistic
– Evaluated by predicting discrete Plant Association Group (PAG), Users’ Guide example data
(Oregon)
Proportion of Species A, Species B
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100 120
Elevation
Spp Anorth
Spp Asouth
Spp Bnorth
Spp Bsouth
Composition transformations:Composition transformations:
• Logistic:= ln[(Total BA – spp BA)/spp BA]
= ln( Total BA – spp BA) – ln(spp BA)
Represented in MSN by two separate variables.
• Cosine Spectral Angle:
= Spp BA / (spp BA)2
Predicting Plant Assoc. Grp. - Users' Guide data (Std Error Kappa = 0.06)
0
0.1
0.2
0.3
0.4
0.5
Mahal ln BA BA% Cos trans Logistic
Transformations of species volumes
Kap
pa
stat
isti
c
Species volumes transformed to cosine of spectral angle - Tally Lake, Montana
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06T
CuF
t
L ln
(V)
DF
ln(V
)
LP
ln(V
)
ES
ln(V
)
AF
ln(V
)
PP
ln(V
)
Cro
wn
Cov
er
Variables
RM
SE
ln(s
p vo
l) r
elat
ive
to M
ahal
anob
is
Cosinetransform ofspp vol
Augmented by two "instrumental" variables
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06T
Cu
Ft
L ln
(V)
DF
ln(V
)
LP
ln(V
)
ES
ln(V
)
AF
ln(V
)
PP
ln(V
)
Cro
wn
Cov
er
Variables
RM
SE
ln(s
p v
ol)
rel
ativ
e to
Mah
alan
ob
is Cosine
transform ofspp vol
adding tot voland crown covto spectraltransform ofspp vol
Gaussian (logarithmic) vs. Logistic
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06T
Cu
Ft
L ln
(V)
DF
ln(V
)
LP
ln(V
)
ES
ln(V
)
AF
ln(V
)
PP
ln(V
)
Cro
wn
Cov
er
Variables
RM
SE
ln(s
p v
ol)
rel
ativ
e to
Mah
alan
ob
is
ln of sppvolumes
logistictransform ofspp vol (twoterm)
Comparing transformations
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06T
Cu
Ft
L ln
(V)
DF
ln(V
)
LP
ln(V
)
ES
ln(V
)
AF
ln(V
)
PP
ln(V
)
Cro
wn
Cov
er
Variables
RM
SE
ln(s
p v
ol)
rel
ativ
e to
Mah
alan
ob
is
Cosinetransform ofspp vol
Logistictransform ofspp vol (twoterm)
Comparing transformations
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08T
Cu
Ft
L ln
(V)
DF
ln(V
)
LP
ln(V
)
ES
ln(V
)
AF
ln(V
)
PP
ln(V
)
Cro
wn
Cov
er
Variables
RM
SE
ln(s
p v
ol)
rel
ativ
e to
Mah
alan
ob
is
cosinetransform ofspp vol
adding tot voland crown covto spectraltransform ofspp volln of sppvolumes
logistictransform ofspp vol (twoterm)
Implications of transformingImplications of transforming
• Imputed value derived from the neighbor, not directly from the model as in regression.
• Neighbor selection may be improved by transforming Y’s and X’s .
• Multivariate Y’s can resolve some indeterminacies from functions having extreme-value points (maxima or minima).
MSN Software Now Includes MSN Software Now Includes Alternative Distance Functions:Alternative Distance Functions:
• Both canonical-correlation based distance functions.
• Euclidean distance on normalized X’s.
• Mahalanobis distance on normalized X’s.
• You supply a weight matrix of your derivation.
• K-nearest neighbors identification.
So ??So ??
• Of the many methods available for imputation of attributes, no one alternative is clearly superior for all data sets.
• E-mail: ncrookston@fs.fed.us
• On the Web:
• In print:Crookston, N.L., Moeur, M. and Renner, D.L. 2002.
User’s guide to the Most Similar Neighbor Imputation Program Version 2. Gen. Tech. Rpt. RMRS-GTR-96. Ogden, UT: USDA Rocky Mountain Research Station 35p.
Software AvailabilitySoftware Availability
http://forest.moscowfsl.wsu.edu/gems/msn.html.
top related