transforms and other prestidigitations—or new twists in imputation

Post on 15-Jan-2016

35 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Transforms and other prestidigitations—or new twists in imputation. Albert R. Stage. Imputation:. To use what we know about “everywhere” that may be useful, but not very interesting- the X’s, To fill in detail that is prohibitive to obtain, except on a sample- the Y’s, - PowerPoint PPT Presentation

TRANSCRIPT

Transforms and other Transforms and other prestidigitations—or new prestidigitations—or new

twists in imputation. twists in imputation.

Albert R. StageAlbert R. Stage

Imputation:Imputation:

• To use what we know about “everywhere” that may be useful, but not very interesting- the X’s,

• To fill in detail that is prohibitive to obtain, except on a sample- the Y’s,

• By finding surrogates based on similarity of the X’s.

TopicsTopics

• Measures of similarity (a few in particular)

• Alternative MSN distance function leading to some improved estimates

• Transformations that improve resolution– On the X-side (known everywhere)– On the Y-side (known for sample only)

Distance measures for interval and Distance measures for interval and ratio scale variables (Podani 2000)ratio scale variables (Podani 2000)

• Euclidean/Mahalanobis • Chord • Angular• Geodesic• Manhattan• Canberra• Clark• Bray-Curtis• Marczewski-Steinhaus• 1-Kulczynski

• Pinkham-Pearson• Gleason• Ellenberg• Pandeya• Chi-square • 1-Correlation• 1-similarity ratio• Kendall difference• Faith intermediate • Uppsala coefficient

Distance measures for binary Distance measures for binary variables Podani (2000)variables Podani (2000)

Symmetric for 0/1• Simple matching• Euclidean• Rogers-Tanimoto• Sokal-Sneath• Anderberg I• Anderberg II• Correlation• Yule I• Yule II• Hamann

Asymmetric for 0/1• Baroni-Urbani-Buser I• Baroni-Urbani-Buser II• Russell-Rao• Faith I• Faith II

• Ignore 0• Jaccard• Sorenson• Chord• Kulczynski• Sokal-Sneath II• Mountford

Distance function in matrix notationDistance function in matrix notation

D2iu = mini [ (Xi-Xu) W (Xi-Xu)’ ]

– Where, for• Euclidean distance: W = I (Identity matrix)• Mahalanobis distance: W = Inverse

covariance matrix)• MSN (1995): W = ’ with:

= matrix of coefficients of canonical variatesdiagonal matrix of canonical correlations

Why Why Weight with Canonical Analysis?Weight with Canonical Analysis?

• Not degraded by non-informative X’s

Effect of adding 2 random X's(Number of original X's = 21)

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

RM

SE

re

lati

ve

to

Ma

ha

lan

ob

is

Mahalanobis

MSN

Why Why Weight with Canonical Analysis?Weight with Canonical Analysis?

• Not affected by non-informative Y’s if number of canonical pairs is determined by test of significance on rank.

Effect of adding 2 random Y's(Number of original Y's =15)

-0.2

-0.1

0

0.1

0.2

0.3

RM

SE

re

lati

ve

to

Ma

ha

lan

ob

is

MSN0

MSN0 with 2 randomY's

MSN1

MSN1 with 2 random Y's

Comparison of MSN Distance Comparison of MSN Distance FunctionsFunctions

• Moeur and Stage 1995– Assumes Y’s are “true”

– Searches for closest linear combination of Y’s

– Set of near neighbors sensitive to lower order canonical correlatrions

• Stage 2003 – Assumes Y’s include

measurement error– Searches for closest

linear combination of predicted Y’s

– Set of near neighbors less sensitive to random elements “swept” into lower order canonical corr.

New regression alternative:New regression alternative:

d ij 2 = (Xi - Xj) [ (I- 2 )]-1 ’ (Xi - Xj )’

is the diagonal matrix of canonical

correlations for k =

W 1 2/

/

/

1 11 0 0 0

0 0 0

0 0 1 0

0

0 0 0 0 0

k k

ii

k

ii

s

1 1

/ P R O P V A R

Effect of change:Effect of change:

• No change if only first canonical pair is used.

• Regression alternative gives more relative weight to higher correlated pairs.

• Effects on Root-Mean-Square Error of imputation are mixed: e.g. the following three data-sets---

Statistics for three data setsStatistics for three data sets

Utah Tally Lake User’s Guide

Canonical pairs (s) 9 8 7

Number of Y’s 15 8 17

Number of X’s (p) 12 20 7

Number of obs. (n) 1076 847 197

n/(p*s+s) 13.3 5.04 3.52

Canonical pair

Utah Tally LakeUser’s

Guide

2Rel.Wgt.

New/old2

Rel.Wgt.

New/old2

Rel.Wgt.

New/old

1 0.465 1.00 0.626 1.00 0.691 1.00

2 0.159 0.64 0.348 0.57 0.454 0.57

3 0.125 0.61 0.327 0.56 0.247 0.41

4 0.042 0.56 0.227 0.49 0.219 0.40

Total 0.863 1.861 1.823

Change in Relative Weights Depends on 2

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

Pro

po

rtio

nal C

han

ge in

M

sq

r (N

ew

/Old

)

Utah FIA Data

Prop. Var = 0.99 Prop. Var = 0.90

-0.06

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

Pro

po

rtio

na

l C

ha

ng

e

in M

sq

r (N

ew

/Old

)Tally Lake, Montana

Prop. Var = 0.99 Prop. Var = 0.90

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Pro

po

rtio

nal

Ch

an

ge i

n

Msq

r (N

ew

/Old

)

MSN User's Guide Example

Prop. Var = 0.99 Prop. Var = 0.90

Transforming X-variablesTransforming X-variables

• To predict discrete classes of modal species composition (MSC) with Euclidean or Mahalanobis distance.

• To predict continuous variables of species composition

Variable 1

Variable 2

Ref. A

Ref. B

Euclidean vs. Cosine (Spectral angle )Euclidean vs. Cosine (Spectral angle )

Euclidean

Spectral angle

Target Obs.

Euclidean distance function with cosine transformation

co s(a )

x x

x x

ij

ik jkk 1

p

ik2

jk2

k 1

p

k 1

p

Z X / X ' Xi i i iLet:

d 2 (Z Z )' I (Z Z ) 2(1 cos(a))ij i j i j

d 2 (Z Z )' I (Z Z ) 2(1 cos(a))ij i j i j

EEffect of using cosine transformation of ffect of using cosine transformation of TM data on classification accuracy*TM data on classification accuracy*

Attribute Untransformed Cosine trans.

Plant Assoc. Grp. (Oregon) **

(Mahalanobis)0.340 0.363

Modal Spp. Comp.(Oregon)** (Mahalanobis)

0.276 0.335

Modal Spp. Comp. (Minn.)***

(Euclidean)0.320 .328

* Kappa statistics **TM data ***TM+ Enhanced data

Transforming the Y-variablesTransforming the Y-variables

• Variance considerations—want homogeneity• And a logical functional form for Y = f(X)

– Transformations of species composition• Logarithm of species basal area• Percent basal area by species• Cosine spectral angle• Logistic

– Evaluated by predicting discrete Plant Association Group (PAG), Users’ Guide example data

(Oregon)

Proportion of Species A, Species B

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100 120

Elevation

Spp Anorth

Spp Asouth

Spp Bnorth

Spp Bsouth

Composition transformations:Composition transformations:

• Logistic:= ln[(Total BA – spp BA)/spp BA]

= ln( Total BA – spp BA) – ln(spp BA)

Represented in MSN by two separate variables.

• Cosine Spectral Angle:

= Spp BA / (spp BA)2

Predicting Plant Assoc. Grp. - Users' Guide data (Std Error Kappa = 0.06)

0

0.1

0.2

0.3

0.4

0.5

Mahal ln BA BA% Cos trans Logistic

Transformations of species volumes

Kap

pa

stat

isti

c

Species volumes transformed to cosine of spectral angle - Tally Lake, Montana

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06T

CuF

t

L ln

(V)

DF

ln(V

)

LP

ln(V

)

ES

ln(V

)

AF

ln(V

)

PP

ln(V

)

Cro

wn

Cov

er

Variables

RM

SE

ln(s

p vo

l) r

elat

ive

to M

ahal

anob

is

Cosinetransform ofspp vol

Augmented by two "instrumental" variables

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06T

Cu

Ft

L ln

(V)

DF

ln(V

)

LP

ln(V

)

ES

ln(V

)

AF

ln(V

)

PP

ln(V

)

Cro

wn

Cov

er

Variables

RM

SE

ln(s

p v

ol)

rel

ativ

e to

Mah

alan

ob

is Cosine

transform ofspp vol

adding tot voland crown covto spectraltransform ofspp vol

Gaussian (logarithmic) vs. Logistic

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06T

Cu

Ft

L ln

(V)

DF

ln(V

)

LP

ln(V

)

ES

ln(V

)

AF

ln(V

)

PP

ln(V

)

Cro

wn

Cov

er

Variables

RM

SE

ln(s

p v

ol)

rel

ativ

e to

Mah

alan

ob

is

ln of sppvolumes

logistictransform ofspp vol (twoterm)

Comparing transformations

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06T

Cu

Ft

L ln

(V)

DF

ln(V

)

LP

ln(V

)

ES

ln(V

)

AF

ln(V

)

PP

ln(V

)

Cro

wn

Cov

er

Variables

RM

SE

ln(s

p v

ol)

rel

ativ

e to

Mah

alan

ob

is

Cosinetransform ofspp vol

Logistictransform ofspp vol (twoterm)

Comparing transformations

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08T

Cu

Ft

L ln

(V)

DF

ln(V

)

LP

ln(V

)

ES

ln(V

)

AF

ln(V

)

PP

ln(V

)

Cro

wn

Cov

er

Variables

RM

SE

ln(s

p v

ol)

rel

ativ

e to

Mah

alan

ob

is

cosinetransform ofspp vol

adding tot voland crown covto spectraltransform ofspp volln of sppvolumes

logistictransform ofspp vol (twoterm)

Implications of transformingImplications of transforming

• Imputed value derived from the neighbor, not directly from the model as in regression.

• Neighbor selection may be improved by transforming Y’s and X’s .

• Multivariate Y’s can resolve some indeterminacies from functions having extreme-value points (maxima or minima).

MSN Software Now Includes MSN Software Now Includes Alternative Distance Functions:Alternative Distance Functions:

• Both canonical-correlation based distance functions.

• Euclidean distance on normalized X’s.

• Mahalanobis distance on normalized X’s.

• You supply a weight matrix of your derivation.

• K-nearest neighbors identification.

So ??So ??

• Of the many methods available for imputation of attributes, no one alternative is clearly superior for all data sets.

• E-mail: ncrookston@fs.fed.us

• On the Web:

• In print:Crookston, N.L., Moeur, M. and Renner, D.L. 2002.

User’s guide to the Most Similar Neighbor Imputation Program Version 2. Gen. Tech. Rpt. RMRS-GTR-96. Ogden, UT: USDA Rocky Mountain Research Station 35p.

Software AvailabilitySoftware Availability

http://forest.moscowfsl.wsu.edu/gems/msn.html.

top related