transforms and other prestidigitations—or new twists in imputation

33
Transforms and other Transforms and other prestidigitations—or prestidigitations—or new twists in new twists in imputation. imputation. Albert R. Stage Albert R. Stage

Upload: adelio

Post on 15-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Transforms and other prestidigitations—or new twists in imputation. Albert R. Stage. Imputation:. To use what we know about “everywhere” that may be useful, but not very interesting- the X’s, To fill in detail that is prohibitive to obtain, except on a sample- the Y’s, - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Transforms and other prestidigitations—or new twists in imputation

Transforms and other Transforms and other prestidigitations—or new prestidigitations—or new

twists in imputation. twists in imputation.

Albert R. StageAlbert R. Stage

Page 2: Transforms and other prestidigitations—or new twists in imputation

Imputation:Imputation:

• To use what we know about “everywhere” that may be useful, but not very interesting- the X’s,

• To fill in detail that is prohibitive to obtain, except on a sample- the Y’s,

• By finding surrogates based on similarity of the X’s.

Page 3: Transforms and other prestidigitations—or new twists in imputation

TopicsTopics

• Measures of similarity (a few in particular)

• Alternative MSN distance function leading to some improved estimates

• Transformations that improve resolution– On the X-side (known everywhere)– On the Y-side (known for sample only)

Page 4: Transforms and other prestidigitations—or new twists in imputation

Distance measures for interval and Distance measures for interval and ratio scale variables (Podani 2000)ratio scale variables (Podani 2000)

• Euclidean/Mahalanobis • Chord • Angular• Geodesic• Manhattan• Canberra• Clark• Bray-Curtis• Marczewski-Steinhaus• 1-Kulczynski

• Pinkham-Pearson• Gleason• Ellenberg• Pandeya• Chi-square • 1-Correlation• 1-similarity ratio• Kendall difference• Faith intermediate • Uppsala coefficient

Page 5: Transforms and other prestidigitations—or new twists in imputation

Distance measures for binary Distance measures for binary variables Podani (2000)variables Podani (2000)

Symmetric for 0/1• Simple matching• Euclidean• Rogers-Tanimoto• Sokal-Sneath• Anderberg I• Anderberg II• Correlation• Yule I• Yule II• Hamann

Asymmetric for 0/1• Baroni-Urbani-Buser I• Baroni-Urbani-Buser II• Russell-Rao• Faith I• Faith II

• Ignore 0• Jaccard• Sorenson• Chord• Kulczynski• Sokal-Sneath II• Mountford

Page 6: Transforms and other prestidigitations—or new twists in imputation

Distance function in matrix notationDistance function in matrix notation

D2iu = mini [ (Xi-Xu) W (Xi-Xu)’ ]

– Where, for• Euclidean distance: W = I (Identity matrix)• Mahalanobis distance: W = Inverse

covariance matrix)• MSN (1995): W = ’ with:

= matrix of coefficients of canonical variatesdiagonal matrix of canonical correlations

Page 7: Transforms and other prestidigitations—or new twists in imputation

Why Why Weight with Canonical Analysis?Weight with Canonical Analysis?

• Not degraded by non-informative X’s

Effect of adding 2 random X's(Number of original X's = 21)

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

RM

SE

re

lati

ve

to

Ma

ha

lan

ob

is

Mahalanobis

MSN

Page 8: Transforms and other prestidigitations—or new twists in imputation

Why Why Weight with Canonical Analysis?Weight with Canonical Analysis?

• Not affected by non-informative Y’s if number of canonical pairs is determined by test of significance on rank.

Effect of adding 2 random Y's(Number of original Y's =15)

-0.2

-0.1

0

0.1

0.2

0.3

RM

SE

re

lati

ve

to

Ma

ha

lan

ob

is

MSN0

MSN0 with 2 randomY's

MSN1

MSN1 with 2 random Y's

Page 9: Transforms and other prestidigitations—or new twists in imputation

Comparison of MSN Distance Comparison of MSN Distance FunctionsFunctions

• Moeur and Stage 1995– Assumes Y’s are “true”

– Searches for closest linear combination of Y’s

– Set of near neighbors sensitive to lower order canonical correlatrions

• Stage 2003 – Assumes Y’s include

measurement error– Searches for closest

linear combination of predicted Y’s

– Set of near neighbors less sensitive to random elements “swept” into lower order canonical corr.

Page 10: Transforms and other prestidigitations—or new twists in imputation

New regression alternative:New regression alternative:

d ij 2 = (Xi - Xj) [ (I- 2 )]-1 ’ (Xi - Xj )’

is the diagonal matrix of canonical

correlations for k =

W 1 2/

/

/

1 11 0 0 0

0 0 0

0 0 1 0

0

0 0 0 0 0

k k

ii

k

ii

s

1 1

/ P R O P V A R

Page 11: Transforms and other prestidigitations—or new twists in imputation

Effect of change:Effect of change:

• No change if only first canonical pair is used.

• Regression alternative gives more relative weight to higher correlated pairs.

• Effects on Root-Mean-Square Error of imputation are mixed: e.g. the following three data-sets---

Page 12: Transforms and other prestidigitations—or new twists in imputation

Statistics for three data setsStatistics for three data sets

Utah Tally Lake User’s Guide

Canonical pairs (s) 9 8 7

Number of Y’s 15 8 17

Number of X’s (p) 12 20 7

Number of obs. (n) 1076 847 197

n/(p*s+s) 13.3 5.04 3.52

Page 13: Transforms and other prestidigitations—or new twists in imputation

Canonical pair

Utah Tally LakeUser’s

Guide

2Rel.Wgt.

New/old2

Rel.Wgt.

New/old2

Rel.Wgt.

New/old

1 0.465 1.00 0.626 1.00 0.691 1.00

2 0.159 0.64 0.348 0.57 0.454 0.57

3 0.125 0.61 0.327 0.56 0.247 0.41

4 0.042 0.56 0.227 0.49 0.219 0.40

Total 0.863 1.861 1.823

Change in Relative Weights Depends on 2

Page 14: Transforms and other prestidigitations—or new twists in imputation

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

Pro

po

rtio

nal C

han

ge in

M

sq

r (N

ew

/Old

)

Utah FIA Data

Prop. Var = 0.99 Prop. Var = 0.90

Page 15: Transforms and other prestidigitations—or new twists in imputation

-0.06

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

Pro

po

rtio

na

l C

ha

ng

e

in M

sq

r (N

ew

/Old

)Tally Lake, Montana

Prop. Var = 0.99 Prop. Var = 0.90

Page 16: Transforms and other prestidigitations—or new twists in imputation

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Pro

po

rtio

nal

Ch

an

ge i

n

Msq

r (N

ew

/Old

)

MSN User's Guide Example

Prop. Var = 0.99 Prop. Var = 0.90

Page 17: Transforms and other prestidigitations—or new twists in imputation

Transforming X-variablesTransforming X-variables

• To predict discrete classes of modal species composition (MSC) with Euclidean or Mahalanobis distance.

• To predict continuous variables of species composition

Page 18: Transforms and other prestidigitations—or new twists in imputation

Variable 1

Variable 2

Ref. A

Ref. B

Euclidean vs. Cosine (Spectral angle )Euclidean vs. Cosine (Spectral angle )

Euclidean

Spectral angle

Target Obs.

Page 19: Transforms and other prestidigitations—or new twists in imputation

Euclidean distance function with cosine transformation

co s(a )

x x

x x

ij

ik jkk 1

p

ik2

jk2

k 1

p

k 1

p

Z X / X ' Xi i i iLet:

d 2 (Z Z )' I (Z Z ) 2(1 cos(a))ij i j i j

d 2 (Z Z )' I (Z Z ) 2(1 cos(a))ij i j i j

Page 20: Transforms and other prestidigitations—or new twists in imputation

EEffect of using cosine transformation of ffect of using cosine transformation of TM data on classification accuracy*TM data on classification accuracy*

Attribute Untransformed Cosine trans.

Plant Assoc. Grp. (Oregon) **

(Mahalanobis)0.340 0.363

Modal Spp. Comp.(Oregon)** (Mahalanobis)

0.276 0.335

Modal Spp. Comp. (Minn.)***

(Euclidean)0.320 .328

* Kappa statistics **TM data ***TM+ Enhanced data

Page 21: Transforms and other prestidigitations—or new twists in imputation

Transforming the Y-variablesTransforming the Y-variables

• Variance considerations—want homogeneity• And a logical functional form for Y = f(X)

– Transformations of species composition• Logarithm of species basal area• Percent basal area by species• Cosine spectral angle• Logistic

– Evaluated by predicting discrete Plant Association Group (PAG), Users’ Guide example data

(Oregon)

Page 22: Transforms and other prestidigitations—or new twists in imputation

Proportion of Species A, Species B

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100 120

Elevation

Spp Anorth

Spp Asouth

Spp Bnorth

Spp Bsouth

Page 23: Transforms and other prestidigitations—or new twists in imputation

Composition transformations:Composition transformations:

• Logistic:= ln[(Total BA – spp BA)/spp BA]

= ln( Total BA – spp BA) – ln(spp BA)

Represented in MSN by two separate variables.

• Cosine Spectral Angle:

= Spp BA / (spp BA)2

Page 24: Transforms and other prestidigitations—or new twists in imputation

Predicting Plant Assoc. Grp. - Users' Guide data (Std Error Kappa = 0.06)

0

0.1

0.2

0.3

0.4

0.5

Mahal ln BA BA% Cos trans Logistic

Transformations of species volumes

Kap

pa

stat

isti

c

Page 25: Transforms and other prestidigitations—or new twists in imputation

Species volumes transformed to cosine of spectral angle - Tally Lake, Montana

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06T

CuF

t

L ln

(V)

DF

ln(V

)

LP

ln(V

)

ES

ln(V

)

AF

ln(V

)

PP

ln(V

)

Cro

wn

Cov

er

Variables

RM

SE

ln(s

p vo

l) r

elat

ive

to M

ahal

anob

is

Cosinetransform ofspp vol

Page 26: Transforms and other prestidigitations—or new twists in imputation

Augmented by two "instrumental" variables

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06T

Cu

Ft

L ln

(V)

DF

ln(V

)

LP

ln(V

)

ES

ln(V

)

AF

ln(V

)

PP

ln(V

)

Cro

wn

Cov

er

Variables

RM

SE

ln(s

p v

ol)

rel

ativ

e to

Mah

alan

ob

is Cosine

transform ofspp vol

adding tot voland crown covto spectraltransform ofspp vol

Page 27: Transforms and other prestidigitations—or new twists in imputation

Gaussian (logarithmic) vs. Logistic

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06T

Cu

Ft

L ln

(V)

DF

ln(V

)

LP

ln(V

)

ES

ln(V

)

AF

ln(V

)

PP

ln(V

)

Cro

wn

Cov

er

Variables

RM

SE

ln(s

p v

ol)

rel

ativ

e to

Mah

alan

ob

is

ln of sppvolumes

logistictransform ofspp vol (twoterm)

Page 28: Transforms and other prestidigitations—or new twists in imputation

Comparing transformations

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06T

Cu

Ft

L ln

(V)

DF

ln(V

)

LP

ln(V

)

ES

ln(V

)

AF

ln(V

)

PP

ln(V

)

Cro

wn

Cov

er

Variables

RM

SE

ln(s

p v

ol)

rel

ativ

e to

Mah

alan

ob

is

Cosinetransform ofspp vol

Logistictransform ofspp vol (twoterm)

Page 29: Transforms and other prestidigitations—or new twists in imputation

Comparing transformations

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08T

Cu

Ft

L ln

(V)

DF

ln(V

)

LP

ln(V

)

ES

ln(V

)

AF

ln(V

)

PP

ln(V

)

Cro

wn

Cov

er

Variables

RM

SE

ln(s

p v

ol)

rel

ativ

e to

Mah

alan

ob

is

cosinetransform ofspp vol

adding tot voland crown covto spectraltransform ofspp volln of sppvolumes

logistictransform ofspp vol (twoterm)

Page 30: Transforms and other prestidigitations—or new twists in imputation

Implications of transformingImplications of transforming

• Imputed value derived from the neighbor, not directly from the model as in regression.

• Neighbor selection may be improved by transforming Y’s and X’s .

• Multivariate Y’s can resolve some indeterminacies from functions having extreme-value points (maxima or minima).

Page 31: Transforms and other prestidigitations—or new twists in imputation

MSN Software Now Includes MSN Software Now Includes Alternative Distance Functions:Alternative Distance Functions:

• Both canonical-correlation based distance functions.

• Euclidean distance on normalized X’s.

• Mahalanobis distance on normalized X’s.

• You supply a weight matrix of your derivation.

• K-nearest neighbors identification.

Page 32: Transforms and other prestidigitations—or new twists in imputation

So ??So ??

• Of the many methods available for imputation of attributes, no one alternative is clearly superior for all data sets.

Page 33: Transforms and other prestidigitations—or new twists in imputation

• E-mail: [email protected]

• On the Web:

• In print:Crookston, N.L., Moeur, M. and Renner, D.L. 2002.

User’s guide to the Most Similar Neighbor Imputation Program Version 2. Gen. Tech. Rpt. RMRS-GTR-96. Ogden, UT: USDA Rocky Mountain Research Station 35p.

Software AvailabilitySoftware Availability

http://forest.moscowfsl.wsu.edu/gems/msn.html.