stats 2 note 04

7/28/2019 Stats 2 Note 04

1/23

17.874LectureNotesPart 2: Matrix Algebra

7/28/2019 Stats 2 Note 04

2/23

2. MatrixAlgebra2.1. Introduction: DesignMatrices andDataMatrices

Matricesarearraysofnumbers. Weencountertheminstatisticsinatleastthreedierentways. First, they are a convenient device for systematically organizing our study designs.Second, data sets typically take the form rectangular arrays of numbers and characters.Third, algebraic and computational manipulation of data is facilitated bymatrixalgebra.This isparticularly important formultivariatestatistics.

Over thenext 2weeksorso wewill learnthebasic matrixalgebraused in multivariatestatistics. Inparticular,wewilllearnhowtoconstructthebasicstatisticalconcepts{means,variances,andcovariances{inmatrixform,andtosolveestimationproblems,whichinvolvesolvingsystemsofequations.

Ibeginthissectionofthe coursebydiscussingDesignMatricesandDataMatrices.

DesignMatricesarearrays thatdescribe the structure ofastudy. Manystudies involvelittleactualdesign. Thedesignmatrix isthedatamatrix. Experiments,quasi-experiments,andsurveysofteninvolvesomedegreeofcomplexityandcareintheirdesign. Foranystudy,adesignmatrixcanbeveryusefulatthestartforthinking aboutexactlywhatdataneedstobecollectedtoansweraquestion.

Thereareseveralcomponentstoadesignmatrix. Therowscorrespondtounitsofanalysisanduniquefactorsinastudy;thecolumnsarevariablesthatonewouldliketomeasure(evenifyoucan't). AttheveryleastoneneedstosetasidecolumnsfortheXvariableandtheYvariable. Youmightalsoconsideranassignmentprobability(inanexperimentorsurvey)orvariable (thismight bean instrumentalvariableinanobservationalstudy).

Forexample, in theearlyandmid-1990s Idid aseriesof abouttwodozenexperimentsinvolvingcampaignadvertising. Earlyon inthosestudiesIwrotedownasimplematrixforkeeping the factors in the studies straight. Units were individuals recruitedto participate

1

7/28/2019 Stats 2 Note 04

3/23

in thestudy. Thefactorswerethatsomepeoplesawanadfroma Democrat,somepeoplesawan ad from a Republican, and some sawno ad. In addition, some ads were negativeand some were positive. I wanted to know how these factors aected variousoutcomes{especially, turnoutand vote intention. Do candidates do betterwith positiveor negativeads? Whichcandidates? Doesturnoutdrop?

SimpleDesignMatrixSource Tone Assignment Turnout Vote

D + .2 TD+ VD+D - .2 TD VDR + .2 TR+ VR+R - .2 TR VR0 0 .2 T0 V0

Ifoundthatwritingdownamatrixlikethishelpedmetothinkaboutwhattheexperimentitselfcouldmeasure. Icanmeasure the eect ofnegativeversus positiveadsonturnoutbytakingthefollowingdierence: [TD++TR+][TD+TR].

It alsoraisedsome interesting questions. Do Ineedthecontrolgroup tomeasurealloftheeectsofinterest?

Data Matrices implement our design matrices and organize our data. Data matricesbegin withinformation abouttheunits in ourstudy{people, countries, years, etc. Theythencontain informationaboutthe variablesof interest. Itreally doesn'tmatterhowyourepresentdatainadatabase,exceptthatmostdatabasesreserveblanksand\."formissingdata.

Below are hypothetical data for 10subject forour experiments. The content of eachvariable is recorded so that it ismost meaningful. UnderthevariableSource\D"meansDemocraticcandidate.

2

7/28/2019 Stats 2 Note 04

4/23

SimpleDesignMatrixSubject Source Tone Gender Turnout Vote

1 D + F Yes D2 D - F Yes R3 C 0 M No 04 R - F Yes R5 C 0 M No 06 R + M Yes D7 D + M Yes R8 C 0 M Yes D9 R - F Yes R

10

D

- M

No

0

We can't easilyanalyze these data. We need numericalrepresentationsofthe outcomevariablesTurnoutandVoteandthecontrolvariables,ifany{inthiscaseGender. Forsuchcategoricaldatawewouldneedtoassign indicatorvariables toidentifyeachgroup. Exactlyhowwedothatmaydependonthesortofanalysiswewishtoperform. Source,forexample,mayleadtotwoindicatorvariables:DemocraticSourceandRepublicanSource. Eachequals1 if the statement istrueand0 otherwise. Tonemightalso leadto two dierent indicatorvariables.

Usuallyvariablesthatare tobeusedinanalysisarecodednumerically. WemightcodeD as 1 and Rand 2 and Control group as 0. Tonemightbe 1 for +, 0 fornone, and -1fornegative. And so forth. Thiscodingcan bedoneduringdataanalysisorduringdataentry. A word ofcautionalways erron the side oftoorichof acoding,rather than skimpon informationavailablewhenyouassembleyourdata.

Matriceshelp usto analyzethe information in adata matrixand helpusthink abouttheinformationcontained(andnot)inastudydesign. Wewilllearn,forexample,whentheeectsandparametersofinterestcaninfact beestimatedfromagivendesign.

2.2. DenitionsVectorsarethe building blocks fordata analysis and matrix algebra.Avectorisacolumnofnumbers. Weusuallydenoteavectorasalowercaseboldletter

3

7/28/2019 Stats 2 Note 04

5/23

(x, y, a). When wedonot haveaspecic setof numbers at hand butareconsidering avectorintheabstract. For example,

0 1x1;B CBx2;CB CBx3;CB CCx=B :;B CB CB :; CB C@ :; Axn

Instatistics,avectorisusuallyavariable. Geometrically,avectorcorrespondstoapointinspace. Eachelementinthevectorcanbegraphedasthe\observationspace."Ifdatawere,say, fromasurvey,we couldgraphhow person1 answered the survey, person2, etc.,asavector.

Wecouldtakeanothersortofvectorindataanalysis{therowcorrespondingtothevaluesofthevariablesforanyunitorsubject. That is,arowvector is aset ofnumbersarrangedinasinglerow:

a= ( a1; a2; a3; : ; : ; : ;ak)Thetransposeofavectorisarewrittingofthevectorfromacolumntoaroworfroma

rowtoacolumn. We denotethetransposewithanelongatedapostrophe 0.0x = ( x1; x2; x3; : ; : ; : ;xn)

Amatrixisarectangulararrayofnumbers,oracollectionofvectors. Wewritematriceswithcapital letters. Theelementsofamatrixarenumbers.

0 1x11; x21;:::xk1 CBx12; x22;:::xk2B CB CX=Bx13; x23;:::xk3 CB C@ A:::x1n; x2n;:::xkn

orX= ( x1;x2;x3;:;:;:; xk)

4

7/28/2019 Stats 2 Note 04

6/23

Thedimensionsofamatrixarethenumberofrowsandcolumns. Above,Xhasnrowsandk columns so we say that it is an nkdimension matrixorjust\n by k."

It is importanttokeepindexesstraightbecauseoperationssuchasadditionandmulti-plicationworkonindividualelements.

The transpose of amatrixrepresents the reorganization of amatrix such that its rowsbecomeitscolumns. 0

Bx ;01x ;021C

B

C

0

3xB CCX0 =B :;B CB CB :; CB C:;

B

C

B ;C

@ A0

kx0 1x11; x12;:::x1nCBx21; x22;:::x2nCB CB

X=Bx31; x32;:::x3nCB C@ A:::xk1; xk2;:::xkn

Note: Thedimensionchangesfromnbyktokbyn.Aspecialtypeofmatrixthatissymmetric{thenumbersabovethediagonalmirrorthe

numbersbelowthediagonal. ThismeansthatX=X0.

As with simple algebra, we will want to have the \numbers" 0 and 1 so that we candenedivision,subtraction,and otheroperators. A 0 vectorormatrixcontainsall zeros.Theanalogue of thenumber1iscalledtheIdentitymatrix. Ithas1'sonthediagonaland0'selsewhere. 0 11;0;0;0;:::;0CB0;1;0;0;:::;0CBB C

I=B0;0;1;0;:::;0CB C@ A:::0;0;0;0;:::;1

2.3. AdditionandMultiplication

5

7/28/2019 Stats 2 Note 04

7/23

2.3.1. AdditionToaddvectorsandmatriceswe sum allelementswiththe same index.

0 1x1+y1;B CBx2+y2;CB CBx3+y3;C

x+y=BBB :; CCCBBB :; :;

CCCA@xn+yn

Matrixaddition isconstructedfromvectoraddition,becauseamatrixisavectorofvectors.Formatrices AandB,A+Bconsistsofrstaddingthevectorofvectors thatisAtothevectorofvectorsthatisBandthenperformingvectoradditionforeachofthevectors.

Moresimply,keepthe indexesstraightand add each element in Ato theelementwiththesameindexinB.ThiswillproduceanewmatrixC.

0 1a11+b11; a21+b21;:::ak1+bk1

B C

B a12+b12; a22+b22;:::ak2+bk2

CC=A+B=B C@:::

Aa1n +b1n; a2n+b2n;:::akn+bkn

NOTE:AandBmusthavethesamedimensions inordertoaddthemtogether.

2.3.2. MultiplicationMultiplicationtakesthreeforms: scalarmultiplication,innerproductandouterproduct.

Wewilldenethemultiplicationrules for vectors rst,andthenmatrices.Ascalar productisanumber timesavectorormatrix. Letbeanumber:

0 1x1;B CBx2;CB CBx3;CB CCx=B :;B CB CB :; CB C@ :; Axk

6

7/28/2019 Stats 2 Note 04

8/23

The inner product ofvectorsisthe sumofthe multipleofelementsoftwovectorswithP

thesame index. That is, n Thisoperationisa rowvector timesa columnvector:i=1xiai.x0a. The rst element of the row is multipliedtimes the rst element of thecolumn. Thesecondelement of the row ismultiplied times the second elementof the column,andthatproduct isaddedtotheproductoftherstelements. Andsoon. Thisoperationreturnsanumber. Note: the columnsoftherowvector mustequaltherowsofthecolumnvector.

In statistics thisoperation is of particular importance. Inner productsreturn sumsofsquaresandsumsofcross- products. Theseareusedtocalculatevariancesandcovariances.

Theouterproduct oftwovectors isthemultiplicationofacolumnvector(dimensionn)timesarowvector (dimensionk). Theresultis amatrixofdimensionnbyk. The elementintherst rowofthecolumnvector is multipliedby theelement intherst columnoftherowvector. Thisproducestheelementinrow1andcolumn1inthenewmatrix. Generally,theithelementof the columnvector is multiplied times thej thelement of the rowvectortogettheijthelementofthenewmatrix.

0 1a1b

1; a

1b

2;:::a

1b

kB CBa2b1; a2b2;:::a2bkCab0 =B C@ ::: A

anb1; anb2;:::anbkInstatistics you willfrequentlyencountertheouterproductofa vectoranditselfwhen

weconsiderthevarianceofavector.For example, ee0 yieldsamatrixoftheresidualssquaredandcross-products.

0 12e1; e1e2;:::e1enB 2 C

0 Be2e1; e2;:::e2enCee =B C@ ::: Aene1; ene2; :::e2n

2.3.3. DeterminantsThe length of a vector x equals x0x. This follows immediately from the Pythagorean

Theorem. Consideravectorofdimension2. Thesquareofthehypotenuseisthesumofthe7

7/28/2019 Stats 2 Note 04

9/23

squareofthetwosides. The squareofthe rstsideisthedistancetraveledondimension12 2(x1) andthesecondsideisthesquareofthedistance traveledondimension2(x2).

Themagnitudeofamatrixistheabsolutevalueofthedeterminantofthematrix. Thedeterminantisthe (signed) areainscribed bythe sumofthe column vectors ofthematrix.Thedeterminantis onlydenedforasquarematrix.

Considera2x2matrixwithelementsonthediagonalonly.

x11; 0X=

0; x22Theobjectdenedbythesevectorsisarectangle,whoseareaisthebasetimestheheight:x11x22.

Fora2x2matrixthedeterminantisx11x22x12x21. Thiscanbederivedasfollows. Thesumofthevectors(x11; x12) and (x21; x22)denesaparallelogram. Andthedeterminant isthe formula for the area of theparallelogramsigned bythe orientation of theobject. Tocalculate the area of the parallelogram nd thearea of therectanglewith sidesx11 +x21and x12+x22. The parallelogram lieswithin thisrectangle. The area inside the rectanglebut not in the parallelogramcan be further divided into two identicalrectanglesand twopairsofidenticaltriangles. Onepairoftrianglesisdenedbytherstvectorandthesecondpairisdenedbythesecondvector. Therectanglesaredenedbytherstdimensionofthesecondvectorandtheseconddimensionoftherstvector. Subtractingotheareasofthetrianglesandrectangles leavestheformulaforthedeterminant(uptothesign).

Note: Iftherstvectorequalledascalartimesthesecondvector,thedeterminantwouldequal0. This istheproblem ofmulticollinearity{ the two vectors (or variables) have thesamevaluesandtheycannotbedistinguishedwiththeobservedvaluesathand. Or,perhaps,theyarereallyjustthesamevariables.

8

7/28/2019 Stats 2 Note 04

10/23

Letusgeneralizethisbyconsidering,rst,a\diagonalmatrix."0x11;0;0;0;:::;01CB0; x22;0;0;:::;0CBB C

X=B0;0; x33;0;:::;0CB C@ A:::0;0;0;0;:::;xnn

This isann-dimensionalbox. Itsareaequalstheproductofitssides: x11x22:::xnn.Foranymatrix,thedeterminantcanbecomputedby\expanding"thecofactorsalonga

givenrow(orcolumn),sayi.nX

jXj= xij(1)i+jjX(ij)j;j=1

where X(ij), called the ijth cofactor, isthe matrix leftupon deleting the ith row andjthcolumnand jX(ij)jisthedeterminantoftheijthcofactor.

Toverifythat thisworks,considerthe2x2case. Expandalongtherstrow:jXj=x11(1)1+1jx22j+x12(1)1+2jx21j=x11x22x12x21

2.4. EquationsandFunctions

2.4.1. Functions.Itisimportanttokeepinmindwhattheinputeofafunctionisandwhatitreturns(e.g.,

anumber,avector,amatrix). Generally,wedenote functionsastakingavectorormatrixasinputandreturninganumberormatrix: f(y). LinearandQuadraticForms:

Linearform: AxQuadraticform: x0Ax2.4.2. Equations inMatrixForm. Asinsimplealgebraanequationissuchthattheleft

side of the equation equals the right side. In vectors or matrices, the equality must holdelement byelement,and thusthetwosides of an equation must havethe samedimension.

9

7/28/2019 Stats 2 Note 04

11/23

Asimplelinearequationis: Ax=c

or,formatrices:AX=C

Thisdenesann-dimensionalplane.Aquadraticformdenesanellipsoid.

0xAx=cGiven aspecicnumber c, the values of xi'sthatsolvethisequationmapoutan ellipticalsurface.

2.5. Inverses.Weusedivisiontorescalevariables,asanoperation(e.g.,whencreatinganewvariable

thatisaratioofvariables),andtosolveequations.The inverseof thematrix isthedivision operation. The inverse isdened asamatrix,

B,suchthatAB =I:

Thegeneralformulaforsolvingthisequationisbij jA(ij)j

0;=

jAjwhere jC(ij)jisthedeterminant oftheijthcofactorofA.

Consider the2x2case b11; b a11; a12 12 1; 0=

a21; a22 b21; b22 0; 1This operationproduces4equationsin4unknowns(theb's)

a11b11+a12b21= 1 a21b11+a22b21= 0

10

7/28/2019 Stats 2 Note 04

12/23

a11b12+a12b22= 0 a21b12+a22b22= 1

Solvingtheseequations: b11 = a22 ,b12 = a12 ,b21 = a21 , and b22 =a11a22a12a21 a11a22a12a21 a11a22a12a21a11 . Youcanverifythatthegeneralformulaaboveleadstothesamesolutions.

a11a22a12a21

11

7/28/2019 Stats 2 Note 04

13/23

2.6. StatisticalConcepts inMatrixForm

2.6.1. ProbabilityConceptsWegeneralizetheconceptofa randomvariabletoarandomvector,x, each of whose

elements isa randomvariable,xi. Thejointdensityfunctionofx is f(x). The cumulativedensity function of x is the areaunderthe density functionuptoa point a1; a2;:::an. This is ann-fold integral:

Z an Z

a1F(a) = ::: f(x)dx1:::dxn1 1

Forexample,supposethatx isuniformontheintervals(0,1). Thisdensity functionis:f(x) = 1if 0< x1

7/28/2019 Stats 2 Note 04

14/23

This isdenedastheexpectedvalueoftheouterproductofthevector:0 1E[(x11)2]; E[(x11)(x22)];:::;E[(x11)(xnn)]B CBE[(x22)(x11)]; E[(x22)2];:::;E[(x22)(xnn)]CE[(x)(x)0] = B C@ ::: A

E[(xnn)(x11)]; E[(xnn)(x22)];:::E[(xnn)2]0 1

12; 12; 13;:::;1nB C

2; 23;:::;2nCB12; 2=B C@ :::: A1n; 2n; 3n;:::;2n

2.6.1.2. TheNormalandRelatedDistributionsThersttwomomentsaresucienttocharacterizeawiderangeofdistributionfunctions.

Most importantly, the mean vector and the variance covariance matrix characterize thenormaldistribution.

12

(x)0 1 (x)f(x) = (2)n=2jj1=2eIfthexi are independent(0covariance)andidentical(samevariancesandmeans),then simplies 2I, and the normal becomes:

12 n 22 (x)0(x)f(x) = (2)n= e

Alineartransformationofanormalrandom variableisalsonormal. LetAbeamatrixof constantsand cbeavectorofconstantsandxbearandomvector,then

Ax+cN(A+c;AA0)Wemaydene the standardnormalasfollows. LetAbeamatrixsuchthatAA0 =.

Letc=. Let zbearandomvectorsuchthatx=Az+. Then, z=A1(x). Becausethe new variable is a linear transformation of x, we know that theresult willbenormal.Whatarethemeanandvariance:

E[z] = E[A1(x)]=A1E(x) = 013

7/28/2019 Stats 2 Note 04

15/23

V(z) = E[(A1(x))(A1(x))0]=E[(A1(x))((x))0A01]

=A1 =A10A0 1E[(x))((x))0]A0 1 =I:Thelastequalityholdsbecause AA0 =.

Fromthenormalwederivetheotherdistributionsofuseinstatisticalinference. Thetwokeydistributionsarethe2 and theF.

The quadraticformofanormallydistributedrandomvectorwillfollowthe2 distribu-tion. Assumeastandardnormalvectorz.

0z z2nbecause this is the sum of the squares of n normal random variables with mean 0 andvariance 1. Alternatively, if x is a normal randomvariablewith mean and variance ,then(x)01(x) follows a 2 withndegreesoffreedom.

The ratio of two quadratic forms will follow theF distribution. Let B and C be twomatricessuchthatBC=0. Let AandBhavera andrb independentcolumns,respectively(i.e.,ranks). Letxbearandomvector withvariance2I, then

(x0AA=2)=ra)(x0Bx=2)=rb) F[ra; rb]

Instatisticalinferenceaboutregressionswewilltreatthevectorofcoecientsasarandomvector, b. We will construct hypothesesabout sets of coecients from a regression, andexpressthoseintheformofthematrices Band C. The 2-statisticamounts tocomparingsquared deviations of the estimated parameters from the hypothesized paramenters. TheF-statistic comparestwomodels{twodierentsetsofrestrictions.

2.6.2.3. ConditionalDistributions.

14

7/28/2019 Stats 2 Note 04

16/23

For completeness, I present the conditionaldistributions. These are frequently used inBayesian statisticalanalysis;theyarealso usedas atheoreticalframeworkwithinwhichtothinkaboutregression. Theywon'tbeusedmuchinwhatwedointhiscourse.

Todeneconditionaldistributionswewillneedtointroduceanadditionalconcept,par-titioning. Amatrixcanbesubdividedintosmallermatrices. Forexample,annxnmatrixAmay bewrittenas,say,four submatrices A11;A12;A21;A22:

A11;A12A=A21;A22

WecanwritetheinverseofAas theinverseofthepartitionsasfollows:1

11(I+A12FA21A11 );A11A12FA1 =A1 ;FA21A1 F

11 ;

whereF= (A22A21A111A12)1. Wecanusethepartitioningresultstocharacterizetheconditionaldistributions.

Considertwonormallydistributedrandomvectorsx1,x2: Thedensityfunctionforx1 is

f(x1)

N(1;

11):

Thedensityfunctionforx1 is

f(x2)N(2;22):Theirjointdistribution is

f(x1;x2)N(;);Thevariancematrixcanbewrittenas follows:

11;12= 21;22

wherejk=E[(xj jxkk)]andj= 1;2 and k= 1;2.The conditionaldistributionisarrived atbydividing thejointdensitybythemarginal

densityofx1,yieldingx2jx1 N(2:1;22:1)

1 1where2:1 =2+1211 (x22) and 22:1 =22121121.15

7/28/2019 Stats 2 Note 04

17/23

2.6.2. Regressionmodel inmatrix form.1. LinearModelThe linearregression modelconsists of n equations. Each equation expresses howthe

value of yforagivenobservationdependsonthevaluesofxi0 andi.y1 =0+ 1X11+2X21:::+kXk1+1y2 =0+ 1X12+2X22:::+kXk2+2

. ..yi =0+1X1i+2X2i:::+kXki+i

. ..yn =0+1X1n+2X2n:::+kXkn+n

This can be rewritten using matrices. Let y be the random vector of the dependentvariable

and

be

the

random

vector

of

the

errors.

The

matrix

X

contains

the

values

of

the

independentvariables. Therowsaretheobservationsandthecolumnsarethevariables. Tocapturetheintercept,therstcolumn isacolumnof1's.

01; x11;:::xk1 1CB1; x12;:::xk2 CBB C

X=B1; x13;:::xk3 CB C@ ::: A1; x1n;:::xkn

Let beavectorofk+ 1 constants: 0 = (0; 1; 2;::k). Then, X producesavectorinwhicheachentry isaweightedaverageofthevaluesofX1i;:::Xki for a given observationiwhere theweightsarethe's.

0 10+ 2X21:::+1X11+ kXk1+1B CB 0+1X12+ 2X22:::+kXk2+2 CB CCB :::B CX=B CB 0+1X1i+2X2i:::+kXki+i CB C@ ::: A0+1X1n+2X2n:::+kXkn+n

16

7/28/2019 Stats 2 Note 04

18/23

Withthisnotation,wecanexpressthelinearregressionmodelmoresuccinctlyas:y=X+

2. AssumptionsaboutErrors'A1. Errors havemean0.

E[] = 0 A2. Sphericaldistributionoferrors(noautocorrelation,homoskedastic).

E[0] = 2IA3. IndependenceofX and.

E[X0] = 0 3. MethodofMomentsEstimatorTheempiricalmoments impliedby Assumptions1and3arethat

i0e= 0 X0e= 0

wheree=yXb. We canuse thesetoderivethe methodof momentsestimatorofb, i.e.,theformulaweusetomakeourbestguessaboutthevalueofbased on the observed yandX.

Substitutethedenitionofeintothesecondequation. This yieldsX0(yXb) = 0

Hence,X0yX0X b =0,whichwecanrewriteas(X0X)b=X0y. This is a system of linear equations. X0X isakxk matrix. We can invertittosolvethesystem:

b= (X0X)1X0y17

7/28/2019 Stats 2 Note 04

19/23

Conceptually,thisisthecovariance(s)betweenyandXdividedbythevarianceofX,whichisthesameideaastheformulaforbivariateregression. Theexactformulaforthecoecientonanygiven variable,sayXj,willnotbe the sameasthe formula forthe coecient fromthebivariateregressionofY on Xj. We must adjust for theother factorsaectingY thatarecorrelatedwithXj (i.e.,theother X variables).

Consideracasewith2independentvariablesandaconstant. Forsimplicity,letusdeviateallvariables fromtheirmeans: Thiseliminatestheconstantterm,butdoesnotchangethesolutionfortheslopeparameters. Sothevectorxj isreallyxjxj. Usingtheinnerproductsvectorstodenotethesumsineachelement,wecanexpresstherelevantmatricesasfollows.

0 x10 ; xx11 2 xX0X=0 x20 ; xx12 2x

0

2x10 ;2 x x2

1

(X0X)1x

0

1x x1=

x 20 )2x10 (2x x2

0

1x10 ;1x x2x

(X0y) = 0 y10 y2

x

xCalculation ofthe formulaforbconsistsofmultiplyingtheX0X matrix(x-prime-xmatrix)timesthevectorX0y. ThisisanalogoustodividingthecovarianceofXandYbythevarianceofX.

0 ;2 x x2xx

0

2x10y10y2

xx

1

b=20 )2x1

0 (2x x20

1x10 ;1x x2

0

1x x1x 0 10 )y2

0 )(x x2120 )x21

0 ) (y x10 ) (x x2 2

0 )(x x220 )(x x11

(x Bb1 =b2

CCB (xB C@ A0 )y1

0 )(x x2120 )x21

0 ) (y x2 0 ) (x x22

0 )(x x110 )(x x11

(x

(x

18

7/28/2019 Stats 2 Note 04

20/23

2.7. DierentiationandOptimization

We willencounter dierentiation instatistics in estimation,wherewe wishtominimizesquarederrorsormaximizelikelihood,andindevelopingapproximations,especiallyTaylor'sTheorem.

Theprimary optimizationproblems take oneofthefollowingtwoforms.First,leastsquaresconsistsofchoosingthevaluesofbthatminimizethesumofsquared

errorswithrespectto:S=0= (yX)0(yX)

This isaquadraticformin.Second,maximumlikelihoodestimationconsistsofchoosingthevaluesoftheparameters

thatmaximizetheprobabilityofobservingthedata. Iftheerrorsareassumedtocomefromthenormal distribution,the likelihoodof thedata isexpressed asthejoint density oftheerrors:

222

L(;

;) = (2)n=2 ne

1

0

22= (2)n=2ne 1(yX)0(yX)

This function is verynon-linear,but itcan beexpressed asaquadratic formaftertakinglogarithms:

n 1ln(L) = ln(2)nln() (yX)0(yX)

2 22Theobjectiveinbothproblemsistondavectorbthatoptimizestheobjectivefunctions

withrespect to . Inthe least squares case we are minimizing and the likelihoodcase wearemaximizing. Inbothcases,thefunctionsinvolvedarecontinuous,sotherstandsecondorderconditionsforamaximumwillallowustoderivetheoptimalb.

Although the likelihood problem assumes normality, the approach can be adapted tomany dierent probability densities, and providesa fairly exible framework for derivingestimates.

19

7/28/2019 Stats 2 Note 04

21/23

0

Regardlessof theproblem, though, maximizing likelihood orminimizingerror involvesdierentiationof the objective function. Iwillteachyouthe basic rulesused in statisticalapplications. Theyarestraightforwardbutinvolvesomecareinaccountingtheindexesanddimensionality.

2.7.1. Partial Dierentiation: Gradients and Hessians. Let y = f(x). The outcomevariable y changesalong eachof thexi dimensions. Wemayexpress thesetof rst partialderivativesasavector,calledthegradient.

0@f(x)1@x1B@f(x)CB CB @x2 C@y CB :B C=B C@x B : CB C@ A:

@f(x)@xn

Instatisticswewillencounterlinearformsandquadraticformsfrequently. Forexample,the regression formula is a linear form and the sums of squared errors is expressed as aquadraticform. Thegradientsforthelinearandquadraticformsareas follows:

@Ax=A0

@x@x0Ax

= (A+A0)x@x

@x0Ax0=xx

@AConsiderthelinearform. WebuildupfromthespecialcasewhereAisavectora. Then,

Pny=a x= i=1aixi. Thegradient returnsavectorofcoecients:P0 @ n aixi 1 0 1

i=1 @x1 a1PB n C B CB CB @ i=1 aixi C Ba2CB CB @x2 C@a0x B C B : C

=B : C=B C =aB CB C B : C@x B : C B CB C@ A @ : AP :n

aixi an@ i=1 @xn

Note: aistransposed inthelinearform,butnotinthegradient.20

7/28/2019 Stats 2 Note 04

22/23

Moregenerally, thegradientof y= Axwithrespect tox returnsthe transpose ofthematrix A. Any element, yi, equals a row vector of A = ai times the column vector x.Dierentiationofthe ithelementofy returns acolumnvectorthat isthe transposeoftheith rowvector ofA. The gradientofeach elementofy returnsanewcolumnvector. Asaresult,therows ofAbecomethecolumnsofthematrixofgradientsofthelinearform.

TheHessianisthematrixofsecondpartialderivatives:10 @f2(x) @f2(x) @f2(x);::: @x1@xn@x1@x1;@x1@x2

B C

B @f2(x); @f2(x) C;::: @f2(x) C

H=

B@x2@x1 @x2@x2 @x2@xnB C@ ::: A@f2(x) @f2(x) @f2(x);::: @xn@xn@xn@x1;@xn@x2

Note: This is a symmetric matrix. Each column is the derivative of the function f withrespecttothetransposeofX. SotheHessianisoftenwritten:

@2H= [fij] = y :@xx0

Hessian'sarecommonlyused inTaylorSeriesapproximations,suchas used intheDeltaTechniqueto approximatevariances. ThesecondorderTaylorSeriesapproximationcanbeexpressedasfollows:

1X X 0 0y n n (xixi)(xjxj)fij (x0) = 1(xx0)0H(x0)(xx0)2 2i=1j=1They are also important in maximum likelihoodestimation, where the Hessian is used

to construct the variance ofthe estimator. For the case ofa normallydistributed randomvectorwithmeanand variance2,theHessian is

P

0 n

1

n=

2

; i=1(xi)@ P AnH= 4 i=1(xi) n Pn; 24 16 i=1(xi)24

The expected value of the inverse of this matrix is the variance-covariance matrix of theestimatedparameters.

2.7.2. Optimization. At anoptimum the gradient equals 0{ inalldirectionstherateofchangeis0. Thisconditioniscalledtherst ordercondition. Tocheckwhether agiven

21

7/28/2019 Stats 2 Note 04

23/23

point is a maximum or a minimum requires checking second order conditions. I will notexamine second orderconditions,butyouare welcome toread aboutand analyzethem inthetextbook.

The rst order condition amountsto asystem of equations. At anoptimum,there is avector that solves the minimization or maximization problem in question. That vector isarrivedatbysettingthegradientequaltozeroandsolvingthesystem ofequations.

Consider theleastsquaresproblem. Findthevalueof=bthatminimizes:S= (yX)0(yX)

Usingtherulesoftransposition,wecanrewritethisas follows:0y+ 0X0 0X0X0 0y+0X0S= (y00X0)(yX) = (y Xy y) = y X2y0X

The gradient of S with respect to consists of derivatives ofquadratic and linear forms.Usingtherulesabove:

@S= 2X0X2X0y

@Ataminimum,thevaluebguaranteesthatthegradientequals0:

@S= 2X0Xb2X0y=0

@The solution in terms of bis

b=X0X1X0yForthemaximumlikelihoodproblem,therstorderconditions are:

@ln(L) 1 ^= (2X0X2X0y) = 0@ 22^

@ln(L)=n^ ^ ^1+3+ ((yX)0(yX) = 0 @

Thesolutionis^=X0X1X0y

1 1^ 0^ ^= (yX)0(yX) = e e

n nwhereeisthevectorofresiduals.

22

stats 2 note 04

Documents