lda for educational data

Upload: kylenpayne

Post on 14-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 LDA for Educational Data

    1/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    1

    ClassificationofSchoolsByAcademic

    AchievementMeasuresKyleN.PayneGroup3

  • 7/27/2019 LDA for Educational Data

    2/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    2

    Stat448FinalProject

    KyleN.Payne

    INTRODUCTION

    Inmanyapplications,itmakeslogicalandpracticalsensetodichotomize

    continuousvariables.Intermsofacademicperformanceineducationalpolicy,wecouldpracticallydescribeacademicperformanceintermsofhighacademic

    achievementandlowacademicachievement.Whileitisreasonabletoassumethat

    indichotomizingcontinuousvariablescausesaconsiderablelossininformation

    (Cohen,1983)wecanalsoreflectupontheconsiderableeaseoftheinterpretation

    inadichotomy,andhowthiscouldhelplawmakers,policyspecialists,etc.inthe

    developmentofsuitableeducationalpolicy.Fromanappliedperspective,alsoitis

    logicaltoinvestigatetheextentthatdemographicvariablespredicttheclassification

    ofschoolsintermsofacademicachievement,andsuchisthesubjectofthefollowing

    analysis.Thedatasetunderstudyconsistsofmathandreadingscoresfrom

    standardizedtestsadministeredannuallyto3rdand5thgradersinthestateofIllinois,aswellasseveraldemographicandeconomicvariables.Thestandardizedtestinquestion,theIllinoisStandardAchievementTestorISATisintendedtoassess

    individualstudentachievementrelativetoIllinoisLearningStandards.Thedatasetcontainsdataforcohortsofstudentsmeasuredatboth3rdand5thgradefrom

    1999-2011.Measurementsareattheschoollevel,withaveragestakenacross

    students.Theentiredatasetconsistsof69466observationsacross109variables,of

    which10werecreatedoverthecourseoftheanalysis.Thesevariablesconsistof

    codingvariables,andaveragesofothervariablesacrosssimilargroups(like3rd,and5thgrade).Thecohort1data(trainingset)consistsof1783observationsacross109

    variables,asdoesthecohort2data(testset).Thedatawascompiledbyfacultyand

    staffattheUniversityOfIllinoisdepartmentofLaborandEmploymentRelations.Notethatsomeanalysesareplacedintheappendixforeaseofreading.

    METHODS

    Formyanalysis,Ichosetouseaquadraticdiscriminantfunctionanalysisto

    modeltheclassmembershipofelementaryschoolsinIllinoisintotwodichotomous

    classes,schoolsthatobtainHighAcademicAchievement(HAA),andthosethat

    obtainLowAcademicAchievement(LAA).Thecriterionforeitherisdecidedin

    advance,i.e.forcohort1,thedataarecoded0forLAAor1forHAAbasedoniftheproportionofstudentsthatexceededexpectationsinISATscores(averagedacross

    mathandreadingandgradesforeachschool)isaboveorbelow15%respectively.

    Thescaleforeachgradeandtestsubjectwereequal,whichallowedforeasyaveragingacrossgrade3,4,and5foreachschool,aswellasforthetwotesttypes.

    Thetestscoresarestandardized,meaningthatallschoolsareassessedinthesamemanner,suchthatthetestscoresarerelativetoanIllinoisstatestandard.The

    discriminantanalysiswasperformedusingtheSAS9.2andSAS9.3platformswith

    thestepdiscanddiscrimprocedures.

    Iconsideredcohort1asthetrainingset,andusedastepwisemodelselection

    procedureinordertoselecttheappropriatemodeloutofaspaceofpossible

  • 7/27/2019 LDA for Educational Data

    3/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    3

    models.Predictorsselectedaregeneraldemographicvariablesofinterest,including

    theaveragenumberoflow-incomestudentsperschool,studentteacherratio,etc.Forfittingthediscriminantfunction,thevariablethatistheclassificationis

    dependentonisacadem_achieve,theproportionofstudentsthatexceedexpectationsontheISATaveragedacrossmathandreadingandgrade3,and5.The

    codingvariableAAisoftheform = {0 < .15, 1 .15}Thisisameasureoftheaverageschool-wisescoreontheISAT.Whileeachclassis

    notmultivariatenormallydistributed,thequadraticdiscriminantfunctionis

    relativelyrobusttonon-normality.Howevertoaddresstherelativeperformanceof

    thediscriminantanalysistoothermethods,Ihavealsousedalogisticregressionto

    modeltheprobabilityschoolsbeingassignedtothetwoclassifications.This

    secondaryanalysiswasdoneusingtheSAS9.2platformwiththelogisticprocedure.

    RESULTS

    Section1

    Thestepdiscprocedurewasinitiallyutilizedforthefollowingpredictors:

    avg_stud_lowincomeTheaveragenumberoflowincomestudentsperschool

    chronic_truant_rateTheaverageproportionofchronictruancyperschool

    avg_dist_tch_salaryTheaverageteachersalaryperdistrict avg_perc_dist_tch_badegreeTheaveragepercentofteacherswith

    bachelorsdegreesperdistrict

    avg_perc_dist_tch_madegree-Theaveragepercentofteacherswithmastersdegreesperdistrict

    bamaxpay_sched-Thebachelorsdegreemaximumpayscheduleperschool

    mamaxpay_shed-Themastersdegreemaximumpayscheduleperschool

    Theprocedurewascarriedoutwitha.05selectionleveland.05significance

    level.Table1.1belowdemonstratesthefirstpartoftheanalysis,inwhichthe

    predictorsareenteredintothemodelbasedupontheirsignificance.

  • 7/27/2019 LDA for Educational Data

    4/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    4

    StatisticsforEntry,DF=1,1708

    Variable

    R-

    Squar

    e FValue Pr>FToleranc

    e

    avg_stud_lowincome0.5345 1961.05

  • 7/27/2019 LDA for Educational Data

    5/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    5

    StatisticsforEntry,DF=1,1707

    Variable

    Partial

    R-

    Square FValue Pr>F

    Toleranc

    e

    chronic_truant_rate 0.0020 3.41 0.065

    2

    0.7956

    avg_dist_tch_salary 0.0142 24.60 F

    Wilks'

    Lambda

    Pr

  • 7/27/2019 LDA for Educational Data

    6/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    6

    ClassLevelInformation

    AA

    Variabl

    e

    Name

    Frequenc

    y Weight

    Proportio

    n

    Prior

    Probabilit

    y

    0 _0 842 842.00

    00

    0.472238 0.500000

    1 _1 941 941.00

    00

    0.527762 0.500000

    Table1.5

    Thediscriminationresultedinanear50/50discriminationofthedata,witha

    roughly47%oftheschoolsintheLAAcategoryand53%intheHAAcategory.As

    seeninthetable1.7,thattheoverallclassificationerrorrateis16.11,whichconsists

    ofa0.2138misclassificationfortheLAAclassand0.1084misclassificationratefor

    theHAAclass.

    NumberofObservationsandPercentClassifiedintoAA

    FromAA LAA HAA Total

    LAA 662

    78.62

    180

    21.38

    842

    100.00

    HAA 102

    10.84

    839

    89.16

    941

    100.00

    Total 764

    42.85

    1019

    57.15

    1783

    100.00

    Priors 0.5

    0.5

    Table1.6

  • 7/27/2019 LDA for Educational Data

    7/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    7

    ErrorCountEstimatesforAA

    LAA HAA Total

    Rate 0.213

    8

    0.108

    4

    0.161

    1Priors 0.500

    0

    0.500

    0

    Table1.7

    Refittingthemodelwithproportionalpriors,Ireceivedthesameresultsofnon-

    homogenousvariancebetweenthetwogroups,andthereforethequadratic

    discriminantfunctionanalysiswasused,asseeninTable1.8.TheMANOVAresults

    aresimilartothenon-proportionalprioranalysis(Table1.9).

    Chi-Square DF Pr>ChiSq

    177.13229

    9

    1 F

    Wilks'Lambda 0.47971

    133

    1931.6

    5

    1 1781

  • 7/27/2019 LDA for Educational Data

    8/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    8

    NumberofObservationsandPercentClassified

    intoAA

    FromAA LAA HAA Total

    LAA 652

    77.43

    190

    22.57

    842

    100.00HAA 99

    10.52

    842

    89.48

    941

    100.00

    Total 751

    42.12

    1032

    57.88

    1783

    100.00

    Priors 0.47224

    0.52776

    Table1.10

    ErrorCountEstimatesforAA

    LAA HAA Total

    Rate 0.22

    57

    0.10

    52

    0.162

    1

    Prior

    s0.4722

    0.5278

    Table1.11

    Thecross-validatederrorrateestimatesareslightlyhigherthantheresubstitution

    rates(table1.12),whicharetypicallylessaccurate.

    CrossValidatedError

    CountEstimatesforAA

    LAA HAA Total

    Rate 0.22

    57

    0.10

    63

    0.162

    6

    Prior

    s0.47

    22

    0.52

    78

    Table1.12

    Becausethepurposeofthediscriminantanalysisistobeabletousethetrainingset

    datatoclassifyfuturedata,Iviewedcohort1dataasatrainingset,andusedcohort

    2dataasatestset.Whileneitherdatasetiscompletelyrandomlysampled,wecan

    viewcohort2astestsetforclassificationundertheassumptionthatthereisno

    distinctnon-stochasticdifferenceintheamountoflow-incomestudents,andISAT

    testscores.Therefore,usingthecohort1dataasthetrainingsetwithproportional

  • 7/27/2019 LDA for Educational Data

    9/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    9

    priors,theresultoftheclassificationofcohort2isshownintable1.13below.We

    canseethatalargerproportionofcohort2isclassifiedintotheHAAclasscomparedwithcohort1.

    NumberofObservationsandPercent

    ClassifiedintoAA

    LAA HAA Total

    Total 76243.12

    100556.88

    1767100.00

    Priors 0.47224

    0.52776

    Table1.13

    Duetotheunivariatenatureofthediscriminantanalysis,wecanalsoviewthe

    classificationvisually.Figure1.1describesthepredictedprobabilityofbeing

    classifiedintotheHAAgroupasafunctionoftheaveragenumberoflow-incomestudentsperschool.ThebluerepresentstheHAAclass,andredrepresentstheLAA

    class.

    Figure1.1

    Reviewingtheassumptionsforquadraticdiscriminantanalysis,itisclear

    thatthereareseveralviolationsinthisparticularanalysis.Thedistributionsoftheaveragenumberoflow-incomestudentsfortheLAAandHAAclassesareboth

    highlynon-normal(figure1.2),whichisaconsequenceofsplittingthedataintothe

    twoclasses.However,Iproceededinthefaceofthisbecausenotallviolationsof

    assumptionsareequallydetrimental,whilesomemakeananalysiscompletely

    invalid,someonlyaffecttheprecisionandaccuracyoftheanalysistoadegree.The

    robustnessofLDAandQDAtoviolationsofnormalityhasbeeninvestigatedin

    (Sever, Lajovic & Rajer, 2005).Theresultsof(Sever, Lajovic & Rajer, 2005)

  • 7/27/2019 LDA for Educational Data

    10/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    10

    indicatethatthelargesteffectofnon-normalityonthediscriminantanalysisisthe

    increasedbiasoferrorcountestimates.SkewnessindistributionappearstohavelittletonoeffectonthediscriminantanalysisusingLDAorQDA.

    Figure1.2

    Section2

    Becausetheclassificationschemeunderstudyinvolvesclassifyingdatainto

    dichotomousclasses,Ialsousedlogisticregressionoftheaveragenumberoflow-incomestudentsperschoolontothelogoddsofsaidschoolbeingclassifiedinthe

    eitheroftheAAclasses.Logisticregressioniscompetitivewithdiscriminant

    analysisforclassificationbecauseofitsrelativelysmallsetofassumptions,andthus

    thenon-normalityoftheclassesisnotaviolation.Thegeneralizedlogitlinkfunction

    wasutilizedassuggestedin(Der & Everitt, 2002)duetotheordinalnatureofthescaleoftheresponse.Thetestoftheglobalnullhypothesis(table2.1)andtheMLE

    estimates(table2.2)areallsignificant.TheasymptoticWaldChi-Squarevalue

    shouldbepreciseduetothelargesamplesize.

    TestingGlobalNullHypothesis:BETA=0

    Test Chi-Square DF Pr>ChiSq

    Likelihood

    Ratio1134.8846 1

  • 7/27/2019 LDA for Educational Data

    11/18

  • 7/27/2019 LDA for Educational Data

    12/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    12

    Figure2.2

    Duetotheunivariatenatureoftheanalysis,wecanalsoviewthelogisticregression

    intermsofaveragenumberoflow-incomestudentsontheprobabilityofaschool

    beingclassifiedasaHAAschool.Figure2.3describesthepredictedprobabilityofa

    schoolbeingclassifiedintotheHAAclassbytheaveragenumberoflow-income

    studentsperschool.

    Figure2.3

    Wecanalsoviewmeasuresoftheassociationofpredictedprobabilitiesandthe

    observedresponse.Thepercentconcordantisthepercentofresponsesthathavea

    predictedmeanscorethatalsoexistsinthesameclass.Thec-cmeasureisan

    adjustmentontheROCcmeasure.Itrangesfrom0.5to1,where0.5reflectsamodel

  • 7/27/2019 LDA for Educational Data

    13/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    13

    randomlypredictingtheresponse,and1perfectlyclassifyingtheresponse(table

    2.4).Itappearsasiftheclassificationisrelativelyaccurate.

    AssociationofPredictedProbabilitiesand

    ObservedResponses

    Percent

    Concordant90.8 Somers'

    D0.81

    8

    PercentDiscordant 9.1 Gamma 0.81

    9

    PercentTied 0.1 Tau-a 0.408

    Pairs 792322

    c-c 0.909

    Table2.4

    Section3

    Incomparingthetwomodelsitisclearthatthediscriminantanalysismaygive

    relativelybiasedpredictionswhencomparedtothelogisticregression.Thisreflects

    thepossiblebiasofthemodelduetotheviolationsofnormality.Whilethetwo

    modelsdodeviatefromeachotherintheirpredictionsoftheprobabilityofbeing

    classifiedintotheHAAclass,thetwomodelsareroughlysimilar(Figure3.1).

    Figure3.1

  • 7/27/2019 LDA for Educational Data

    14/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    14

    Conclusion

    Fromthetwoanalyses,wecanpaintaveryconvincingpicture:Theaveragenumber

    oflow-incomestudentsperschoolisassociatedwithdecreasesintheprobabilityofsaidschoolbeingclassifiedasintotheHighAcademicAchievementclass.Both

    modelspredictthatschoolswithhighnumberoflow-incomestudentshaveahighprobabilityofbeingclassifiedasLAA,andthereforethemodelspredictthatthose

    schoolshavealowernumberofstudentsthatexceedexpectationsonISATscores.

    NotonlydidtheAverageNumberofLow-IncomeStudentsperschoolclassify

    schoolswell,itdidsoaboveanyotherdemographicpredictor.Themodelselection

    processdescribedinsection1oftheresultssectionisevidencetowardsthispoint,

    asavg_stud_lowincomehadapartial!=0.5345.Thiscouldprovideauseful

    perspectivetobudgetarydecisions,astheaveragenumberoflow-incomestudents

    explainedmuchmorevariancethentheaverageteachersalaryperdistrict

    (Althoughthisisamessycomparisonasthereisvarianceinaverageteachersalary

    withinadistrict).Whilethiseffectsizemayseemrelativelysmall,itisactuallyquite

    highwithregardtoeffectssizescommonlyexpectedinsocialscience.Thisalsospeakstothegeneralnoisey-nessofthedata.Furtheranalysiscouldlookatthe

    relativeperformanceofthediscriminantmodelacrosseachofthecohorts,orusing

    amoresophisticatedmultivariateregressionmodelwhereISATscoresformathand

    readingaremultipleresponses.Othertypesofclassificationschemescouldalsobe

    performedonthedata,suchasK-Meansclustering,non-parametricdiscriminant

    analyses,etc.

  • 7/27/2019 LDA for Educational Data

    15/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    15

    Reference

    Cohen, J. (1983). Cost of dichotomization.Applied Psychological Measurement, 7(3),

    249-250.

    Der, G. & Everitt, B. S. (2002).A handbook of statistical analyses using sas. (2nd ed.,p. 292). Boca Raton, FL: Chapman & Hall/CRC

    Sever, M., Lajovic, J., & Rajer, B. (2005). Robustness of the fishers discriminant

    .Metodoloki zvezki,2(2), 239-242.

  • 7/27/2019 LDA for Educational Data

    16/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    16

    Appendix:

    A1.Someunivariateresultsforavg_stud_lowincome:

    LAA:

    Moments

    N 842 SumWeights 842

    Mean 205.1981 SumObservations 172776.8

    StdDeviation 84.2103863 Variance 7091.38915

    Skewness -0.6552029 Kurtosis -0.8303315

    UncorrectedSS 41417329.4 CorrectedSS 5963858.28

    CoeffVariation 41.0385799 StdErrorMean 2.90208156

    BasicStatisticalMeasures

    Location Variability

    Mean 205.1981 StdDeviation 84.21039

    Median 231.7000 Variance 7091

    Mode 279.2000 Range 300.00000

    InterquartileRange 140.90000

    Goodness-of-FitTestsforNormalDistribution

    Test Statistic pValue

    Kolmogorov-Smirnov D 0.1670009 Pr>D W-Sq A-Sq

  • 7/27/2019 LDA for Educational Data

    17/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    17

    HAA:

    Moments

    N 941 SumWeights 941

    Mean 59.7202976 SumObservations 56196.8

    StdDeviation 53.6670837 Variance 2880.15587

    Skewness 1.18972537 Kurtosis 1.4619666

    UncorrectedSS 6063436.14 CorrectedSS 2707346.52

    CoeffVariation 89.8640595 StdErrorMean 1.74949693

    BasicStatisticalMeasures

    Location Variability

    Mean 59.72030 StdDeviation 53.66708

    Median 46.40000 Variance 2880

    Mode 0.00000 Range 282.30000

    InterquartileRange 74.10000

    Goodness-of-FitTestsforNormalDistribution

    Test Statistic pValue

    Kolmogorov-Smirnov D 0.1328989 Pr>D W-Sq A-Sq

  • 7/27/2019 LDA for Educational Data

    18/18

    ClassificationofSchoolsbyAcademicAchievementMeasures

    18

    StatisticsforRemoval,DF=1,1707

    Variable

    Partial

    R-Square FValue Pr>F

    avg_stud_lowincome 0.5411 2012.52 F

    Wilks'Lambda 0.456281 677.64 3 1706