sas for actuarial science students zhaochen gao mentor

20
SAS for Actuarial Science Students Zhaochen Gao Mentor: David Varodayan University of Illinois at Urbana-Champaign

Upload: others

Post on 28-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SAS for Actuarial Science Students Zhaochen Gao Mentor

SASforActuarialScienceStudents

ZhaochenGaoMentor:DavidVarodayan

UniversityofIllinoisatUrbana-Champaign

Page 2: SAS for Actuarial Science Students Zhaochen Gao Mentor

Chapter1BasicsofSASinActuarialScienceThischapterwillfocusonthreebasicimplementationsofSASindataprocessing,includingtheDataReading,DataSubsetandDataValidatingandCleaning.Chapter1.1DataReadingInordertoanalyzethedatathroughSAS,atfirst,wehavetolearnabouthowtoreadtherawdatafilesthroughSAS.Youareexpectedtoimportdatafrommanydifferentfileformats.Usually,therearethreefileformats:1)SASDataSets,2)ExcelWorksheet3)DelimitedRawDataFiles.

ReadingSASDataSets libname________________________________;data_____________________________________;set__________________________________;…run;

ReadingExcelWorksheets libname________________________________;data_____________________________________;set__________________________________;…run;

ReadingDelimitedRawDataFiles data____________________________________;infile______________________________;input______________________________;…run;

ForSASDataSetsandExcelWorksheets:TheLIBNAMEstatementassignsalibraryreferencenametoaSASdatalibrary,ifapermanentdatasetisbeingreadorcreated.

LIBNAMElibref'physical-file-name'<options>;

TheDATAstatementcancreatetemporaryorpermanentdatasets,whichisfollowedbythenameofoutputdataset.TheSETstatementautomaticallyreadalltheobservationsfromtheinputdataset.ForDelimitedRawDataFiles:

1) StandardDelimitedRawData:

TheINFILEstatementaccessesthephysicalnameoftherawdatafile.

Page 3: SAS for Actuarial Science Students Zhaochen Gao Mentor

TheDLM=optioncanbeaddedtotheINFILEstatementtoidentifythespecificdelimiter.Forexample,dlm=’’(blankspace),dlm=’,’anddlm=’09’x(tab).TheINPUTstatementspecifieshowtoreadthedatafromrawdatafiles.Inputspecificationsincludecolumninput,formattedinputandlistinput.Wecanadd$signtoindicatethatwewanttostorethevariableasacharactervalueratherthananumericvalue.

TheLENGTHstatementdefinesthelengthofavariable.

LENGTHvariable(s)$length;

2) NonstandardDelimitedData

TheDSDoptionfortheINFILEstatementsetsthedefaultdelimitertoacomma,treatsconsecutivedelimitersasmissingvaluesandenablesSAStoreadvalueswithembeddeddelimitersifthevalueissurroundedbyquotationmarks.TheMISSOVERoptionrecognizesmissingvaluesattheendofarecord

INFILE'raw-data-file-name'DSDMISSOVER;Example:ForourdatasetforTitanic,it’sarawdataset.Wecanreaditthroughthisway.

Andwewillgettheoutput.

Page 4: SAS for Actuarial Science Students Zhaochen Gao Mentor

Atfirst,wedefinethelengthofeachvariable.AndwereadPassengerID,NameandSexascharactervalue.Sincethefirstlineofourrawdatasetisthenumberofeachvariables,wereadthesecondlineasourfirstobservation.Chapter1.2SubsetDataLearninghowtosubsetdataisquiteusefulwhenwereceivemassiveamountsofdatabutonlywanttoanalyzepartofit.TheWHEREstatementsubsetsobservationsthatmeetaparticularcondition.Theconditionscouldcontainsvariousoperators,includingthecomparison,arithmeticcalculation,orlogicaloperation.Charactervaluesmustbeenclosedinquotationmarksandarecasesensitive.Numericvaluesdonotusequotationmarks.ThesubsettingIFstatementcontinuesprocessingonlythoseobservationsthatmeetthecondition.AnalternativetothesubsettingIFstatementistheDELETEstatementinanIF-THENstatement.GeneralformoftheIF-THENDELETEstatement:TheDELETEstatementstopsprocessingthecurrentobservation.Example:Ifwewanttogetthedataaboutfemalesurvivorsinclass1,thenwecanwriteas:

*ThePRINTproceduredisplaysthedataportionofaSASdataset.Bydefault,PROCPRINTdisplaysthefollowing:allobservations,allvariablesandanObscolumnontheleftside.*TheCONTENTSprocedurewiththe_ALL_keywordproducesalistofalltheSASfilesinthedatalibrary.Afterprocessingthestepslistedabove,wecangetthesubsetlikethis:

Page 5: SAS for Actuarial Science Students Zhaochen Gao Mentor

Chapter1.3ValidatingandCleaningDataAtmosttimes,wecanobservedataerrorsinarawdatafile.Inordertoprocessthem,wehavetolearntheproceduresortechniquesforvalidatingdataandcleaningdata.WhenSASencountersadataerror,theseeventsoccur:-AnotethatdescribestheerrorisprintedintheSASlog.-Theinputrecord(contentsoftheinputbuffer)beingreadisdisplayedintheSASlog.-ThevaluesintheSASobservation(contentsofthePDV)beingcreatedaredisplayedintheSASlog.-AmissingvalueisassignedtotheappropriateSASvariable.Therearefourapproacheswecanusetovalidatethedata.PROCPRINTstepwithVARandWHEREstatements

Detectsinvalidcharacterandnumericvaluesbysubsettingobservationsbasedonconditions.

PROCFREQstepwithTABLESstatement

Detectsinvalidcharacterandnumericvaluesbylookingatdistinctvalues.

PROCMEANSstepwithVARstatement

Detectsinvalidnumericvaluesbyusingsummarystatistics.

PROCUNIVARIATEstepwithVARstatement

Detectsinvalidnumericvaluesbylookingatextremevalues.

1.PRINTProcedureTheVARstatementhelpsselectvariablestobeincludedinthereportandalsoshowstheorderofdisplayofthesevariables.Forexample,IfwecanexaminethemissingvaluesorinvaliddataforthePclassvariable(theobservationthatisnot1,2or3),wecandoitinthisway:

Page 6: SAS for Actuarial Science Students Zhaochen Gao Mentor

Andfortunately,wegetnothingappearedintheresults.SothereisnomissingvaluesorinvaliddataforTitanicdataset.2.FreqProcedureTheFreqstatementproducesthethefrequencyofdistinctivevaluesfortheacquiredvariables.

PROCFREQDATA=data_set;TABLESvariables;

RUN;Forourexample,wecanexamallthevariablesthroughoneFreqstatement.

*nlevelshelpsusgettheleveltableforourvariables.Thecolumnnamed‘Levels’showsthenumberofdistinctivevaluesforthecorrespondingvariable.The‘MissingLevels’showsthatwhetherornottherearemissingvalues.*noprintintheTablesstatementsuppressthedisplayofseparatetableforeachofvariable.Here’sanexamplefortheseparatetableofvariableAge.

Page 7: SAS for Actuarial Science Students Zhaochen Gao Mentor
Page 8: SAS for Actuarial Science Students Zhaochen Gao Mentor

Sothereare177missingvalues.Anditseemslikethatthefirstfiveobservationsarenoteveninteger.DataValidatingProcedure:

Page 9: SAS for Actuarial Science Students Zhaochen Gao Mentor

Weaddanewvariablenamed“AgeGroup”foranalyzingdatainthenextchapter.Chapter2AnalyzingdataThischapterwilldisplayhowtoanalyzethedatabychoosingdifferentvariablesanddrawingdifferentgraphs.Chapter2.1DataAnalysisviaVisualizationExampleIfwewanttofindtherelationshipbetweenclassofcabinsandthesurvivalrate,wecandrawthehistogramfordifferentclassandcomparetheirsurvivalrate.

Page 10: SAS for Actuarial Science Students Zhaochen Gao Mentor

Thenwegettheresults:

Page 11: SAS for Actuarial Science Students Zhaochen Gao Mentor

WecaneasilyobservethatthesurvivalrateforClass1ismuchhigherthanthatforClass2andClass3.Ifpeopleupgradetheirclass,theywillhavehighersurvivalrate.Ifwewanttocombinethehistogramsofthreeclass,thenwecanrunfollowingsteps:

Andgetthefinalresults:

Page 12: SAS for Actuarial Science Students Zhaochen Gao Mentor

Accordingtothecombinedhistogram,wecanstillobservethatthesurvivalrateforclass1ishighest.Andthedeathrateforclass3ishighest.Therefore,wecanconcludethatthesurvivalratedecreasesalongwiththedowngradeofclassstanding.Similarly,forothervariables,wecanalsodrawthehistogramsforotherimportantvaribles:

(1) Sex

Wecaneasilyobservethatthesurvivalrateforfemaleissignificantlyhigherthanmale.Thismaybeowingtothefactthatmostofmalegavetheirseatsonlifeboatstofemale.

(2)Embarked

Page 13: SAS for Actuarial Science Students Zhaochen Gao Mentor

WecanobservethatthesurvivalrateforEmbarkedlabeled“C”ishighestanddeathrateforEmbarkedlabeled“S”ishighest.

(2) AgeGroup

Page 14: SAS for Actuarial Science Students Zhaochen Gao Mentor

WecanobservethatthesurvivalrateforKidishighestandthedeathrateishighestforSenior.Maybeit’sowingtothatkidsarethepriority.However,itwassohardforseniorstogetthroughthisdisaster.

Chapter2.2FreqTableInordertoanalyzetherelationshipbetweenSurvivalRateandothervariables,let’sstartwithgettingsomedescriptivestatisticsofthevariablesofinterestusingfrequencytable.Example:

Results:

Page 15: SAS for Actuarial Science Students Zhaochen Gao Mentor

Accordingtofrequencytable,74.20%offemalesurvived.Butonly18.89%survived.And68.13%ofthesurvivedpeoplearefemale.

Accordingtothefrequencytable,62.96%ofClass1peoplesurvived.Only24.24%ofClass3peoplesurvived.

Accordingtothefrequencytable,55.36%ofEmbarked“C”survived.However,for“Q”and“S”,survivalratesareonly38.96%and33.70%.

Page 16: SAS for Actuarial Science Students Zhaochen Gao Mentor

Accordingtothefrequencytable,54.69%ofkidssurvived.However,foradultsandseniors,survivalratesareonly39.69%and36.67%.Chapter2.3LogisticRegressionIfwewanttoperformstatisticaltestonourdataset,thenvariableSurvivedwouldbethedependentvariable.However,thereareonlytwovaluesforSurvived,0and1.Therefore,wewoulduselogisticregressionmodel.Logisticregressionassumesthattheoutcomevariableisbinary(i.e.,codedas0and1).Example:

AboveweusePROCLOGISTICtoestimatealogisticregressionmodel.TheoutcomeSurvivedandthepredictorPclassarebothcategoricalvariablesandshouldbeindicatedassuchontheCLASSstatement.WecanspecifythebaselinecategoryforSurvivedusing(ref="1")andthereferencegroupforPclassusing(ref="1").ThePARAM=REFoptionontheCLASSstatementtellsSAStousedummycodingratherthaneffectcodingforthevariablePclass.PartofResults:

Page 17: SAS for Actuarial Science Students Zhaochen Gao Mentor

Intheoutputabove,thelikelihoodratiochi-squareof103.5471withap-value<0.0001tellsusthatourmodelasawholefitssignificantlybetterthananemptymodel.Therelativelogoddsofbeingdiedvs.survivedwillincreaseby1.6704ifmovingfromthehighestlevelofPclass(Pclass==1)tothelowestlevelofPclass(Pclass==3).TheoveralleffectofPclasslislistedunder"Type3AnalysisofEffects",anditissignificant.Othervariables:(1)Sex

Thelikelihoodratiochi-squareof268.8512withap-value<0.0001tellsusthatourmodelasawholefitssignificantlybetterthananemptymodel.Therelativelogoddsofbeingdiedvs.survivedwillincreaseby2.5137ifmovingfromthesexFemaletosexMale.TheoveralleffectofSexislistedunder"Type3AnalysisofEffects",anditissignificant.(2)Embarked

Thelikelihoodratiochi-squareof25.8653withap-value<0.0001tellsusthatourmodelasawholefitssignificantlybetterthananemptymodel.Therelativelogoddsofbeingdiedvs.survivedwilldecreaseby0.8919ifmovingfromPclass'S'totheEmbarkedlabeled'C'.TheoveralleffectofPclasslislistedunder"Type3AnalysisofEffects",anditissignificant.(3)AgeGroup

Page 18: SAS for Actuarial Science Students Zhaochen Gao Mentor

Thelikelihoodratiochi-squareof5.7879withap-value=0.1224tellsusthatourmodelasawholefitsnotsosignificantlybetterthananemptymodel.Therelativelogoddsofbeingdiedvs.survivedwillincreaseby0.6699ifmovingfromAgeGroup'Kid'toAgeGroup'Teenager'.Chapter2.4MultinomialLogisticRegressionIt’ssimilarwithsimplelogisticregressionmodel.Butwecanusemorethanonevariableasdependentvariablesrightnow.Example:

Thelikelihoodratiochi-squareof359.7668withap-value<0.0001tellsusthatourmodelasawholefitssignificantlybetterthananemptymodel.Therelativelogoddsofbeingdiedvs.survivedwillincreaseby1.9055ifmovingfromthehighestlevelofPclass(Pclass==1)tothelowestlevelofPclass(Pclass==3).Therelativelogoddsofbeingdiedvs.survivedwillincreaseby2.6419ifmovingfromthesexFemaletosexMale.TheoveralleffectsofPclasslandSexarelistedunder"Type3AnalysisofEffects",andbotharesignificant.Chapter2.5PredictedProbabilitiesWecanusepredictedprobabilitiestohelpusunderstandthemodel.WecancalculatepredictedprobabilitiesusingtheLSMEANSstatementandtheILINKoption.Formultinomialdata,LSMEANSrequiresglmratherthanreference(dummy)coding,eventhoughtheyareessentiallythesame,sobesuretore-specifythecodingontheclassstatement.However,glmcodingonlyallowsthelastcategorytobethereferencegroup(Survived=0andPclass=3)andwillignoreanyotherreferencegroupspecifications.BelowweuseLSMEANStocalculatethepredictedprobabilityofsurvivalateachlevelofPclass.Example:

e-RequeststhattheLmatrixcoefficientsfortheLSMEANSeffectsbedisplayed.

Page 19: SAS for Actuarial Science Students Zhaochen Gao Mentor

ilink-Requeststhatestimatesandtheirstandarderrorsinthe"LeastSquaresMeans"tablealsobereportedonthescaleofthemean(theinverselinkedscale).cl-Requeststhatt-typeconfidencelimitsbeconstructedforeachoftheLS-means.Theconfidencelevelis0.95bydefault;thiscanbechangedwiththeALPHA=option.PartofResults:

Thepredictedprobabilitiesareinthe"Mean"column.Thus,forPclass=3,weseethattheprobabilityofbeingdiedis0.7576.ToobtainpredictedprobabilitiesfortheSurvived(=1),wecanreversetheorderingofthecategoriesusingthedescendingoptionontheproclogisticstatement.Thiswillmake1thevalueforSurvivedand3thereferencegroupforPclass.

PartofResults:

HereweseetheprobabilityofbeingsurvivedwhenPclass=3is0.2424,whichiswhatwewouldhaveexpectedsince(1-0.7576)=0.2424,where0.7576istheprobabilitiesofbeingdiedunderthesameconditions.Chapter2.6Conclusion

Page 20: SAS for Actuarial Science Students Zhaochen Gao Mentor

ThisguidecoversbasicusageofSASondataanalysis.Itshowshowtoreaddata,validatedata,cleandata,drawhistogram,analyzedataandperformlogisticregression.