histogram - wikipedia, the free encyclopedia.pdf

Upload: mohan-shaha

Post on 08-Mar-2016

268 views

Category:

Documents


0 download

TRANSCRIPT

  • Histogram

    Firstdescribedby

    KarlPearson

    Purpose Toroughlyassesstheprobabilitydistributionofagivenvariablebydepictingthefrequenciesofobservationsoccurringincertainrangesofvalues

    HistogramFromWikipedia,thefreeencyclopedia

    Ahistogramisagraphicalrepresentationofthedistributionofdata.Itisanestimateoftheprobabilitydistributionofacontinuousvariable(quantitativevariable)andwasfirstintroducedbyKarlPearson.[1]Toconstructahistogram,thefirststepisto"bin"therangeofvaluesthatis,dividetheentirerangeofvaluesintoaseriesofsmallintervalsandthencounthowmanyvaluesfallintoeachinterval.Arectangleisdrawnwithheightproportionaltothecountandwidthequaltothebinsize,sothatrectanglesabouteachother.Ahistogrammayalsobenormalizeddisplayingrelativefrequencies.Itthenshowstheproportionofcasesthatfallintoeachofseveralcategories,withthesumoftheheightsequaling1.Thebinsareusuallyspecifiedasconsecutive,nonoverlappingintervalsofavariable.Thebins(intervals)mustbeadjacent,andusuallyequalsize.[2]Therectanglesofahistogramaredrawnsothattheytoucheachothertoindicatethattheoriginalvariableiscontinuous.[3]

    Histogramsgivearoughsenseofthedensityofthedata,andoftenfordensityestimation:estimatingtheprobabilitydensityfunctionoftheunderlyingvariable.Thetotalareaofahistogramusedforprobabilitydensityisalwaysnormalizedto1.Ifthelengthoftheintervalsonthexaxisareall1,thenahistogramisidenticaltoarelativefrequencyplot.

    Ahistogramcanbethoughtofasasimplistickerneldensityestimation,whichusesakerneltosmoothfrequenciesoverthebins.Thisyieldsasmootherprobabilitydensityfunction,whichwillingeneralmoreaccuratelyreflectdistributionoftheunderlyingvariable.Thedensityestimatecouldbeplottedasanalternativetothehistogram,andisusuallydrawnasacurveratherthanasetofboxes.

    AvariablebinwidthhistogramwasintroducedbyDenbyandMallows(2009)(http://pubs.research.avayalabs.com/pdfs/ALR2007003paper.pdf).ExamplesofthisaredisplayedonCensusbureaudatabelow.

    Anotheralternativeistheaverageshiftedhistogram(http://www.stat.rice.edu/~scottdw/stat550/HW/hw4/c05.pdf)whichisfasttocompute,andgetsasmoothcurveestimateofthedensitywithoutusingkernels.

    Thehistogramisoneofthesevenbasictoolsofqualitycontrol.[4]

    Histogramsareoftenconfusedwithbarcharts.Abarchartisaplotofcategoricalvariables,andthediscontinuityshouldbeindicatedbyhavinggapsbetweentherectangles.Oftenthisisneglectedwhichmayleadtoabarchartbeingconfusedforahistogram.

  • Anexamplehistogramoftheheightsof31BlackCherrytrees.

    Contents

    1Etymology2Examples3Mathematicaldefinition

    3.1Cumulativehistogram3.2Numberofbinsandwidth

    4Seealso5References6Furtherreading7Externallinks

    Etymology

    Theetymologyofthewordhistogramisuncertain.SometimesitissaidtobederivedfromtheGreekhistos'anythingsetupright'(asthemastsofaship,thebarofaloom,ortheverticalbarsofahistogram)andgramma'drawing,record,writing'.ItisalsosaidthatKarlPearson,whointroducedthetermin1891,derivedthenamefrom"historicaldiagram".[5]

    Examples

    Thisisatoyexample

  • Bin Count3.5 92.5 321.5 1090.5 1800.5 1321.5 342.5 43.5 9

    Thelanguageusedtodescribethepatternsinahistogramaresymmetric,skewedleftorright,unimodal,bimodalormultimodal.

    Symmetric,unimodal

    Skewedright

    Skewedleft

    Bimodal

    Multimodal

    Symmetric

    Itisagoodideatoplotyourdataonseveraldifferentbinwidthstolearnmoreaboutit.Hereisanexampleontipsgiveninarestaurant.

  • Tipsusinga$1binwidth,skewedright,unimodal

    Tipsusinga10cbinwidth,stillskewedright,multimodalwithmodesat$and50camounts,indicatesrounding,alsosomeoutliers

    Hereareacouplemoreexamples.

    PricesofhousessoldinAmesin2009,exhibitssomerightskew.

    Acesbyplayersinagrandslamtennistournament,facettedbygender.Therearemoreacesinthemensgame.

    TheU.S.CensusBureaufoundthattherewere124millionpeoplewhoworkoutsideoftheirhomes.[6]Usingtheirdataonthetimeoccupiedbytraveltowork,Table2belowshowstheabsolutenumberofpeoplewhorespondedwithtraveltimes"atleast30butlessthan35minutes"ishigherthanthenumbersforthecategoriesaboveandbelowit.Thisislikelyduetopeopleroundingtheirreportedjourneytime.Theproblemofreportingvaluesassomewhatarbitrarilyroundednumbersisacommonphenomenonwhencollectingdatafrompeople.

  • Histogramoftraveltime(towork),US2000census.Areaunderthecurveequalsthetotalnumberofcases.ThisdiagramusesQ/widthfromthetable.

    Databyabsolutenumbers

    Interval Width Quantity Quantity/width

    0 5 4180 836

    5 5 13687 2737

    10 5 18618 3723

    15 5 19634 3926

    20 5 17981 3596

    25 5 7190 1438

    30 5 16369 3273

    35 5 3212 642

    40 5 4122 824

    45 15 9200 613

    60 30 6461 215

    90 60 3435 57

    Thishistogramshowsthenumberofcasesperunitintervalastheheightofeachblock,sothattheareaofeachblockisequaltothenumberofpeopleinthesurveywhofallintoitscategory.Theareaunderthecurverepresentsthetotalnumberofcases(124million).Thistypeofhistogramshowsabsolutenumbers,withQinthousands.

  • Histogramoftraveltime(towork),US2000census.Areaunderthecurveequals1.ThisdiagramusesQ/total/widthfromthetable.

    Databyproportion

    Interval Width Quantity(Q) Q/total/width

    0 5 4180 0.0067

    5 5 13687 0.0221

    10 5 18618 0.0300

    15 5 19634 0.0316

    20 5 17981 0.0290

    25 5 7190 0.0116

    30 5 16369 0.0264

    35 5 3212 0.0052

    40 5 4122 0.0066

    45 15 9200 0.0049

    60 30 6461 0.0017

    90 60 3435 0.0005

    Thishistogramdiffersfromthefirstonlyintheverticalscale.Theareaofeachblockisthefractionofthetotalthateachcategoryrepresents,andthetotalareaofallthebarsisequalto1(thefractionmeaning"all").Thecurvedisplayedisasimpledensityestimate.Thisversionshowsproportions,andisalsoknownasaunitareahistogram.

    Inotherwords,ahistogramrepresentsafrequencydistributionbymeansofrectangleswhosewidthsrepresentclassintervalsandwhoseareasareproportionaltothecorrespondingfrequencies:theheightof

  • Anordinaryandacumulativehistogramofthesamedata.Thedatashownisarandomsampleof10,000pointsfromanormaldistributionwithameanof0andastandarddeviationof1.

    eachistheaveragefrequencydensityfortheinterval.Theintervalsareplacedtogetherinordertoshowthatthedatarepresentedbythehistogram,whileexclusive,isalsocontiguous.(E.g.,inahistogramitispossibletohavetwoconnectingintervalsof10.520.5and20.533.5,butnottwoconnectingintervalsof10.520.5and22.532.5.Emptyintervalsarerepresentedasemptyandnotskipped.)[7]

    Mathematicaldefinition

    Inamoregeneralmathematicalsense,ahistogramisafunctionmithatcountsthenumberofobservationsthatfallintoeachofthedisjointcategories(knownasbins),whereasthegraphofahistogramismerelyonewaytorepresentahistogram.Thus,ifweletnbethetotalnumberofobservationsandkbethetotalnumberofbins,thehistogrammimeetsthefollowingconditions:

    Cumulativehistogram

    Acumulativehistogramisamappingthatcountsthecumulativenumberofobservationsinallofthebinsuptothespecifiedbin.Thatis,thecumulativehistogramMiofahistogrammjisdefinedas:

    Numberofbinsandwidth

    Thereisno"best"numberofbins,anddifferentbinsizescanrevealdifferentfeaturesofthedata.GroupingdataisatleastasoldasGraunt'sworkinthe17thcentury,butnosystematicguidelinesweregiven[8]untilSturges'sworkin1926.[9]

    Usingwiderbinswherethedensityislowreducesnoiseduetosamplingrandomnessusingnarrowerbinswherethedensityishigh(sothesignaldrownsthenoise)givesgreaterprecisiontothedensityestimation.Thusvaryingthebinwidthwithinahistogramcanbebeneficial.Nonetheless,equalwidthbinsarewidelyused.

    Sometheoreticianshaveattemptedtodetermineanoptimalnumberofbins,butthesemethodsgenerallymakestrongassumptionsabouttheshapeofthedistribution.Dependingontheactualdatadistributionandthegoalsoftheanalysis,differentbinwidthsmaybeappropriate,soexperimentationisusuallyneededtodetermineanappropriatewidth.Thereare,however,varioususefulguidelinesandrulesofthumb.[10]

  • Thenumberofbinskcanbeassigneddirectlyorcanbecalculatedfromasuggestedbinwidthhas:

    Thebracesindicatetheceilingfunction.

    Squarerootchoice

    whichtakesthesquarerootofthenumberofdatapointsinthesample(usedbyExcelhistogramsandmanyothers).[11]

    Sturges'formula

    Sturges'formula[9]isderivedfromabinomialdistributionandimplicitlyassumesanapproximatelynormaldistribution.

    Itimplicitlybasesthebinsizesontherangeofthedataandcanperformpoorlyifn

  • WikimediaCommonshasmediarelatedtoHistograms.

    where isthesamplestandarddeviation.Scott'snormalreferencerule[14]isoptimalforrandomsamplesofnormallydistributeddata,inthesensethatitminimizestheintegratedmeansquarederrorofthedensityestimate.[8]

    FreedmanDiaconis'choice

    TheFreedmanDiaconisruleis:[15][8]

    whichisbasedontheinterquartilerange,denotedbyIQR.Itreplaces3.5ofScott'srulewith2IQR,whichislesssensitivethanthestandarddeviationtooutliersindata.

    ChoicebasedonminimizationofanestimatedL2[16]riskfunction

    where and aremeanandbiasedvarianceofahistogramwithbinwidth , and.

    Remark

    Agoodreasonwhythenumberofbinsshouldbeproportionalto isthefollowing:supposethatthedataareobtainedas independentrealizationsofaboundedprobabilitydistributionwithsmoothdensity.Thenthehistogramremainsequallyruggedas tendstoinfinity.If isthewidthofthedistribution(e.g.,thestandarddeviationortheinterquartilerange),thenthenumberofunitsinabin(thefrequency)isoforder andtherelativestandarderrorisoforder .Comparingtothenextbin,the

    relativechangeofthefrequencyisoforder providedthatthederivativeofthedensityisnonzero.Thesetwoareofthesameorderif isoforder ,sothat isoforder .

    Thissimplecubicrootchoicecanalsobeappliedtobinswithnonconstantwidth.

    Seealso

    DatabinningDensityestimation

    Kerneldensityestimation,asmootherbutmorecomplexmethodofdensityestimation

    FreedmanDiaconisrule

  • ImagehistogramParetochartSevenBasicToolsofQualityVoptimalhistograms

    References

    1. ^Pearson,K.(1895)."ContributionstotheMathematicalTheoryofEvolution.II.SkewVariationinHomogeneousMaterial".PhilosophicalTransactionsoftheRoyalSocietyA:Mathematical,PhysicalandEngineeringSciences186:343414.Bibcode:1895RSPTA.186..343P(http://adsabs.harvard.edu/abs/1895RSPTA.186..343P).doi:10.1098/rsta.1895.0010(http://dx.doi.org/10.1098%2Frsta.1895.0010).

    2. ^Howitt,D.andCramer,D.(2008)StatisticsinPsychology.PrenticeHall3. ^CharlesStangor(2011)"ResearchMethodsForTheBehavioralSciences".Wadsworth,CengageLearning.

    ISBN9780840031976.4. ^NancyR.Tague(2004)."SevenBasicQualityTools"(http://www.asq.org/learnaboutquality/sevenbasic

    qualitytools/overview/overview.html).TheQualityToolbox.Milwaukee,Wisconsin:AmericanSocietyforQuality.p.15.Retrieved20100205.

    5. ^M.EileenMagnello(December2006)."KarlPearsonandtheOriginsofModernStatistics:AnElasticianbecomesaStatistician"(http://www.rutherfordjournal.org/article010107.html).TheNewZealandJournalfortheHistoryandPhilosophyofScienceandTechnology.1volume.OCLC682200824(https://www.worldcat.org/oclc/682200824).

    6. ^US2000census(http://www.census.gov/prod/2004pubs/c2kbr33.pdf).7. ^Dean,S.,&Illowsky,B.(2009,February19).DescriptiveStatistics:Histogram.Retrievedfromthe

    ConnexionsWebsite:http://cnx.org/content/m16298/1.11/

    8. ^abcScott,DavidW.(1992).MultivariateDensityEstimation:Theory,Practice,andVisualization.NewYork:JohnWiley.

    9. ^abSturges,H.A.(1926)."Thechoiceofaclassinterval".JournaloftheAmericanStatisticalAssociation:6566.JSTOR2965501(https://www.jstor.org/stable/2965501).

    10. ^e.g.5.6"DensityEstimation",W.N.VenablesandB.D.Ripley,ModernAppliedStatisticswithS(2002),Springer,4thedition.ISBN0387954570.

    11. ^EXCEL2007:Histogram(http://cameron.econ.ucdavis.edu/excel/ex11histogram.html)12. ^OnlineStatisticsEducation:AMultimediaCourseofStudy(http://onlinestatbook.com/).ProjectLeader:David

    M.Lane,RiceUniversity(chapter2"GraphingDistributions",section"Histograms")13. ^DoaneDP(1976)Aestheticfrequencyclassication.AmericanStatistician,30:18118314. ^Scott,DavidW.(1979)."Onoptimalanddatabasedhistograms".Biometrika66(3):605610.

    doi:10.1093/biomet/66.3.605(http://dx.doi.org/10.1093%2Fbiomet%2F66.3.605).15. ^Freedman,DavidDiaconis,P.(1981)."Onthehistogramasadensityestimator:L2theory".Zeitschriftfr

    WahrscheinlichkeitstheorieundverwandteGebiete57(4):453476.doi:10.1007/BF01025868(http://dx.doi.org/10.1007%2FBF01025868).

  • WikimediaCommonshasmediarelatedtoHistogram.

    LookuphistograminWiktionary,thefreedictionary.

    16. ^Shimazaki,H.Shinomoto,S.(2007)."Amethodforselectingthebinsizeofatimehistogram"(http://www.mitpressjournals.org/doi/abs/10.1162/neco.2007.19.6.1503).NeuralComputation19(6):15031527.doi:10.1162/neco.2007.19.6.1503(http://dx.doi.org/10.1162%2Fneco.2007.19.6.1503).PMID17444758(https://www.ncbi.nlm.nih.gov/pubmed/17444758).

    Furtherreading

    Lancaster,H.O.AnIntroductiontoMedicalStatistics.JohnWileyandSons.1974.ISBN0471512508

    Externallinks

    JourneyToWorkandPlaceOfWork

    (http://www.census.gov/population/www/socdemo/journey.html)(locationofcensusdocumentcitedinexample)Smoothhistogramforsignalsandimagesfromafewsamples(http://www.mathworks.com/matlabcentral/fileexchange/30480histconnect)Histograms:Construction,AnalysisandUnderstandingwithexternallinksandanapplicationtoparticlePhysics.(http://quarknet.fnal.gov/toolkits/ati/histograms.html)AMethodforSelectingtheBinSizeofaHistogram(http://2000.jukuin.keio.ac.jp/shimazaki/res/histogram.html)Interactivehistogramgenerator(http://www.shodor.org/interactivate/activities/histogram/)Matlabfunctiontoplotnicehistograms(http://www.mathworks.com/matlabcentral/fileexchange/27388plotandcomparenicehistogramsbydefault)DynamicHistograminMSExcel(http://excelandfinance.com/histograminexcel/)Histogramconstruction(http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_ModelerActivities_MixtureModel_1)andmanipulation(http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_PowerTransformFamily_Graphs)usingJavaapplets,andcharts(http://www.socr.ucla.edu/htmls/SOCR_Charts.html)onSOCR

    Retrievedfrom"http://en.wikipedia.org/w/index.php?title=Histogram&oldid=641376957"

  • Categories: Statisticalchartsanddiagrams Qualitycontroltools EstimationofdensitiesNonparametricstatistics

    Thispagewaslastmodifiedon7January2015at08:42.TextisavailableundertheCreativeCommonsAttributionShareAlikeLicenseadditionaltermsmayapply.Byusingthissite,youagreetotheTermsofUseandPrivacyPolicy.WikipediaisaregisteredtrademarkoftheWikimediaFoundation,Inc.,anonprofitorganization.