biometrics stat 319 chapter 1: introduction. 2stat 319 biometrics spring 2009 student password: a...

526
Biometrics STAT 319 Chapter 1: Introduction

Upload: barrie-green

Post on 21-Jan-2016

277 views

Category:

Documents


1 download

TRANSCRIPT

  • BiometricsSTAT 319Chapter 1: Introduction

  • *STAT 319 Biometrics Spring 2009Student Password:A students password is their 8-digit tech-id.Standard Form for User-IDs:The standard form for student user-ids is the following: FLastName.Dxxx.yywhere F is the first initial of the student, LastName is the last name of the student Dis the first letter of the department (M for Math, S for Stat or H for Hons) xxxis the course number yy is the section numberNote: all non-alphabetic characters are removed from a students First and Last names before forming the user-id.Drive structure:When logged on to a lab or computer classroom computer, drive H: and the My Documents folder, refer to the same folder. This is true for both faculty and students alike.For students, drive I: refers to Class Files folder associated with the class. For faculty, drive I: refers to a folder containing folders for all classes taught by the instructor. In each class folder there is the Class Files folder (that the students see as drive I: ), and a folder for each student in the class where the students can store their work. Instructors can place files that they want students to access in the Class Files folder; students cannot modified or delete files in these folders.

    STAT 319 Biometrics Spring 2009

  • Important Data SourcesMinnesota Department of Health http://www.health.state.mn.us/stats.htmlCenter for Disease Control (CDC) http://www.cdc.gov/nchs/about/major/nhis/released200812.htm#4Australian Bureau of Statisticshttp://www.ausstats.abs.gov.au/ausstats/subscriber.nsf/0/3B1917236618A042CA25711F00185526/$File/43640_2004-05.pdfNational Wild Fish Health Survey http://www.fws.gov/wildfishsurveyBureau of Justice Statistics http://ojp.usdoj.gov/bjs/

    *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 20091.1 OverviewStatistics is a collection of methods forPlanning experimentsObtaining data (data are collected observations, such as measurements and survey responses)Organizing dataSummarizing (graphically and numerically) dataAnalyzing dataInterpreting resultsPresenting results, andDrawing conclusions or making inferencesStatistics is a branch of Mathematics ->

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Statistics is invented for studying Randomness- a lack of order, purpose, cause, or predictability (by Wiki)- without which the world will be of no interest.Examples of random phenomena:Phelps won 8 gold medalsA 6-sided die is flipped and landed a 4Its going to rain tomorrowRandomness, Fuzziness and UncertaintyRandomness creates uncertainty. On the other hand, randomness can be used. When estimating the proportion of current SCSU students who smoked, we can randomly survey 1000 students and use the survey responses as our data. How randomness is used? Why use it?

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Population and SampleIn the previous example, all SCSU students form a population while the 1000 surveyed form a sample.In general, a population is the complete collection of all items to be studied. These items can be human subjects, animals, machines, even scores.A sample is a sub-collection of items selected from a population.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009More about SamplesA sample should represent the underlying population. Therefore, sample data must be collected in an appropriate way, such as through a process of random selection.A self-selected sample is one in which the respondents themselves decide whether to be included. USA Today often publish results from surveys in which people with strong interests or opinions are more likely to participate. The survey responses are not representative of the whole population. Valid conclusions based on a self-selected sample can be made only about the specific group of people who chose to participate. How large should a sample be?What are those appropriate ways to generate a sample?

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Parameter and StatisticOne of the important tasks of statistics is to estimate a quantity for a population. For example, we are interested in the proportion (denoted p) of voters who support presidential candidate X. Here the population consists of all qualified voters and the quantity of interest is p. Another quantity of interest is the average GPA (denoted ) of all new SCSU students.Here the unknown quantities p and are called parameters.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Minnesota Teacher Characteristics and Average Salary Teacher characteristics, 2005-2006 97 percent of teachers are licensed 50 percent have advanced degrees 56 percent have taught more than 10 yearsAverage Salary The average salary for a Minnesota public school teacher was $46,906 in 2005; there were 52,213 full-time equivalent teachers. Source: http://www.house.leg.state.mn.us/hrd/issinfo/tchrchar.htm

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009The following are all parameters:percent of teachers are licensed percent have advanced degrees percent have taught more than 10 yearsaverage salary for a Minnesota public school teacher in 2005The true values of these parameters are all known by census.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009We now know that a parameter is a quantity associated with a population.To estimate a parameter, one usually takes a sample from the population. A quantity based on the sample, called statistic, can be used to estimate the unknown population quantity.Example: Based on a sample of 877 surveyed executives, it is found that 45% of them would not hire someone with a typographic error on their job application. That figure of 45% is a statistic because it is based on a sample.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009A parameter is a measurement describing some characteristic of a population.A statistic is a measurement describing some characteristic of a sample.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 20091.2 Types of DataData are observations that have been collected. Data can be numerical, such as heights, weights, incomes, GPAs, tumor counts, orNon-numerical, such as colors, genders, smoking status, political affiliationsNumerical data are called quantitative data, which consist of numbers representing counts or measurements.Non-numerical data are called qualitative (or categorical) data, which can be separated into different categories.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Types of Data (contd)Quantitative data can be discrete or continuousDiscrete data are counts, such as the number of bacteria in a bottle of water.Continuous data are measurements that can assume any value over a continuous span, such as the amount of water in a bottle.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Four Levels of MeasurementsThere are 4 levels of measurementsData are at the nominal level of measurement if they can not be arranged in an ordering scheme. Such as colors and gendersData are at the ordinal level of measurement if they are qualitative, but can be arranged in an ordering scheme, such as letter grades

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Four Levels of Measurements (contd)Data are at the interval level of measurement if they are quantitative, but a zero does not mean none, such as temperatures, yearsData are at the ratio level of measurement if they are quantitative and a zero does mean none, such as weights, heights, ages, GPAsFor interval data, differences are meaningful, while ratios are meaningless. For example, 400F is not twice as hot as 200F.For ratio data, both differences and ratios are meaningful.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009A Data ExampleThis case study is an example of a clinical trial to assess the effectiveness of a new drug as part of a combination therapy (diet, exercise and drug) to treat obesity.Click me to see the data

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 20091.3 Design of ExperimentsAn experiment is a study design in which experimental units are randomly assigned to treatments. VocabularyExperimental units are individuals on whom an experiment is performed. Usually called subjects or participants when they are human.A treatment is the process, intervention, or other controlled circumstance applied to randomly assigned experimental units.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Example of ExperimentOver a 4-month period, among 30 people with bipolar disorder, patients who were given a high dose (10g/day) of omege-3 fats from fish oil improved more than those given a placebo. Identify the experimental units and treatments used.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Observational StudiesAn observational study is one in which no manipulation of treatments is employed.In observational studies the researcher doesnt assign choices but observes outcomes.Widely used in public health and marketing.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Example of Observational Studies(Blood pressure) In a test of roughly 200 men and women, those with moderately high blood pressure (averaging 164/89 mm Hg) did worse on tests of memory and reaction time than those with normal blood pressure. (Hypertension 36 [2000]: 1079)

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Retrospective and Prospective StudiesAn observational study can be retrospective or prospective.A retrospective (or case-control) study is one in which subjects are first identified and then their previous conditions or behaviors are determined.A prospective (or cohort) study is one in which subjects are followed to observe future outcomes.

    STAT 319 Biometrics Spring 2009

  • Case-Control Studiesoutcome is measured before exposurecontrols are selected on the basis of not having the outcomegood for rare outcomesrelatively inexpensivesmaller numbers requiredquicker to completeprone to selection biasprone to recall/retrospective biasrelated methods are risk (retrospective), chi-square 2 by 2 test, Fisher's exact test, exact confidence interval for odds ratio, odds ratio meta-analysis and conditional logistic regression.

    Source: http://www.statsdirect.com/help/basics/prospective.htm

    *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Cohort Studiesoutcome is measured after exposureyields true incidence rates and relative risksmay uncover unanticipated associations with outcomebest for common outcomesexpensiverequires large numberstakes a long time to completeprone to attrition bias (compensate by using person-time methods)prone to the bias of change in methods over timerelated methods are risk (prospective), relative risk meta-analysis, risk difference meta-analysis and proportions

    *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Examples of Retrospective and Prospective StudiesA researcher obtains data about head injuries by examining hospital records from the past 5 years. -- retrospective(Psychology of Trauma) A researcher plans to obtain data by following (to the year 2020) siblings of victims who perished in a terrorist attack. -- prospective

    STAT 319 Biometrics Spring 2009

  • More Readinghttp://altmed.creighton.edu/HIV/retrovspro.htm

    *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Cross-sectional StudyA cross-sectional study involves data collected at a single point in time, often using survey research methodsExample: The Centers for Disease Control (CDC) obtains current flu data by polling 3000 people this month. *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Differences between Experiments and Observational StudiesWhether treatments are employedExperiments can study causal relationship, but observational studies can NOT.For example, experiments can (but observational studies can NOT) answer questions such asDoes taking vitamin C reduce the chance of getting a cold?Is this drug a safe and effective treatment for that disease?

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Study IssuesThe results of observational studies are considered much less convincing than those of designed experiments, as they are much more prone to selection bias. Researchers attempt to compensate for this with complicated statistical methods such as propensity score matching methods.Experiments may be ruined because of confounding. Confounding occurs when effects of variables are somehow mixed so that the individual effects of the variables can not be identified.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Example: 1000 people are treated with a vaccine designed to prevent Lyme disease caused by ticks.If an early onset of cold weather causes the ticks to hibernate and the 1000 vaccinated subjects subsequently experience an unusually low incidence of Lyme disease, we dont know if the lower disease rate is the result of an effective vaccine or the early onset of cold weather. The effects of the vaccine and the effects of the cold weather have been mixed and can not be distinguished. A better experimental design would take account of both the vaccine and the cold weather.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Controlling Effects of VariablesEffects of variables can be controlled by using such devices as BlindingBlockingRandomization

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009BlindingIn an experimental design to test the effectiveness of a vaccine, some subjects are given such a treatment, while others are given a placebo.A placebo effect occurs when an untreated subject reports an improvement in symptoms. Blinding can minimize a placebo effect.An experiment can be single-blinded or double-blinded.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009BlockingBlocking is the arranging of experimental units in groups (blocks) that are similar to one another. For example, an experiment is designed to test a new drug on patients. In addition to the new drug treatment, a placebo is also administered to male and female patients in a double blind trial. The sex of the patient is a blocking factor accounting for treatment variability between males and females. This reduces sources of variability and thus leads to greater precision.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009MalesFemales30 with treatment30 with placebo30 with treatment30 with placeboBlocking

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009RandomizationTo control effects of variables in an experiment, a third device is to randomly assign subjects to treatments. Randomization tends to balance treatment groups with respect to confounding variables.When assigning subjects, one approach is to use a completely randomized design (CRD), whereby the assignment is done by using a completely random assignment process. For example, Imagine that we have children, a coin, a vaccine, and a placebo. Flip the coin, assign a child to the vaccine if an outcome of heads results, otherwise to the placebo.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Randomization (contd)CRD is not efficient when blocking factor exists. A more efficient approach is to use a randomized (complete) block design (RCBD).In the previous example, we first form blocks of males and females. Then in each block, we use a CRD.If the vaccine does affect males and females differently, The RCBD has a much better chance to detect that difference.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Replication and Sample SizeIn addition to controlling effects of variables, another key element of experimental design is the sample size.The larger the sample size in a treatment group, the easier to detect differences from different treatments.Using a same treatment to more than one subjects is called replication. Replication increases the sample size. The subjects using the same treatment are called replicates.

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Sampling StrategiesRandom sampling: obtaining a sample in such a way that each individual in the population has the same chance of being chosen. A sample thus obtained is called a random sample. A random sample of size n is called a simple random sample (SRS), if any possible sample of the same size n has the same chance of being chosen.

    STAT 319 Biometrics Spring 2009

  • Example Picture a classroom with 36 students arranged in six rows of 6 students each. Consider two sampling schemes: (1) Write 1 to 36 on 36 slips of paper, different numbers on different slips. Label students 1 to 36. Put the 36 slips in a bag and shuffle. Take out 6 randomly. (2) Roll a fair die and select the row of students corresponding to the outcome. Which scheme results in a SRS?*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Selecting a Simple Random Sample with Random-Digit Table and SoftwareThe question: How can an auditor select 10 accounts to auditor in a school district that has 60 accounts?Using random-digit tables:Number the subjects in the sampling frame by 01, 02, 03, , 60 (all numbers have 2 digits as 60 does)In a random digit table, such as this, start from any row and any column you like, say row 2 and column 7, select two digits at a time discarding repeated numbers and those that are 00 or larger than 60. This process continues until you get 10 numbers.*STAT 319 Biometrics Spring 2009

    ColumnsRows1-56-1011-1516-20130120138508190356587269696817992732833287317784000052558451364435821496308768653852575763405700465530679

    STAT 319 Biometrics Spring 2009

  • *STAT 319 Biometrics Spring 2009Answer:

    17, 99, 27, 32, 83, 32, 87, 17, 78, 40, 00, 05, 25, 58, 45, 13, 64, 35

    Tip: record these numbers in order so you know repeats easily

    STAT 319 Biometrics Spring 2009

  • Systematic SamplingA systematic random sample is one in which sample units are selected at specified intervals. A "random start" is required as a basis for selecting the units for the sample.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Selecting a Systematic Random SampleA table of random digits provides an objective method of selecting a "random start."For example, assume that a listing of 50,000 units represents the population from which a systematic random sample of 400 units is desired. The sample size, 400, is 400/50,000 or 1/125 of the population. From the table, select at random a number between 1 and 125 to begin the sample. If the number selected from the table is "64," the sample would consist of every 125th unit on the listing or in the file, beginning with the 64th unit. Thus, if the units in the population are numbered consecutively, the 64th, 189th, 314th, 439th, 564th, etc., units would be drawn as the sample. Such a sample is called a 1 in 125 systematic sample.Questions: (1) Do all the units have the same chance of being selected? If yes, what is the common probability? (2) How to determine the label of the last sample unit? What is it? Adapted from www.warms.vba.va.gov/admin20/m20_2/Appc.doc

    *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example How can we sample 10 houses from a street of 123 houses? Number the houses by 001, 002, , 123Since 123/10=12.3, round down to 12, so every 12th house is chosen after a random starting point between 1 and 12 is chosen.If the random starting point is 8, then the houses selected are 8th, 20th, 32th, 44th, 56th, 78th, 90th, 102th and 114th.

    *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Convenience SamplingSimply collect results that are very easy to get.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Stratified SamplingSubdivide the population into at least two different subgroups (called strata) that share the same characteristics (such as gender or age bracket), then draw a simple random sample from each stratum. *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example At a large University a simple random sample of 5 female professors is selected and a simple random sample of 10 male professors is selected. The two samples are combined to give an overall sample of 15 professors. The overall sample is a stratified sample.

    *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example Olivia is planning to take a foreign language class. To research how satisfied other students are with their foreign language classes, she decides to take a sample of 20 such students. The university offers classes in four languages: Spanish, German, French, and Japanese. She will select a simple random sample of five students from each language.

    *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Computing the Mean from a Stratified SampleSuppose that a population can be stratified into k groups (called strata) containing N1, N2,, and Nk units, respectively.Suppose a stratified sample is selected, n1 units being from stratum 1, , and nk units being from stratum k.Denote the means of the k strata by m1, m2, , and mk, respectively.Then the mean of the stratified sample is defined as *STAT 319 Biometrics Spring 2009Correction has been made.

    STAT 319 Biometrics Spring 2009

  • The SURVEYMEANS procedure in SAS*STAT 319 Biometrics Spring 2009http://www.d.umn.edu/math/docs/saspdf/stat/chap61.pdf

    STAT 319 Biometrics Spring 2009

  • Stratified Sampling: Advantages and DisadvantagesAdvantages - Better coverage of the population- Convenient to administrate - More efficient

    Disadvantages- Sometimes difficult in identifying appropriate strata*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Cluster SamplingFirst divide the population area into sections (called clusters), then randomly select some of those clusters, and then choose all the members from those selected clusters.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example Suppose you are a representative from an athletic organization wishing to find out which sports Grade 11 students are participating in across Canada. It would be too costly and lengthy to survey every Canadian in Grade 11, or even a couple of students from every Grade 11 class in Canada. Instead, 100 schools are randomly selected from all over Canada.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Cluster Sampling: Advantages and DisadvantagesAdvantages - Save time- Reduce cost- Does not require an accurate list of the whole population

    The disadvantages of Cluster Sampling- Less likely to represent the whole population- Do not have total control over the final sample size

    *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • R Codes: Demonstrating SRS Techniques with Animations*STAT 319 Biometrics Spring 2009sample.srs = function(pop = 1:205, n = 20){ s = floor(sqrt(length(pop))) x = cbind(sort(rep(1:s, s)), rep(1:s, s)) y = length(pop) - s^2 plot(x, xlim = c(1,s), ylim = c(0, s), pch = 20) points(cbind(1:y, 0), pch = 20) for (i in sample(pop, n)){ if (i
  • R Codes: Demonstrating Cluster Sampling Techniques with Animations*STAT 319 Biometrics Spring 2009sample.cluster = function(pop = list(1:20, 1:30, 1:40, 1:50, 1:60), n = 3){ len = sapply(pop, length) k = length(pop) plot(1,1, type = 'n', xlim = c(1, max(len)), ylim = c(1,k)) for (i in 1:k){ for (j in pop[[i]]) points(j, i, pch = 20) } x = sample(1:k, n) for (i in x){ for (j in pop[[i]]){ points(j, i, col = "red", pch = 10, cex = 2); Sys.sleep(0.05) } }}sample.cluster()

    STAT 319 Biometrics Spring 2009

  • R Codes: Demonstrating Stratified Sampling Techniques with Animations*STAT 319 Biometrics Spring 2009sample.stratified = function(pop = list(1:20, 1:30, 1:40, 1:50, 1:60), n = 2:6){ len = sapply(pop, length) k = length(pop) plot(1,1, type = 'n', xlim = c(1, max(len)), ylim = c(1,k)) for (i in 1:k){ for (j in pop[[i]]){ points(j, i, pch = 20) } } for (i in 1:k) { s = sample(len[i], n[i]) for(j in s) { points(pop[[i]][j], i, col = "red", pch = 10, cex = 2); Sys.sleep(1)} }}sample.stratified()

    STAT 319 Biometrics Spring 2009

  • Multistage Sample DesignsA Multistage Sample Design is to combine some of the above five sampling schemes.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example In order to select a sample of undergraduate students in the United States, a simple random sample of four states is selected. From each of these states, a simple random sample of two colleges or universities is then selected. Finally, from each of these eight colleges or universities, a simple random sample of 20 undergraduates is selected. The final sample consists of 160 undergraduates.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example On a chilly spring afternoon, 10 lab sections of a statistics class all have full attendance. The 10 lab sections each have the same number of students enrolled in it. A class evaluation is about to be administered to some of students. It has been decided to first randomly select 3 of the 10 lab sections and then give the evaluation to a simple random sample of one-fourth of the students in those sections.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Sampling ErrorsA sampling error is the difference between a sample result and the true population result. Such an error results from sample-to-sample variation.A non-sampling error occurs when the sample data are incorrectly collected, recorded, and analyzed.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • BiometricsSTAT 319-002Chapter 2Describing, Exploring, and Comparing Data

  • 2.1 Overview Important Characteristics of DataCenter: a representative value that indicates where the middle of the data set is located.Variation: a measure of the amount that the data values vary among themselves.Distribution: The nature or shape of the distribution of the data (such as bell-shaped, uniform, or skewed).Outliers: sample values that lie very far away from the majority of the other sample values.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Descriptive Statistics and Inferential StatisticsThe numerical summaries and graphical summaries to be presented in this chapter are called descriptive statistics.Methods to make inferences about a population using sample data are called inferential statistics.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • 2.2 Frequency DistributionsA frequency distribution lists data values (individually for categorical data or by groups or intervals for quantitative data), along with their corresponding frequencies (or counts).Vocabulary: The frequency for a particular category is the number of original values that fall into the category.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example (categorical data) The array of grades of a statistics class is given below: B B A C B A C C B B A A F D B C A B C B B DThe frequency distribution of grades is given in the table.Does the frequency distribution contain the same amount of information as the data does? STAT 319 Biometrics Spring 2009

    GradesFrequenciesA5B9C5D2F1

    STAT 319 Biometrics Spring 2009

  • Example (quantitative data) The systolic blood pressures (SBP) of 20 men are given: 93 104 105 108 109 112 114 115 117 119 119 120 121 123 127 130 135 139 139 158

    The frequency distribution of the data is given in the table.Here: [90,100] means 90 to 100, inclusive, while (100,110] means 100 to 110, excluding 100.

    Does the frequency distribution contain the same amount of information as the data does?STAT 319 Biometrics Spring 2009Tip: First sort the data from lowest to highest.

    SBP (Interval)Frequency[90,100]1(100,110]4(110,120]6(120,130]4(130,140]4(140,150]0(150,160]1

    STAT 319 Biometrics Spring 2009

  • Terms Used with Frequency DistributionsClasses are categories (for categorical data) or intervals (for quantitative data). Intervals should have the same length.For quantitative data, we have the following terms:Lower class limits are the smallest numbers that can belong to the different classes. Upper class limits are the largest numbers that can belong to the different classes. Class boundaries are the numbers used to separate classes. Class midpoints are the midpoints of the classes.Class width is the common length of classes.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Find --Classes:Lower Class Limits:Upper Class Limits:Class Midpoints:Class Width:ExampleSTAT 319 Biometrics Spring 2009

    SBPFrequency[90,100]1(100,110]4(110,120]6(120,130]4(130,140]4(140,150]0(150,160]1

    STAT 319 Biometrics Spring 2009

  • Find --Classes: 7 classes, [90,100],(100,110],Lower Class Limits:90, 100, 110, ,150Upper Class Limits: 100, 110, , 160Class Midpoints: 95, 105, , 155Class Width: 10AnswerSTAT 319 Biometrics Spring 2009

    SBPFrequency[90,100]1(100,110]4(110,120]6(120,130]4(130,140]4(140,150]0(150,160]1

    STAT 319 Biometrics Spring 2009

  • Procedure for Constructing a Frequency DistributionStep 1: Decide on the number of classes you want. (5 25)Step 2: Calculate the class width (round up) class width (maximum minimum) / #classesStep 3: Determine the lower class limit of the first class. This number is either the lowest data value or a convenient value that is a little smaller.Step 4: Determine all other lower class limits using the lower class limit of the first class and the class width.Step 5: List all the lower class limits in a vertical column and proceed to enter the upper class limits, which are easily identified.Step 6: Enter the second column of frequencies.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example Construct the frequency distribution for the 20 systolic blood pressures (SBP) of 20 men 93 104 105 108 109 112 114 115 117 119 119 120 121 123 127 130 135 139 139 158 using 7 classes.

    We need to determine-#classes: Class width: The lower limit of the first class: Other lower limits: Upper limits: Frequencies:STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Answer Construct the frequency distribution for the 20 systolic blood pressures (SBP) of 20 men 93 104 105 108 109 112 114 115 117 119 119 120 121 123 127 130 135 139 139 158 using 7 classes.Solution#classes: 7Class width: (158 93) / 7 = 9.3 10The lower limit of the first class: 90Other lower limits: 100, 110, 120, 130, 140, 150Upper limits: 100, 110, , 160Frequencies: See the tableSTAT 319 Biometrics Spring 2009

    SBPFrequency[90,100]1(100,110]4(110,120]6(120,130]4(130,140]4(140,150]0(150,160]1

    STAT 319 Biometrics Spring 2009

  • Construct Frequency Distributions In ExcelData Analysis Histogram Specify data ranges and upper class limits (bins)By default, Excel generates frequency distributions.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Relative Frequency DistributionThe relative frequency for a class is expressed as percent. relative frequency = (frequency) / (sum of all frequencies)In a frequency distribution, if the frequencies are replaced by relative frequencies, the resultant table is called a relative frequency distribution.

    STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Examples STAT 319 Biometrics Spring 2009

    SBPFrequencyRelative Frequency[90,100]11/20 = 5%(100,110]44 / 20 = 20%(110,120]66 / 20 = 30%(120,130]44 /20 = 20%(130,140]44 / 20 = 20%(140,150]00 / 20 = 0%(150,160]11 /20 = 5%

    GradesFrequencyRelative FrequencyA5 5/22 = B99/22 =C55/22D22/22F11/22

    STAT 319 Biometrics Spring 2009

  • Cumulative Frequency Distribution for a Quantitative VariableThe cumulative frequency for a class is the sum of the frequencies for that class and all previous classes.A cumulative frequency distribution lists the intervals that are expressed as less than or equal to x, along with the number of values falling in the corresponding intervals.Those xs are chosen to be the upper class limits.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • ExampleSTAT 319 Biometrics Spring 2009

    SBPCumulative Frequency 1001 1105 12011 13015 14019 15019 16020 (the total)

    STAT 319 Biometrics Spring 2009

  • Example Construct the cumulative frequency distribution that corresponds to the given frequency distribution.STAT 319 Biometrics Spring 2009

    Cholesterol of MenFrequency[0,200]1(200,400]5(400-600]11(600-800]15(800-1000]19(1000-1200]19(1200-1400]20

    Cholesterol of MenCumulative Frequency 2001 4006 60017 80032 100051 120070 140090 (total)

    STAT 319 Biometrics Spring 2009

  • Cumulative Relative Frequency Distribution for a Quantitative VariableSTAT 319 Biometrics Spring 2009

    Cholesterol of MenCumulative FrequencyCumulative Relative Frequency 20011/90 = 1.11% 40066/90 = 6001717/90 = 8003232/90 = 10005151/90 = 12007070/90 = 140090 (total)90/90 = 100%

    STAT 319 Biometrics Spring 2009

  • 2.3 Visualizing DataGraphs to be constructed: HistogramOgiveDotplotStem-and-leaf plotPareto chartPie chartScatterplotTime-series graphSTAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • HistogramsA histogram is a bar graph in which the horizontal scale represents classes/intervals of data values and the vertical scale represents frequencies (or relative frequencies). The heights of the bars correspond to the frequency (or the relative frequency) values, and the bars are drawn adjacent to each other without gaps.

    STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example Construct a histogram for the 20 systolic blood pressures (SBP) of 20 men 93 104 105 108 109 112 114 115 117 119 119 120 121 123 127 130 135 139 139 158STAT 319 Biometrics Spring 2009

    SBPFrequency[90,100]1(100,110]4(110,120]6(120,130]4(130,140]4(140,150]0(150,160]1

    STAT 319 Biometrics Spring 2009

  • R Codes SBP = c(93,104,105,108,109,112,114,115,117,119, 119,120,121,123,127,130,135,139,139,158) hist(SBP, breaks = seq(90, 160, 10), col = 'green)

    Copy and paste these codes to R, then you will see the histogram.

    STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Frequency PolygonsA frequency polygon uses line segments connected to points located directly above class midpoint values. The line segments are extended to the right and left so that the graph begins and ends on the horizontal axis.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Class midpoints: 94.5, 104.5, 114.5, 124.5, 134.5, 144.5, 154.5STAT 319 Biometrics Spring 2009

    SBPFrequencySBP1[90,100]4(100,110]6(110,120]4(120,130]4(130,140]0(140,150]1

    STAT 319 Biometrics Spring 2009

    Chart1

    0

    1

    4

    6

    4

    4

    0

    1

    0

    freq

    SBP

    Frequency

    Frequency polygon

    Sheet1

    midpointsfreq

    84.50

    94.51

    104.54

    114.56

    124.54

    134.54

    144.50

    154.51

    164.50

    Sheet1

    freq

    SBP

    Frequency

    Frequency polygon

    Sheet2

    Sheet3

  • STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

    Chart2

    0

    1

    5

    11

    15

    19

    19

    20

    SBP

    Cumulative Freq

    Ogive of SBP

    Sheet1

    SBPClassesmidpointsfreqboundariesCumulative freq

    9384.5089.50

    10490-9994.5199.51

    105100-109104.54109.55

    108110-119114.56119.511

    109120-129124.54129.515

    112130-139134.54139.519

    114140-149144.50149.519

    115150-159154.51159.520

    117164.50

    119

    119

    120

    121

    123

    127

    130

    135

    139

    139

    158

    Sheet1

    0

    0

    0

    0

    0

    0

    0

    0

    0

    freq

    SBP

    Frequency

    Frequency polygon

    Sheet2

    0

    0

    0

    0

    0

    0

    0

    0

    SBP

    Cumulative Freq

    Ogive of SBP

    Sheet3

  • Example Dot plot Quiz scores for twenty students: 5, 7, 8, 3, 7, 7, 1, 9, 6, 8, 5, 6, 7, 10, 7, 9, 6, 8, 6, 61 2 3 4 5 6 7 8 9 10Dot plots: Shows a dot for each observation, placed just above the value on the number line for that observation.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Stem-and-Leaf PlotsStem-and-Leaf Plots: similar to dot plot. Each observation is represented by a stem and a leaf.

    STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example Stem-and-Leaf Plot Stem Leaves 456789105

    24 5 6 60 4 7 760Test scores for 12 students: 80, 45, 100, 76, 84, 87, 96, 62, 75,74, 87, 76Step 1: Sorted test scores: 45, 62, 74, 75, 76, 76, 80, 84, 87, 87, 96, 100Step 2: Place the scores in the corresponding stems and leaves. (usually the last digit will be the leaf)STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Stem Leaves 4.5.6.7.8.9.10.5

    24 5 6 60 4 7 760Example Quiz scores for 12 students: 8.0, 4.5, 10.0, 7.6, 8.4, 8.7, 9.6, 6.2, 7.5, 7.4, 8.7, 7.6Step 1: Sorted test scores: 4.5, 6.2, 7.4, 7.5, 7.6, 7.6, 8.0, 8.4, 8.7, 8.7, 9.6, 10.0Step 2: Place the scores in the corresponding stems and leaves. (usually the last digit will be the leaf)STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Pareto Chart: Bar Graph with categories Ordered by Their Frequency from the Tallest Bar to Shortest. Click to see data.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

    Chart4

    37.7

    27.3

    9.2

    9

    5.7

    5.3

    3.7

    2

    Percent (%)

    Group

    Percentage

    Pareto Chart

    Sheet1

    GroupPercent (%)

    EPP37.7

    PES27.3

    ELDR9.2

    Other9

    EFA5.7

    EUL5.3

    UEN3.7

    EDD2

    Total99.9

    Sheet1

    Percent (%)

    Group

    Percentage

    Pareto Chart

    Sheet2

    Sheet3

  • Pie Charts Pie chart: A circle having a slice of a pie for each category. The size of slice corresponds to the percentage of observations in the category.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Constructing Pareto Chart Using ExcelSTAT 319 Biometrics Spring 2009Click to see data

    STAT 319 Biometrics Spring 2009

  • ScatterplotsIs a plot of paired (x,y) data with a horizontal x-axis and a vertical y-axis.Click to see an example.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Time SeriesA time series is a data set collected over time.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • 2.4 Measures of CenterA measure of center is a value at the center or middle of a data set.Measures of center:Mean: the average value of data points.Median: the middle value when a data set is arranged in order of magnitude.

    STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Examples: Find the mean and median of the data: 12, 10, 4, 5, 1

    (2) Find the mean and median of the data: 12, 10, 4, 5, 1, 1000

    STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • ModeThe mode of a data set is the value that occurs most frequently.Examples:The data 2, 2, 3, 1, 5 have a mode of 2The data 3, 1, 3, 1, 5, 0, 6 have two modes 1 and 3The data 2, 4, 6, 7, 0 have no mode (no value repeated).STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Rounding-off RuleTo get more accurate results, carry as many decimal places as possible.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Weighted MeansTo calculate a final score, exam 1 accounts for 25%, exam 2 accounts for 35%, and final exam accounts for 40%. Suppose exam 1 is worth 60 points, exam 2 is 80 points, and final exam is 90 points, the final score is the weighted mean (60)(.25) + (80)(.35) + (90)(.40) = 79STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Mean or MedianMeans are sensitive to outliers, while medians are resistant.Means are generally good, but use medians when there is any outlier.

    STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Mean, Median, and ModeThe distribution of data is SymmetricThe distribution is skew to the leftThe distribution is skew to the rightSTAT 319 Biometrics Spring 2009These pictures are smoothed histograms.

    STAT 319 Biometrics Spring 2009

  • 2.5 Measures of VariationRange of data: maximum - minimumStandard deviation: measure of variation about the meanVariance: Square of standard deviationAll these measure how concentrate (or divergent) the data are.

    STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • The variance of a set of observations is an average of the squares of deviation from the mean.

    The standard deviation s is the square root of the variance

    Calculation of VariancesSTAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example (Calculating the standard deviation s) Metabolic rates of 7 men who took part in a study of dieting. The units are calories per 24 hours. 1792 1666 1362 1614 1460 1867 1439 Find the mean first:

    The standard deviation: ExampleSTAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Observations Deviations Squared deviations sum = 0 sum = 214870The variance

    The standard deviation ContdSTAT 319 Biometrics Spring 2009

    1792192 3686416666643561362-238566441614141961460-140196001867267712891439-16125921

    STAT 319 Biometrics Spring 2009

  • Variance and Standard Deviation of a PopulationSTAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Coefficient of Variation (CV)For a population,

    For a sample, STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example: Find CV for the data: 2, 4, 1, 6, 7, 0, 3, 2.

    mean = (2+4+1+6+7+0+3+2)/8 = 3.125Standard deviation = 2.416CV = 2.416/3.125 = 0.773 = 77.3%

    STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Using Standard Deviation As a RulerIf a value is within 2 standard deviations away from the mean, then the value is said to be usual.Then, (called the range rule of thumb) The minimum usual value would be mean 2(standard deviation)The maximum usual value would be mean + 2(standard deviation)

    STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example (Head Circumferences of Girls)Past results from the National Health Survey suggest that the head of circumferences of two-month-old girl have a mean of 40.05 cm and a standard deviation of 1.64 cm. (1) Use the range rule of thumb to find the minimum and maximum usual head circumferences. (2) Determine whether a circumference of 42.6 cm would be considered unusual.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Chebyshev's theoremAt least 100(1-1/k^2)% of all values are within k standard deviations of the mean. This is true for any data.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • 68-95-99.7 Rule for Data with a Bell-Shaped DistributionThis rule is also called the empirical rule.This rule states that, for data sets having a distribution that is approximately bell-shaped, About 68% of all values fall within 1 standard deviation of the mean.About 95% of all values fall within 2 standard deviation of the mean.About 99.7% of all values fall within 3 standard deviation of the mean.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example (Heights of Women)Heights of women have a bell-shaped distribution with mean 163 cm and standard deviation 6 cm. Then (1) 68% of women have heights between 163 1(6) = 157 cm and 163 + 1(6) = 169 cm. (2) 95% of women have heights between 163 2(6) = 151 cm and 163 + 2(6) = 175 cm. (3) 99.7% of women have heights between 163 3(6) = 145 cm and 163 + 3(6) = 181 cm.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • 2.6 Measures of Relative StandingA standard score, or z-score, is the number of standard deviation that a given value x is above or below the mean.For a given value x, its z-score is STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Z-scores Can be Used to Compare ValuesExample (Heights of Women)Heights of women have a bell-shaped distribution with mean 163 cm and standard deviation 6 cm. Then (1) A woman of 149cm has a z-score of (149 163)/6 = - 2.33 (2) A woman of 169cm has a z-score of (169 163)/6 = 1 (3) A woman of 178 cm has a z-score of (178 163)/6 = 2.5 STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Z-Scores and Unusual ValuesUsual values have z-scores between 2 and 2, inclusive.Unusual values have z-scores greater than 2 or less than 2.If a value has a negative z-score, the value must be less than the mean. Similarly, If a value has a positive z-score, the value must be greater than the mean. STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example (Heights of Women)Heights of women have a bell-shaped distribution with mean 163 cm and standard deviation 6 cm. The height of a woman is 179cm. Is she unusually tall?

    Solution:

    The z-score of 179 cm is (179 163)/6 = 2.67, so, she is unusually tall relative to other women.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example (Comparing Test Scores)Which is relatively better: A score of 85 on a biology test or a score of 45 on an economics test? Scores on the biology test have a mean of 90 and a standard deviation of 10. Score on the economics test have a mean of 55 and a standard deviation of 5.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example (Conversion Between Scores) Many colleges and universities accept SAT or ACT scores for admission. Suppose that to have a SAT score in the top 25%, one needs to score at least 1750. If one took the alternative ACT test, how high would he need to score in order to make him equivalent to those top 25% SAT scorers. For SAT scores, the mean is 1520 and the standard deviation is 250.For ACT, the mean is 20.8 and the standard deviation is 4.8.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Quantiles: Quartiles and PercentilesExample (SAT Scores) A SAT has three 800-point sections (math, critical reading, and writing). In addition to their score, students receive a number which is the percent of other test takers with lower scores. We are interested in two questionsWhat is the lowest score one should get to be among the top p percent? Say p = 5.If a students score is x, what percent of test takers score less than or equal to x?STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • PercentilesThe first question is related to percentiles.A percentile is the value below which a certain percent of observations fall. For example, the 20th percentile is the value (or score) below which 20 percent of the observations may be found. In general, the kth percentile of a sample or population is the value (or score) below which k percent of the observations may be found. STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • QuartilesIf k = 25, the percentile is called the first quartile, denoted Q1.If k = 50, the percentile is called the second quartile, denoted Q2.If k = 75, the percentile is called the third quartile, denoted Q3.Note that Q2 is just the median. The middle 50% of values are between Q1 and Q3.Inter quartile range: IQR = Q3 - Q1.Percentiles and quartiles are examples of quantiles.

    STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Finding kth PercentileLet n = # of data values L = locator that gives the position of a value For example, the 13th value in the sorted data (from smallest to largest) has L = 13. Pk = kth percentileThen L = (k/100)*n.If L is a whole number, Pk = the average of the Lth value and the (L+1)th in the sorted data; Otherwise, round L up to the next whole number, say M, and Pk = Mth value in the sorted data.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example (Cotinine Levels of 40 Smokers) 0, 1, 1, 3, 17, 32, 35, 44, 48, 86, 87, 103, 112, 121, 123, 130, 131, 149, 164, 167, 173, 173, 198, 208, 210, 222, 227, 234, 245, 250, 253, 265, 266, 277, 284, 289, 290, 313, 477, 491Find P20 , P25, P75, and IQR.Solution: To find P20, we know n = 40, k = 20. Then L = (k/100)*n = (20/100)*40 = 8.The 20th percentile P20 is then the average of 44 (the 8th) and 48 (the 9th), or 46.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Different Software Packages May Give Different QuantilesTo find the 25th percentile of the data 1, 3, 6, 10, 15, 21, 28, 36Using the formula, we get 4.5.SAS gives 4.5, too.Excel uses percentile(array, k), click me. The 25th percentile is 5.25. R gives the same result 5.25, x=c(1, 3, 6, 10, 15, 21, 28, 36); quantile(x,0.25)The secret: Both Excel and R use linear interpolation, while SAS takes average.

    STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • 2.7 Exploratory Data Analysis (EDA)EDA is the process of using statistical tools, numerical or graphical, to investigate data sets in order to understand their important characteristics, such as the center, variation, distribution, and outliers.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • The 5-Number Summary and BoxplotsFor a data set, the 5-number summary consists of the minimum value, Q1, Q2, Q3, and the maximum value.A (regular) boxplot is a graphical display of the 5-number summary.STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Procedure for Constructing a Regular BoxplotFind the 5-number summary.Construct a scale with values that include the minimum and maximum data values.Construct a box extending from Q1 to Q3, and draw a line in the box at the median.Draw lines extending outward from the box to the minimum and maximum data values.

    STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Procedure for Constructing a Modified BoxplotFind the 5-number summary.Construct a scale with values that include the minimum and maximum data values.Construct a box extending from Q1 to Q3, and draw a line in the box at the median.Draw lines extending outward from the box to the minimum and maximum data values within the fences formed by Q1 1.5*IQR and Q3 + 1.5*IQR.(5) Any data values outside the fences are treated as potential outliers and marked in the boxplot.Note: Most software gives modified boxplots.

    STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Interpretation: The following observations can be made. Means: Variation: Distributions: Outliers:

    STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • BiometricsSTAT 319-002Chapter 3Probability

  • False Positives and False NegativesExample: In clinical trials of a blood test for pregnancy, 99 women are randomly selected from a population of women who seek medical help in determining whether they are pregnant.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • 80 subjects are true positives; 11 subjects are true negatives;3 subjects are false positives; 5 subjects are false negatives.*STAT 319 Biometrics Spring 2009

    Pregnancy Test ResultsPositive NegativeSubject is pregnant

    Subject is not pregnant 5

    3 11

    STAT 319 Biometrics Spring 2009

  • 3-1 OverviewRare Event Rule for inferential statistics: If , under a given assumption, the probability of a particular observed event is extremely small, we conclude that the assumption is probably not correct.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • 3-2 FundamentalsAn event is any collection of results or outcomes of a procedure.A simple event is an event that can not be further broken down into simpler components.The sample space for a procedure consists of all possible simple events.

    *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • In the procedure of rolling a die, possible simple events are rolling a 1, rolling a 2,..., rolling a 6. The sample space (S) is the collection of 1, 2, ..., 6, or S = {1,2,...,6}. Is rolling an even number a simple event?In the procedure of selecting a ball from a bag which contains 5 balls, 2 red and 3 blue. Possible simple events are selecting a red ball and selecting a blue ball. The sample space (S) is S = {red, blue}.*STAT 319 Biometrics Spring 2009Examples of events and sample spaces

    STAT 319 Biometrics Spring 2009

  • NotationEvents are denoted by upper case letters, such as A, B, C, and so on.P(A) denotes the probability that the event A happens.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Three Approaches of Defining a ProbabilityRelative frequency approach:If the event A occurs k times among n trials, then the probability that A occurs, P(A), can be estimated as follows: P(A) = k/n. This approach is based on the Law of Large Numbers.Classical approach: Assume that a given procedure has n different simple events, each of which has an equal chance of occurring. If event A can occur in s of these n ways, then P(A) = s/n. Subjective probability: The probability of an event A, P(A), is estimated by educated guess.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Law of Large Numbers (LLN)The relative frequency approach is based on the following theorem:

    Law of Large Numbers As a procedure is repeated again and again, the relative frequency probability of an event tends to approach the actual probability.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Simulating LLNLLN can be simulated using Excel. Click to see details.Using R: Copy the following R codes and paste to R GUI x = sample(1:6, 1000, replace = TRUE) freq1 = sum(x==1)/1000 freq1

    *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Examples of Probabilities: Indicating What Approach You are using to Find ProbabilitiesA fair coin is rolled 1000 times, 489 being heads. Estimate the probability of rolling heads.Randomly select a card from a deck of 52 cards. What is the probability of selecting (1) a diamond (2) a face card (3) an ace (4) a card that is not club.What is the probability that it will be raining tomorrow? *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Summary: Finding Probabilities Using the Classical ApproachStep 1: Write the sample space to find the number of simple events, n.Step 2: Express the event for which you wish to find a probability in terms of simple events. Find the number of simple events in this event, k.Step 3: The probability is k/n.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • More Examples of Finding ProbabilitiesA couple plans to have 3 children. Find the probability of each event. (1) Among 3 children, there is exactly one girl. (2) Among 3 children, there are exactly two girls. (3) Among 3 children, all are girls.

    *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Basic Properties of ProbabilityThe probability of an impossible event is 0.The probability of an event that is certain to occur is 1.For any event A, 0 P(A) 1.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Complementary Events*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example Randomly select a card from a deck of 52 cards. What is the probability of selecting a card that is not an ace.

    *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • 3-3 Addition RuleA compound event is an event that combines two or more simple events.When tossing a die, the event tossing an even number is an example of compound event.When finding the probability of a compound event, we need the addition rule, stated below. Addition Rule Suppose that a compound event A can be expressed as B or C, that is, A = B or C, then P(A) = P(B or C) = P(B) + P(C) - P(B and C).*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • P(B or C) = P(B) + P(C) - P(B and C)

    Can be written as

    P(B C) = P(B) + P(C) - P(B C)*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example The following table summarizes blood groups and Rh types for 100 typical people. Rh TypeBlood GroupIf one person is randomly selected, find the probability of getting someone who is type Rh -.If one person is randomly selected, find the probability of getting someone who is group B.If one person is randomly selected, find the probability of getting someone who is group B or type Rh -.*STAT 319 Biometrics Spring 2009

    O A B ABPositiveNegative 39 35 8 4 6 5 2 1

    STAT 319 Biometrics Spring 2009

  • Special Case of the Addition RuleIf two events B and C are disjoint (or mutually exclusive), meaning that they can not occur simultaneously, then P(BC) = P(B) + P(C).*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Examples of Disjoint EventsToss a die. Let A = tossing a 1 and B = tossing an even number. A and B are disjoint events.Select a card from a deck of 52. Let A = a Jack, B = a 4, and C = Not a club. A and B are disjoint, but neither A nor B is disjoint with C. Find (1) P(A or B) (2) P(A or C) (3) P(B or C)*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • 3-4 Multiplication Rule: BasicsNotation P(A and B) = P(event A and event B occurs simultaneously) P(A and B) can be written as P(AB). P( B | A ) = P(event B occurs, given that event A has already ocurred.)P(AB) = P(A)P(B|A) or P(B)P(A|B).*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Examples Using Multiplication Rule2 balls are randomly selected from a bag of 10 balls, with 4 red and 6 blue. If the balls are selected without replacement, find (1) the probability that the first ball selected is red and the second blue.(2) the probability that the two balls selected are both blue.(3) the probability that the two balls selected are of different color.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Independent EventsTwo events A and B are independent, if the occurrence of one does not affect the probability of the occurrence of the other.If two events A and B are not independent, they are said to be dependent.How can we generalize the definition of independence to 3 or more events?*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Multiplication Rule for Independent EventsIf A1, A2, ..., An are independent, then P(A1A2...An) = P(A1)P(A2)...P(An)Especially, if events A1, A2, and A3 are independent, then P(A1A2A3) = P(A1)P(A2)P(A3) *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example 2 balls are randomly selected from a bag of 10 balls, with 4 red and 6 blue. If the balls are selected with replacement, find (1) the probability that the first ball selected is red and the second blue.(2) the probability that the two balls selected are both blue.(3) the probability that the two balls selected are of different color.

    *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Toss a fair coin 10 times. Find the probability of tossing 10 heads.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • 3-5 Multiplication Rule: Beyond the BasicsThe probability of at least one P(at least one) = 1 - P(none)Conditional probability P(A | B) = P(AB)/P(B).Bayes Theorem

    *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Examples(1) Describing complements: (a) When 50 electrocardiograph units are shipped, all of them are free of defectives. (b) When five different blood samples are obtained from donors, at least one of them has type O blood.(2) A couple plans to have 3 children. What is the probability of having at least one girl? Tip: Choose an appropriate sample space. (Use tree diagram)

    *STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Example (Bayes Theorem) The New York State Health Department reports a 10% rate of the HIV virus for the at-risk population. Under certain conditions, a preliminary screening test for the HIV virus is correct 95% of the time, both for HIV positive and negative people. (Subjects are not told that they are HIV infected until additional tests verify the results.) One person is randomly selected from the at-risk population.a. What is the probability that the selected person has the HIV virus if it is known that this person has tested positive in the initial screening.b. What is the probability that the selected person tests positive in the initial screening if it is known that this person has the HIV virus.

    STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • 3-6 Risks and OddsIn a medical experiment, 401,974 children were injected either with the Salk vaccine or with a placebo. After a period of follow-up, 33 children were developed paralytic polio.

    Questions: (1) P(Polio|Salk Vaccine) = ? (2) P(Polio|Placebo) = ?The difference and ratio of the two probabilities (or incidence rates) are of more interest. They are called absolute risk reduction and relative risk, respectively. *STAT 319 Biometrics Spring 2009

    PolioNo PolioTotalSalk VaccinePlacebo33115200,712201,114200,745201,229

    STAT 319 Biometrics Spring 2009

  • Measures for Comparing Two Incidence RatesConsider the general follow-up study (prospective study): Let pt = P(Disease | Treatment) and pc = P(Disease | Control). DefineAbsolute Risk Reduction = | pt pc | = | a/(a+b) c/(c+d) |Relative Risk = pt / pc = [a/(a+b)] / [c/(c+d)] *STAT 319 Biometrics Spring 2009

    DiseaseNon-diseaseTreatment (or exposed) Control (Not exposed)acbd

    STAT 319 Biometrics Spring 2009

  • Find

    (1) pt = P(Polio | Salk Vaccine) and pc = P(Polio | Placebo). (2) Absolute Risk Reduction and Relative Risk.Example*STAT 319 Biometrics Spring 2009

    PolioNo PolioTotalSalk VaccinePlacebo33115200,712201,114200,745201,229

    STAT 319 Biometrics Spring 2009

  • Odds against or Odds in Favor of an EventSuppose that A is an event.The (actual) odds against event A is defined as P(A complement)/P(A)

    The (actual) odds in favor of event A is defined as P(A)/P(A complement)

    Odds are often expressed as the ratio of two integers.*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • ExampleFor those children treated with Salk vaccine, find the odds in favor of being polio diseased. (2) For those children treated with the placebo, find the odds in favor of being polio diseased. (3) Find the ratio of the two odds, called odds ratio.Solution: Let D = polio. Need to calculate P(D)/P(D complement). (1) (2) *STAT 319 Biometrics Spring 2009

    PolioNo PolioTotalSalk VaccinePlacebo33115200,712201,114200,745201,229

    STAT 319 Biometrics Spring 2009

  • Odds RatioIn a prospective study, the odds ratio (OR) is a measure of risk found from the ratio of the odds for the treatment (or exposure) group to the odds for the control (or non-exposure) group.

    The odds ratio is (ad)/(bc).An odds ratio of 1 indicates no difference in risk for the two groups.*STAT 319 Biometrics Spring 2009

    DiseaseNon-diseaseTreatment (or exposed) Control (Not exposed)acbd

    STAT 319 Biometrics Spring 2009

  • Odds Ratio Can be Obtained RetrospectivelyThe odds ratio is defined prospectively, but can be obtained through a retrospective study. Specifically, suppose that a retrospective study is described by the following table, in which the total number of disease and the total number of non-disease are both fixed. P(Smoker | Disease) 140/161 = 87%P(Smoker | Non-Disease) 532/2239 = 23.8%The relative risk is NOT estimable through a retrospective study,but the odds ratio is. For disease, OR (140x1707)/(21x532) = 21.4.*STAT 319 Biometrics Spring 2009

    DiseaseNon-diseaseSmoker (Exposure)140532Non-smoker (Non-exposure)211707Total1612239

    STAT 319 Biometrics Spring 2009

  • 3-8 Counting Counting rule: For a sequence of K events in which the first event can occur n1 ways, the second event can occur n2, , and the Kth event can occur nK ways, the events together can occur a total of (n1)(n2)(nK) ways.STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • ExamplesDNA is made of nucleotides, each of which can contain any one of these nitrogenous bases: A(adenine), G(guanine), C(cytosine), T(thymine). If one of those four bases (A, G, C, T) must be selected three times to form a linear triplet (called codon), how many different triplets are possible?If a password must contain 6 digits (0-9) or letters (A-Z, a-z), how many passwords are possible? If passwords must start with a letter, how many such passwords are possible?How many different ways are possible to arrange n different items in order?STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • The Permutation RuleThe number of permutations (or sequences) of r items selected from n available items without replacement is

    If there are n items with n1 alike, n2 alike, , nk alike, the number of permutations of all n items is STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • ExamplesWhen testing a new drug, phase I involves only 20 volunteers, and the objective is to assess the drugs safety. To be cautious, you plan to treat the 20 subjects in sequence, so that any particularly adverse effect can stop the treatments before any other subjects are treated. If 30 volunteers are available, how many different sequences of 20 subjects are possible?How many numbers can be formed using 3 1s and 4 2s?STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • Combinations RuleThe number of combinations of r items selected from n different items is STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • ExamplesWhen testing a new drug on humans, a clinical trial is normally done in three phases. Phase I is conducted with a relatively small number of healthy volunteers. Lets assume that we want to treat 20 healthy humans with a new drug, and we have 30 suitable volunteers available. If 20 subjects are selected from the 30 that are available, and the 20 selected subjects are all treated at the same time, how many treatment groups are possible?STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • Permutation or Combination?When order matters, we have a permutation problem.When order does not matter, we have a combination problem.How many ways are possible to select 3 letters from the 26 letters a-z? This is a combination problem.How many ways are possible to select 3 letters from the 26 letters a-z and arrange them in a sequence? This is a permutation problem.

    STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • Find Probabilities Using Permutation and CombinationA bag has 10 balls, 4 red and 6 blue. Randomly select 2 balls without replacement. Find the probability that (1) the first selection is red and the second is blue. (2) the two balls are of different color.Solution: (1) Lets label the 10 balls, say 1-4 (red), 5-10 (blue). The sample space contains all possible pairs of 2 balls. The number of such pairs is k = ___. The event A = first red and second blue contains n =___ those pairs in the sample space. Because all pairs in the sample space have the same probability to occur, the classical probability formula is applicable. The probability of A is k/n = ___. Check your answer by calculating this probability using multiplication rule of probability. (2) Lets label the 10 balls, say 1-4 (red), 5-10 (blue). Let the sample space be S which contains all possible combinations of 2 balls. These n = ___ combinations are equally likely. The event B = two balls have different colors is a subset of S and contains k = ___ combinations. So, P(B) = k/n = ___.STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • BiometricsSTAT 319-002Chapter 4Discrete Probability Distributions

  • STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • Homework # 2Due on ???*STAT 319 Biometrics Spring 2009

    STAT 319 Biometrics Spring 2009

  • Homework: Using Hawkes Learning SystemTake tests for sections 5.1, 5.2, 5.3, and 5.5. Submit your certificates to D2L.STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • 4-1 OverviewIn this chapter and the next, we discuss some probability models. These models are useful for studying some random phenomena. probability models are statistical description of random phenomena. No model is true/best, but models can be good (Goodness-of-fit test). Model selection is an important issue.probability models can be discrete or continuous.Discrete models are the focus of this chapter.STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • 4-2 Random variablesA random variable is a variable taking different values with certain probabilities.Examples:Let X denote the number of heads among 10 tosses of a fair coin. X can take on values of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Find P(x = 0) and P(X = 10). How to find P(X = 4)?Let Y denote the number of dandelions per square meter. Then Y can be 0, 1, 2, 3, . How to find P(X = 0)? A bag has 10 balls, 4 red and 6 blue. Randomly select 2 balls. Let X denote the number of red balls selected. Then X can assume values of 0, 1, and 2.Randomly select a person from a group of persons. Let Z denote the height (in inch.) of this person. Then X can take any value that is greater than 0.The first 3 are examples of discrete random variables. The 4th is an example of continuous random variables.STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • Probability DistributionsThe probability distribution of a random variable gives the possible values of the variable, along with probabilities taking such values.Probability distributions can be a graph, table, or a formula.STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • Example A bag has 10 balls, 4 red and 6 blue. Randomly select 2 balls. Let X denote the number of red balls selected. Then X is a random variable and can assume values of 0, 1, and 2.Its easy to verify that P(X = 0) = 1/3, P(X = 1) = 8/15, P(X = 2) = 2/15.The distribution of random variable X can be expressed as a graph:A table:

    A formula: STAT 319 Biometrics Spring 2009*

    X = kP(X = k)0121/38/152/15

    STAT 319 Biometrics Spring 2009

  • Probability HistogramTo graph the distribution of a discrete random variable, we use the probability histogram. (Page 161)

    STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • Requirements for a Probability Distribution(1) All P( X = k ) are between 0 and 1;(2) The sum of all probabilities is 1. Which of the following tables describes a probability distribution?

    STAT 319 Biometrics Spring 2009*

    X = kP(X = k)01230.20.30.10.4

    X = kP(X = k)01230.30.10.30.5

    STAT 319 Biometrics Spring 2009

  • Mean, Variance, and Standard Deviation of a Probability DistributionLet = Mean, 2 = Variance, and = Standard Deviation.

    The mean is also known as the expected value or expectation.

    STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • Example: The following is the distribution of the number of boys (denoted X) in a family of 8 kids. Calculate the mean and standard deviation. x P(x) 0 0.004 1 0.031 2 0.109 3 0.219 4 0.273 5 0.219 6 0.109 7 0.031 8 0.004STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • Identifying Unusual Results with the Range Rule of ThumbRecall the Range Rule of Thumb: 68-95-99.7Any value that is greater than + 2 or less than - 2 is said to be unusual.Find any unusual values for the previous example.STAT 319 Biometrics Spring 2009*

    STAT 319 Biometrics Spring 2009

  • Identifying Unusual Results with ProbabilitiesUnusually high: x successes among n trials is an unusually high number of successes if P(x or more) is very small (such as 0.05 or less).Unusually low: x successes among n trials is an unusually low number of successes if P(x or fewer) is very small (such as 0.05 or less).Example: Refer to the previous example. Find all unusual results. Hint: you probably think that x = 7 is unusually high, so you calculate P(7 or more) = P(7) + P(8) = 0.035