statistics ch 1-5 notes (2)
Post on 05-Apr-2018
226 Views
Preview:
TRANSCRIPT
-
7/31/2019 Statistics Ch 1-5 Notes (2)
1/39
Chapter 1: Introduction Defining the Role of Statistics in Business
Statistical Analysis: helps extract information from data and provides anindication of the quality of that information
Data mining: combines statistical methods with computer science & optimizationin order to help businesses make the best use of the information contained in largedata sets
Probability: helps you understand risky and random events and provides a way ofevaluating the likelihood of various potential outcomes
1.1 - Why should you learn statistics?oAdvertising. Effective? Which Commercial? Which markets?oQuality control. Defect rate? Cost? Are improvements working?oFinance. Risk how high? How to control? At what cost.oAccounting. Audit to check financial statements. Is error material?oOther economic forecasting, measuring and controlling productivity
1.2 What is statistics?Statistics: the art and science of collecting and understanding data
oA complete and careful statistical analysis will summarize the general factsthat apply to everyone and will also alert you to any exceptions.
1.3 The Five Basic Activities of Statistics1. Design Phase: will resolve these issues so that useful data will result
a. Designing the Studyinvolves planning the details of datagathering. Can avoid the costs & disappointment of find out too late that the data collected are not adequate to answer the importantquestions.
b. The Population: large group of people, firms, or other items
c. The Sample: a smaller group that consists of some of the populationd. Statistical Inference: the process of generalizing from the
observed sample to the larger populatione. The Random Sample: best way to select a practical sample, to be
studied in detail, from a population that is too large to be examinedin its entirety
i. Guarantees the selection process is fair & without bias; sosample is representative of the population
ii. The randomness, introduced in a controlled way during thedesign phase, will help ensure validity of the statisticalinferences drawn later
2. Exploring the Data: involves looking at your data set from many angles,
describing it, and summarizing it. Exploration is the first phase once youhave data to look at.
a. Prepares for the formal analysis either by:i. By verifying that the expected relationships actually exist in
the data, thereby validating the planned techniques ofanalysis
-
7/31/2019 Statistics Ch 1-5 Notes (2)
2/39
ii. By finding some unexpected structure in the data that must betaken into account, thereby suggesting some changes in theplanned analysis
3. Modeling the Data: a system of assumptions & equations is selected inorder to provide a framework for further analysis.
a. Model: a system of assumptions and equations that can generate
artificial data similar to the data you are interested in, so that youcan work with a few numbers (parameters) that represent theimportant aspects of the data
i. Often, a model says that: data equals structure plusrandom noise
1. Data = Structure + Random Noise4. Estimating an Unknown Quantity: a numerical summary of an unknown
quantity, based on data. It produces the best educated guess possiblebased on the available data. We all want (and often need) estimates ofthings that are just plain impossible to know exactly.
a. Provides an indication of the amount of uncertainty or error involvedin the guess, accounting for the consequences of random selection of
a sample from a large populationb. Confidence Interval: gives probable upper and lower bounds on
the unknown quantity being estimated. Puts the estimate inperspective and helps you avoid the tendency to treat a singlenumber as very precise when, in fact, it might not be precise at all.
c. NOTES:i. Estimating an unknown, best guess based on dataii. Wrong, but by how much?iii. were 95% sure that the unknown is between
never say 100% wrong 5% of the time (but by how much)5. Hypothesis Testing: uses the data to help decide what the world is really
like in some respect. It is the use of data in deciding between two (or more)different possibilities in order to resolve an issue in an ambiguous situation
a. Produces a definite decision about which of the possibilities iscorrect, based on data
b. Procedure is to collect data that will help decide among thepossibilities and to use careful statistical analysis for extra powerwhen the answer is not obvious from just glancing at the data
c. Each hypothesis makes a definite statement, and it may be eithertrue or false
d. The result of a statistical hypothesis test is the conclusion that eitherthe data support the hypothesis or they dont.
e. NOTES:i. Hypothesis testing data decide between two possibilitiesii. Does it really work? Or is it just randomly better?iii. Whiter, brighter wash?
1.4 Data Mining
Data Mining: a collection of methods for obtaining useful knowledge by analyzinglarge amounts of data, often by searching for hidden patterns.
-
7/31/2019 Statistics Ch 1-5 Notes (2)
3/39
Probabilit
Statistical
o Goal of data mining is to obtain value from these vast stores of data inorder to improve the company with higher sales, lower costs, and betterproducts
o Marketing and sales: data can be mined for guidance on how (and when) tobetter reach customers in the future
o Finance: useful in forming and evaluating investment strategies and inhedging (or reducing) risk
o Product Design: answers what particular combinations of featurescustomers are ordering in larger-than-expected quantities.
o Production
o Fraud Detection: best methods of protection involves mining data todistinguish between ordinary and fraudulent patterns of usage, then usingthe results to classify new transactions, and looking carefully at suspiciousnew occurrences to decide whether or not fraud is actually involved
o Data Mining involves combining resources from many fields:
Statistics: All of the basic activities of statistics are involved: adesign for collecting the data, exploring for patterns, a modeling
framework, estimation of features, and hypothesis testing to assesssignificance of patterns Computer Science: Efficient algorithms (computer instructions) are
needed for collecting, maintaining, organizing, and analyzing data. Optimization: Helps achieve a goal, which might be very specific
such as maximizing profits, lowering production cost, finding newcustomers, developing profitable new product models, or increasingsales volume.
Often accomplished by adjusting the parameters of a modeluntil the objective is achieved
1.5 - Probability
Probability: a what iftool for understanding risk and uncertainty. Shows you thelikelihood, or chances, for each of the various potential future events, based on a setof assumptions about how the world works
o Probability is the inverse of statistics. Whereas statistics helps you go fromobserved data to generalizations about how the world works, probabilitygoes the other direction.
o Probability works with statistics by providing a solid foundation forstatistical inference.
How the world works What is likely to happen
What happened What is likely tohappen
If you make assumptions about how the world works, then probability can helpyou figure out how likely various outcomes are and thus help you understandwhat is likely to happen. If you have data that tell you something about what
-
7/31/2019 Statistics Ch 1-5 Notes (2)
4/39
-
7/31/2019 Statistics Ch 1-5 Notes (2)
5/39
Chapter 1 Questions
1. Why is it worth the effort to learn about statistics?a. Answer for management in general
b. Answer for one particular area of business of special interest to you
2. Skip
3. How should statistical analysis and business experience interact with eachother?
4. What is statistics?
5. What is the design phase of a statistical study?
6. Why is random sampling a good method to use for selecting items for study?
7. What can you gain by exploring data in addition to looking at summary resultsfrom an automated analysis?
-
7/31/2019 Statistics Ch 1-5 Notes (2)
6/39
8. What can a statistical model help you accomplish? Which basic activity ofstatistics can help you choose an appropriate model for your data?
9. Are statistical estimates always correct? If not, what else will you need (in
addition to the estimate values) in order to use them effectively?
10.Why is confidence interval more useful than an estimated value?
11.Give two examples of hypothesis testing situations that a business firm wouldbe interested in.
12.What distinguishes data mining from other statistical methods? What methods,in addition to those of statistics, are often used in data mining?
13.Differentiate between probability and statistics.
14.A consultant has just presented a very complicated statistical analysis,complete with lots of mathematical symbols and equations. The results of thisimpressive analysis go against your intuition and experience. What should youdo?
15.Why is it important to identify the source of funding when evaluating theresults of a statistical study?
Problems6. Which of the five basic activities of statistics is represented by each of the
following situations?a. A factorys quality control division is examining detailed quantitative
information about recent productivity in order to identify possibletrouble spots.
b. A focus group is discussing the audience that would best be targetedby advertising, with the goal of drawing up and administering aquestionnaire to this group.
c. In order to get the most out of your firms Internet activity data, itwould help to have a framework or structure of equations to allowyou to identify and work with the relationships in the data.
-
7/31/2019 Statistics Ch 1-5 Notes (2)
7/39
d. A firm is being sued for gender discrimination. Data that showsalaries for men and women are presented to the jury to convincethem that there is a consistent pattern of discrimination and thatsuch a disparity could not be due to randomness alone.
e. The size of next quarters gross national product must be known sothat a firms sales can be forecast. Since it is unavailable at this time,
an educated guess is used.
Chapter 2: Data Structures Classifying the Various Types of Data Sets
Data Set: consists of observations on items, typically with the sameinformation being recorded for each item
Elementary Units: the items themselves
Data sets can be classified:o By the number of pieces of information (variables)o By the kind of measurement (numbers or categories)o By whether or not the time sequence of recording is relevant
o By whether or not the information was newly created or had previouslybeen created by others for their own purposes
2.1 How Many Variables?
Variables: a piece of information recorded for every item (its cost, forexample)
o One = univariate data, two = bivariate data, & many = multivariate data
Univariate Data Sets (one-variable)o Have just one piece of information recorded for each itemo Statistical methods summarize the basic properties and answer
questions such as what is a typical summary value, how diverse arethese items, and do any individuals or groups require special attention
o Examples: incomes of subjects in a marketing survey, number of defectsin each TV set sample, interest rate forecasts of 25 experts, and thebond ratings of the firms in an investment portfolio
Bivariate Data Sets (two-variable)o Have exactly two pieces of information recorded for each item
-
7/31/2019 Statistics Ch 1-5 Notes (2)
8/39
o In addition to summarizing each of these two variables separately asunivariate data sets, statistical methods would also be used to explorethe relationship between the two factors being measured.
o Answers is there a simple relationship between the two, how stronglyare they related, can you predict one from the other & if so with whatdegree of reliability, and do any individuals or groups require specialattention
o Examples: cost of production (1st variable) & number produced (2nd
variable), price of one share of your firms common stock (first variable)& the date (2nd variable), and purchase or non-purchase of an item (1st
variable) & whether an advertisement for the item is recalled (2nd
variable)
Multivariate Data (many variable)o Have three or more pieces of information recorded for each itemo Summarizes each variable separately, looks at the relationship between
any two variables, AND also looks at the interrelationships among all the
itemso Answers is there a simple relationship between the two, how strongly
are they related, can you predict one from the other & if so with whatdegree of reliability, and do any individuals or groups require specialattention
o Examples: growth rate (special variable) and a collection of measures ofstrategy (the other variables), such as type of equipment, extent ofinvestment, and management style, for each of a number of newentrepreneurial firms, and salary (special variable) and gender (recordedas male or female 0/1), number of years of experience, job category,and performance record, for each employee
2.2 Quantitative Data: Numberso Meaningful numbers are numbers that directly represent the measured
or observed amount of some characteristic or quality of the elementaryunits
Include dollar amounts, counts, sizes, numbers of employees, andmiles per gallon
They exclude numbers that are merely used to code for or keeptrack of something else (like 1 = buy stock, 2 = sell stock, 3 =buy bond, 4 = sell bond)
o Quantitative Data: data that is meaningful numbers, that representquantities
Discrete Quantitative Datao Discrete Variable: can assume values only from a list of specific
numbers Example: the number of children in a household is a discrete
variable
Continuous Quantitative Datao Continuous variable: any numerical variable that is not discrete
-
7/31/2019 Statistics Ch 1-5 Notes (2)
9/39
o The possible values form a continuum such as the set of all positivenumbers, all numbers, or all values between 0 and 100%
Example: the actual weight of a candy bar marked net weight1.7 oz is a continuous random variable, the actual weight mightbe 1.70235 or 1.69481 oz
Watch out for meaningless numberso Make sure the numbers are meaningful
2.3 Qualitative Data: Categorieso Qualitative Data: if the data set tells you which one of several non-
numerical categories each item falls into (b/c they record some qualitythat the item possesses)
Ordinal: for which there is a meaningful ordering but nomeaningful numerical assignment
Can say first, second, third, and so on
Can rank the data according to this ordering, and ranking
will probably play a role in the analysis There is a median value (the middle one, once the data is
put into order) Nominal: for which there is no meaningful order
There are only categories, with no meaningful order
There are no meaningful numbers to compute with, and nobasis for ranking
About all that can be done is to count and work with thepercentage of cases falling into each category, using themode (the category occurring most often) as a summarymeasure
2.4 Time-Series and Cross-Sectional Datao Time-Series Data: if the data values are recorded in a meaningful
sequence, such as daily stock priceso Cross-Sectional Data: if the sequence in which the data are recorded
is irrelevant, such as the first-quarter earnings of eight aerospace firms
Another way of saying that no time sequence is involved; yousimply have a cross-section, or snapshot, of how things are at oneparticular time
2.5 Sources of Data, including the Interneto Primary Data: when you control the design of the data-collection plan
(even if the work is done by others) More likely to be able to get exactly the information you want
because you control the data-generating process
Primary data sets are often expensive and time-consuming toobtain
o Secondary Data: when you use data previously collected by others fortheir own purposes
-
7/31/2019 Statistics Ch 1-5 Notes (2)
10/39
Often inexpensive (or even free) and you might find exactly (ornearly) what you need
o To look for data on the Internet, most people use a search engine andspecify some key words
Still common for a search to fail to find the information you reallywant
-
7/31/2019 Statistics Ch 1-5 Notes (2)
11/39
Chapter 2 Questions1. What is a data set?
2. What is a variable set?
3. What is an elementary unit?
4. What are three basic ways in which data sets can be classified? (Hint: theanswer is not univariate, bivariate and multivariate, but is at a higher level)
5. What general questions can be answered by analysis of:
-
7/31/2019 Statistics Ch 1-5 Notes (2)
12/39
a. Univariate data
b. Bivariate data
c. Multivariate data?
6. In what way to bivariate data represent more than just two separateunivariate data sets?
7. What can be done with multivariate data?
8. What is the difference between quantitative and qualitative data?
9. What is the difference between discrete and continuous quantitativevariables?
10.What are qualitative data?
11.What is the difference between ordinal and nominal qualitative data?
12.Differentiate between time-series data and cross-sectional data.
13.Which are simpler to analyze, time-series or cross-sectional data?
14.Distinguish between primary and secondary data
-
7/31/2019 Statistics Ch 1-5 Notes (2)
13/39
Chapter 3: Histograms Looking at the Distribution of Data
Histogram: a picture that gives you a visual impression of many of the basicproperties of the data set as a whole
o Answers what values are typical in this data set, how different are thenumbers from one another, are the data values strongly concentratednear some typical value, what is the pattern of the concentration (dodata values trail off at the same rate at lower values as they do athigher values), are there any special data values that might requirespecial treatment, and do you have single/ homogeneous collection orare there distinct groupings within the data that might require separate
analysiso Many standard methods of statistical analysis require that the data beapproximately normally distributed
-
7/31/2019 Statistics Ch 1-5 Notes (2)
14/39
3.1 A List of Data
List of Numbers: the simplest kind of data set, representing some kind ofinformation (a single statistical variable) measured on each item of interest(each elementary unit)
Number Line: a straight line with the scale indicated by numberso In order to visualize the relative magnitudes of a list of numberso The numbers need to be regularly spaced on a number line so that there
is no distortion
3.2 Using a Histogram to Display the Frequencies Histogram: displays the frequencies as a bar chart rising above the number
line, indicated how often the various values occur in the data seto Horizontal axis = measurements of the data set (dollars, # of people,
miles/ gallon, etc)o Vertical axis = represents how often these values occuro An especially high bar indicates that many cases had data values at this
position on the horizontal number line, while a shorter bar indicates aless common value
A histogram is a bar chart of the frequencies, not of the datao The height of each bar in the histogram indicates how frequently the
values on the horizontal axis occur in the data set (where values areconcentrated & where they are scarce)
3.3 Normal Distributions
Normal Distribution: an idealized, smooth, bell-shaped histogram with all ofthe randomness removed
o Represents an ideal data set that has lots of numbers concentrated inthe middle of the range, with the remaining numbers trailing offsymmetrically on both sides
-
7/31/2019 Statistics Ch 1-5 Notes (2)
15/39
o It is common for statistical procedures to assume that the data set isreasonably approximated by a normal distribution
o It is important to explore the data, by looking at a histogram, todetermine whether or not it is normally distributed
Especially important if a standard statistical calculation will beused that requires a normal distribution
3.4 Skewed Distributions and Data Transformation
Skewed Distribution: is neither symmetric nor normal because the datavalues trail off more sharply on one side than on the other
In business often find skewness in data sets that represent sizes using positivenumbers
o Reason is that data values cannot be less than zero (imposing aboundary on one side), but are not restricted by a definite upperboundary
One of the problems with skewness in data is that many statistical methodsrequire at least on approximately normal distribution
Transformation: a solution to skewness; makes a skewed distribution moresymmetric. It is replacing each data value by a different number (such as alogarithm) to facilitate statistical analysis
o If data includes a negative number or zero, this technique cannot beused
o Logarithm: using the log often transforms skewness into symmetrybecause it stretches the scale near zero, spreading out all of the smallvalues, which had been bunched together
Base 10 (common logs) (*what we will use in this section)
Base e (natural logs)o The logarithm pulls in the very large numbers, minimizing their
difference from other values in the set, and stretching out the low values
3.5 Bimodal Distributions
It is important to recognize when a data set consists of two or more distinctgroups so that they may be analyzed separately
o Can be seen in a histogram as a distinct gap between two cohesivegroups of bars
Bimodal Distribution: when two clearly separate groups are visible in ahistogram
o Has two modes, or two distinct clusters of data
May be an indication that the situation is more complex, or that extra care isrequired
o Should find out the reason for the two groupso Must be large enough, individually cohesive, and either have a fair gap
between them or else represent a large enough sample to be sure thatthe lower frequencies between the groups are not just randomfluctuations
-
7/31/2019 Statistics Ch 1-5 Notes (2)
16/39
3.6 Outliers
Outliers: data values that dont seem to belong with the others because theyare either far too big or far too small
How you deal with outliers depends on what caused them
o 1- mistakes and 2 correct but different data values Dealing with outliers
o Mistakes change the data value to the number it should have been inthe first place
o Correct outliers are more difficult to deal with
If it can be argued convincingly that the outliers do not belong tothe general case under study, they may then be set aside so thatthe analysis can proceed with only the coherent data
-
7/31/2019 Statistics Ch 1-5 Notes (2)
17/39
Must be able to convince any person for whom report is intended Compromise solution perform two different analyses. One with
the outlier included and one with it omitted. By reporting theresults of both analyses, you have not unfairly slanted the results
Whenever any outlier is omitted, in order to inform othersand protect yourself from any possible accusations:whenever an outlier is omitted, explain what you did andwhy.
Why must outliers be addressed?
It is difficult to interpret the detailed structure in a data setwhen one value dominates the scene and calls too muchattention to itself
Many of the most common statistical methods can failwhen used on a data set that doesnt appear to have anormal distribution
3.7 Data Mining with Histograms
The histogram is a useful tool for large data sets because you can see theentire data set at a glance
o Provides a visual impression of the data set, and with large data setsyou will be able to see more of the detailed structure
One advantage of data mining with a large data set is that we can ask for moredetail
o Can have more histogram bars by reducing the width of the bar
3.8 Histograms by Hand: Stem-and-Leaf
Stem-and-Leaf:easiest way to construct a histogram by hand, in which thehistogram bars are constructed by stacking numbers one on top of the other(or side-by-side).
-
7/31/2019 Statistics Ch 1-5 Notes (2)
18/39
Chapter 3 Questions
1. What is a list of numbers?
2. Name six properties of a data set that are displayed by a histogram.
3. What is a number line?
4. What is the difference between a histogram and a bar chart?
5. What is a normal distribution?
6. Why is the normal distribution important in statistics?
7. When a real data set is normally distributed, should you expect the histogramto be a perfectly smooth bell-shaped curve? Why or why not?
8. Are all data sets normally distributed?
9. What is a skewed distribution?
10.What is the main problem with skewness? How can it be solved in many cases?
11.How can you interpret the logarithm of a number?
-
7/31/2019 Statistics Ch 1-5 Notes (2)
19/39
12.What is a bimodal distribution? What should you do if you find one?
13.What is an outlier?
14.Why is it important in a report to explain how you dealt with an outlier?
15.What kinds of trouble do outliers cause?
16.When is it appropriate to set aside an outlier and analyze only the rest of thedata?
17.Suppose there is an outlier in your data. You plan to analyze the data twice:once with and once without the outlier. What result would you be most pleasedwith? Why?
18.What is a stem-and-leaf histogram?
19.What are the advantages of a stem-and-leaf histogram?
-
7/31/2019 Statistics Ch 1-5 Notes (2)
20/39
Chapter 4: Landmark Summaries Interpreting Typical Values andPercentiles
Summarization: using one or more selected or computed values to representthe data set
Discovering and identifying the features that the cases have in common arestatistical activities because they treat the information as a whole
In statistics, one goal is to condense a data set down to one number (or two ora few numbers) that express the most fundamental characteristics of the data)
Methods most appropriate for a single list of numbers:
o One the average, median and mode different ways of selecting asingle number that closely describes all the numbers in a data set
Typical value, center, or locationo Two a percentile summarizes information about ranks
o Three the standard deviation is an indication of how different thenumbers in the data set are from one another (also referred to asdiversity or variability)
Outliers may be described separately. You can summarize a large group ofdata by 1) summarizing the basic structure of most of its elements and 2)making a list of any special exceptions
4.1 What is the Most Typical Value?
Typical Value: the ultimate summary of any data set is a single number thatbest represents all of the data values
o Average or Mean can only be computed for meaningful numbers
(quantitative data) the most common method for finding a typical value for a list of
numbers, found by adding up all the values and then dividing bythe number of items
Excels average function can be used to find the average of a listof numbers
=AVERAGE(A3:A7)
The idea of an average is the same whether you view your list ofnumbers as a complete population or as a representative samplefrom a larger population; however, the notion differs slightly
For an entire population, the convention to use N torepresent the number of items and let (Greek letter mu)represent the population mean value
The average may be interpreted as spreading the total evenlyamong the elementary units (if you replaced each data value bythe average, then the total remains unchanged)
-
7/31/2019 Statistics Ch 1-5 Notes (2)
21/39
The average preserves the total while spreading amounts outevenly, it is most useful as a summary when there are no extremevalues (outliers) present and the data set is a more-or-lesshomogeneous group with randomness
The average is the only summary measure capable ofpreserving the total
Weighted Average
Is like the average, except that it allows you to give adifferent importance, or weight to each data item
Gives you the flexibility to define your own system ofimportance when it is not appropriate to treat each item
equally
The weighted average may best be interpreted as anaverage to be used when some items have moreimportance than others; the items with greater importancehave more of a say in the value of the weighted average
It combines the known information about each group (fromthe sample), with better information about each groupsrepresentation (from the population rather than the
sample) since the best information of each type is used,the result is improved
o Median (half way point) can be computed for ordered categories(ordinal data) or for numbers
Median: the middle value; half of the items in the set are largerand half are smaller
-
7/31/2019 Statistics Ch 1-5 Notes (2)
22/39
It must be in the center of the data and provide aneffective summary of the list of data
Find it by putting the data in order and then locating the middlevalue
Might have to average the two middle values if there is nosingle value in the middle
Ranks: associate the numbers 1, 2, 3, ., n with the datavalues so that the smallest has rank 1, the next smallesthas rank 2, and so forth up to the largest, which has rank n
The median has rank (1+n)/2
How does the median compare to the average?o When the data set is normally distributed, they will
be close to one another since the normal distributionis so symmetric and has such a clear middle point
o The average and the median will usually be a littledifferent even for a normal distribution b/c eachsummarizes in a different way, and there is nearly
always some randomness in real datao When the data set is not normally distributed, the
median and average can be very different b/c askewed distribution does not have a well-definedcenter point
o Typically, the average is more in the direction of thelonger tail or of the outlier than the median isbecause the average knows the actual values ofthese extreme observations, whereas the medianknows only that each value is either on one side oron the other
o Mode (most common category) can be computed for unorderedcategories (nominal data), ordered categories, or numbers.
Mode: the most common category, the one listed most often inthe data set
It is the only summary measure available for normal qualitativedata because unordered categories cannot be summed (as forthe average) and cannot be ranked (as for the median)
Easily found for ordinal data by ignoring the ordering ofthe categories and proceedings as if you had a nominaldata set with unordered categories
Is also defined for quantitative data (numbers) (is ambiguous)
can be defined as the value at the highest point of the histogram Slightly imprecise can be two tallest bars or the
construction of the histogram (the bar width and locationwill make some changes in the shape of the distribution,and the mode can change as a result)
Which Summary should you use?o The mode can be computed for any univariate data set (some ambiguity
with quantitative data)
-
7/31/2019 Statistics Ch 1-5 Notes (2)
23/39
o The average can be computed only from quantitative data (meaningfulnumbers)
o The median can be computed for anything except nominal data(unordered categories)
Quantitative
Ordinal Nominal
Average YesMedian Yes YesMode Yes Yes Yes
For quantitative data, where all three summaries can be
computed, how are they different?
For a normal distribution, there is very little differenceamong the measures since each is trying to find the well-defined middle of that bell-shaped distribution
With skewed data, there can be noticeable differencesamong them
The average should be used when the data set is normallydistributed, and in cases where the need to preserve or forecasttotal amounts is important since the other summaries do not dothis as well
The median can be a good summary for skewed distributions
since it is not distracted by a few very large data items It summarizes most of the data better than the average
does in cases of extreme skewness
Also useful when outliers are present because of its abilityto resist their effects
Useful with ordinal data (ordered categories) although themode should be considered also
-
7/31/2019 Statistics Ch 1-5 Notes (2)
24/39
The mode must be used with nominal data (unordered categories)since the others cannot be computed.
Also useful with ordinal data (ordered categories) when themost represented category is important
o Biweight: a promising kind of estimate, a robust estimator, whichmanages to combine the best features of the average and the median
4.2 What Percentile is it?
Percentiles: summary measures expressing ranks as percentages from 0% to100% rather than from 1 to n so that the 0th percentile is the smallest number,the 100th percentile is the largest, the 50th percentile is the median, and so on
Used in two ways:o 1) to indicate the data value at a given percentage (as in the 10th
percentile is $156,293)o 2) to indicate the percentage ranking of a given data value (as in Johns
performance, $296,994, was in the 55th percentile)
Extremes, Quartiles, and Box Plotso One important use of percentiles is as landmark summary valueso You can use a few percentiles to summarize important features of the
entire distributiono The median is the 50th percentile since it is ranked hallway between the
largest and smallesto Extremes: the smallest and largest values (0th and 100th percentiles,
respectively) Quartiles: defined as the 25th and 75th percentiles
Are the data values ranked one-fourth of the way in fromthe smallest and largest values ambiguity as to exactlyhow to find them
o Five Number Summary: defined as the following set of five landmarksummaries: smallest, lower quartile, median, upper quartile, and largest
The smallest data value (the 0th percentile)
The lower quartile (the 25th percentile, of the way in fromthe smallest)
The median (the 50th percentile, in the middle)
The upper quartile (the 75th percentile, of the way infrom the smallest and of the way in from the largest)
The largest data value (the 100th percentile) The two extremes indicate the range spanned by the data, the
median indicates the center, the two quartiles indicate the edges
of the middle half of the data and the position of the medianbetween the quartiles gives a rough indication of skewness orsymmetry
o Box Plot: a picture of the five-number summary
Serves the same purpose as a histogram provides a visualimpression of the distribution BUT in a different way
-
7/31/2019 Statistics Ch 1-5 Notes (2)
25/39
Shows less detail and is more useful in seeing the big picture andcomparing several groups of numbers without the distraction ofevery detail of each group
The histogram is still preferable for a more detailed look atthe shape of the distribution
o Detailed Box Plot: is a box plot, modified to display the outliers, whichare identified by labels
Outliers: those data points (if any) that are far from the middleof the data set
a larger data value will be declared to be an outlier if it is biggerthan Upper quartile + 1.5 x (upper quartile lowerquartile)
a smaller data value will be declared to be an outlier if it issmaller than Lower quartiles 1.5 x (upper quartile lowerquartile)
in addition to displaying and labeling outliers, you may also labelthe most extreme cases that are not outliers
The Cumulative Distribution Function Displays the Percentileso Cumulative distribution function: is a plot of the data specifically
designed to display the percentiles by plotting the percentages against
the data values Percentages from 0% to 100% on the vertical axis and percentiles
(data values) along the horizontal axiso Has a vertical jump of height 1/n at each of the n data alues and
continues horizontally between data pointso Finding the Percentile Ranking for a Given Number:
1) find the data value along the horizontal axis in the cumulativedistribution function
-
7/31/2019 Statistics Ch 1-5 Notes (2)
26/39
2) Move vertically up to the cumulative distribution function. If ouhit a vertical portion, move halfway up
3) Move horizontally to the left and read the percentile ranking
Chapter 4 Questions1. What is summarization of a data set? Why is it important?
2. List and briefly describe the different methods for summarizing a data set.
3. How should you deal with exceptions when summarizing a set of data?
4. What is meant by a typical value for a list of numbers? Name three differentways of finding one.
-
7/31/2019 Statistics Ch 1-5 Notes (2)
27/39
5. What is the average? Interpret it in terms of the total of all values in thedata set.
6. What is a weighted average? When should it be used instead of a simpleaverage?
7. What is the median? How can it be found from its rank?
8. How do you find the median for a data set:a. With an odd number of values?
b. With an even number of values?
9. What is the mode?
10.How do you usually define the mode for a quantitative data set? Why is thisdefinition ambiguous?
11.Which summary measure(s) may be used on:a. Nominal data?
b. Ordinal Data?
c. Quantitative data?
12.Which summary measure is best for:a. A normal distribution?
b. Projecting total amounts?
c. A skewed distribution when totals are not important?
13.What is a percentile? In particular, is it a percentage (e.g. 23%), or is itspecified in the same units as the data (e.g. $35.62)?
-
7/31/2019 Statistics Ch 1-5 Notes (2)
28/39
14.Name two way sin which percentiles are used.
15.What are the quartiles?
16.What is the five-number summary?
17.What is a box plot? What additional detail is often included in a box plot?
18.What is an outlier? How do you decide whether a data point is an outlier ornot?
19.Consider the cumulative distribution function:a. What is it?
b. How is it drawn?
c. What is it used for?
d. How is it related to the histogram and the box plot?Chapter 5: Variability Dealing with Diversity
We need statistical analysis because there is variability in data
Variability: the extent to which the data values differ from each other
o Diversity, uncertainty, dispersion, and spread(similar meaning)
Three ways of summarizing the amount of variability in a data set:
o One standard deviation: summarizes how far an observation typically is
from the average.
If you multiply the standard deviation by itself, you find the
variance
-
7/31/2019 Statistics Ch 1-5 Notes (2)
29/39
o Two range: is quick and superficial and is of limited use. It summarizes
the extent of the entire data set, using the distance from the smallest to the
largest data value
o Three coefficient of variation: the traditional choice for a relative (as
opposed to an absolute) variability measure and is used moderately often
Summarizes how far an observation typically is from the average as
a percentage of the average value using the ratio of standard
deviation to average
5.2 The Standard Deviation: The Traditional Choice
Standard Deviation: a number that summarizes how far away from the average
the data values typically are
o Is the basic tool for summarizing the amount of randomness in a
situation
EXAMPLE:
o If all numbers are the same
5.5, 5.5, 5.5, 5.5
The average will be X = 5.5 and the standard deviation will be S = 0
21
Variability: Introduction
Also known as dispersion, spread, uncertainty,
diversity, risk Example data: 2, 2, 2, 2, 2, 2, 2
Variability = 0
Example data: 1, 3, 2, 2, 1, 2, 3
How much variability?
Look at how fa r each da ta value is from averageX= 2:
Deviations from average are -1, 1, 0, 0, -1, 0, 1
Variability should be betwe en 0 and 1
22
Examples
Stock market, daily change, is uncertain
Not the same, day after day!
Risk of a business venture There are potentia l rewards , but possible losses
Uncertain payoffs and risk aversion
Which wo uld you rather have
$1,000,000 for sure
$0 or$2,000,000, each outcome equally likely
Both have same average! ($1,000,000)
Most would prefer the choice with less uncertainty
-
7/31/2019 Statistics Ch 1-5 Notes (2)
30/39
o Most data sets have some variability
43.0, 17.7, 8.7, -47.4
The average is the same, the data values are different (and so is the
standard deviation
Deviations: the distances from the average (also called residuals), indicate how
far above the average (if positive) or below the average (if negative) each data
value is
The standard deviation summarizes the deviations cant just take an average
since some numbers are positive and some are negative, the end result would be
zero which is not helpful
o Instead, the standard method
1) find the deviations by subtracting the average from each data
value
2) find the square of each number (multiply it by itself) to eliminate
the minus sign
3) add them up
4) divide the resulting sum by n-1 (this is the variance)
5) take the square root (which undoes the squaring you did earlier)
(this is the standard deviation)
23
Standard Deviation S
Measures variability by answering:
Approximate ly how far from average are the data
values? (same measurement units as the data)
For a sample
For the population
1
)(...)()( 2222
1
n
XXXXXXS n
)(...)()(22
2
2
1
N
XXX N
24
Example
On the histogram
Average is loca ted near the cente r of the dis tribution
Standard deviat ion is a distance away from the average
Standard deviat ion is the typical distance from average
0
1
2
3
0 1 2 3 4 5 6 7
spending
Frequency
X= 2.05S= 1.83 S= 1.83
-
7/31/2019 Statistics Ch 1-5 Notes (2)
31/39
o The variance (the square of the standard deviation) is sometimes used as a
variability measure in statistics, especially by those who work directly with
the formulas, but the standard deviation is a better choice
The variance contains no extra information and is more difficult to
interpret than the standard deviation practice
o In Excel =STDEV(B3:B6)
o The Standard Deviation for a Sample
Interpreting the StandardDeviation
o The standarddeviation has a
simple, direct interpretation: it summarizes the typical distance from
average for the individual data values the result is a measure ofthe variability of these individuals
o The standard deviation represents the typical deviation size expect some data values to be less than one standard deviation fromthe average, while others will be more than one standard deviationaway from the average (*expect individuals to deviate to both sidesof the average)
25
Normal Distribution and Std. Dev.
For a normal distribution only
2/3 of data within one standard deviation of the average
(either above or below)
95% for 2 std. devs. 99.7% for 3
2/3 of data
95% of the data
99.7% of the data
onestandarddeviation
onestandarddeviation
Fig 5.1.3
25
Normal Distribution and Std. Dev.
For a normal distribution only
2/3 of data within one standard deviation of the average
(either above or below)
95% for 2 std. devs. 99.7% for 3
2/3 of data
95% of the data
99.7% of the data
onestandarddeviation
onestandarddeviation
Fig 5.1.3
-
7/31/2019 Statistics Ch 1-5 Notes (2)
32/39
Interpreting the Standard Deviation for a Normal Distributiono When a data set is approximately normally distributed, the standard
deviation has a special interpretation approximately two-thirds ofthe data values will be within one standard deviation of the average,on either side of the average
o Expect to find about 95% of the data within two standard deviationsfrom the average, with error rates often limited to 5%
o Expect nearly all of the data (99.7) to be within three standarddeviations from the average
o If data is NOT normally distributed, the above percentages do notapply
Since there are so many different kinds of skewed (or othernon-normal) distributions, there is no single exact rule thatgives percentages for any distribution
The Sample and the Population Standard Deviationso Two different, but related kinds of standard deviation
Sample Standard Deviation: for a sample from a larger
population denoted S
Population Standard Deviation: for an entire population denoted (lower case Greek sigma)
The sample standard deviation is slightly larger in order toadjust for the randomness of sampling
To resolve any remaining ambiguity, proceed as follows: if indoubt, use the sample standard deviation
Using the larger value is usually the careful,conservative choice since it ensures that you will not besystematically understanding the uncertainty
For computation, the only difference between the twomethods is that you subtract 1 for the sample standarddeviation, but you do not subtract 1 for the population. (alsosome notation changes)
-
7/31/2019 Statistics Ch 1-5 Notes (2)
33/39
L
The smaller the number of items (N or n), the larger thedifference between the formulas. (with reasonably largeamounts of data, there is little difference between thetwo methods)
5.2 The Range: Quick and Superficial
Range: the largest minus the smallest data value and represents the size orextent of the data
-
7/31/2019 Statistics Ch 1-5 Notes (2)
34/39
o Range of data set (185, 246, 92, 508, 153)o = Largest Smallesto = 508 92o = 416
On Excel =MAX(orders)-MIN(orders)
Is a sensible measure of diversity (like seeking to describe the extent of the dataor to search for errors)
Because of its sensitivity to the extremes, the range is not very useful as astatistical measure of diversity in the sense of summarizing the data set as awhole
o The range does not summarize the typical variability in the data but ratherfocuses too much attention on just two data values
The standard deviation is more sensitive to all of the data & providesa better lok at the big picture
The range will always be larger than the standard deviation
5.3 The Coefficient of Variation: A Relative Variability Measure
Coefficient of Variation: defined as the standard deviation divided by theaverage, is a relative measure of variability as a percentage or proportion of theaverage
o Most useful when there are no negative numbers in the data seto
Note that the standard deviation is the numerator, as isappropriate because the result is primarily an indication ofvariability
o The coefficient of variation has no measurement units it is a pure number,a proportion or percentage, whose measurement units have canceled eachother in the process of dividing standard deviation by average
Makes the coefficient of variation useful in those situations whereyou dont care about the actual (absolute) size of the differences,and only the relative size is important
o Using the coefficient of variation allows you to reasonably compare a largeto a small firm to see which one has more variation on a size-adjustedbasis
o Can be larger than 100% even with positive numbers could happen witha very skewed distribution or with extreme outliers (the situation is veryvariable with respect to the average value
Coefficient of Variation = Standard DeviationAverage
For a sample:Coefficient of Variation = S
x
For a population:Coefficient of Variation =
-
7/31/2019 Statistics Ch 1-5 Notes (2)
35/39
5.4 Effects of Adding to or Rescaling the Data
If a number is added to each data value, then this same number is added to theaverage, median, mode and percentiles to obtain the corresponding summariesfor the new data set
If each data value is multiplied by a fixed number, the average, median, mode,percentiles, standard deviation and range are each multiplied by this samenumber to obtain the corresponding summaries for the new data set (thecoefficient of variation is unaffected)
If the data values are multiplied by a factor c and an amount d is then added; Xbecomes cX + d
o
The new average is c X (old average) + d; likewise for the median, modeand percentileso The new standard deviation is |c| X (old standard deviation), and the range
is adjusted similarly (note that the added number, d, plays no role here)
27
Coefficient of Variation
A relative measure of variability
The ratio: Standard deviation divided by average For a sample: S/X
For a population: /
No measurement units. A pure number. Answers:
Typically, in percentage terms, how far are data values
from average?
Useful for comparing situations of different sizes
To see how variability compares after adjusting for s ize
28
Example: Portfolio Performance
You have invested $100 in each of 5 stocks
Results : $116, 83, 105, 113, 98 Average is $103, std. dev. is $13.21
Your friend has invested $1,000 in each stock
Results : $1,160, 830, 1,050, 1,130, 980
Average is $1,030, std. dev. is $132.10
Coefficients of variation are identical
13.21/103 = 132.10/1,030 = 0 .128 = 12.8%
Typically, results for these 5 stocks were
approximately 12.8% from their average value
-
7/31/2019 Statistics Ch 1-5 Notes (2)
36/39
Chapter 5 Questions1. What is variability?
2.a. What is the traditional measure of variability?
b. What other measures are also used?
3.a. What is a deviation from the average?
b. What is the average of all of the deviations?
4.a. What is the standard deviation?
-
7/31/2019 Statistics Ch 1-5 Notes (2)
37/39
b. What does the standard deviation tell you about the relationshipbetween individual data values and the average?
c. What are the measurement units of the standard deviation?
d. What is the difference between the sample standard deviation andthe population standard deviation?
e. What is the difference between the sample standard deviation andthe population standard deviation?
5.a. What is the variance?
b. What are the measurement units of the variance?
c. Which is the more easily interpreted variability measure, thestandard deviation or the variance? Why?
d. Once you know the standard deviation, does the variance provideany additional real information about the variability?
6. If your data set is normally distributed, what proportion of the individuals doyou expect to find:
a. Within one standard deviation from the average?
b. Within two standard deviations from the average?
c. Within three standard deviations from the average?
d. More than one standard deviation from the average?
e. More than one standard deviation above the average? (be careful)
-
7/31/2019 Statistics Ch 1-5 Notes (2)
38/39
7. How would yoru answers to question 6 change if the data were not normallydistributed?
8.a. What is the range?
b. What are the measurement units of the range?
c. For what purpose is the range useful?
d. Is the range a very useful statistical measure of variability? Why or
why not?
9.a. What is the coefficient of variation?
b. What are the measurement units of the coefficient of variation?
10.Which variability measure is most useful for comparing variability in twodifferent situations, adjusting for the fact that the situations have verydifferent average sizes? Justify your choice.
11.When a fixed number is added to each data value, what happens to:a. The average, median and mode?
b. The standard deviation and range?
c. The coefficient of variation?
12.When each data value is multiplied by a fixed number, what happens toa. The average, median and mode?
b. The standard deviation and range?
-
7/31/2019 Statistics Ch 1-5 Notes (2)
39/39
c. The coefficient of variation?
top related