statistics notes 1 data_plots and summaries

Upload: andrewcliu

Post on 30-May-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    1/178

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    2/178

    2

    About this Course

    Below is a link to the course website. Please visitand bookmark this site NOW.

    faculty.chicagobooth.edu/alan.bester/teaching/

    You can also find the course website on Chalk orGoogle business statistics bester.

    Everything you need to know is in the lecture

    notes. Everything you need for the class is on

    the course website.

    http://faculty.chicagobooth.edu/alan.bester/teaching/http://faculty.chicagobooth.edu/alan.bester/teaching/
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    3/178

    3

    About These Notes

    You will find links to data sets, examples, and other thingswe talk about throughout the notes.

    Due to the name change Ive had to change all the links

    from chicagogsb.edu to chicagoboth.edu. If you find one(in the notes or on the website) that doesnt work trychanging gsb to booth in the URL.

    Yes, there are a lot of slides. I like to restate things and limitthe number of concepts per slide. This course is actuallyabout a small number of big ideas that we will developthroughout the quarter.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    4/178

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    5/178

    5

    Notes1: Data: Plots and Summaries

    1. Data

    2. Looking at a Single Variable2.1 Tables2.2 Histograms2.3 Dotplots2.4 Time Series Plots

    3. Summarizing a Single Numeric Variable3.1 The Mean and Median3.2 The Variance and Standard Deviation3.3 The Empirical Rule

    3.4 Percentiles, quartiles, and the IQR4. Looking at Two Variables

    4.1 Categorical variables: the Two-way table4.2 Numeric variables: Scatter Plots4.3 Relating Numeric and Categorical variables

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    6/178

    6

    5. Summarizing Bivariate Relations5.1 In Tables5.2 Covariance and Correlation

    6. Linearly related variables6.1 Linear functions6.2 Mean and variance of a linear function

    6.3 Linear combinations6.4 Mean and variance of a linear combination

    7. Linear Regression8. Pivot Tables (Optional)

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    7/178

    7

    1.Data

    age sex soc edu Reg inc cola restE juice cigs antiq news ender friend simp foot

    67 2 3 1 3 12 1 0 1 0 1 0 0 0 0 0

    51 2 3 8 3 10 1 1 0 1 1 0 1 1 0 0

    63 2 3 1 2 13 1 1 0 1 1 0 1 0 0 0

    45 2 4 3 1 18 1 1 1 0 1 0 0 0 0 0

    Here is some data (oursample):

    The data is from a large survey carried out by a marketingresearch company in Britain. (Marketing data)

    Each row corresponds to a household.Each column corresponds to a different feature of the household.The features are called variables.

    The rows are called observations.

    .

    .

    .(many more rows !!)

    http://faculty.chicagobooth.edu/alan.bester/teaching/data/bmrbxl.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/bmrbxl.xls
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    8/178

    8

    Most data sets come in this form.

    A rectangular array.

    Rows are observations.Columns are variables.

    Variables are the fundamental object in statistics.They come in several types.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    9/178

    9

    The variable labeled "age" is simply the age (in years)of the responder.

    This is a numericvariable.This variable has units, and averages are interpretable.

    1 "Scotland"

    2 "North West"

    3 "North"

    4 "Yorkshire & Humberside"

    5 "East Midlands" 6 "East Anglia"

    7 "South East"

    8 "Greater London"

    9 "South West"

    10 "Wales"

    11 "West Midlands"

    A variable like Regis called categorical.

    Think of:numeric vs. categorical

    quantitative vs. qualitative

    In contrast, the variable "Reg" is the geographical regionof the household. Each "number" is really just a codefor a region:

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    10/178

    10

    Instead of using numbers we could have usedtext strings in the data file, that is,

    Reg:NorthNorthNorth_West

    Scotland..

    But it is extremely common to use numeric codes.

    Another example: Which Democratic candidate doyou support?

    1= Hillary Clinton, 2= John Edwards,3= Barack Obama, 4= Bill Richardson

    Reg:332

    1..

    we could haveInstead of

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    11/178

    11

    The variable soc is categorical.It takes on codes 1-6, with meanings:

    1 "A"

    2 "B"

    3 "C1"

    4 "C2"

    5 "D"

    6 "E"

    This is an ordered categoricalvariable.You can't think of it as a numerical measurebut A < B < ... < E. (A is actually the lowestsocial grade)

    Soc is ordered like age, but does not have units.It does not really make sense to compute the differenceor to average two soc measurements.It does make sense to difference two ages.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    12/178

    12

    That pretty much covers it.Variables are either numeric, categorical, or

    ordered categorical.

    Of course a numeric variable is always ordered.

    A variable is discrete if you can list its possible values.

    Otherwise it is called continuous.

    For numeric variables we also have:

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    13/178

    13

    For example, the amount of rainfall in the City of Chicagothis month is usually thought of as being continuous.

    As a practical matter, any variable is discrete sincewe put it in the computer. What it comes down tois, if there are a lot of possible values, we think of it

    as continuous. (This is not really that important now;it will be later when we get to probability.)

    For example, you might think of age as continuous

    even though we measure it in years and can easilylist its possible values.

    Number of children is more likely to be thought of as discrete.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    14/178

    14

    Again, a good rule when working with a numericvariable is to keep in mind the units in which it ismeasured.

    For example age has units years.

    Percentages, which are numeric, don't have units.

    Butthere are always units somewhere. For example, if

    we look at the percentage of income a householdspends on entertainment, we are looking at onequantity measured in units of currency divided byanother.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    15/178

    15

    Here are the definitions of all the variables in the surveydata set:

    age: age in yearssex: 1 means male, 2 means femalesoc: we saw thisedu: education, terminal age of education

    1 "14 Or Under"

    2 "15"

    3 "16"

    4 "17"

    5 "18"

    6 "19"

    7 "20"

    8 "21 - 23"

    9 "24 Or Over"

    Reg: we saw this.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    16/178

    16

    VARIABLE LABELS V_842 "Total Family Income Before Tax".VALUE LABELS V_842

    1 "1,999 Or Less"2 "2,000 - 2,999"

    3 "3,000 - 3,999"4 "4,000 - 4,999"

    5 "5,000 - 5,999"6 "6,000 - 6,999"7 "7,000 -7,999"

    8 "8,000 - 8,999"9 "9,000 - 9,999"

    10 "10,000 - 10,999"11 "11,000 - 11,999"12 "12,000 - 14,999"13 "15,000 - 19,999"14 "20,000 - 24,999"15 "25,000 - 29,999"16 "30,000 - 34,999"17 "35,000 - 39,999"18 "40,000 - 49,999"19 "50,000 Or Over"20 "Not Stated"

    inc: income

    Note:

    Both edu and inc could have

    been numeric, but are brokendown into ranges. They arethus ordered categorical.

    This is extremely common;with income there are actuallygood reasons for doing this!

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    17/178

    17

    cola, restE, juice, cigs indicate use of a productcategory.

    1 if you use it, 0 if you don't.

    This is called a dummy variable.1 indicates something "happened", 0 if not.

    So, cigs=1 means you purchase cigarettes.restE means "restaurants in the evening".

    This is extremely common. Often in statistics weare interested in does something happen?.

    Another example is approval ratings ( 1=approve ).We will work with a lot of dummy variables this quarter.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    18/178

    18

    The rest of the variables in the marketing data

    represent tv shows.They are dummies: 1 if you watch, 0 if you don't.

    antiq: antiques roadshownews: bbc news

    enders: east endersfriend: friendssimp: simpsonsfoot: "football" (soccer)

    A dummy variable can take on two values, 0 or 1.We use dummy variables to indicate something,

    1 if that something happened, 0 if it did not.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    19/178

    19

    Now we can see that there are three types of variablesin the data set.

    (i) Demographics: age through income(ii) Product category usage,(iii) Media exposure (tv shows).

    What is the point? Why collect this data?

    We want to see how product usage relatesto demographics. What kind of people drink colas?

    We want to see how the media relates to product usageso that we can select the appropriate media toadvertise in. If friends viewers tend to drink colas,that might be a good place to advertise your cola.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    20/178

    20

    Important Note:

    You can always take a numeric variable and

    make it an ordered categorical variable byusing bins.

    For example, instead of treating age as a numeric

    variable it is common to break it into ranges.

    0-20: a121-30:a231-40:a3

    41-50:a451-60:a561-70:a6>70: a7

    for example:

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    21/178

    21

    The simplest case is a dummy variable:

    1

    0

    x ad

    x a

    >=

    For example, you could define someone to be "old"if older than 40 and "young" otherwise.

    d=1 then means "old" and d=0 means "young".

    where x is numeric

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    22/178

    22

    2. Looking at a Single Variable

    The most interesting thing in statistics is understandinghow variables relate to each other.

    "Friends watchers tend to drink colas".

    "Smokers tend to get cancer".

    But it is still very important to get of sense of what variablesare like on their own.

    Note: Well use the term distribution informally to talkabout what a variable looks like (what does a typical valuelook like, how spread out are its values, etc.) We will usethe term more formally when we study probability.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    23/178

    23

    2.1 Tables

    To look at a categorical variable we use a table:soc count

    1 28

    2 151

    3 310

    4 2355 156

    6 120

    We simply count how many of each category we have.

    Note: We have 1000 observations total, so the numbersin this table must add to 1000.

    How to make this table

    http://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_counttable.htmhttp://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_counttable.htm
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    24/178

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    25/178

    25

    2.2 Histograms

    We take a numeric variable, break it down into categories

    and then plot the table as on the previous slide.Remember, the height of each bar = # of observations orfrequency in that category.

    Histogram for age

    0

    20

    40

    60

    80

    100

    120

    90

    Category

    35-40means(35,40]that is,

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    26/178

    26

    Histogram for Inter arrivalTime

    0

    10

    20

    30

    40

    50

    60

    70

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    27/178

    27

    4 %

    5 %

    Heres a histogram of monthly hedge fund returns from1994 to 2005. Notice anything interesting?

    Source: Nicolas P. B. Bollen and Veronika K. Pool, Do Hedge Fund Managers Misreport Returns? Evidence from the

    Pooled Distributions; original data from Center for International Securities and Derivatives Markets, University of

    Massachusetts

    0

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    28/178

    28

    Aside: Histograms can be displayed in different ways

    The observations here are starting players in the NFL (on offense). The numbers onthe verticalaxis correspond to rounds of the NFL draft, while the length of each blue bar

    is thepercentage of starting players drafted at that position (forget the red bars). Theplots on the right show onlyquarterbacks and fullbacks. (Source)

    Aside or Optional on a slide means you are not

    responsible for the material on that slide on an exam!

    Dont worry, all of our histograms will be like the previous two slides.

    http://www.footballoutsiders.com/2006/04/24/ramblings/nfl-draft/3828/http://www.footballoutsiders.com/2006/04/24/ramblings/nfl-draft/3828/
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    29/178

    29

    2.3 Dotplots

    nbeerm: the number of beers male MBA students claimthey can drink without getting drunk

    nbeerf: same for females

    It can be a hassle choosing the bins for a numericvariable.

    For discrete variables and/or small data sets, we canjust put a dot on the number line for each value.

    (Beer data)

    Note (1): Unfortunately StatPro doesnt do dotplots.The dotplots in these slides were done in Minitab.

    Note (2): The beer data is text, not Excel format. Use Text toColumns.

    http://faculty.chicagobooth.edu/alan.bester/teaching/data/beer.dathttp://faculty.chicagobooth.edu/alan.bester/teaching/data/beer.dat
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    30/178

    30

    .

    : :: :

    . . : : : :

    . . : . : : :.: : : : . .

    +---------+---------+---------+---------+---------+-------

    nbeerm

    . .. . : : .

    +---------+---------+---------+---------+---------+-------

    nbeerf

    0.0 4.0 8.0 12.0 16.0 20.0

    Generally the males claim they can drink more,their numbers are centered or located at larger values.

    Note: The dot plot is giving you the same kind ofinformation as the histogram.

    We call a pointlike this anoutlier.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    31/178

    31

    2.4 Time Series Plots

    The survey data is what we call cross-sectional.The households in our survey are a (hopefullyrepresentative) cross section of all British households at aparticular point in time.

    In cross-sectional data, order doesnt matter. We can sortour households by age, social, etc. and none of our resultschange as long as we keep each row intact.

    Other examples would be samples were everyrow corresponded to a firm, a plant, a machine...

    With a time series, each observation corresponds toa point in time.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    32/178

    32

    Date Open High Low Close Volume

    1-May-00 10749.4 11001.3 10622.2 10811.8 9663000

    2-May-00 10805.6 10932.5 10580.7 10731.1 10115000

    3-May-00 10732.2 10754.4 10345.2 10480.1 9916000

    4-May-00 10478.9 10631.5 10293.1 10412.5 9258000

    Daily data on the Dow Jones index: (Dow data)

    For time series data, the order of observations matters.

    (1-May-00 comes before 2-May-00, etc.)

    The easiest way to visualize time series data is oftensimply to plot the series in time order.

    .

    .

    .

    http://faculty.chicagobooth.edu/alan.bester/teaching/data/DJI.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/DJI.xls
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    33/178

    33

    Time series plot of Close

    7800

    8400

    9000

    9600

    10200

    10800

    11400

    5

    /1/2000

    6

    /1/2000

    7

    /1/2000

    8

    /1/2000

    9

    /1/2000

    10

    /1/2000

    11

    /1/2000

    12

    /1/2000

    1

    /1/2001

    2

    /1/2001

    3

    /1/2001

    4

    /1/2001

    5

    /1/2001

    6

    /1/2001

    7

    /1/2001

    8

    /1/2001

    9

    /1/2001

    10

    /1/2001

    11

    /1/2001

    12

    /1/2001

    1

    /1/2002

    2

    /1/2002

    3

    /1/2002

    4

    /1/2002

    Date

    Close

    Time series plot of the close series.

    How to make this plot

    http://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_tsplot.htmhttp://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_tsplot.htm
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    34/178

    34

    We could have data at various frequencies:

    daily,monthly,quarterly,annual.

    The kinds of patterns you will uncover can be verydifferent depending on the frequency of the data.

    A current hot topic of research at Booth is"high frequency data".

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    35/178

    35

    70605040302010

    20

    19

    18

    17

    16

    15

    14

    13

    12

    Index

    b_

    prod

    MonthlyUS beer

    production.

    Do you seea pattern?

    Would we see this pattern if we looked at annual data?

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    36/178

    36

    Time series plot of monthly returns on a portfolioof Canadian assets: (Country Portfolio returns)

    10080604020

    0.1

    0.0

    -0.1

    Index

    canada

    On theverticalaxis we

    havereturns.

    On thehorizontalaxis wehave time.

    Do you see a pattern?

    http://faculty.chicagobooth.edu/alan.bester/teaching/data/conret.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/conret.xls
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    37/178

    37

    Here is thehistogram

    of the Canadianreturns.

    0.090.060.030.00-0.03-0.06-0.09

    30

    20

    10

    0

    canada

    Frequency

    0.10.0-0.1

    30

    20

    10

    0

    canada

    Fre

    quency

    Notes:

    (i) The histogramdoes not dependon the time order.

    (ii) The appearance of

    the histogram dependson the number of bins.Too many bins makesthe histogram appear

    spiky.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    38/178

    38

    Taken from David Greenlaw, Jan Hatzius, Anil Kashyap, and Hyun Shin, US Monetary Policy Forum Report No. 2, 2008

    Be careful. What pattern do you see in this series?

    How about now?

    http://faculty.chicagogsb.edu/anil.kashyap/research/MPFReport-final.pdfhttp://faculty.chicagogsb.edu/anil.kashyap/research/MPFReport-final.pdfhttp://faculty.chicagogsb.edu/anil.kashyap/research/MPFReport-final.pdf
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    39/178

    39

    Time series plots are also used to compare patternsacross different variables over time, and sometimes to seethe impact of past events (be very careful there, too).

    From same paper as the previous slide.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    40/178

    40

    3. Summarizing a Single Numeric Variable

    We have looked at graphs. Suppose we are now interestedin having numerical summaries of the data rather thangraphical representations.

    Two important features of any numeric variable are:

    1) What is a typical or average value?

    2) How spread out or variable are the values?

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    41/178

    41

    The mean and median capture a typical value.The variance/standard deviation capture the spread.

    For example we saw that the men tend to claimthey can drink more.

    How can we summarize this?

    .

    : :

    : :

    . . : : : :

    . . : . : : :.: : : : . .

    +---------+---------+---------+---------+---------+-------nbeerm

    . .. . : : .

    +---------+---------+---------+---------+---------+-------

    nbeerf

    0.0 4.0 8.0 12.0 16.0 20.0

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    42/178

    42

    Monthly returns

    on Canadianportfolioand Japaneseportfolio.

    They seemto be centeredroughly atthe same place

    but Japanhas morespread.

    How can we summarize this?

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    43/178

    43

    1 2 3 nx ,x ,x ,...x

    the firstnumber

    the last number, n is the numberof numbers,or the number ofobservations. You may also hear

    it referred to as the sample size.

    xi is the value of x associated with the ithobservation (row).

    3.1 The Mean and Median

    We will need some notation.

    Suppose we have n observations on a numericvariable which we call "x".

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    44/178

    44

    Here, x is just a name for the set of numbers, we couldjust as easily use y.In a real data set we would use a meaningful name like "age".

    x

    5

    2

    8

    62

    x1

    x3

    n=5

    Sometimes the order of the observations means something.

    In our return data the first observation corresponds to thefirst time period.In the survey data, the order did not matter.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    45/178

    45

    The sample mean is justtheaverage of the numbers x:

    1 2 nx x ... xsumxn n

    + + += =

    We often use the symbol to denote the mean of thenumbers x.

    We call it x bar.

    x

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    46/178

    46

    Here is a more compact way to write the same thing

    Consider

    1 2 nx x ... x+ + +We use a shorthand for it (it is just notation):

    n

    i 1 2 n

    i 1

    x x x ... x=

    = + + +

    This is summation notation.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    47/178

    47

    Using summation notation we have:

    x n xi

    i

    n

    ==

    1

    1

    The sample mean:

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    48/178

    48

    Character Dotplot

    . . . . : : .

    +---------+---------+---------+---------+---------+-------nbeerf

    .

    : :

    : :

    . . : : : :

    . . : . : : : . : : : : .

    +---------+---------+---------+---------+---------+-------nbeerm

    0.0 2.5 5.0 7.5 10.0 12.5

    In some sense, the men claim to drink more.To summarize this we can compute the average valuefor each group (men and women).

    Note: I deleted the outlier, I do not believe him!.

    Graphical interpretation of the sample mean

    Here are the dot plots of the beer data for women and men.

    Which group claims to be able to drink more?

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    49/178

    49

    Mean of nbeerf = 4.2222

    Mean of nbeerm = 7.8625

    Character Dotplot

    . . . . : : .

    +---------+---------+---------+---------+---------+-------nbeerf

    .

    : :: :

    . . : : : :

    . . : . : : : . : : : : .

    +---------+---------+---------+---------+---------+-------nbeerm

    0.0 2.5 5.0 7.5 10.0 12.5

    On average women claimthey can drink 4.2 beers. Men

    claim they can drink 7.9 beers

    In the picture, I think of the mean as the center of the data.

    4.2

    7.86

    How to calculate these means

    http://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_beerexample.htmhttp://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_beerexample.htm
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    50/178

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    51/178

    51

    Let us look at summation in more detail.

    xii

    n

    =1means that for each value of i, from 1 to n,

    we add to the sum the value indicated,in this case xi.

    add in this value for each i

    More on summation notation (take this as an aside)

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    52/178

    52

    x y year

    0.07 0.11 1

    0.06 0.05 2

    0.04 0.09 30.03 0.03 4

    Think of each row as anobservation on both x and y.To make things concrete, thinkof each row as corresponding to

    a year and let x and y be annualreturns on two different assets.

    In year 1 asset x had return7%.In year 4 asset y had return3%.

    To understand how it works let us consider someexamples.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    53/178

    53

    (here, we do not sumover all observations: we sumonly over the second and thethird observation).

    compute x bar.

    compute y bar.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    54/178

    54

    For each value of i, we can add in anything we want:

    = (.02)*(.04) + (.01)*(-.02) + (-.01)*(.02)+(-.02)*(-.04)

    How to do these calculations using Excel

    http://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_ssfunc.htmhttp://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_ssfunc.htm
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    55/178

    55

    The median

    After ordering the data, the median is themiddle value of the data. If there is an evennumber of data points, the median is theaverage of the two middle values.

    Example

    1,2,3,4,5 Median = 31,1,2,3,4,5 Median = (2+3)/2 =2.5

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    56/178

    56

    Mean versus median

    Although boththe mean and the median are goodmeasures of the center of a distribution of measurements,the median is less sensitive to extreme values.

    The median is not affected by extreme values sincethe numerical values of the measurements are notused in its computation.

    Example

    1,2,3,4,5 Mean: 3 Median: 31,2,3,4,100 Mean: 22 Median: 3

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    57/178

    57

    If data is right skewed the mean will be biggerthan the median. You can think of this as the extremeright tail observations pulling the mean upward.

    Summary measures for selectedvariables

    InterarrivalTime

    Mean 4.163

    Median 2.779

    For the bank interarrival data:

    H is t o g r a m f o r I n t e r a r r i v a

    0

    10

    20

    30

    40

    50

    60

    70

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    58/178

    58

    Median or Mean?

    At Booth professors are rated by students from 1-5 inseveral categories. In the past only the mean rating wasreported.

    Some faculty members believe the median shouldbe reported instead. This was actually a major debate ata faculty meeting a few years ago.

    What difference would this make?

    In fact, Booth now reports the mean andmedian,along with a histogram of all the ratings!

    Th M f D V i bl

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    59/178

    59

    The Mean of a Dummy Variable

    Consider the "simpson" variable in the survey data set.Does it make sense to take the mean?

    Summary measures for selected variables

    simpsons

    Count 1000.000

    Mean 0.181

    The sum of the 1's and0's will equal the numberof respondents who watchthe simpsons.

    So the mean is the fractionof respondents who watch.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    60/178

    60

    So, in general, the average of a dummy,

    gives the percentage of times that whatever dummy=1signals happens.

    Another example, if a poll is conducted about a

    particular candidate where1=approval, 0=disapproval

    then the sample mean is the candidates approval rating.

    This may seem obvious, but we will get a lot of use outof this idea throughout the quarter.

    3 2 Th V i d St d d D i ti

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    61/178

    61

    3.2 The Variance and Standard Deviation

    The mean and the median give usinformationabout the central tendency of a set of

    observations, but they shed no light on thedispersion, or spread of the data.

    Example: Which data set is more variable ?

    5,5,5,5,5 Mean: 51,3,5,8,8 Mean: 5

    If these were portfolio returns (in percent), means areaverage returns. What else might we want tomeasure?

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    62/178

    62

    The Sample Variance

    . . . .

    -+---------+---------+---------+---------+---------+-----x

    . . . .

    -+---------+---------+---------+---------+---------+-----y

    0.030 0.045 0.060 0.075 0.090 0.105

    The y numbers are more spread outthan the x numbers.We want a numerical measure of variation or spread.

    The basic idea is to view variability in terms of distancebetween each measurement and the mean.

    x xi

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    63/178

    63

    . . . .

    -+---------+---------+---------+---------+---------+-----x

    . . . .

    -+---------+---------+---------+---------+---------+-----y

    0.030 0.045 0.060 0.075 0.090 0.105

    Overall, these are smaller than these.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    64/178

    64

    We cannot just look at the distance between each

    measurement and the mean. We need an overallmeasure of how big the differences are

    (i.e., just one number like in the case of the mean).

    Also, we cannot just sum the individual distancesbecause the negative distances cancel out with thepositive ones giving zero always (Why?).

    The average squared distance would be

    1

    1

    2

    nx xi

    i

    n

    ( )=

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    65/178

    65

    So, the sample variance of the x data is defined to be:

    s

    n

    x xx ii

    n2

    1

    21

    1

    =

    =

    ( )

    We use n-1 instead of n for technical reasons that will

    be discussed later (and because Excel does it this way).

    Think of it as the average squared distance of

    the observations from the mean.

    Sample variance:

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    66/178

    66

    2) What are the units of the variance?

    It is helpful to have a measure of spread whichis in the original units. The sample variance is not in theoriginal units. We now introduce a measure of dispersionthat solves this problem: the sample standard deviation

    1) What is the smallest value a variance can be?

    Questions

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    67/178

    67

    The sample standard deviation

    It is defined as the square root of the sample variance (easy).

    s sx x=

    2

    The units of the standard deviation are the sameas those of the original data.

    The sample standard deviation:

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    68/178

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    69/178

    69

    The samplestandard deviation

    for the y datais bigger thanthat for the x data.

    This numerically

    captures thefact that y hasmore variationabout its meanthan x.

    Example 2 (graphical)

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    70/178

    70

    Character Dotplot

    .

    :

    : :

    :: :

    .::: :.:

    : : :::: ::::

    ::: :::: :::: :::

    . : :::: :::: :::: :::.-----+---------+---------+---------+---------+---------+-canada

    . .

    ::. . : .

    . ::: .:: :.: .

    : ::: .::: :::: : :.

    . .. .. :.:: :::: :::: :::: : :: : : . : .

    -----+---------+---------+---------+---------+---------+-japan

    -0.160 -0.080 0.000 0.080 0.160 0.240

    Variable N Mean StDev

    canada 107 0.00907 0.03833

    japan 107 0.00234 0.07368

    Example 2 (graphical)The standard deviationsmeasure the fact that thereis more spread in the Japanese

    returns

    3 3 Th E i i l R l

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    71/178

    71

    3.3 The Empirical Rule

    We now have two numerical summaries for the data

    x sx

    where the data is how spread out,how variable the data is

    The mean is pretty easy to interpret (some sort of center of thedata).

    We know that the bigger sx is, the more variable the data is, but how

    do we really interpret this number?

    What is a big sx, what is a small one ?

    The empirical rule will help us understand s and

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    72/178

    72

    The empirical rule will help us understand sx and

    relate the numerical summaries back to our plots.

    Empirical Rule

    For mound shaped data:

    Approximately 68% of the data is in the interval

    ( , )x s x s x sx x x + =

    Approximately 95% of the data is in the interval

    ( , )x s x s x sx x x + = 2 2 2

    We can see this on a histogram of the Canadian returns

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    73/178

    73

    We can see this on a histogram of the Canadian returns

    x =.00907

    sx =.03833

    x sx+x sx

    x sx 2 x sx+ 2

    The empirical

    rule says thatroughly 95%of theobservationsare between the

    dashed lines androughly 68% betweenthe dotted lines.

    Looks reasonable.

    H i s t o g r a m f o r c a

    0

    5

    1 0

    1 5

    2 0

    2 5

    3 0

    . 1

    -0.1 0.10

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    74/178

    74

    10080604020

    0.1

    0.0

    -0.1

    Index

    cana

    da

    x

    xx 2s+

    xx 2s

    Same thingviewed from

    the perspectiveof the timeseries plot.

    n=108, so5% outsidewould be about5 points.

    There are 4 pointsoutside, which ispretty close.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    75/178

    A little finance: comparing mutual funds

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    76/178

    76

    A little finance: comparing mutual funds

    Let us use the means and standard deviations to compare mutual funds.For 9 different assets we compute the means and standard deviations.Later, we plot the means versus the standard deviations.

    The assets are:

    #C1 - R22 Drefus (growth)#C2- R30 Fidelity Trend fund (growth)

    #c3- R55 Keystone Speculative fund (max capital gain)

    #c4- R92 Putnam Income Fund (income)

    #c5- R99 Scudder Income

    #c6- R129 Windsor Fund (growth)

    #c7- equally weighted market#c8- value weighted market

    #c9- tbill rate

    # sample period monthly returns 1:68 - 12-82

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    77/178

    77

    Variable N Mean StDev

    drefus 180 0.00677 0.04724fidel 180 0.00470 0.05659

    keystne 180 0.00654 0.08424

    Putnminc 180 0.00552 0.03008

    scudinc 180 0.00443 0.03597

    windsor 180 0.01002 0.04864eqmrkt 180 0.01082 0.06856

    valmrkt 180 0.00681 0.04800

    tbill 180 0.00598 0.00252

    The speculative fund (keystne) has a higher mean andstandard deviation than the income fund (Putnminc).

    Later well see how to look at this information graphically.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    78/178

    78

    3.4 Percentiles, quartiles, and the IQR

    Again, this just applies to numeric variables.

    The 10th percentile is the number such that 10% ofthe values are less than it and 90% are bigger.

    The median is the 50th percentile.

    Percentiles are also known as quantiles.

    95th percentile,.95 quantile, and 95% quantile

    all mean the same thing.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    79/178

    79

    Summary measures for selectedvariables

    age

    Count 1000.000

    5th percentile 25.000

    10th percentile 28.000

    90th percentile 71.000

    95th percentile 75.000

    For the age variable in the survey data:

    5% of the 1000 age valuesare less than 25.

    90% of people in the sample

    are less than 71 years old.

    5% of the people in thesample are over 75 years of

    age.

    For now dont worry aboutstrictly less than vs. lessthan or equal to.

    Summary measures for selectedvariables

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    80/178

    80

    The first, second,and third quartiles are the25th, 50th, and 75th percentiles.

    The interquartile rangeis the difference betweenthe third and first quartile.

    variables

    age

    Count 1000.000

    Mean 48.312

    Median 48.000

    Standard deviation 15.718

    Variance 247.062

    First quartile 35.000

    Third quartile 60.000

    Interquartile range 25.000

    The interquartile rangeis used as a measureof spread (IQR is tovariance as median is tomean).

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    81/178

    81

    Histogram for age

    0

    20

    40

    60

    80

    100

    120

    90

    Category

    first quartile = 35 years

    We can interpret quantiles graphically on the histogram.25% of the area of the colored bars is to the left of the first quantile.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    82/178

    82

    The empirical rule is actually a statement about quantiles.

    What does it say? For a variable with a mound shapedhistogram

    What quantile is two standard deviations below the mean?

    What quantile is one standard deviation above?

    2.5%

    84%

    To see this yourself, draw the picture! Well learn later thatthe empirical rule is based on a very important probabilitymodel.

    10th Percentile (o) 50th Percentile (+) 90th Percentile ( )

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    83/178

    83

    10th Percentile (o) 50th Percentile (+) 90th Percentile ( )

    Figure 3. Indexed Real Wages for Men by Percentile 1967-1997Year

    70 75 80 85 90 95

    0.90

    1.00

    1.10

    1.20

    1.30

    Aside: We wont use percentiles much in this class, but above is aninteresting time series plot of the 90th (top line), median (middle line),and 10th percentiles of real wages in the U.S. from the late 1960s tolate 1990s. This widening income gap is a major concern foreconomists or is it?

    Source: Murphy, Kevin and Finis Welch, Wage Differentials in the 1990s: Is the Glass Half-full or Half-empty?

    4 L ki t T V i bl

    http://freakonomics.blogs.nytimes.com/2008/05/19/shattering-the-conventional-wisdom-on-growing-inequality/http://www.footballoutsiders.com/2006/04/24/ramblings/nfl-draft/3828/http://www.footballoutsiders.com/2006/04/24/ramblings/nfl-draft/3828/http://freakonomics.blogs.nytimes.com/2008/05/19/shattering-the-conventional-wisdom-on-growing-inequality/
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    84/178

    84

    4. Looking at Two Variables

    While it is important to look at variables oneat a time, many interesting business problemsconcern how two (or more) variables are related

    to each other.

    4 1 Categorical variables: the Two way Table

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    85/178

    85

    4.1 Categorical variables: the Two-way Table

    Lets look at the relationship between two categoricalvariables,xand y.

    Ifxhas two categories and yhas two as well,then there are four categories using both x and y.

    We can then just count the number of observations ineach category.

    If x has r1 and y has r2, then we have r1*r2possibilities. We can arrange these possibilities ina two-way table.

    This is the two way table relating viewership of the simpsons

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    86/178

    86

    simpsons

    colas 0 1Grand Total

    0 387 35 4221 432 146 578

    Grand Total 819 181 1000

    This is the two way table relating viewership of the simpsonswith cola use.146 of the 1000 view simpsons andconsume colas.

    simpsons

    colas 0 1Grand Total

    0 38.70% 3.50% 42.20%1 43.20% 14.60% 57.80%

    Grand Total 81.90% 18.10% 100.00%

    Raw counts: Percent of total:

    Percent of column: Percent of row:Count of colas simpsons

    colas 0 1Grand Total

    0 47% 19% 42%

    1 53% 81% 58%

    Grand Total 100% 100% 100%

    Count of colas simpsons

    colas 0 1Grand Total

    0 92% 8% 100%

    1 75% 25% 100%

    Grand Total 82% 18% 100%

    How to make these tables

    A picture of the table:

    http://gsbwww.uchicago.edu/fac/alan.bester/teaching/notes/n1_2waytable.htmhttp://gsbwww.uchicago.edu/fac/alan.bester/teaching/notes/n1_2waytable.htm
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    87/178

    87

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    0 1

    1

    0

    simpsons

    colas

    A much higher fraction of the simpsons viewersconsumes colas.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    88/178

    4 2 N i i bl S tt Pl t

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    89/178

    89

    4.2 Numeric variables: Scatter Plots

    For two numeric variables we have the scatter plot.

    nbeer weight

    12.0 192

    12.0 160

    5.0 155

    5.0 120

    7.0 150

    13.0 175

    4.0 100

    12.0 165

    12.0 165

    12.0 150

    . .

    . .

    . .

    How are they related?

    Each row is an observationcorresponding to a person.

    Each person has two numbersassociated with him/her,

    # beers and weight.

    Is the numberof beers you can drinkrelated to your weight?

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    90/178

    90

    200150100

    20

    10

    0

    weight

    nbeer

    nbeer weight

    12.0 192

    12.0 160

    5.0 155

    5.0 120

    7.0 150

    13.0 175

    4.0 100

    12.0 165

    12.0 165

    12.0 150

    . .

    . .

    . .

    You can think of a scatterplot as a 2D dotplot. Each point corresponds to an

    observation: weightdetermines the positionon the horizontal axis, heighton the vertical.

    related to your weight?

    Notice our outlier is back (circled)... and is he really an outlier?!

    In addition to relating two variables, a scatterplot also gives youall the information youd get from a dotplot of either variable

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    91/178

    91

    200150100

    20

    10

    0

    weight

    nbeer

    allthe information you d get from a dotplot of either variable.

    Sample Exam Question

    The sample mean ofweight is

    (i) 105 (ii) 130 (iii) 155 (iv) 180

    Imagine the dots on the scatterplotbeing pulled downward by gravity youd get a dotplot of weight!

    Same ideafor nbeer,though thevertical axiscan be alittle harderto picture(Hint: rotatethe paper)

    The sample SD of weight is around 28,so roughly 68% of observationsbetween 127 and 183 pounds.

    Example

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    92/178

    92

    Are returns on a mutual fund related to market returns?

    0.20.10.0-0.1

    0.2

    0.1

    0.0

    -0.1

    valmrkt

    windsor

    Each pointcorrespondsto a month.

    Like the histogram,scatterplots canalso be used withtime series data,

    and the resultingplot does notdepend on the timeordering.

    Example

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    93/178

    93

    Heres another example of an outlier. This data is from a pokerwebsite that went through a major cheating scandal.

    WINRATE

    VPIP

    A similar scandal surfaced recently. Is the evidence as compelling?

    In finance we often use a different type of 2-D plot to compare asset

    http://www.msnbc.msn.com/id/26563848/http://www.msnbc.msn.com/id/26563848/
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    94/178

    94

    0.090.080.070.060.050.040.030.020.010.00

    0.011

    0.010

    0.009

    0.008

    0.007

    0.006

    0.005

    0.004

    StDev

    Mean

    tbill

    valmrkt

    eqmrkt

    windsor

    scudinc

    Putnminc

    keystne

    fidel

    drefus

    yp p preturns. Here each point is a mutual fund. The horizontal and verticallocation of each point reflects the sample standard deviation andsample mean of its returns within the same sample period.

    If youre a

    fundmanager,where doyou wantto be on

    this plot?

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    95/178

    95

    Let us compare some countries (Country returns data)

    Basedonmonthlyreturnsfrom 88to 96

    0.080.070.060.050.040.03

    0.02

    0.01

    0.00

    StDev

    Mean

    singaporusa

    japan

    italy

    honkong

    germany

    france

    finalndcanada

    belgium australi

    http://faculty.chicagobooth.edu/alan.bester/teaching/data/conret.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/conret.xls
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    96/178

    96

    4.3 Relating a Numeric to a Categorical variable

    How do you plot a numeric variable vs acategorical variable?

    This is not so obvious.

    An easy thing to do is make the numeric variablecategorical by binning it, like we did when making ahistogram.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    97/178

    97

    cigs

    age 0 1Grand Total

    16-25 50.98% 49.02% 100.00%

    26-35 63.64% 36.36% 100.00%

    36-45 67.69% 32.31% 100.00%

    46-55 64.76% 35.24% 100.00%

    56-65 79.76% 20.24% 100.00%

    66-75 91.13% 8.87% 100.00%

    76-85 88.10% 11.90% 100.00%

    86-95 100.00% 0.00% 100.00%

    Grand Total 71.20% 28.80% 100.00%

    Cigarette usage and age:

    0.00%

    20.00%

    40.00%

    60.00%

    80.00%

    100.00%

    120.00%

    16-25 26-35 36-45 46-55 56-65 66-75 76-85 86-95

    1

    0

    Quick what is the relationship betweenage and cigarette usage?

    Plots are a great way to identify patterns, but carefulHow strong is the evidence?

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    98/178

    5.1 In Tables

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    99/178

    99

    5.1 In Tables

    There does not seem to be a standard way to

    summarize the strength of the relationship in a table.

    Sometimes I use the difference between a marginalproportion and a conditional proportion.

    simpsons

    colas 0 1Grand Total

    0 38.70% 3.50% 42.20%

    1 43.20% 14.60% 57.80%

    Grand Total 81.90% 18.10% 100.00%

    simpsons

    colas 0 1Grand Total

    0 47.25% 19.34% 42.20%

    1 52.75% 80.66% 57.80%

    Grand Total 100.00% 100.00% 100.00%

    In this case it would be: |.578 - .8066| =.2286

    The difference between the percent of cola drinkersand percent of simpsons viewers that are cola drinkers.

    5.2 Covariance and Correlation

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    100/178

    100

    In the beer data (beers vs weight) and mutual fund data(windsor vs valmrkt), it looks like there is a relationship.

    Even more, the relationship looks linear in that it looks likewe could draw a line through the plot to capture the pattern.

    Covarianceandcorrelation summarize how strong alinearrelationship there is between two variables.

    In our first example weight and # beers were two variables.In our second example our two variables were two kinds of

    returns.

    In general, we think of the two variables as x and y.

    The sample covariance between x and y:

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    101/178

    101

    p y

    sn

    x x y yxy i i

    i

    n

    =

    =

    1

    1 1( )( )

    The sample correlation between x and y:

    rs

    s sxy

    xy

    x y

    =

    So, the correlation is just the covariance divided bythe two standard deviations. What are the units?

    We will get some intuition about these formulae, but firstl t th i ti H d th i d t

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    102/178

    102

    let us see them in action. How do they summarize datafor us? Let us start with the correlation.

    Correlation, the facts of life:

    1 1rxy

    The closer r is to 1 the stronger the linearrelationship is with a positive slope.When one goes up, the other tends to go up.

    The closer r is to -1 the stronger the linear

    relationship is with a negative slope.When one goes up, the other tends to go down.

    The correlations corresponding to the two scatter plots

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    103/178

    103

    Correlation of valmrkt and windsor = 0.923

    Correlation of nbeer and weight = 0.692

    p g pwe looked at are:

    The larger correlation between valmrkt and windsor

    indicates that the linear relationship is stronger.

    Let us look at some more examples.

    0.20.10.0-0.1

    0.2

    0.1

    0.0

    -0.1

    valmrkt

    windsor

    200150100

    20

    10

    0

    weight

    nbeer

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    104/178

    104

    3210-1-2-3

    2

    1

    0

    -1

    -2

    x1

    y1

    Correlation of

    y1 and x1 = 0.019

    3210-1-2-3

    3

    2

    1

    0

    -1

    -2

    -3

    x2

    y2Correlation of

    y2 and x2 = 0.995

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    105/178

    105

    3210-1-2-3

    4

    3

    2

    1

    0

    -1

    -2

    -3

    -4

    x3

    y3

    Correlation of

    y3 and x3 = 0.586

    3210-1-2-3

    3

    2

    1

    0

    -1

    -2

    -3

    x4

    y4Correlation of

    y4 and x4 = -0.982

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    106/178

    106

    3210-1-2-3

    9

    8

    7

    6

    5

    4

    3

    2

    1

    0

    x5

    y5

    Correlation of y5 and x5 = 0.210

    IMPORTANT: Correlation only measures linearrelationships (here the value is small but there is a strongnonlinearrelationship between y5 and x5.)

    Example: The country data

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    107/178

    107

    Which countries go up and down together?I have data on 23 countries.That would be a lot of plots!

    0.10.0-0.1

    0.1

    0.0

    -0.1

    usa

    canada

    The correlation matrixis a table of all sample correlations

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    108/178

    108

    pbetween each possible pair of a set of variables.

    australi belgium canada finalnd france germany honkong italy

    belgium 0.189

    canada 0.507 0.357

    finalnd 0.387 0.183 0.386

    france 0.275 0.734 0.342 0.176

    germany 0.226 0.691 0.302 0.304 0.709

    honkong 0.334 0.301 0.558 0.355 0.359 0.339

    italy 0.159 0.367 0.334 0.389 0.352 0.465 0.261

    japan 0.251 0.418 0.271 0.307 0.421 0.318 0.219 0.426

    usa 0.360 0.429 0.651 0.264 0.501 0.372 0.429 0.240

    singapor 0.409 0.355 0.478 0.391 0.408 0.467 0.647 0.416

    japan usa

    usa 0.246

    singapor 0.407 0.473

    Why is this blank?

    StatPro will also make the covariance matrix, whichdisplays covariances with variances on the diagonal.

    Make this table in StatPro

    Understanding the covariance and correlation formulae

    http://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_statprosumstats.htmhttp://faculty.chicagobooth.edu/alan.bester/teaching/notes/n1_statprosumstats.htm
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    109/178

    109

    How do these weird looking formulae for covariance andcorrelation capture the relationship?

    To get a feeling for this, let us go back to the simple exampleand compute covariance and correlation

    x y

    0.07 0.11

    0.06 0.05

    0.04 0.090.03 0.03

    First let us compute the covariance

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    110/178

    110

    First, let us compute the covariance(which is a necessary ingredient tocompute the correlation):

    1

    1

    1

    307 05 11 07 06 05 05 07 04 05 09 07 03 05 03 07

    13

    02 04 01 02 1 02 02 04

    1

    30008 0002 0002 0008

    1

    30012 0004

    1nx x y yi i

    i

    n

    =

    + + +

    = + + +

    = + = =

    =

    ( )( )

    ((. . )(. . ) (. . )(. . ) (. . )(. . ) (. . )(. . ))

    (. *. . * ( . ) ( . )*. ( . ) * ( . ))

    (. . . . ) (. ) .

    = .0004

    Each of the 4 points makes a contribution to the sum.Let us see which point does what.

    x

    ( )( ) . *. .x x y y1 1 02 04 008 = =( )( ) ( . )*. .x x y y3 3 01 02 0002 = =

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    111/178

    111

    0.070.060.050.040.03

    0.11

    0.10

    0.09

    0.08

    0.07

    0.06

    0.05

    0.04

    0.03

    x

    y

    x

    y

    ( )( ) ( . ) * ( . ) .x x y y4 4 02 04 008 = =( )( ) . * ( . ) .x x y y2 2 01 02 0002 = =

    (I)

    (III)

    (II)

    (IV)

    Points in (I) have both x and y bigger than their means so we get a positive

    contribution to the covariance.Points in (III) have both x and y less than their means so we get a positivecontribution to the covariance.In (II) and (IV) one of x and y is less than its mean and the other is greaterso we get a negative contribution.

    The further out the point is, the bigger the contribution.

    Lots of positive contributions

    just a fewrelatively small

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    112/178

    112

    0.20.10.0-0.1

    0.2

    0.1

    0.0

    -0.1

    valmrkt

    windsor

    Lots of positive contributions

    Lots of positive contributions

    just a fewrelatively smallcontributions

    relatively smallcontributions

    We saw beforethat this mutualfunds returnsare positively

    correlated withthe market.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    113/178

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    114/178

    The sign of the correlation contains the same information

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    115/178

    115

    gas the sign of the covariance (in fact, they have the samesign because the standard deviations always positive).

    Positive sign: positive relationshipNegative sign: negative relationship

    The correlation can be more informative, though, becauseit is unit-less (always between 1 and 1), by construction.Hence, it is a more easily interpretable measure of thestrength of the relationship.

    Close to 1: strong positive relationship

    Close to -1: strong negative relationship

    6 Linearly Related Variables

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    116/178

    116

    We have studied data sets that display some kind of relation

    between variables (the mutual fund returns and the marketreturns, for instance).

    Sometimes there is an exactlinear relation between variables:

    y = c0 + c1 x

    In this linear relationship, c0 is called the intercept.

    c1 is called the slope.

    Suppose we had started with x and we already knew itssample mean and variance.

    Can we figure out the sample mean and variance of thenew variable, y?

    6.1 Linear functions

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    117/178

    117

    Example

    Suppose we have a sample of temperatures in Celsiusand we convert them to Fahrenheit.

    fahr = 32 + (9/5) * cel

    cel fahr

    10 50

    15 59

    20 68

    25 77

    40 10430 86

    50 122

    70 158

    How are the cel values relatedto the fahr values?

    Note that cel = 32.5, and scel = 20

    We could find fahr and sfahrusing a spreadsheet.

    Note: if we make a scatter plot of

    http://faculty.chicagobooth.edu/alan.bester/teaching/data/celfahr.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/celfahr.xls
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    118/178

    118

    Note: if we make a scatter plot offahr versus cel, what do we see ?

    Correlation of cel and fahr = 1.000

    10 20 30 40 50 60 70

    50

    100

    150

    cel

    fahr

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    119/178

    119

    The variable y is a linear function of the variable x if:

    0 1y c c x= +

    In general, we like to use the symbols y and xfor the two variables

    0

    1

    c : the intercept

    c : the slope We think of the cs as constants(fixed numbers) while x and y vary.

    Example

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    120/178

    120

    Example

    Suppose your client is a movie star. She has adeal which pays her a $10 million fee per movie +10% of the gross ticket revenues.

    How is our stars income related to the gross?

    Let I denote income.Let G denote Gross.

    10 1I . G= +

    Note: Dont forget units! When we write it this way weneed to make sure all our numbers are in millions ofdollars.

    6.2 Mean and variance of a linear function

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    121/178

    121

    Suppose y (i.e., each value of the variable y) is a linear

    function of x.

    How are the mean and variance (standard deviation)of y related to those of x?

    Let us look atour temperatureexample.

    Suppose wefirst multiply by(9/5) and thenadd 32.

    mul = 9/5 * celfahr = 32 + mul

    = 32 + (9/5)*cel

    Variable Mean StDev

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    122/178

    122

    . . .. . . . .

    +---------+---------+---------+---------+---------+-------cel

    . . . . . . . .

    +---------+---------+---------+---------+---------+-------mul

    . . . . . . . .

    +---------+---------+---------+---------+---------+-------fahr

    0 30 60 90 120 150

    cel 32.50 20.00

    mul 58.5 36.0

    fahr 90.5 36.0

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    123/178

    123

    Interpret

    When we multiply cel by 9/5 we affect (increase) boththe mean and the standard deviation proportionally.

    If we add a constant (32 in our case) we simply

    increase the mean (by the value of the constant) butleave the overall dispersion unaffected.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    124/178

    S l d i f li f ti

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    125/178

    125

    Sample mean and variance of a linear function

    Suppose

    Then,

    0 1y c c x= +

    0 1y c c x= +

    y 1 xs | c | s=

    2 2 2

    y 1 xs c s=

    Example

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    126/178

    126

    So, instead of using a spreadsheet, we could have used

    our linear formulas.

    We knew that fahr = 32 + (9/5) * cel

    c0 = 32y

    xc1 = 9/5

    Our handy linear formulas tell us:

    fahr = c0 + c1 * cel

    sfahr = |c1| * scel = |9/5| * 20 = 36

    Of course,these are

    the sameanswers wegot before!!

    = 32 + (9/5)*32.5= 90.5

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    127/178

    Aside: Why? (The hard way)

    y c c x= +

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    128/178

    128

    1

    0 1

    1

    0 1

    1 1

    0 1

    1

    1( )

    1 1

    n

    i

    i

    n

    i

    i

    n n

    i

    i i

    x xn

    y c c xn

    c c xn n

    c c x

    =

    =

    = =

    =

    = +

    = +

    = +

    2 2

    1

    2 2

    0 1 0 1

    1

    0

    1

    ( )1

    1( )

    1

    1 (1

    n

    x i

    i

    n

    y i

    i

    s x xn

    s c c x c c xn

    cn

    =

    =

    =

    = + +

    =

    1 0ic x c+ 2

    1

    1

    2 2 2 2

    1 1

    1

    )

    1( )

    1

    n

    i

    n

    i x

    i

    c x

    c x x c sn

    =

    =

    = =

    0 1i iy c c x= +

    NOTE: This is way more math than we will typically need in this course.

    BUT you should know these formulas are properties of our summary statistics,not just some coincidence. AND they come up again when we do probability!

    Example Each Income numberi 10 + 1* th di

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    129/178

    129

    Suppose our movie starmade 10 pictures lastyear and the samplemean and sample

    variance of the gross onthe films are 100 and900, respectively.

    What are the samplemean and variance ofthe stars income?

    Gross Income

    115.8 21.58

    128.9 22.89

    109.5 20.95

    127.1 22.71

    87.2 18.72

    111.2 21.12

    62.5 16.25

    129.4 22.94

    87.2 18.7241.2 14.12

    is 10 + .1* the correspondingGross number.

    See the file "moviestar1.xls". Remember,

    G I

    http://faculty.chicagobooth.edu/alan.bester/teaching/data/moviestar1.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/moviestar1.xls
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    130/178

    130

    10 1. G= +

    ( )2 2

    1 G. * s=

    10 1I . G= +

    c0 c1y x

    So,

    0 1I c c G= +

    10 1 100. *= +

    20=

    2 2 2

    1I Gs c s=

    9=

    Gross Income

    115.8 21.6

    128.9 22.9 The average of the Gross numbers = 100

    109.5 21.0 The sample variance of the Gross numbers = 900

    127. 1 22. 7 The s tandard deviat ion of t he Gross numbers = 30

    87.2 18.7111.2 21.1 The average of the Income numbers= 20

    62.5 16.2 The sample variance of the Income numbers= 9

    129. 4 22. 9 The s tandard deviat ion of t he Income numbers= 3

    87.2 18.7

    41.2 14.1

    10+.1*100= 20

    (.1) 2 * 900 = 9

    .1*30= 3

    14

    16

    18

    20

    22

    24

    40 60 80 100 120 140

    Gross

    Income

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    131/178

    Why are these formulas useful?

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    132/178

    132

    We could always just type everything into a

    spreadsheet and use spreadsheet functions to get theanswers.

    Really, though, the reason for these formulas will

    become apparent when we study probability,statistical inference, and regression. You cannotunderstand statistics or regression without a

    solid understanding of linear relationships.

    In other words, yes, I recognize these formulas are probably theleast fun part of the course (and considering this is basic stats,thats saying something). But you absolutely mustknow them.

    Example

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    133/178

    133

    Example

    Suppose x has mean 100 and standard deviation 10.

    What are the mean, standard deviation and variance of:

    (i) y = 2x?

    (ii) y = 5+x?

    (iii) y = 5-2x?

    (c0=0, c1=2)

    (c0=5, c1=1)

    (c0=5, c1= -2)

    Answers:Mean SD Variance

    (i) 200 20 400(ii) 105 10 100(iii) -195 20 400

    Answers are above; click on the textbox just above this and use your cursorto highlight the text inside.

    6.3 Linear combinations

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    134/178

    134

    We may want a variable to be related to several others instead ofjust one. We will assume that Y is a function of X,Z,rather than

    just a function of X.

    When a variable y is linearly related to several others,we call it a linear combination.

    0 1 1 2 2 k ky c c x c x c x= + + +K

    We say, y is a linear combination of the xs.c0 is called the intercept or just the constant

    ci is called the coefficient of xi.

    Example

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    135/178

    135

    Suppose in addition to the flat $10 million fee and 10

    percent of ticket revenues, our movie star also gets 5percent of all sales of the soundtrack (on CD) releasedwith the movie.

    How is the stars income related to the films gross and

    CD sales (in millions of dollars)?

    Let I,G,C, denoteincome, Gross, and cd sales 10 1 05I . G . C= + +

    yx1

    x2

    c0 c1 c2

    Important example: Portfolios

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    136/178

    136

    Suppose you have $100 to invest.

    Let x1 be the return on asset 1.

    If x1 = .1, and you put all your money into asset 1, then

    you will have $100*(1+.1) = $110 at the end of the period.

    Let x2 be the return on asset 2.

    If x2 = .15, and you put all your money into asset 2, then

    you will have $100*(1+.15) = $115 at the end of the period.

    Suppose you put of your money into asset 1 the other of your money into asset 2.What will happen?

    At the end of the period you will have,

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    137/178

    137

    .5*(100)*(1+.1) + .5*(100)*(1+.15) = 100*[ 1+(.5*.1)+(.5*.15) ]

    55 + 57.50 = $112.50

    So the return is (.5*.1) + (.5*.15) = .125

    In other words, when we put of our money into asset 1and the other into asset 2, the return on the resulting

    portfolio is

    Investment inasset 1

    Investment inasset 2

    Return onportfolio

    Rp = ( )*x1 + ( )*x2

    The return on a portfolio is a linear combination of

    the returns on the individual assets.

    It turns out this is true in general. Suppose you have $M toinvest in two assets with returns x1 and x2. Let w1 be the

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    138/178

    138

    invest in two assets with returns x1 and x2. Let w1 be the

    fraction of your wealth you choose to invest in asset 1:

    w M x w M x M w w w x w x

    M w x w x

    1 1 2 2 1 2 1 1 2 2

    1 1 2 2

    1 1

    1

    ( ) ( ) ( )

    ( )

    + + + = + + +

    = + +

    The portfolio return is:

    p 1 1 2 2R w x w x= +

    The portfolio return is a linear combination of the individualasset returns. The coefficients are the portfolio weights(fraction of wealth invested in each asset).

    Note: For this to work, we need w1 + w2 = 1

    Notice that the portfolio weights always sum up to one.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    139/178

    139

    Notice that the portfolio weights always sum up to one.(If I invest 30% of my wealth in asset 1, then I have to

    invest 70% of my wealth in asset 2).

    When were talking about portfolios, we use w1, w2,

    instead of c1, c2, to remind us that weights have to sumto one. Our linear formulas work the same way in eithercase. Most of the time when we do portfolios, we dontworry about the constant (c0=0).

    Question for those with some finance experience:Can portfolio weights be negative?

    Suppose we have m assets.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    140/178

    140

    The return on the ith asset is xi.

    Put wi fraction of your wealth into asset i..

    Your portfolio is determined by the portfolio weights wi.

    Then, the return on the portfolio is:

    m

    p 1 1 2 2 m m i i

    i 1

    R w x w x ... w x w x=

    = + + + =

    Your portfolio return is always a linear combination ofindividual asset returns, with coefficients equal to thefraction of wealth invested.

    6.4 Mean and variance of a linear combination

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    141/178

    141

    y c c x c x= + +0 1 1 2 2

    2 inputs:

    Suppose

    Then,

    y c c x c x= + +0 1 1 2 2

    s c s c s c c sy x x x x2

    1

    2 2

    2

    2 2

    1 21 2 1 22= + +

    First, we consider the case where we have only two xs.

    For linear combinations of 2or more variables, variance

    also depends on thecovariance between the xs!!

    More on this later

    Example

    For each film she does our movie star makes $10 million

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    142/178

    142

    Gross Cd

    115.763100 5.412503

    128.904400 6.539900

    109.524600 5.878809127.133700 4.984490

    87.234720 3.544932

    111.248000 5.602628

    62.455030 3.954600

    129.397300 5.38724487.171460 5.092816

    41.167710 3.602078

    For each film she does, our movie star makes $10 millionplus 10% of gross ticket revenues and 5% of CD sales.

    Here is the data for ten movies she made last year:

    Here is her income for

    each film.Remember,

    Income

    21.8

    23.2

    21.223.0

    18.9

    21.4

    16.4

    23.219.0

    14.3

    10 1 05I . G . C= + +

    So each number in theIncome column equals 10plus .1 times the Grossvalue plus .05 times theCd value.

    Note: All numbers are in millions of $.

    Like before, we could type everything in and get thesample mean and variance of income using a

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    143/178

    143

    sample mean and variance of income using aspreadsheet.

    But lets suppose, as her agent, we already knew that:

    100G = 5C =

    30Gs = 1Cs =

    0 8CGr .=

    Like before, we know that:

    10 1 05I . G . C= + +

    c0 c1 c2

    So: I = c0 + c1 G + c2 C = 10 + .1*(100) + .05*(5)= 20.25

    sI2 = c1

    2sG2 + c2

    2sC2 + 2c1c2sCG

    = (.1)2(30)2 + (.05)2(1)2 + 2(.1)(.05)(30)(1)(.8) = 9.24

    See next slide

    Reminder:

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    144/178

    144

    Remember, we defined sample correlation as the

    covariance divided by the standard deviations

    So, if we know the correlation and both standarddeviations, we can get back sample covariance

    rs

    s sxy

    xy

    x y

    =

    xy x y xys s s r =

    So, if we know the sample standard deviations and eitherof correlation or covariance, we can figure out the other.We used this trick to calculate sCG on the previous slide.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    145/178

    Example (the country data again)

    L d d h h d

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    146/178

    146

    Let us use our country data and suppose that we had put.5 into USA and .5 into Hong Kong.What would our returns have been?

    port = .5*honkong + .5*usa

    honkong usa port

    0.02 0.04 0.030

    0.06 -0.03 0.015

    0.02 0.01 0.015

    -0.03 0.01 -0.0100.08 0.05 0.065

    ........

    For each month, weget the portfolio return

    as *hongkong + *usa.

    port = .5*honkong + .5*usa

    w1 (= c1) w2 (= c2)

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    147/178

    147

    honkong usa port

    0.02 0.04 0.0300.06 -0.03 0.015

    0.02 0.01 0.015

    -0.03 0.01 -0.010

    0.08 0.05 0.065

    ........

    For each month, weget the portfolio returnas *hongkong + *usa.

    The sample means are: honkong = 0.02103

    usa = 0.01346

    The sample mean of our portfolio returns is:

    port = w1 honkong + w2 usa

    = .5*.02103 + .5*.01346 = .01724

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    148/178

    What if we had put 25% into USA and 75% into Hong Kong?

    C i

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    149/178

    149

    Covariances

    honkong usa port2

    honkong 0.00521497

    usa 0.00103037 0.00110774

    port2 0.00416882 0.00104972 0.00338905

    (.75)2(.00521) + (.25)2(.00111) +(2)*(.25)*(.75)*(.00103)

    port2 =.75*honkong +.25*usa

    To get sport22 just use the SAME formula from the previous

    slide, except now with w1=.75 and w2=.25

    = .00339

    How do the returns on the w1=w2=.5 portfolio compare with

    those of Hong Kong and USA?

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    150/178

    150

    g g

    0.070.060.050.040.03

    0.021

    0.020

    0.019

    0.018

    0.017

    0.016

    0.015

    0.014

    0.013

    StDev

    Mean port

    usa

    honkong

    It lookslike the meanfor my portfoliois right inbetween the

    means ofUSA andHong Kong.

    What about the

    standard deviation?

    The sample standard deviation is less than halfwaybetween susa and shonkong what happened?

    port = .0172

    sport = .046

    Why is covariance important?

    We just used the formulafrom this slide:

    =1 2 1 2 1 2x x x x x x

    s s s r

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    151/178

    151

    Often useful to rewrite the variance formula as

    = + +1 2 1 2 1 2

    2 2 2 2 2y 1 x 2 x 1 2 x x x xs c s c s 2c c s s r

    Remember, correlations are between -1 and 1!IF x1 and x2 are perfectly correlated (r=1), then

    = + +1 2 1 2

    2 2 2 2 2y 1 x 2 x 1 2 x xs c s c s 2c c s s

    = +1 2

    2

    1 x 2 x(c s c s )

    So in this case,1 2y 1 x 2 xs c s c s

    = +

    1 2y 1 x 2 xs c s c s< +

    BUT in general, when c1 and c2 are positive,

    The basic idea here is

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    152/178

    152

    The smallerthe correlation, the fasterthis

    happens.

    This is actually one of the most importantideas in statistics well see it again!!

    It is also one of the most important ideas infinance, because it leads to diversification.

    When we take averages,

    variance gets smaller.

    1

    0 1

    -0.07-0.05

    -0.12Example (Optional)

    y = 5x + 5 x

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    153/178

    153

    x1

    x2

    -1 0 1 2

    -2

    -1

    0

    -0.05

    -0.1

    0.12

    -0.03

    0.05

    0.04

    -0.08

    0.03

    0.05

    0.12

    -0.01

    -0.05

    -0.01

    -0.06

    0.13

    0.03

    0.11

    y = .5x1 + .5 x2

    At each point weplot the value of y.

    The variances andcovariance are:

    Then, the variance of y is

    Why is the variance of y so much smaller than those of the xs ?

    x1 x2

    x1 1.334636

    x2 -1.208679 1.106238

    0.0058105 = .5*.5*1.3346 + .5*.5*1.106 +2*.5*.5*(-1.208679)

    The dashed lines are drawn atthe mean of x1 and x2.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    154/178

    1.5

    2.0

    1.290.75

    0.93

    -0.27

    -0.02

    Example (Optional)

    y = 5x + 5 x

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    155/178

    155

    x1

    x2

    -2 -1 0 1

    -1.0

    -0.5

    0.0

    0.

    5

    1.0

    -1.07 -0.76

    0.43

    -0.09

    -1.2

    0.39

    0.17

    -0.69

    -1.11

    -0.43

    0.13

    -0.35

    1.03

    0.23

    -1.67

    y .5x1 + .5 x2

    At each point weplot the value of y.

    x1 x2

    x1 1.3870537

    x2 0.1976187 0.8247886

    The variances andcovariance are:

    Then, the variance of y is

    0.65175=.5*.5*1.387 + .5*.5*.8248 + 2*.5*.5*.1976

    Why is the variance of y less than those of x1 and x2 ?

    The dashed lines are drawn atthe mean of x1 and x2.

    3 inputs:

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    156/178

    156

    y c c x c x c x= + + +0 1 1 2 2 3 3

    y c c x c x c x= + + +0 1 1 2 2 3 3

    s c s c s c sc s c s c s

    y x x x

    x x x x x x

    2

    1

    2 2

    2

    2 2

    3

    2 2

    2 3 2

    1 2 3

    1 2 1 3 3 22

    = + ++ + +c c c1 1 3

    Note that there are now THREE covariance terms, one for each PAIR of xs

    The formula forthe sample

    mean isbasically thesame, just onemore termbecause theresone more x

    Example: Portfolio with 3 inputs

    port = .1*fidel+.4*eqmrkt+.5*windsor

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    157/178

    157

    Covariances

    port fidel eqmrkt windsor

    port 0.00306760

    fidel 0.00280224 0.00320210

    eqmrkt 0.00369384 0.00319150 0.00470021

    windsor 0.00261967 0.00241087 0.00298922 0.00236580

    .0030676 = (.1)*(.1)*.00320 + (.4)*(.4)*.00470 + (.5)*(.5)*.00236+2*[ (.1)*(.4)*.00319 + (.1)*(.5)*.00241+(.4)*(.5)*.00299 ]

    sport2 = w1

    2sfidel2 + w2

    2seqmarket2 + w3

    2swindsor2 +

    2 w1w2 sfidel , eqmarket + 2 w1w3 sfidel , windsor + 2 w2w3 seqmarket , windsor

    Let us try a portfolio with three stocks.Let us go short on Canada (i.e., we borrow Canada to investin the other stocks)

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    158/178

    158

    0.070.060.050.040.03

    0.020

    0.015

    0.010

    StDev

    Mean

    port

    usa

    honkong

    canada

    in the other stocks)

    port = -.5*canada+usa+.5*honkong

    Clearly,formingportfoliosis an interestingthing to do!

    Aside: We can show (using ourlinear formulas) that all portfolios

    that can be formed with a givenset of assets lie on a hyperbolain mean-s.d. space. Yourinvestments class will call thisthe portfolio possibilities curveor just the efficient frontier.

    Aside: Why would we form portfolios?

    M b th tf li h i d i (i

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    159/178

    159

    Maybe the portfolio has a nice mean and variance (i.e.

    nice average return and nice risk)

    Because portfolio returns are linear combinations ofreturns on individual assets, we can apply our linearformulas to find the average return and risk of any

    possible portfolio as long as we know the means andvariances of the individual asset returns. Theseformulae are fundamental tools for those who reallyunderstand finance.

    And remember our when we take averages, variancegets smaller idea? In finance, thats known asdiversification

    Example (Optional)

    Cut from a Finance Textbook:

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    160/178

    160

    Cut from a Finance Textbook:

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    161/178

    161

    y c c x c x c x c xk k= + + + + +0 1 1 2 2 3 3

    K inputs (Optional): Suppose

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    162/178

    162

    y c c x c x c x c xk k + + + + +0 1 1 2 2 3 3

    then,

    y c c x c x c x c xk k= + + + + +0 1 1 2 2 3 3

    s c s c s c sy x x k xk2

    1

    2 2

    2

    2 2 2 2

    1 2

    2

    = + + +

    +

    L

    N

    M

    MM

    O

    Q

    P

    PP

    the sum of all the different

    covariance terms

    times the products of the c's

    I wont ask you to do calculations by hand for more than 3 inputs,this is just to give you an idea of what the formulas look like.

    7. Linear Regression

    This is data on 128 homes (Housing data)

    http://faculty.chicagobooth.edu/alan.bester/teaching/data/MidCity.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/MidCity.xls
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    163/178

    163

    50000

    75000

    100000

    125000

    150000

    175000

    200000

    225000

    1400 1600 1800 2000 2200 2400 2600

    SqFt

    Price

    This is data on 128 homes. (Housing data)x=size (square feet) y = price (dollars)

    Clearly, the data are correlated:

    http://faculty.chicagobooth.edu/alan.bester/teaching/data/MidCity.xlshttp://faculty.chicagobooth.edu/alan.bester/teaching/data/MidCity.xls
  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    164/178

    164

    Table of correlations

    SqFt Price

    SqFt 1.000

    Price 0.553 1.000

    But what is the equation of the line you would draw

    through the data?

    Linear regression fits a line to the plot.

    When I "run a regression" I get values for

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    165/178

    165

    Regression coefficients

    Coefficient

    Constant -10091.1299

    SqFt 70.2263

    When I run a regression I get values forthe intercept and the slope.

    y = (intercept) + (slope) * x

    intercept

    slope

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    166/178

    166

    Here is thescatter plotwith the linedrawn through it.

    Looks reasonable!

    It turns out the formula for the slope and the intercept are

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    167/178

    167

    xy

    2x

    s

    slope = s

    intercept = y - slope*x

    Well see these later when we study regression.But it isnt that hard to see what they do!

    The slope formula takes covariance and standardizes it

    so that its units are (units of y)/(units of x)

    The intercept formula makes our line pass throughthe point (x,y)

    Regression and Prediction

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    168/178

    168

    You have a house on the market with size = 2200 sqft.

    Can we predict at what price the house will sell?

    Histogram ofPrice (in $1,000s)

    Price = $130.4 k

    sPrice = $26.9 k

    We might use the sample mean or median as ourprediction. But this doesnt take size into account.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    169/178

    Summary of Regression

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    170/178

    170

    Because they are using other information, the predictions

    we make are (hopefully!) better in some sense.One of the homework problems asks you to explore this.

    Most importantly, though, regression is based on the

    same concepts (sample means, standard deviations, andcovariance) that weve studied in these notes. Its simplya new way to display (and use!) this information.

    Theres nothing magical or mysterious about linear

    regression! If you understand the basics well, regressionis both intuitive and incredibly useful.

    Limitations of Regression

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    171/178

    171

    One thing to notice about regression is that it is not

    symmetric. As weve seen, the sample correlation (orcovariance) between x and y is the same as between xand y.

    In regression, it matters which variable is on the left handside of the = (the dependent variable). A regressionwith y = Size and x = Price gives a different answer.

    Remember:

    Correlation is not causation!

    Just because we regress y on x doesnt mean changes inx cause changes in y.

    8. Pivot Tables (Optional)

    Up till now, we have tried to look at pairs of

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    172/178

    172

    p , pvariables.

    Of course, it would be interesting to look at morethan two at a time.

    The Pivot table utility in excel uses tables to do this.But the tables can be "more than two way" and youcan put a summary for another variable in eachcell.

    The simple two way tables we looked at earlierwere also created using pivot tables.

    In each cell is printed the average of the cigs dummy.This gives the percentage of smokers.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    173/178

    173

    The cells are determined by a binned version of ageand sex.

    In the age group 16-25, 53% of female respondentsare smokers.

    This table attempts to look at 3 variables at the same time!!

    Average of cigs age

    sex 16-25 26-35 36-45 46-55 >56 Grand Total

    1 0.42 0.42 0.37 0.35 0.16 0.28

    2 0.53 0.33 0.28 0.39 0.23 0.29

    Grand Total 0.49 0.36 0.32 0.37 0.19 0.29

    What do you think is going on here?

    here is the pivot chart.

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    174/178

    174

    0.00

    0.20

    0.40

    0.60

    0.80

    1.00

    1.20

    1.40

    1.60

    1.80

    2.00

    1 2

    >56

    46-55

    36-45

    26-3516-25

    The Hockey Data

    We have data on every penalty called in the NHLfrom 95 96 to 2001 2002 Data below is a

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    175/178

    175

    oppcall timespan laghome goaldiff inrow2 laghomeT inrowT

    0 14.75 0 -1 0v one

    0 6.90 1 2 0h one

    1 8.45 1 2 1h two

    0 11.75 0 0 0v one

    1 6.30 1 1 0h one

    1 3.33 1 -1 1h two

    1 5.93 0 -1 1v two

    from 95-96 to 2001-2002. Data below is a

    subsample of size 5000.

    oppcall = 1 if penalty switches, that is, if A is playing Band the last penalty was on B, then oppcall =1 ifthis penalty is on A.

    Each row corresponds to a penalty.

    (Can't have first penalty in game).

    timespan=time between penalties (mins)

    laghome=1 last pen on home team

    goaldiff = lead of last penalized team

    inrow2=1 if last two pens on same team

    laghomeT: h if laghome=1

    inrowT: two if inrow2=1

    ...

    The table attempts to look at 4 variables at one time!!!!

  • 8/14/2019 Statistics Notes 1 Data_Plots and Summaries

    176/178

    176

    Average of oppcall goaldiff

    inrowT l