lecture 1 notes - asal aslemandasalaslemand.weebly.com/uploads/3/1/3/1/31310805/... · statistics...
TRANSCRIPT
Lecture 1 Notes➢ Chapter 1. Stats Starts Here
➢ Chapter 2. Displaying and Describing Categorical Data
1
Data▪ We can make sense of the world by making sense of data.
▪ Data is plural. Datum is singular.
▪ Data are values along with their context.
▪ Data is any collection of numbers, characters, images, or other items that provide information about some thing.
▪ Data help us see the underlying truth and pattern.
▪ Nowadays data mostly come in an excel form.
▪ Data are presented in a table (rows and columns) like the example below.
2
Data Vary
For example ask Canadians to what extent do you agree or disagree with the statement “Leaning new things is fun”?
• Let’s find out what were Canadians’ responses to this statement in 2008.
• Source: Access and Support to Education and Training Survey, 2008.
3
Example of Canadian Data Set
4
CHASS: Computing in the Humanities and Social Sciences
Faculty of Arts & Science, University of Toronto
Copy and paste the link below into a new tab in your internet browser: http://www.chass.utoronto.ca/
Example of Canadian Data Set
5
From the left-side menu, click on Data Centre > U. of T. users
Example of Canadian Data Set
6
Click on SDA@CHASS (SDA: Survey Documentation and Analysis)
Example of Canadian Data Set
7
Click on Continue in English
Example of Canadian Data Set
8
Click on Access and Support to Education and Training Survey, 2008 (ASETS)
Example of Canadian Data Set
9
Click on Data
Example of Canadian Data Set
10
Click on Codebooks > SDA codebooks
Example of Canadian Data Set
11
Click on Sequential Variable List
Example of Canadian Data Set
12
Click on Attitudes Towards Learning
Example of Canadian Data Set
13
Click on item: al_g02 Learning new things is fun
What do you think about Canadians’ responses to this item?
Do you think they all agreed? Or some of them agreed?
Example of Canadian Data Set
14
Do you expect to obtain the same answers (responses) from different selection of Canadians in year 2008?
Do you expect to obtain the same responses from the same selected Canadians in 2018?
Statistics
Statistics is the science of data.
This involves collecting, classifying, analyzing, presenting, interpreting and communicating numerical information.
Statistics helps us:
• make sense of the world in everyday life by seeing past the underlying variation to find patterns and relationships (e.g., in health, politics, economics, education, environment, and social issues);
• become informed citizens by giving us the tools to understand, question, and interpret data.
• Understand articles published in research journals and reports in government agencies and private industries.
15
Role of Statistics
Statistics has important roles in answering real questions like the following:
• How do we assess the risk of genetically engineered foods being considered by the Canadian Food Inspection?
• How do we asses the safety and effectiveness of new drugs submitted to Health Canada for approval?
• How do we determine whether vitamin C really prevent illness?
• Which factors have the greatest impact on student performance in school?
• Which factors affect people’s quality of their health care?
• Which factors affect people’s decision to retire?
Improving Human Welfare in 2013 International Year of Statistics:
http://www.worldofstatistics.org/about-us/
16
Elements of Statistics
• An Individual Case (Experimental Unit) is an object about which we collect data.
• The cases are sample of cases selected from some larger population that we would like to understand.
Example: student, animal, transaction, event
• A population is a set of individual cases that we are interested in studying.
Example: All students at University of Toronto
• A sample is a subset of the individual cases (units) of a population.
Example: All students in this class
• A representative sample exhibit characteristics typical of those possessed by the population.
• It is a kind of snapshot of image of larger world.
• The most common way to satisfy the representative sample requirement is to select a random sample as it ensures that every subset of a fixed size in the population has the same chance of being selected.
Example: If we want to understand students’ experience at U of T, we need to randomly select students from the entire population of U of T students.
17
• Variation is the of statistics.
• Variation is the foundation of sound reasoning about the data.
• Statistical methods helps explain the variation in the data; We model the variation in the data.
• A variable is a characteristic of an individual case (experimental or observational unit) in the population.
• A variable can take different values on different cases.
Example: An undergraduate student’ data base.
Individual cases: Students of the university.
Variables: Gender, GPA, Program of Study, Year of Study, etc.
18
Elements of Statistics
Context of Data Answers the Five W’s
When you plan a statistical study or explore data from someone else’s work, ask your self the following questions:
• Who will be the cases in my study? How many individuals will be in my study?
• Why conduct this study? What purpose do the data have? Do I hope to answer some specific questions? Do I want to draw conclusions about individuals other than the ones actually I have data for? Is my data reliable?
• What? How many variables do the data contain? Exact definitions of these variables? In what unit of measurement is each variable recorded?
• When is an appropriate time to conduct my study?
• Where can I conduct my study?
In addition to the five W’s:
• How can I conduct my study? (e.g. use an instrument, for example, a validate survey)
19
Asal’s ExampleI study students’ attitudes about statistics.
• Who: Undergraduate students
• Why: By understanding attitudes about statistics I aim to improve teaching and learning of statistics
• What: Students’ attitudes, their prior mathematics related experiences and achievement, their gender, their program of study, their year of study, and their statistics course outcome.
• Where: University of Toronto
• When: At the beginning and at the end of an introductory statistics course
• How: By administrating the Survey of Attitude Towards Statistics (SATS-36©) and linking students’ responses to students’ repository record from the Office of Registrar
At the time of my study in 2016, I was the instructor for the course under the study. I had a research assistant who administered the survey at both times and collected the data. She assigned a participant number to each student who participated into the study. The participant number is an “Identifier Variable”, which identifies individual cases. This was not included in the data analysis but it helped to match students’ information regarding their attitudes toward statistics and their program and their year of study.
20
Types of DataClassify Variables: Quantitative or Categorical
Quantitative variable:
• When the measurement scale has numerical values. They describe amount of something.
• These variables must accompany with their unit of measurement.
E.g., University GPA: (range form 1.0 to 4.0)
E.g., Hours of study: (0 to infinite!?)
• They may also arise from the process of counting.
E.g., The number residents in the province of Ontario.
E.g., The number siblings a person has.
Categorical variable:
• When the measurement scale is set of categories.
• To determine what group or category individuals (cases) belong to.
• Counting is a natural way to summarize and learn about a categorical variable.
• Often called qualitative variables: Distinct categories differ in their qualities not in their numerical magnitude.
E.g., Program of Study: Environmental Sciences, Life Sciences, Social Science, and so on.
E.g., Canadian Provinces: Ontario, British Columbia, Alberta, and so on.21
Why Classify Variables as Quantitative or Categorical?
• For application of different statistical methods.
• For obtaining appropriate graphs and summary statistics.
Example of a Quantitative Variable:
• Income of Canadian Citizens (in thousands of dollars).
We might be interested in average income of all Canadian Citizens.
Graphical Display: Histogram or Boxplot of distribution of income.
Example of a Categorical Variable:
• Canadian Provinces.
We might be interested in the number of Canadians living in each province (Count).
Graphical Display: Bar chart or Pie Chart
22
Convert a Quantitative Variable to a Categorical Variable
Simply break up the range of values into several intervals.
Example: Age
Distribution of on- and off-reserve First Nations people (single identity), by age group, 2011
Source: http://www.statcan.gc.ca/pub/89-653-x/2016010/tbl/tbl01-eng.htm
23
Scales of Measurement
• Interval Scales:
For quantitative variables intervals are equal distances.
Example: Annual income (in thousands of dollars).
• The interval (distance) between $30,000 to $40,000 is $10,000.
• Purpose: We can compare outcomes are how much larger or how much smaller one is than the other (e.g., in which interval should an annual income go to).
• Nominal Scales:
For categorical variables no level (category) is greater or smaller than any other level (category).
Example: Primary mode of transportation to school.
• Categories: automobile, bus, subway, bicycle, walk.
24
Scales of Measurement: Ordinal Scales
• A variable between nominal and interval scales.
• Consists of categorical scales having a natural ordering of values.
• The levels form an ordinal scale.
Examples: Social Class
• Categorical scale: upper, middle, lower.
Example: Political philosophy
• Categorical scale: Very liberal, moderately liberal, slightly liberal, very conservative, moderately conservative, slightly conservative
25
Quantitative Aspects of Ordinal Scales
• The position of ordinal scales on the quantitative-qualitative(categorical) classification is fuzzy.
• Often methods used for their statistical analysis is the same as nominal (categorical) variables.
• In some cases, they could closely resemble interval scales for quantitative variables.
• Each level has a greater or smaller magnitude than another level.
• We can conduct a sensitivity analysis and check if conclusions would differ in any significant way of other choices of scores.
• Example: Survey of Attitude Towards Statistics (SATS-36©):
• SATS-36© items are ordinal (e.g., strongly disagree, strongly agree)
• We might want to treat them as a quantitative variable (1, 2, 3, 4, 5, 6, 7; interval scale: distance is 1) to compute a mean score for an item (e.g., I will like statistics). 26
Asal’s Example
I obtained a statistic (average) about students’ reported willingness to spend a great deal to learn statistics as 5.88 on a 7-point Likert scale (“1” indicates a strong disagreement to “4” neutral to “7” strong agreement).
With this value of 5.88, since it is above 4 (neutral response),
I described that on average, students reported a great deal of effort to learn statistics in their course.
27
Variables: Discrete or Continuous
Discrete Variables:
• Any variable with finite number (countable) of possible values is discrete.
Examples:
• number of siblings for a person.
• number of people living in Ontario.
• ALL categorical variables (nominal or ordinal) are discrete, having a finite set of categories.• Categories/levels pre-determined for a categorical variable.• Example: Social Class (Upper, Middle, Lower)
Continuous Variables:
• Any variable with infinite continuum (no ending number) of possible real number values (e.g., a number with decimal points).
Examples:
• Time (in minutes) takes to finish reading a book.
• Age of a person.
28
In summary:
Quantitative Variables:
• Have interval scales.
• Could be either continuous (e.g., age) or discrete (e.g., number of times dined at l'espresso bar mercurio in the month of April).
Categorical Variables (always discrete):
• Nominal scale (e.g., mode of transportation to school: automobile, bus, subway, bicycle, walk) are always discrete.
• Ordinal scales (e.g., University GPA: A, B, C, D) are always discrete.
29
Summarizing and Describing a Single Categorical Variable
30
• Recall our earlier example, responses to the ASETS (2008) survey item: Learning new things is fun.
Summarizing and Describing a Single Categorical Variable
31
Example: Learning new things is fun: A categorical variable.
Frequency Table:
• Count the number of cases corresponding to each category and put them into a table.
• Frequency table records the totals and uses the category names to label each row.
The table on the right describes the distribution of Canadian responses to the statement “Learning new things is fun”, because it names the possible categories and tell how frequently each occur (how cases are distributed across the categories).
Example: 15,712 participants strongly agreed to the statement
Relative Frequency:
• Divide the count by the total number of cases. This gives fraction (proportion) of the whole.
Example: 15712/23519 = 0.668
• Multiply the proportions by 100 to obtain the percentages.
Example: 0.668 x 100 = 66.8%
Majority (66.8%) of the respondents strongly agreed
with the statement.
Summarizing and Describing a Single Categorical Variable
32
Bar Chart:
• Display the distribution of a categorical variable.
• Shows the frequency (count) for each category next to each other for easy comparison.
• The height of the bar shows the count for its category
• It is better to have spaces between bars to indicate that these are freestanding bars that could be arranged into any order.
• The bars are the same width so their heights determine the areas.
• These areas are proportional to the counts in each category.
Note: Bar chart stays true to the Area Principle.
Area Principle:
The area occupied by a part of the graph should correspond to the magnitude of the value it represents.
Example: Learning new things is fun: A categorical variable.
Summarizing and Describing a Single Categorical Variable
33
Pie Chart:
• Display the whole group of cases as a circle.
• It slices the circle into pieces whose size is proportional to the fraction of a whole.
Majority (66.8%) of the respondents strongly agreed
with the statement.
Example: Learning new things is fun: A categorical variable.
34
Contingency Tables:
• Classification with respect to two categorical variables.
• It determines if two categorical variables are related (associated, depended).
• Idea: Arrange the counts in a two-way table.
Example: A question on the General Social Survey (2009) on Victimization asked a random sample of 9689
Canadians about their opinion regarding Canadian Criminal courts: “Are they doing a good job, an average job or
poor job of determining whether the accused or the person charged is guilty or not?”.
The data are summarized in the two-way table below.
Exploring Relationships Between Two Categorical Variables
Sex
Opinion Regarding Criminal Courts at Sentencing
Good Average Poor Total
Male 1664 2237 705 4606
Female 1479 2794 810 5083
Total 3143 5031 1515 9689
35
• Table below is called a 2 x 3 (read as “2-by-3”) contingency table (two rows and three columns), because it shows
how the individuals are distributed along each variable, contingent on the value of the other variable.
• Subjects are classified to both their sex and their opinion regarding Canadian’s criminal courts at sentencing.
• Each cell of the table gives the count for a combination of values of the two variables.
Example: 1664 represents the number of respondents who are male and think that the Canadian criminal courts are
doing a good job at sentencing.
Exploring Relationships Between Two Categorical VariablesContingency Table of Opinion Regarding Criminal Court at Sentencing and Sex of the Respondents
Sex
Opinion Regarding Criminal Courts at Sentencing
Good Average Poor Total
Male 1664 2237 705 4606
Female 1479 2794 810 5083
Total 3143 5031 1515 9689
36
Read Data in R
37
Bar Plot of Sentencing Opinion by Sex
• Women are more likely to think that the criminal court is doing an average job at sentencing, compared with the men.
• There is not much of a difference between the sexes in the likelihood of opinion regarding the criminal court is doing a
poor job at sentencing.
Finding Marginal Distribution
38
Marginal Distribution of Sex:
• The percentage of respondents who are male:
(4606/9689) = 0.4754 x 100 ≅ 47.54%
• The percentage of respondents who are female:
(5083/9689) = 0.5246 x 100 ≅ 52.46%
Marginal Distribution of Opinion about Criminal Court:
• The percentage of respondents who think that Canadian
criminal courts are doing a good job at sentencing:
(3143/9689) = 0.32445 x 100 ≅ 32.44%
• The percentage of respondents who think that Canadian
criminal courts are doing an average job at sentencing:
(5031/9689) = 0.5192 x 100 ≅ 51.92%
• The percentage of respondents who think that Canadian
criminal courts are doing a poor job at sentencing:
(1515/9689) = 0.1564 x 100 ≅ 15.64%
The margins of the table, on the right and at the bottom, give the totals.
Note: the two proportions adds up to 1.
Note: the three proportions adds up to 1.
Finding Joint Distribution: Overall Percentages
39
Joint distributions of two variables of all cases belong to each combination of row and column category.
Example: The percentage of respondents who are male and think that Canadian criminal courts are doing a good
job at sentencing: (1664/9689) = 0.17174115 x 100 ≅ 17.17%
Note: the six proportions adds up to 1.
40
Describe the conditional distribution of sex for those who think that criminal court is doing a good job at sentencing.
• The percent of males among those who think that Canadian criminal courts are doing good job at sentencing:
(1664/3143) = 0.5294 x 100 ≅ 52.94%
• The percent of females among those who think that Canadian criminal courts are doing good job at sentencing:
(1479/3143) = 0.4706 x 100 ≅ 47.06%
Finding Conditional Distributions: Column Percentages
Note: the column proportions adds up to 1.
Looking for Associations Between Two Variables
41
1. Describe the conditional distribution of opinion about criminal court at sentencing for males (Row Percentages).
The percentage of male respondents who think that the Canadian criminal courts are doing:
• a good job at sentencing is (1664/4606) = 0.3613 x 100 ≅ 36.13%
• an average job at sentencing is (2237/4606) = 0.4857 x 100 ≅ 48.57%
• a poor job at sentencing is (705/4606) = 0.1531 x 100 ≅ 15.31%
Note: the row proportions adds up to 1.
Looking for Associations Between Two Variables
42
2. Describe the conditional distribution of opinion about criminal court at sentencing for females (Row Percentages).
The percentage of female respondents who think that the Canadian criminal courts are doing:
• a good job at sentencing is (1479/5083) = 0.2910 x 100 ≅ 29.10%
• an average job at sentencing is (2794/5083) = 0.5498 x 100 ≅ 54.98%
• a poor job at sentencing is (810/5083) = 0.1594 x 100 ≅ 15.94%
Compare Row Percentages: Associations Between Two Variables
43
• Women (54.98%) are more likely to think that the criminal court is doing an average job at sentencing, compared
with the men (48.57%).
• There is not much of a difference between the sexes in the likelihood of opinion regarding the criminal court is
doing a poor job at sentencing.
Side-by-side Bar Chart of Sentencing Opinion by Sex
Another Example: Distributions of Perceived Health By Sex Source: Canadian Community Health Survey (CCHS, 2012)
44
• There were 1500 respondents.
• 832 of the respondents were females. 668 of the respondents were males.
• Most females (315) and most males (240) reported “3 = Very Good” as their perceived health. • There are not much differences of reported perceived health between males and females.
Variable(s): 2
Variable Names: Perceived Health, Sex
Variable Type:
• Perceived Health: Ordinal Categorical Variable
0 = Poor, 1 = Fair, 2 = Good, 3 = Very Good, 4 = Excellent
• Sex: Nominal Categorical Variable
Male, Female
Distributions of Perceived Health by Sex (CCHS, 2012)Possible Lack of Association Between Two Variables
45
• There are not much differences of reported perceived health between males and females.
• There is no apparent association between reported perceived health and the sex of the subjects.
• Reported perceived health may be independent of (may not depend on) sex of the subject.
• That is the sex of the respondents does not appear to explain reported perceived health.
Distributions of Perceived Health (CCHS, 2012)
46
Most of the respondents (37%) perceived
their health as very good.
Exploring Relationships Between Two Categorical Variable
47
• Use the either row or column percentages to compare the percentages.
• That is, find the conditional distribution of one variable within each level of another variable.
• When the distribution of one variable is different for all categories of another variables, we say that the variables are
dependent (the variables are associated; the variables are related).
• When the distribution of one variable is the same for all categories of another variables, we say that the variables are
independent (the variables are not associated; the variables are not related).
• Note the points made above are an informal method of comparing distributions. In STA221, we will see a formal
way of checking for independence (Test of Hypothesis regarding the independence of two variables),
Nice to meet you and see you soon ☺
Please Bring your laptop to the next class for exploring RStudio.
50