1 statistics 202: statistical aspects of data mining professor david mease tuesday, thursday...
TRANSCRIPT
![Page 1: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/1.jpg)
1
Statistics 202: Statistical Aspects of Data Mining
Professor David Mease
Tuesday, Thursday 9:00-10:15 AM Terman 156
Lecture 5 = More of chapter 3
Agenda:1) Announce TA office hours2) Assign chapter 3 homework3) Lecture over more of chapter 3 (section 3.3)
![Page 2: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/2.jpg)
2
Announcement:
TA office hours for (almost) the entire semester are posted at
www.stats202.com/ta.html
which is now linked from www.stats202.com/course_info.html
which is linked from
www.stats202.com
under “Course Information”
![Page 3: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/3.jpg)
3
Homework Assignment:
Chapter 3 Homework Part 1 is due Tuesday 7/17
Either email to me ([email protected]), bring it to class, or put it under my office door.
SCPD students may use email or fax or mail.
The assignment is posted at
http://www.stats202.com/homework.html
![Page 4: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/4.jpg)
4
Introduction to Data Mining
byTan, Steinbach, Kumar
Chapter 3: Exploring Data
![Page 5: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/5.jpg)
5
Exploring Data We can explore data visually (using tables or graphs) or numerically (using summary statistics)
Section 3.2 deals with summary statistics
Section 3.3 deals with visualization
We will begin with visualization
Note that many of the techniques you use to explore data are also useful for presenting data
![Page 6: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/6.jpg)
6
Visualization Page 105:
“Data visualization is the display of information in a graphical or tabular format.
Successful visualization requires that the data (information) be converted into a visual format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.
The goal of visualization is the interpretation of the visualized information by a person and the formation of a mental model of the information.”
![Page 7: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/7.jpg)
7
Example:Below are exam scores from a course I taught once.
Describe this data.
192 160 183 136 162165 181 188 150 163 192164 184 189 183 181 188191 190 184 171 177 125192 149 188 154 151 159141 171 153 169 168 168157 160 190 166 150
Note, this data is at www.stats202.com/exam_scores.csv
![Page 8: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/8.jpg)
8
The Histogram Histogram (Page 111):
“A plot that displays the distribution of values for attributes by dividing the possible values into bins and showing the number of objects that fall into each bin.”
Page 112 – “A Relative frequency histogram replaces the count by the relative frequency”. These are useful for comparing multiple groups of different sizes.
The corresponding table is often called the frequency distribution (or relative frequency distribution).
The function “hist” in R is useful.
![Page 9: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/9.jpg)
9
In class exercise #7:Make a frequency histogram in R for the exam scores using bins of width 10 beginning at 120 and ending at 200.
![Page 10: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/10.jpg)
10
In class exercise #7:Make a frequency histogram in R for the exam scores using bins of width 10 beginning at 120 and ending at 200.
Answer:
> exam_scores<- read.csv("exam_scores.csv",header=F)
> hist(exam_scores[,1],breaks=seq(120,200,by=10), col="red", xlab="Exam Scores", ylab="Frequency", main="Exam Score Histogram")
![Page 11: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/11.jpg)
11
In class exercise #7:Make a frequency histogram in R for the exam scores using bins of width 10 beginning at 120 and ending at 200.
Answer:Exam Score Histogram
Exam Scores
Fre
qu
en
cy
120 140 160 180 200
02
46
81
01
2
![Page 12: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/12.jpg)
12
The (Relative) Frequency Polygon
Sometimes it is more useful to display the information in a histogram using points connected by lines instead of solid bars.
Such a plot is called a (relative) frequency polygon.
This is not in the book.
The points are placed at the midpoints of the histogram bins and two extra bins with a count of zero are often included at either end for completeness.
![Page 13: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/13.jpg)
13
In class exercise #8:Make a frequency polygon in R for the exam scores using bins of width 10 beginning at 120 and ending at 200.
![Page 14: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/14.jpg)
14
In class exercise #8:Make a frequency polygon in R for the exam scores using bins of width 10 beginning at 120 and ending at 200.
Answer:
> my_hist<-hist(exam_scores[,1], breaks=seq(120,200,by=10),plot=FALSE)
> counts<-my_hist$counts
> breaks<-my_hist$breaks
> plot(c(115,breaks+5), c(0,counts,0), pch=19, xlab="Exam Scores", ylab="Frequency",main="Frequency Polygon")
> lines(c(115,breaks+5),c(0,counts,0))
![Page 15: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/15.jpg)
15
In class exercise #8:Make a frequency polygon in R for the exam scores using bins of width 10 beginning at 120 and ending at 200.
Answer:
120 140 160 180 200
02
46
81
01
2
Frequency Polygon
Exam Scores
Fre
qu
en
cy
![Page 16: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/16.jpg)
16
The Empirical Cumulative Distribution Function (Page 115)
“A cumulative distribution function (CDF) shows the probability that a point is less than a value.”
“For each observed value, an empirical cumulative distribution function (ECDF) shows the fraction of points that are less than this value.” (Page 116)
A plot of the ECDF is sometimes called an ogive.
The function “ecdf” in R is useful. The plotting features are poorly documented in the help(ecdf) but many examples are given.
![Page 17: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/17.jpg)
17
In class exercise #9:Make a plot of the ECDF for the exam scores using the function “ecdf” in R.
![Page 18: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/18.jpg)
18
In class exercise #9:Make a plot of the ECDF for the exam scores using the function “ecdf” in R.
Answer:
> plot(ecdf(exam_scores[,1]), verticals= TRUE, do.p = FALSE, main ="ECDF for Exam Scores", xlab="Exam Scores", ylab="Cumulative Percent")
![Page 19: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/19.jpg)
19
In class exercise #9:Make a plot of the ECDF for the exam scores using the function “ecdf” in R.
Answer:
120 140 160 180 200
0.0
0.2
0.4
0.6
0.8
1.0
ECDF for Exam Scores
Exam Scores
Cu
mu
lativ
e P
erc
en
t
![Page 20: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/20.jpg)
20
Comparing Multiple DistributionsIf there is a second exam also scored out of 200 points, how will I compare the distribution of these scores to the previous exam scores?
187 143 180 100 180
159 162 146 159 173
151 165 184 170 176
163 185 175 171 163
170 102 184 181 145
154 110 165 140 153
182 154 150 152 185
140 132
Note, this data is at www.stats202.com/more_exam_scores.csv
![Page 21: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/21.jpg)
21
Comparing Multiple Distributions
Histograms can be used, but only if they are relative frequency histograms.
Relative Frequency Polygons are even better. You can use a different color/type line for each group and add a legend.
Plots of the ECDF are often even more useful, since they can compare all the percentiles simultaneously. These can also use different color/type lines for each group with a legend.
![Page 22: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/22.jpg)
22
In class exercise #10:Plot the relative frequency polygons for both the first and second exams on the same graph. Provide a legend.
![Page 23: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/23.jpg)
23
In class exercise #10:Plot the relative frequency polygons for both the first and second exams on the same graph. Provide a legend.
Answer:
> more_exam_scores<- read.csv("more_exam_scores.csv",header=F)
> my_new_hist<- hist(more_exam_scores[,1], breaks=seq(100,200,by=10),plot=FALSE)
> new_counts<-my_new_hist$counts
> new_breaks<-my_new_hist$breaks
> plot(c(95,new_breaks+5),c(0,new_counts/37,0), pch=19,xlab="Exam Scores", ylab="Relative Frequency",main="Relative Frequency Polygons",ylim=c(0,.30))
> lines(c(95,new_breaks+5),c(0,new_counts/37,0), lty=2)
![Page 24: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/24.jpg)
24
In class exercise #10:Plot the relative frequency polygons for both the first and second exams on the same graph. Provide a legend.
Answer (Continued):
> points(c(115,breaks+5),c(0,counts/40,0), col="blue",pch=19)
> lines(c(115,breaks+5),c(0,counts/40,0), col="blue",lty=1)
> legend(110,.25,c("Exam 2","Exam 1"), col=c("black","blue"),lty=c(2,1),pch=19)
![Page 25: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/25.jpg)
25
In class exercise #10:Plot the relative frequency polygons for both the first and second exams on the same graph. Provide a legend.
Answer (Continued):
100 120 140 160 180 200
0.0
00
.05
0.1
00
.15
0.2
00
.25
0.3
0
Relative Frequency Polygons
Exam Scores
Re
lativ
e F
req
ue
ncy
Exam 2Exam 1
![Page 26: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/26.jpg)
26
In class exercise #11:Plot the ECDF for both the first and second exams on the same graph. Provide a legend.
![Page 27: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/27.jpg)
27
In class exercise #11:Plot the ECDF for both the first and second exams on the same graph. Provide a legend.
Answer:
> plot(ecdf(exam_scores[,1]), verticals= TRUE,do.p = FALSE, main ="ECDF for Exam Scores", xlab="Exam Scores", ylab="Cumulative Percent", xlim=c(100,200))
> lines(ecdf(more_exam_scores[,1]), verticals= TRUE,do.p = FALSE, col.h="red",col.v="red",lwd=4)
> legend(110,.6,c("Exam 1","Exam 2"), col=c("black","red"),lwd=c(1,4))
![Page 28: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/28.jpg)
28
In class exercise #11:Plot the ECDF for both the first and second exams on the same graph. Provide a legend.
Answer:
100 120 140 160 180 200
0.0
0.2
0.4
0.6
0.8
1.0
ECDF for Exam Scores
Exam Scores
Cu
mu
lativ
e P
erc
en
t
Exam 1Exam 2
![Page 29: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/29.jpg)
29
In class exercise #12:Based on the plot of the ECDF for both the first and second exams from the previous exercise, which exam has lower scores in general? How can you tell from the plot?
![Page 30: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/30.jpg)
30
Visualizing Paired Numeric Data
The two sets of exam scores in the previous exercise were not paired. However, the data at www.stats202.com/exams_and_names.csv contains the same exam scores along with an identifier of the student. This data is paired.
For visualizing paired numeric data, scatter plots (Page 116) are extremely useful. These can be produced using the plot() command in R.
When the data set has two or more numeric attributes, examining scatter plots of all possible pairs is often useful. The function pairs() in R does this for you. The book calls this a scatter plot matrix (Page 116).
![Page 31: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/31.jpg)
31
In class exercise #13:Use R to make a scatter plot of the exam scores at www.stats202.com/exams_and_names.csv with the first exam on the x-axis and the second exam on the y-axis. Scale the x-axis and y-axis both from 100 to 200. Add the diagonal line (y=x) to the plot. What does this plot reveal?
![Page 32: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/32.jpg)
32
In class exercise #13:Use R to make a scatter plot of the exam scores at www.stats202.com/exams_and_names.csv with the first exam on the x-axis and the second exam on the y-axis. Scale the x-axis and y-axis both from 100 to 200. Add the diagonal line (y=x) to the plot. What does this plot reveal?
Answer:
data<-read.csv("exams_and_names.csv")
plot(data$Exam.1,data$Exam.2,xlim=c(100,200),ylim=c(100,200),pch=19,main="Exam Scores",xlab="Exam 1",ylab="Exam 2")
abline(c(0,1))
![Page 33: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/33.jpg)
33
In class exercise #13:Use R to make a scatter plot of the exam scores at www.stats202.com/exams_and_names.csv with the first exam on the x-axis and the second exam on the y-axis. Scale the x-axis and y-axis both from 100 to 200. Add the diagonal line (y=x) to the plot. What does this plot reveal?
Answer:
100 120 140 160 180 200
10
01
20
14
01
60
18
02
00
Exam Scores
Exam 1
Exa
m 2
![Page 34: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/34.jpg)
34
Labeling Points on a Scatter Plot
The R commands text() and identify() are useful for labeling points on the scatter plot.
![Page 35: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/35.jpg)
35
In class exercise #14:Use the text() command in R to label the points for the students who scored lower than 150 on the first exam. Use the identify command to label the points for the two students who did better on the second exam than the first exam. Use the first column in the data set for the labels.
![Page 36: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/36.jpg)
36
In class exercise #14:Use the text() command in R to label the points for the students who scored lower than 150 on the first exam. Use the identify command to label the points for the two students who did better on the second exam than the first exam. Use the first column in the data set for the labels.
Answer:
text(data$Exam.1[data$Exam.1<150], data$Exam.2[data$Exam.1<150],labels=data$Student[data$Exam.1<150],adj=1)
identify(data$Exam.1,data$Exam.2, labels=data$Student)
![Page 37: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/37.jpg)
37
In class exercise #14:Use the text() command in R to label the points for the students who scored lower than 150 on the first exam. Use the identify command to label the points for the two students who did better on the second exam than the first exam. Use the first column in the data set for the labels.
100 120 140 160 180 200
100
120
140
160
180
200
Exam Scores
Exam 1
Exa
m 2
Student #4Student #23
Student #30
Student #5
Student #34
![Page 38: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/38.jpg)
38
Adding Noise to a Scatter Plot
When both variables are discrete, many points in a scatter plot may be plotted over top of one another, which tends to skew the relationship.
A solution is to add a small amount of noise to the points so that they are jittered a little bit.
Note: If you have too many points to display cleanly on a scatter plot, sampling may also be helpful.
![Page 39: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/39.jpg)
39
In class exercise #15:Add noise uniformly distributed on the interval -0.5 to 0.5 to both the x and y values in the graph in the previous exercise.
![Page 40: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/40.jpg)
40
In class exercise #15:Add noise uniformly distributed on the interval -0.5 to 0.5 to both the x and y values in the graph in the previous exercise.
Answer:
data$Exam.1<-data$Exam.1+runif(40)-.5data$Exam.2<-data$Exam.2+runif(40)-.5
(then same as before)
![Page 41: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/41.jpg)
41
In class exercise #15:Add noise uniformly distributed on the interval -0.5 to 0.5 to both the x and y values in the graph in the previous exercise.
100 120 140 160 180 200
10
01
20
14
01
60
18
02
00
Exam Scores
Exam 1
Exa
m 2
Student #4Student #23
Student #30
Student #40
Student #5
Student #34
![Page 42: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/42.jpg)
42
Boxplots (Pages 114-115) Invented by J. Tukey
A simple summary of the distribution of the data
Boxplots are useful for comparing distributions of multiple attributes or the same attribute for different groups
outlier
10th percentile
25th percentile
75th percentile
50th percentile
90th percentile
![Page 43: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/43.jpg)
43
Boxplots in R
The function boxplot() in R plots boxplots
By default, boxplot() in R plots the maximum and the minimum (if they are not outliers) instead of the 10th and 90th percentiles as the book describes
outlier
10th percentile
25th percentile
75th percentile
50th percentile
90th percentile Maximum
Minimum
![Page 44: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/44.jpg)
44
Boxplots (Pages 114-115) Boxplots help you visualize the differences in the medians of multiple attributes relative to the variation
Example: The median value of Attribute A was 2.0 for men and 4.1 for women. Is this a “big” difference?
![Page 45: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/45.jpg)
45
Boxplots (Pages 114-115) Boxplots help you visualize the differences in the medians of multiple attributes relative to the variation
Example: The median value of Attribute A was 2.0 for men and 4.1 for women. Is this a “big” difference?
Maybe yes:
Men Women
12
34
5
Attribute A
![Page 46: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/46.jpg)
46
Boxplots (Pages 114-115) Boxplots help you visualize the differences in the medians of multiple attributes relative to the variation
Example: The median value of Attribute A was 2.0 for men and 4.1 for women. Is this a “big” difference?
Maybe yes: Maybe no:
Men Women
-20
-10
01
02
03
0
Attribute A
Men Women
12
34
5
Attribute A
![Page 47: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/47.jpg)
47
In class exercise #16:Use boxplot() in R to make boxplots comparing the first and second exam scores in the data atwww.stats202.com/exams_and_names.csv
![Page 48: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/48.jpg)
48
In class exercise #16:Use boxplot() in R to make boxplots comparing the first and second exam scores in the data atwww.stats202.com/exams_and_names.csv
Answer:
data<-read.csv("exams_and_names.csv")
boxplot(data[,2],data[,3],col="blue",main="Exam Scores",names=c("Exam 1","Exam 2"),ylab="Exam Score")
![Page 49: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/49.jpg)
49
In class exercise #16:Use boxplot() in R to make boxplots comparing the first and second exam scores in the data atwww.stats202.com/exams_and_names.csv
Answer:
Exam 1 Exam 2
10
01
20
14
01
60
18
0
Exam Scores
Exa
m S
core
![Page 50: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/50.jpg)
50
Visualization in Excel
Up until now, we have done all the visualization in R
Excel also can make many different types of graphs. They are found under the “Insert” menu by selecting “Chart”
When using Excel to make graphs which anyone will see other than yourself, I strongly encourage you to change defaults such as the grey background.
Excel also has a nice tool for making tables and associated graphs called “PivotTable and PivotChart Report” under the “Data” menu.
![Page 51: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/51.jpg)
51
In class exercise #17:Use “Insert” > “Chart” > “XY Scatter” to make a scatter plot of the exam scores at www.stats202.com/exams_and_names.csvPut Exam 1 on the X axis and Exam 2 on the Y axis.
![Page 52: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/52.jpg)
52
In class exercise #17:Use “Insert” > “Chart” > “XY Scatter” to make a scatter plot of the exam scores at www.stats202.com/exams_and_names.csvPut Exam 1 on the X axis and Exam 2 on the Y axis.
Answer:
![Page 53: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/53.jpg)
53
In class exercise #18:The data www.stats202.com/more_stats202_logs.txtcontains access logs from May 7, 2007 to July 1, 2007.Use “Data” > “PivotTable and PivotChart Report” In Excel to make a table with the counts of GET /lecture2=start-chapter-2.ppt HTTP/1.1andGET /lecture2=start-chapter-2.pdf HTTP/1.1for each date. Which is more popular?
![Page 54: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/54.jpg)
54
In class exercise #18:The data www.stats202.com/more_stats202_logs.txtcontains access logs from May 7, 2007 to July 1, 2007.Use “Data” > “PivotTable and PivotChart Report” In Excel to make a table with the counts of GET /lecture2=start-chapter-2.ppt HTTP/1.1andGET /lecture2=start-chapter-2.pdf HTTP/1.1for each date. Which is more popular?
Answer:
Date GET /lecture2=start-chapter-2.pdf HTTP/1.1 GET /lecture2=start-chapter-2.ppt HTTP/1.1 Grand Total27-Jun-07 150 17 16728-Jun-07 247 29 27629-Jun-07 253 53 30630-Jun-07 77 9 861-Jul-07 50 7 57Grand Total 777 115 892
![Page 55: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/55.jpg)
55
In class exercise #19:The data www.stats202.com/more_stats202_logs.txtcontains access logs from May 7, 2007 to July 1, 2007.Use “Data” > “PivotTable and PivotChart Report” In Excel to make a table with the counts of the rows for each date in May.
![Page 56: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/56.jpg)
56
In class exercise #19:The data www.stats202.com/more_stats202_logs.txtcontains access logs from May 7, 2007 to July 1, 2007.Use “Data” > “PivotTable and PivotChart Report” In Excel to make a table with the counts of the rows for each date in May.
Answer:
Date CountMay-7 88May-8 88May-9 65May-10 179May-11 47May-12 67May-13 47May-14 59May-15 58May-16 107May-17 64May-18 93May-19 66May-20 104May-21 123May-22 75May-23 85May-24 81May-25 49May-26 60May-27 78May-28 66May-29 64May-30 69May-31 46
![Page 57: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/57.jpg)
57
In class exercise #20:Use “Insert” > “Chart” > “Line” In Excel to make a graph on the number of rows versus the date for the previous exercise.
![Page 58: 1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 5 = More of chapter 3 Agenda:](https://reader035.vdocuments.us/reader035/viewer/2022070305/551445ab5503462d4e8b4b87/html5/thumbnails/58.jpg)
58
In class exercise #20:Use “Insert” > “Chart” > “Line” In Excel to make a graph on the number of rows versus the date for the previous exercise.
Answer: Stats 202 Logs
0
20
40
60
80
100
120
140
160
180
200M
ay-
7
Ma
y-8
Ma
y-9
Ma
y-1
0
Ma
y-1
1
Ma
y-1
2
Ma
y-1
3
Ma
y-1
4
Ma
y-1
5
Ma
y-1
6
Ma
y-1
7
Ma
y-1
8
Ma
y-1
9
Ma
y-2
0
Ma
y-2
1
Ma
y-2
2
Ma
y-2
3
Ma
y-2
4
Ma
y-2
5
Ma
y-2
6
Ma
y-2
7
Ma
y-2
8
Ma
y-2
9
Ma
y-3
0
Ma
y-3
1
Date
Ac
ce
ss
Co
un
t