chapter 2 slides 2015

33
Chapter 2 Data Summary and Presentation Items to covered in this presentation include: Review basic statistics Discuss samples and populations Discuss presentations techniques Statistics are a foundation for engineering sciences. It “quantifies” the likelihood of an occurrence based on a summary of the past. “Statistics is the taming of randomness” Engineers need to understand the fundamentals of statistics.

Upload: akmal-farhan

Post on 25-Dec-2015

222 views

Category:

Documents


2 download

DESCRIPTION

slides mech 305

TRANSCRIPT

Page 1: Chapter 2 Slides 2015

Chapter 2 Data Summary and Presentation

► Items to covered in this presentation include: Review basic statistics Discuss samples and populations Discuss presentations techniques

► Statistics are a foundation for engineering sciences.

► It “quantifies” the likelihood of an occurrence based on a summary of the past.

► “Statistics is the taming of randomness”

► Engineers need to understand the fundamentals of statistics.

Page 2: Chapter 2 Slides 2015

Statistical Analysis of Experimental Data

► Through statistics we want to quantify three points: A single representative value for the data A value that represents the variation or spread in the data. An interval about which the true value is expected to lie.

► Mean

► Median

► Mode

► Deviation

► Variance

► Standard Deviation

Page 3: Chapter 2 Slides 2015

Sample Statistics

► The variance of a sample is given by:

► The variance for a population is:

► The standard deviation = square root of the variance.

N

iixN 1

22 )(1

2

11

2

1

22 1

1

1)(

1

1 n

ii

n

ii

N

ii x

nx

nxx

ns

Page 4: Chapter 2 Slides 2015

“n” or “n-1”

► In general, whenever a simple random sample is used, one needs to increase the size of the variance to compensate for the fact that does not exactly equal the true mean μ, hence n-1 is used as the divisor rather than n.

►The term degrees of freedom, is also used to describe n-1.

►Recall that the sum of the deviations is always zero so if one knows all the deviations (xi-1 - ) except for one, the unknown can easily be solved.

x

x

Page 5: Chapter 2 Slides 2015

Median

►The median is another central measure; to find it:

Arrange all the samples from highest to lowest (or lowest to highest)

If n is odd, the median is the value in the middle position

((n+1)/2)

If n is even, the sample median is the average of the

values in positions (n/2) and ((n+1)/2)

►The determination of the median for a population is identical.

Page 6: Chapter 2 Slides 2015

Mode and Outliers

►The mode is the value that occurs most frequently. There may be more than one mode in a data set.

►The range = maximum value – minimum value.►Outliers are values that are well outside the other

values in the data set. Some are legitimate values, but some may be due to error in measurement or recording.

Note: Summary statistics that use arithmetic methods and involve the entire data set are always affected by outliers; e.g. the mean and standard deviation. The median is not affected by outliers.

Page 7: Chapter 2 Slides 2015

Quartiles

► The trimmed mean reduces the effect of outliers by calculating the mean of the data set when p% of the values at each end (highest and lowest) are trimmed.

► Quartiles divide the ranked data set into four equal-size groups, as closely as possible. Arrange the data set in order. First quartile (Q1)—the value at position (n + 1)/4. Third quartile (Q3)—the value at position 3(n + 1)/4. Q2, the value at position (n + 1)/2, is the median. Definitions of quartiles vary. We will use these ones. Quartiles are quite insensitive to outliers.

► Interquartile Range (IQR) is the difference between the upper and lower quartiles (range of Q2 plus Q3). It highlights the variability in the data

Page 8: Chapter 2 Slides 2015

Percentiles

►The percentile value divides a data set so that p% of the values are less than, and 100(1–p)% are greater than, it. pth percentile = value at position p(n + 1)/100. Other

definitions exist; this one works well except at the extremes.

Q1 = 25th percentile; median = 50th percentile; Q3 = 75th percentile.

►For qualitative data, the summary statistics above can not be calculated because the values are names or labels. The only meaningful statistics are the frequencies and relative frequencies of values or groups of values.

Page 9: Chapter 2 Slides 2015

Summary Statistics

►For a sample, the values calculated above are known as statistics. For a population, the values are called parameters.

►We want to know the parameters of a population but it is impractical/impossible to access the entire population.

►That is why we collect a sample and use the descriptive statistics above to provide a cursory assessment, or we use inferential statistical methods to make estimates, test theories, or formulate models of the parameters.

Page 10: Chapter 2 Slides 2015

Data Plots

► The most effective method for reviewing data is through graphical methods.

► Engineers are “visual” people and therefore require “charts” to be transformed to “graphs.”

Page 11: Chapter 2 Slides 2015

Stem and Leaf Diagram

► A Stem and leaf plot is used for exploring data but is not used for formal reporting.

► It allows for quick review to determine medians and mode.

► The method involves the following procedure:

Page 12: Chapter 2 Slides 2015

Stem and Leaf Diagram (textbook)

Page 13: Chapter 2 Slides 2015

Stem and Leaf Diagram (textbook)

Page 14: Chapter 2 Slides 2015

Stem and Leaf Diagram (textbook)

Page 15: Chapter 2 Slides 2015

Histogram

►Histograms & frequency distributions1. Choose boundary points for the class intervals (cells or

bins). Usually, intervals are the same width. Class limits must not overlap.

2. Find the frequency (number of data values) in each interval.

3. Calculate the relative frequencies (number of data values total number in an interval).

4. If the class intervals are the same width, draw rectangles with heights equal to the frequencies or relative frequencies.

5. If the class intervals are not equal in width, draw rectangles with areas that represent the frequencies or relative frequencies.

Page 16: Chapter 2 Slides 2015

Histogram (example)

Page 17: Chapter 2 Slides 2015

Skewed or Symmetric

►A histogram is perfectly symmetric if its right half is a mirror image of its left half.

►Histograms that are not symmetric are referred to as skewed.

►A histogram with a long right tail is said to be skewed to the right, or positively skewed.

►A histogram with a long left tail is said to be skewed to the left, or negatively skewed.

Page 18: Chapter 2 Slides 2015

Unimodal, bimodal, and multimodal data sets

►A data set that has a histogram with only one peak is called unimodal. Example?

► If a data set has a histogram with two peaks, we say that it is bimodal. Example?

► If there are more than two peaks in the histogram of a data set, it is said to be multimodal. Example?

Page 19: Chapter 2 Slides 2015

Box Plot

► The box plot is a graphical display that simultaneously describes several important features of a data set, such as center, spread, departure from symmetry, and identification of observations that lie unusually far from the bulk of the data.

► The basic boxplot presents the median, Q1, Q3, the maximum and the minimum values of the data set.

► The width of the box is the interquartile range (Q3-Q1). The median is marked by a line in the box.

► Draw lines from the box to the values that are closest to, but within, a range of 1.5 IQR (called whiskers or fences).

- Lower whisker = Q1 – 1.5(Q3 – Q1). Upper whisker = Q3 + 1.5(Q3 – Q1)

► Identify each value outside the fences separately; these are the outliers.

Page 20: Chapter 2 Slides 2015

Box Plot

Page 21: Chapter 2 Slides 2015

Box Plot

Page 22: Chapter 2 Slides 2015

Box Plots

► Comparative (side-by-side) boxplots When we want to compare

samples from more than one data set, we plot the boxplots side-by-side using the same scale.

This allows us to compare how the distributions differ between the data sets.

► Boxplots can be plotted horizontally or vertically. It is usual for comparative boxplots to be plotted vertically.

► Histograms are often used in formal reports. Boxplots are also seen in formal reports, but stem and leaf plots should not appear in formal reports.

Page 23: Chapter 2 Slides 2015

Example

►Tensile tests for a set of alum. alloy yields the following results: 15 30 51 20 17 19 20 32 17 15 23 19 15 18 16 22 29 15 13 15 ksi. Plot the data using stem and leaf, histogram and box plot.

1 | 3 5 5 5 5 [Q1] 5 6 7 7 8 [Q2] 9 92 | 0 0 2 [Q3]3 93 | 0 24 | 5 | 1

Median = 18.5Mode = 15

Page 24: Chapter 2 Slides 2015

Example

Page 25: Chapter 2 Slides 2015

Time Series Plots

►A time series plot is a graph in which the vertical axis denotes the observed value of the variable (say x) and the horizontal axis denotes the time (which could be minutes, days, years, etc.).

► A time series or time sequence is a data set in which the observations are recorded in the order in which they occur.

► When measurements are plotted as a time series, one often see trends, cycles, or other broad features of the data

Page 26: Chapter 2 Slides 2015

Time Series

Page 27: Chapter 2 Slides 2015

Multivariate Data

►Multivariate data occurs when each data observation in the data set has two or more values that are possibly related.

►We can present bivariate data graphically using a scatterplot or an x-y diagram.

►From a scatterplot, we can assess the following aspects of the possible relationship between the two variables: Direction: positive or negative Strength: strong, medium, weak, no relationship Linearity: linear or non-linear.

Page 28: Chapter 2 Slides 2015

Multivariate Data

Page 29: Chapter 2 Slides 2015

Multivariate Data

►Covariance and correlation—measure the association or relatedness between variables for bivariate (x,y) data. Given bivariate (x,y) data, calculate The sample covariance gives the direction of the

association but does not give an indication of the comparative strength:

. and , , , yx sysx

1),(

1))((

1111

n

SsyxCov

yxn

yxyyxxS

xyxy

n

ii

n

ii

n

iii

n

iiixy

Page 30: Chapter 2 Slides 2015

Multivariate Data

► The sample correlation coefficient, r, gives the direction (+/–) and the comparative strength of the association (–1 rxy 1):

2

1

2

2

1

2

)1()(

)1()(

y

n

iiyy

x

n

iixx

snyyS

snxxS

Negative means negative slopeWe want value as close to –ve or +ve 1 cause that means stronger correlation…

Page 31: Chapter 2 Slides 2015

Multivariate Data

► In general, we can make some general remarks about the correlation variable:

For r > 0.8 we have a strong correlation (0.9 and better to draw conclusions).

With 0.5 < r < 0.8, we have a moderate correlation With r < 0.5, we have a weak correlation.

Page 32: Chapter 2 Slides 2015

Example

► The following are results for tensile strength and hardness for a copper alloy. Is there a correlation?Tensile Str. Brinell Hardness

106.2 35.0106.3 37.2105.3 39.8106.1 35.8105.4 41.3106.3 40.7104.7 38.7105.4 40.2105.5 38.1105.1 41.6

Tensile Strength and Hardness

for Copper Specimens

34

3638

40

42

104 105 106 107

Tensile Strength (kgf)

Bri

nel

l Har

dn

ess

Page 33: Chapter 2 Slides 2015

Example