statistics revision - xtreme papers

31
Introduction Charts and graphs are used to convey information in a way that a user can get an immediate overall impression about the data. It should be clear exactly what a chart or graph is about and so should have a fully descriptive heading or title if it is to be of any use. Many students tend to be unsure of what type of graph or charts is appropriate and often use bar charts, just because they have more experience with them. It is also clear that when using IT packages, such as Excel or OpenOffice, many tend to choose graphs that look good, regardless of whether they are appropriate or properly set up. They also do not give full attention to axes and scales. This is not good practice and will cost marks in coursework. [In fact I suggest that you don’t use Excel to draw your graphs, unless you are prepared to spend time and effort doing them properly, you will find it easier and quicker to do the graphs manually on graph paper (not on ordinary paper for GCSE). I know this is heresy. However if you have a proper statistical package such as Fathom, Autograph, FX Draw/FX Stat, R, Minitab or SPSS etc., then use that. It is also possible to get “Add Ins” for Excel, but I haven’t tried any of them. The examination boards do encourage the use of ICT to draw graphs though, but please be careful.] Now, remember that data tends to split into three main types: Qualitative (Categorical): This is where the data is a type or category, types of make or colours or types of vehicle for instance. Quantitative: Discrete: This is where data is counted. Examples include how many blue cars, how many apples on a tree, how often did a person watch a particular TV programme.

Upload: syed-ali-raza

Post on 20-Apr-2015

78 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistics Revision - Xtreme Papers

IntroductionCharts and graphs are used to convey information in a way that a user can get an immediate overall impression about the data. It should be clear exactly what a chart or graph is about and so should have a fully descriptive heading or title if it is to be of any use.Many students tend to be unsure of what type of graph or charts is appropriate and often use bar charts, just because they have more experience with them.It is also clear that when using IT packages, such as Excel or OpenOffice, many tend to choose graphs that look good, regardless of whether they are appropriate or properly set up. They also do not give full attention to axes and scales. This is not good practice and will cost marks in coursework.[In fact I suggest that you don’t use Excel to draw your graphs, unless you are prepared to spend time and effort doing them properly, you will find it easier and quicker to do the graphs manually on graph paper (not on ordinary paper for GCSE). I know this is heresy. However if you have a proper statistical package such as Fathom, Autograph, FX Draw/FX Stat, R,Minitab or SPSS etc., then use that. It is also possible to get “Add Ins” for Excel, but I haven’t tried any of them. The examination boards do encourage the use of ICT to draw graphs though, but please be careful.]Now, remember that data tends to split into three main types:

Qualitative (Categorical):This is where the data is a type or category, types of make or colours or types of vehicle for instance.

Quantitative:Discrete: This is where data is counted. Examples include how many blue cars, how many apples on a tree, how often did a person watch a particular TV programme.Continuous: This is where data is measured, using some kind of measuring instrument. Examples include length, weight and mass, force, capacity, electrical quantities, etc.Sometimes it isn’t obvious whether some data is discrete or continuous. For instance time or money can be considered as one or the other. It depends on how it is used. For instance you might count how many seconds elapse between two events.If you allow fractions of a second to be appropriate then you should consider the time to be continuous, if fractions of a second are inappropriate then you might consider the time to be a discrete quantity. The same idea applies to dates. It depends upon what context you consider appropriate. Try to consider if it is meaningful or allowable to have fractions of the time or date?For each type of data you have to use the correct types of graphs. This is now extremely important, especially for the GCSE Statistics examination.

Page 2: Statistics Revision - Xtreme Papers

Pictograms Or PictographsPictograms are a basic way of showing frequency data, especially for  categorical items. This means that we are dealing with categories such as colour, or type of object rather than numerical data. The main point is that there must be a key that shows how many of the item a particular symbol represents. This should be clearly identified. It is also common to use a fraction of the symbol to represent lesser frequencies. This can lead to confusion, especially if non-obvious fractions are used (as below).Here is a pictogram that shows the distribution of shoe size in Year 8 (Of course shoe size could be considered to be discrete data as well.) Clearly the modal category is shoe size 7. A shop would want to stock more of this size than any other.

We can see that for shoe size 8 there are five complete symbols and a half symbol. From this we estimate that there were 55 year 8 boys who have shoe size 6.If the fraction is not an obvious fraction of the key symbol then we must use our judgment. For instance on the row for shoe size 9 we can see that there is only a part of a figure. Perhaps this represents just two or three shoes. Let's say 3 shoes. So number of boys with shoe size 9 is 3.Here is a frequency table that shows the information in the pictogram.

Shoe Size No. of Y8 Boys

5 20

6 10

7 70

8 55

Page 3: Statistics Revision - Xtreme Papers

9 3

10 10

Total 168

Categorical Bar Chart And Vertical Line GraphCategorical Bar Chart and Vertical Line Graph

This type of chart is used to show the frequency or quantity of some item. For instance, the number of blue cars etc. The individual types of data are called  categories. Sometimes the categories could be actual numbers. For instance the number on the face of a dice when thrown.The key things to consider are that there must be gaps between the bars (no matter what the science and geography departments say!) and those labels should show what each bar represents. The vertical axis represents the frequency.Here is a typical bar chart that shows the number of each colour of car that passed a school in a 10 minute period.

Page 4: Statistics Revision - Xtreme Papers

Although it is personal choice, the width of the bars should be about the same size as the gaps or narrower for the best effect.To make it easier to draw, some people like to use a line-column graph, which is a bar chart with vertical lines instead of bars. Note that the vertical line is drawn in the middle of the category width.Note also that if the frequency axis doesn’t start from zero it is best to put a little zigzag on the axis to highlight this. This is true for all other types of chart as well.

The modal category of the data is the category that has the largest frequency. In the case above it is Red. It is sometimes allowable to have two categories for the modal category (bi- modal), but never more than two. If there is no highest frequency, that is, there are three or more bars having the same frequency then we say that there is  no mode or no modal category. 

Discrete And Continuous DataDiscrete Data

For simple discrete data where there are only a few allowable values, such as number of children in a family, or number of passengers in a car, we can use a bar chart or a vertical line graph. The value of the item, e.g. 3 children, is used as a label for the bar or vertical line. Here are examples of both for the frequency table below:

No. of Children Frequency

0 3

1 5

2 8

3 2

Page 5: Statistics Revision - Xtreme Papers

4 1

Effectively there is no real difference between the two types of graph. Note, however, that for the vertical graph the lines are drawn in the middle of the width allocated to the label or value. The number of children value is simply acting as a label here.It is allowable to use a bar chart to show data for two or more situations so that they can be compared, but to my eyes they always look messy.

Continuous Data

The data set shows a group of continuous data.This data is called continuous because the scale of measurement - distance - has meaning at all points between the numbers given, eg we can travel a distance of 1.2 and 1.8 miles.

Length of journey to work

Distance in miles 0.1 0.2 0.6 1.1 1.2 1.8 2.0 2.7 3.4 4.6 6.2 8.0 12.1 14.2 

Grouped Discrete Data Bar ChartGrouped Discrete Data Bar Chart

Sometimes, when working with some datasets where a value is counted (i.e. discrete data), we have a problem if we use each data value as a category. For instance if we were counting the number of apples on a tree we might get a different value for each tree. If we had 100 trees we might need a bar chart with one hundred labels if each tree had a different number of apples on it!In such cases it is better to group the data into intervals or partitions. Such a partition is called a class interval. It is important to note that the data must not overlap. So we can't have an interval from 0 - 10 and then from 10- 20, because into which class would we put the number 10? So we break it up so that we have 0-9 and 10-19, or we could do 0-10, 11-20 for instance.We do (usually) want to try and make the class intervals exactly the same width, although we might have to make the first or last one slightly larger. So for instance if we used 0 - 10, 11 - 20 etc. then the first interval is 11 numbers wide, and the rest are 10 numbers wide. This may be relevant or important in later calculations!

Page 6: Statistics Revision - Xtreme Papers

Here is an example of a frequency table showing the number of apples on a tree.

No. of apples / tree Frequency

0 - 9 3

10 - 19 7

20 - 29 2

30 - 39 6

40 - 49 1

50+ 1

Total 20

We can now construct a bar chart as before. Again, note the gaps and note how the class intervals are used as labels, rather than as a scale. There is a heading showing clearly what the graph is about. (Although it could be argued that more information is really required to make it specific as to which orchard, what date etc.?)

The mode is now known as the modal class and in this case is 10-19 apples/tree. Note also that pupils tend to want to colour in each bar in a separate colour. This isn’t usually appropriate

Page 7: Statistics Revision - Xtreme Papers

and tends to look unprofessional! It doesn’t matter where the bar actually starts, though I personally tend to start at the left most part of the class and go about a third to half way across. What does tend to look tacky is when only a very narrow space is left. Bar charts have gaps; make sure it is clear that they do.Note how the class intervals are acting as a label for each bar. The bottom axis is not a scale.When doing calculations from grouped data frequency tables remember that unless we have the original data we would estimate the mean etc. by using the middle of the class interval. (See the chapter on analyzing data for more details.)We could, of course, have replaced the bar chart by a vertical line graph:

 

HistogramsHistograms

Histograms are similar to bar charts in that they show frequency, but this time of continuous (or measured) data. As such they are drawn differently. The main thing is to note that the horizontal axis is a scale rather than just labels for categories or class intervals. You will also see that a histogram does not have gaps between the bars.Some people tend to consider a histogram to be such only if the class intervals are of different widths. If the class widths are the same they consider them to be bar charts. I prefer to use the word histogram for continuous data bar charts regardless.It is also important to note that if the class intervals are of unequal size, then a  frequency density has to be calculated andThis is plotted on the vertical axis rather than frequency. This is because it is the  area of the bar that represents the frequency rather than its height! The reason for this is that if we simply plot the frequency, then the area tends to be exaggerated more than the height and this tends to give a false impression as to the relative size of the data represented.It isn't hard to calculate frequency density - we just divide the frequency by the class-width. (I find that I best remember this by remembering that the frequency is an area and so are the base times height of the bar, which is the class width multiplied by the frequency density.)

Page 8: Statistics Revision - Xtreme Papers

Here is an equal class width histogram showing the heights of plants in a garden. Since the class widths are equal I can plot frequency. Note the scale at the bottom and that the units are stated.

Here is an example of an unequal width histogram. The first one (on the left) is drawn correctly using frequency density, and the one on the right is drawn incorrectly using frequency so that you can see how it over-emphasises the frequencies. I have also included the original frequency table.

Height of plants h (cm) Frequency Frequency Density

0 ≤ h ≤ 20 8 8/20 = 0.4

20 < h ≤ 30 20 20/10 = 2

30 < h ≤ 40 15 15/10 = 1.5

40 < h ≤ 60 12 12/20 = 0.6

60 < h ≤ 100 10 10/40 = 0.25

Total 65

Page 9: Statistics Revision - Xtreme Papers

Quite clearly the extreme large area from 60-100 gives the impression that there is a lot of data in that portion, whereas it is actually quite sparsely populated. 

Frequency PolygonsFrequency Polygons for Grouped Continuous Data

If we want to show the data from two datasets on the same graph for comparison purposes we can use a frequency polygon. We could use two (or more bars) on the same graph, keep them narrow and use different colours, but often it is hard to read the graph and get an overall impression. We cannot use a histogram for this purpose. Sometimes you might see the bar chart turned so that the bars are horizontal and then the two sets of data are shown on opposite sides of a vertical axis, but again these tend to be difficult to interpret - though they might look pretty!A frequency polygon joins the mid-points of the tops of the bars in a histogram. The main teaching point is that the lines should be straight, not curved.[In the past it was felt reasonable to use a frequency polygon for discrete data also, but to use a dashed line to link the points instead of a solid one. The dashed line indicated that data could not be interpolated, that is no intermediate values could be obtained. Now a vertical line graph is used for discrete data.Personally, if I were comparing two datasets of discrete data, I would be quite happy to use a frequency chart to show both. It is certainly more effective than vertical lines! However, for the purposes of GCSE do what the examination board say.

Page 10: Statistics Revision - Xtreme Papers

 

Line GraphsIt is vital to make clear that line graphs are NOT the same as bar charts. They do not show frequency based data, but show how one quantity varies for changes in another value. The thing that is changed by the experimenter is called the independent (or explanatory) variable, the thing that changes as a result is called the dependent (or response) variable (because it depends on the thing being changed).Typical examples are found in science and geography especially. Students find it hard not to use a bar chart for everything!Usually the independent variable is continuous and it is therefore proper to join the data points by a smoothly changing curve. If the independent variable is discrete then just the actual data points should be drawn and no curve or line should really be drawn between them. This is to stop users of the data thinking that they can interpolate information from the graph! (Interpolation means to get values from between given values by estimating or more mathematical methods. Extrapolation is estimating values from beyond the given data.)The graph below shows a typical (and made up) line graph. Normally, the arrows on the end would not be drawn, but I wanted to show you that the graph can actually continue on in either direction.

Page 11: Statistics Revision - Xtreme Papers

Pie ChartsChildren love pie charts, especially the 3D type produced by Excel etc. Unfortunately they tend to misunderstand the point of a pie chart in that it is used to try and show relative proportions, not frequencies!The usual mistake with pie charts is not to provide any information showing what the whole pie chart represents, or what the individual slices represent. Without this information it is impossible to compare pie charts in a quantitative way.When constructing a pie chart we need to work out fractions. We could have a fraction of 360 degrees (or a fraction of 100% if a percentage pie chart is used).

The fraction is the Value / TotalHere is an example of how to construct a pie chart. Thirty children were asked to choose their favourite colour from a list of alternatives. Here are the (made up) results:

Colour Frequency Size of Angle

Red 6 6/30 x 360 = 72

Green 4 4/30 x 360 = 48

Blue 11 11/30 x 360 = 132

Yellow 9 9/30 x 360 = 108

Total 30 Total = 360 degrees

Page 12: Statistics Revision - Xtreme Papers

When constructing the pie chart it is probably best to draw the first line as a vertical radius from the center upwards and to measure the first angle in a clockwise direction from this. The second angle is then measured from this new line, not from the original vertical line. Remember you don't need to measure the last section of pie, though it is a useful check.

Often pupils like to colour in the pieces of pie, but this can look garish, especially if printed off from Excel or OpenOffice.It can also be beneficial to show either percentages, the actual values or degrees for each sector, but only if this won't clutter the chart.

eStem And Leaf PlotsA particularly nice way of initially presenting data is in the form of a stem and leaf plot. It is useful for organizing small amounts of data and for finding out the median and range quickly.Here is some data on the amount of change (young) pupils had in their pockets (as usual made up to illustrate a point!).34, 73, 65, 23, 29, 76, 45, 54, 54, 65, 23, 84, 70, 30, 25, 26, 47, 65, 34, 15, 70, 50, 36Firstly we draw a vertical bar and on the stem side we write down the numbers 0-9

Page 13: Statistics Revision - Xtreme Papers

Now we go through our list of numbers and write them down the leaf side. The key to this is that the number 39 is such that the stem is the 3 and the leaf part is the 9, so 3|9 is 39If we now have the number 38 we write 3|9 8If we had the number 30 we would write: 3|9 8 0 and so on.

Now we make sure that the leaf parts are ordered in ascending order (small to large)

Page 14: Statistics Revision - Xtreme Papers

This is the completed Stem and Leaf diagram. Note that there  must be a key.Using this diagram we can see that the smallest value is 15p (the 1|5) and the largest value is 84p (the 8|4). The range is therefore 84-15 = 69p. The median is the middle value. There are 23 values, so the median is the ½ (23+1)th item = 12th item which is 47. Note you only count the leaves, not the stem and leaves.You could work out the quartiles as well if you wanted to. To do this for the lower quartile we calculate the ¼ (n+1)th value and locate that. (Note that for GCSE it is likely that this will result in a whole number. If not then you are going to need to interpolate the value required by finding the required fraction into the interval between the values. So for instance if you needed the 3.25th value and the 3rd value was 6 and the 4th value was 9, you would need to find 6 + ¼ (9-6) = 6.75.)For the Upper quartile we could find either the ¾ (n+1)th value or use the lower quartile position, but start at the highest value and go backwards!The mode is the item with the most occurrences. In this case we have two lots of 23 two lots of 54 and three lots of 65, so 65 is the modal value.Can we calculate the mean? Yes, but only by adding up all the values and dividing by 23. Some datasets can be awkward to put into a stem and leaf plot. We can also put leaves on both sides of the stem if we have two datasets to compare.Also be aware that we can use a stem & leaf to plot two datasets at once, with one group going to the left and the other to the right.

Page 15: Statistics Revision - Xtreme Papers

Stem and leaf diagrams are okay if the data has nice values, but if the data is something like 24, 305, 87 2.6, then it isn’t really feasible to make the stem and leaf work properly.

Cumulative Frequency Step Polygon For Simple Discrete DataConsider we count the number of goals per football match from the group stages at a major competition.We add an extra column to the frequency table called the Cumulative Frequency column. Into this column we start off with the first frequency, i.e. 6. We then add to this the number in the next frequency column. i.e. 6 + 15 = 21. Then we add the 8 to this total to get 21+8 = 29 and so on.What this gives us is the total number of items equal to or less than the number of Goals/Match.So there were 37 matches where the total number of goals was equal to, or less than 3.

Number of Goals/Match Frequency Cumulative Frequency

0 6 6

1 15 21

2 8 29

3 8 37

4 8 45

5 2 47

6 2 49

7 1 50

Total 50

What this does is allow us to find out in which class the half-way point of the data values were. Half of 50 is 25. This occurs in the 2 goals/match class.

Page 16: Statistics Revision - Xtreme Papers

The point where this happens is the median. So the median value occurs in the 2 goals/match class. Half of the values are less than this, half the values are more than this point.Now we plot the points and from each point we draw a step, down to the next point.

In order to find the median value we look at the total number of values we have. In this case there were 50 matches played. We add one to this then we then half this, to get 25.5 and from the cumulative frequency axis (the vertical axis) we draw a line across to the steps and then down. If the number of items is large enough we can simply use ½n rather than ½(n+1).

The median value is thus 2 goals/match.We can also find the lower and upper quartiles (or Q1 and Q3). Really we should find the ¼ (n+1)th and the ¾ (n+1)th values, but since in this case n is quite large we can just use the ¼ n and ¾ n values without losing too much accuracy.

Page 17: Statistics Revision - Xtreme Papers

We half the value of the median to get 12.5 and from this point on the vertical cumulative frequency axis we go across and down. This gives the lower quartile. For the upper quartile, we can add the half value and this quarter value (to get 37.5) and also go across and down to get the upper quartile.

Q1 is 1 goal/match and Q3 is 4 goals/match. The Interquartile range is therefore 4 – 1 = 3 goals per match.

Continuous Data Cumulative Frequency PolygonCumulative Frequency Graphs

These graphs are very useful for estimating the median and inter-quartile range of grouped data. The data type can either be simple discrete data, grouped discrete data or continuous data. They can also be useful for comparing distributions.

Continuous Data Cumulative Frequency Polygon

The cumulative frequency graph for continuous data is almost exactly the same as for grouped discrete data. Here is some data collected on the heights of some pupils in a form.

Height - h (cm) Freq Cumulative Frequency

Page 18: Statistics Revision - Xtreme Papers

120 < h ≤ 130 1 1

130 < h ≤ 140 4 5

140 < h ≤ 150 7 12

150 < h ≤ 160 8 20

160 < h ≤ 170 4 24

170 < h ≤ 180 2 26

26

Here is a plot of this data. Note that the upper bound is effectively the top of the class interval.

Page 19: Statistics Revision - Xtreme Papers

Again, we can use this graph to estimate the median and the quartiles.

For the median we chose ½ of (26+1) which is 13.5, for Q1 we used ½ of 13.5 which is 6.25 and for Q3 we added these together to get the ¾ point, so 13.5+6.25 = 19.75 (For GCSE Maths you would get away with 13, 6 and 19.)The median is 151cm, Q1 is 141cm and Q3 is 160. Therefore the Interquartile range is 160-141 = 19cm.This gives us some indication of the spread or variance of the data. A larger number would indicate that the data is more spread out. This would also be shown on the graph by a less steep curve. It is useful to know this if you plot two datasets on the same graph.If you do plot two polygons on the same graph, you might find it useful to convert the cumulative frequencies to percentage cumulative frequencies. This is easy to do as is seen below.

Height - h (cm) Frequency Cumulative Frequency % Cumulative Frequency

120 < h ≤ 130 1 1 1/26 x 100 = 4%

130 < h ≤ 140 4 5 5/26 x 100 = 19%

140 < h ≤ 150 7 12 12/26 x 100 = 46%

150 < h ≤ 160 8 20 20/26 x 100 = 77%

Page 20: Statistics Revision - Xtreme Papers

160 < h ≤ 170 4 24 24/26 x 100 = 92%

170 < h ≤ 180 2 26 26/26 x 100 = 100%

26 (rounded to nearest whole %)

The median is now at the 50% mark, Q1 at the 25% and Q3 at the 75% marks. The advantage of this is that it is really easy to compare two datasets that have different number of data items in them.Often questions are asked about what proportion are less than a certain amount or more than a certain amount. To use the graph to calculate these we draw a dotted line from the horizontal data axis to the curve and then across. If we want the number lessthan this amount we can simply read off the value. However if we want the value that is larger than this we need to subtract this value from the total number of items. This is to get the number that are more than the required value.

Cumulative Frequency Curve – The Ogive

If we take the (ac)cumulating process to the extreme, and if we had a large number of data items, we would find that we are getting lines which are closer and closer to each other giving us a curve, rather than a polygon. This S shapes curve is called an ogive, (pronounced oh ghyve, not oh give). We use it in just the same way as a polygon, but of course it will be much more accurate. If you wanted you could connect the points on a cumulative frequency polygon by your own estimated curve, but this is not recommended at GCSE level.

Box And Whisker PlotsBox and whisker plots are useful in showing at a quick glance the salient features of a distribution of data (dataset) and especially in directly comparing two or more datasets.There are two types: those that show the median and interquartile range and those that show the mean and standard deviation.

Page 21: Statistics Revision - Xtreme Papers

I have shown both types on this diagram although normally we would only use one type in a comparison situation.The basic features are that we have a line from the minimum point to the first quartile (or the mean - 1 standard deviation) and a line from the third quartile (or mean + 1 standard deviation) to the maximum point. These lines are the whiskers. If there are any obvious outliers or rogue values that do not seem to be part of the normal dataset values we (might) indicate these by using a dot. A box is then drawn and a line to show the median (or mean) value is used to split the box. For median-IQRbox plots the line may divide the box asymmetrically, for the mean-s.d. plot then the mean will split the box into two.For the median-quartiles type of box plot an outlier or rogue value is defined to be any point that is below the lower quartile - 1.5 times the interquartile range and more than the upper quartile + 1.5 times the upper quartile.For the mean, standard deviation type I don’t know what the definition is! In fact, it isn’t usual to use a box and whisker for mean/standard deviation, though I’m not sure why (possibly because it doesn’t show skewness.)As you can see it is very easy to use these diagrams to discuss the spread of data and also its average. The wider the box the more spread out the data.Box and whisker plots can also be drawn vertically.For a median – IQR boxplot the skewness of the data can be seen. If the median is closer to Q1 (LQ) than to Q3(UQ) then the skew is positive, otherwise it will be negative, and i.e. the median is closer to the Q3 point (UQ).

Comparing Data – Scatter Graphs And Line Of Best FitIt is often the case that we can collect more than piece of data for an item. For instance the classic example is the height and weight of a person. Other examples include the price and age of a car, the age at death and average number of cigarettes smoked by a person, the age and weight of a baby, the number of ice creams sold by an ice-cream van and the outside temperature and so on.We might sometimes feel that there is a relationship between the two things. We might expect that if a person is taller then they will weigh more, or if a car is old it will be worth less (unless a classic, veteran or vintage car?). Sometimes we might not expect any relationship between the parameters. For instance we might compare a person's mathematics score and the time in which they can run 100m. We might be surprised if there was any relationship here. The classic way of comparing such parameters for an item is to use a scatter graph. In this we have an axis for one parameter (say height) and use the other axis for the other parameter (say weight). For each item we then mark the intersection of their height and weight with a mark, usually a cross, which we call a scatter (point).

Page 22: Statistics Revision - Xtreme Papers

Here we have plotted a scatter for a person who weighs 68kg and has a height of 170cm.We can now do this for a number of people and get a typical result like the following graph.

We can see from third diagram that there does seem to be some kind of relationship between the weight and height of people. In general the scatters seem to be going upwards from left to right.We could write this in words and say:

The greater the weight the greater the height .

We say using technical language that we have a positive correlation.You should get into the habit of writing statements about relationships in this way. You certainly need to compare the effect of increasing one parameter on the other and using the technical term. Correlation simply means relationship. So a positive correlation simply means we have a positive relationship. We use positive because the trend of the data is in a positive upwards direction - it is just a convention.We could actually give the impression of this relationship by drawing a line on the graph which not only shows the direction the scatters seem to be forming, but which also try to be an average of the direction the scatters are going in. We call this a line of best fit.

Page 23: Statistics Revision - Xtreme Papers

A few things to note about this line. Firstly, it doesn't have to start or seem to come from the origin. In this case we could argue that a person having zero weight would also have zero height, so we could draw the line from the origin. However we have no data points that close to the origin so we have no need to start it that close. The line is also drawn so that it is balancing out the distances of the scatters from the line. A piece of software (such as Excels Trend option) would do this mathematically to get the average distance from the line as low as possible, but we can use our eye and judgment to get the best line possible. Of course this does mean that different people might judge the line differently and get different answers when they use it, but this is taken into account in examinations by allowing a range of acceptable values.However if we want to locate this line a little more precisely we can mark the  double mean point. To do this we work out the mean value of both sets of data and then plot this point, perhaps put a little circle round it to indicate that it is the mean point. Our line of best fit would now go through this point.Note that we do not draw the line to go too far past the scatters. We could expect the line to follow the same pattern, but really we have no evidence to expect this, so we are cautious. If you do want to draw the line past the scatters then it might be best to use a dashed line to indicate this.Note also that occasionally we get scatters that seem way off from the usual trend. In this diagram that particular scatter is marked along the bottom to the right. We call this an outlier. (It seems to be a small, but heavy person!) We don't force the line to include this point because it would bias our line. We also seem to have more scatters on one side of the line than on the other. This is okay because some of the scatters on the top of the line are further away from the line than those underneath. We are just trying to get a balanced line.We can use a line of best fit to work out expected values of one parameter given the other parameter. For instance if I want to know the expected height for someone who weighs 75kg, I could draw a dashed line from this mark on the scale to the line of best fit and then horizontally as shown in the diagram below. Note that we draw to the line, not to a point that might exist.

Page 24: Statistics Revision - Xtreme Papers

We can see that we get a height of about 160cm. When we draw lines to the line of best fit such that we are inside the range of scatter points we say we are  interpolating the results. Remember we are only obtaining an estimate of what we might expect. It isn't guaranteed. We are also only assuming that the relationship does in fact exist - it might not! It also might not be a linear (as shown by a straight line) relationship at all. If we want to try and find an value that is outside the line of best fit, we have to assume that the straight line (linear) relationship continues (though it might not!) and as mentioned we can draw dashed lines to show this. Then we can extrapolate the answer. This of course may well be inaccurate. It is always best to work out values that lie inside the range of scatters than to go beyond. i.e. it is better to interpolate than extrapolate.

In this example we are extrapolating the weight of a person who is 40cm tall to get a weight of 20kg. Whether this is true is debatable?Let's now look at the example of the price of a second hand car compared with its age:

Page 25: Statistics Revision - Xtreme Papers

The distribution of scatters seems to be going down from left to right here. We could say, the greater the age of the car the lower the price. This is a  negative correlation. If we draw a line of best fit here we would get the following diagram:

Note that the lower left scatter is saying that we have a fairly new car that is quite cheap – it must be a Skoda! (old joke I know.)Let us now look at the graph we might obtain if we plot the score in a mathematics test against the time for a person to run 100m.

In this case there is no discernible pattern to the scatters. They are all over the place. In this case we say there appears to beno correlation. I cannot and must not draw a line of best fit.

Page 26: Statistics Revision - Xtreme Papers

Notice how careful I am with my use of language. I cannot say with any certainty there is definitely no correlation at all, because I do not know that for certain. The data I have so far collected seems to suggest that there is no relationship, but perhaps more data would show a relationship, or perhaps the mathematics test I am using is not powerful enough to reveal the relationship. The same is true if there seems to be a positive or negative correlation. The scatter graph is only suggesting such a relationship. It might also be that any relationship is actually not linear but could be some other relationship such as quadratic, exponential, sinusoidal, or an even more complex function.Pupils do find it hard to resist drawing a line of best fit even if this is clearly inappropriate - so be careful.Further mathematical studies will reveal various mathematical techniques that can be used to formalise the use of lines of best fit - regression - and of calculating the degree of correlation by using correlation coefficients.

Obtaining The Equation Of Line Of Best FitIf we have a straight line then we can calculate an equation for this line from the graph and then use this equation as a formula.We need two things: the gradient of the line and the y-intercept, or we can work out the equation from two points on the line.For the gradient we need to draw a right angled triangle onto the line of best fit and then work out a vertical side distance and a horizontal side distance. The gradient is the vertical distance divided by the horizontal distance.The y-intercept is where the line of best fit (extended if needed) cuts through the vertical axis.The material on graphs gives the details.

Misleading GraphsIt is possible to convey false impressions. By changing the scale of the vertical axis one can exaggerate an increase (say in sales) or minimise a decrease (say in loss). Note the graph on the right gives a better impression of increasing sales.