gems basicstatistics concepts 23apr09

7/28/2019 GEMS BasicStatistics Concepts 23Apr09

1/15

GEMS Basic Statistics Concepts


2/15

Copyright 2009 Gemcom Software International Inc.

All Rights Reserved. This publication, or parts thereof, may not be reproduced in any form, by anymethod, in whole or in part, for any purpose.

Gemcom Software International Inc. makes no warranty, either expressed or implied, including but notlimited to implied warranties of merchantability or fitness for a particular purpose, regarding these

materials.In no event shall Gemcom Software International Inc. be liable to anyone for special, collateral, incidental,or consequential damages in connection with or arising out of the use of these materials. The sole andexclusive liability to Gemcom Software International Inc., regardless of the form of action, shall notexceed the purchase price of the materials described herein.

Gemcom Software International Inc. reserves the right to revise and improve its products as it deemsappropriate. This publication describes the state of this product at the time of publication for the versionnumber stated, and may not reflect the product at all times in the future.

Gemcom Software International Inc.Suite 1100 1066 West Hastings Street Tel: +1 604.684.6550Vancouver, BC Canada V6E 3X1 Fax: +1 604.684.3541

Web site:www.gemcomsupport.com

Gemcom, the Gemcom logo, combinations thereof, and GEMS are trademarks of Gemcom SoftwareInternational Inc.

Revision date: 4/23/2009
http://www.gemcomsupport.com/http://www.gemcomsupport.com/http://www.gemcomsupport.com/http://www.gemcomsupport.com/


3/15

Page 3 of 15

Table of Contents

Introduction .............................................................................................................................. 4

Overview............................................................................................................................................. 4

Requirements ..................................................................................................................................... 4

Workflow ............................................................................................................................................. 4

General Summary ..................................................................................................................... 5

Understand the Domains ..................................................................................................................... 5

Validate the Input Data ........................................................................................................................ 5

Understand Estimation Methods and Parameters ................................................................................ 5

Validate the Output Model ................................................................................................................... 5

Domains .................................................................................................................................... 6

Overview............................................................................................................................................. 6

The Impact of Domains on Estimated Values ...................................................................................... 6

Composites ............................................................................................................................... 8

Basic Statistics ....................................................................................................................... 10

Overview........................................................................................................................................... 10

Descriptive statistics .......................................................................................................................... 10

What is typical? .............................................................................................................................. 10

How are they different? .................................................................................................................. 10

Histograms ....................................................................................................................................... 11

Cumulative Probability Plots .............................................................................................................. 11

Probability Plots ................................................................................................................................ 12

Bimodal Distributions ........................................................................................................................ 12

Outliers .................................................................................................................................... 14

Overview........................................................................................................................................... 14

Outliers and Topcuts ......................................................................................................................... 14

Methods of Determining a Topcut Value ............................................................................................ 14

Histogram....................................................................................................................................... 14

Confidence interval ......................................................................................................................... 15

Percentile ....................................................................................................................................... 15

Experience ..................................................................................................................................... 15


4/15

Page 4 of 15

Introduction

Overview

Statistics helps build our understanding about the data we are working with. It is useful to know whatvalues are typical, how different our samples, if there are any patterns or spatial relationship between oursamples, and how these observations correlate with our geological knowledge. We use statistics for:

QAQC

Validating, comparing and combining domains

Validating resource models

Requirements

You should be able to do the following:

Create, edit and import data into a geological drillhole database

Manipulate drillhole data

Composite drillhole data

Interpret geological sections and plans

Surface and solid modelling

Display and View data in 2D and 3D.

Caution: If you do not have a good background in these subjects, many parts of this training may be

difficult to follow.

Workflow

There can be many different workflows for applying basic statistics. The decision of which techniques toapply, and which order they are applied in is usually something that only experience will teach you.

Here is a generic process for performing basic statistics:

1. Identify and model Domains

For each domain:

2. Create composites

3. Perform Basic statistics

4. Topcut Outliers

This would be followed up by geostatistics which would include the following

5. Calculate Anisotropy and Variogram Parameters

6. Perform ID2 and Ordinary kriging

7. Validate the Model

Steps 5 -7 are covered in the Variography Manual.


5/15

Table of Contents

Page 5 of 15

General Summary

In order to reduce estimation errors, you should:

understand the domains

validate the input data

understand estimation methods and parameters

validate the output model.

Understand the Domains

It is important to recognise separate regionsor domains within a model. Once you have identified thedomains, it is important to group all sample data contained within each domain into distinct subsets. Afterthat, you can analyse each subset individually, and use data from each separate domain to makeestimations within that domain.

Validate the Input DataThe saying Garbage in = Garbage out is certainly true in geostatistics. Although sampling theory andlaboratory quality control practices are important concepts which impact the quality of any estimationmade using a set of data values, these subjects are outside the scope of this tutorial.

Assuming that the quality of the data is as good as youre going to get, there are a couple of potentiallyhazardous characteristics of the data which you should look for: bimodalism and outliers. You canlook for both of these features with a histogram. A data set is said to be unimodal if the histogramshows a single peak. If there are two peaks, the data is said to be bimodal. If you use some of themore common estimation techniques to create a model based on a bimodal distribution, it is likely tocontain more estimation errors than a model created from a unimodal data set. Additionally, outliers(values which are significantly distant from the majority of the data) can cause estimation errors.

Understand Estimation Methods and Parameters

There are a large number of estimation methods, and a large number of parameters within each method.Before using a particular estimation method, you should have a good background in basic statistics, aswell as basic geostatistical principles.

Using geostatistics can be likened to flying a jet plane. Although there are autopilot modes, where youjust press a few buttons and something happens, it is important that the pilot understand the theory ofaerodynamics to understand what impact a particular control has upon the end result.

Validate the Output Model

A final method you should use to check the quality of estimation is to take time to examine the output.Histograms of estimated values, contours of plans, cross sections of block models, colour coded androtated in three-dimensional space are all methods which can be used to verify the output values.


6/15

Table of Contents

Page 6 of 15

Domains

Overview

One of the most important aspects of geostatistics is to ensure that any data set is correctly classified intoa set ofhomogenous domains. A domain is either a 2D or 3D region within which all data is related.Mixing data from more than one domain or not classifying data into correct domains can often be thesource of estimation errors.

You will learn about:

the impact of domains on estimated values

viewing and using domains.

The Impact of Domains on Estimated Values

Imagine that you are a meteorologist, and you are given three air temperatures measured at locations A,

B, and C, as displayed below. Based on the values shown, what would you guess the temperature is atlocation X? Would you guess that the temperature at location X was greater than 25?

Using the information above, you may have the following thoughts:

1. Since location A is relatively distant from X, the value at A may have little or no influence on theestimated temperature at X.

2. Since locations B and C are about the same distance from X, they will probably have equalinfluence on the estimated temperature.

3. Given the previous two points, the temperature at X would probably be the average of the

temperatures at B and C: (18 + 32) / 2 = 25 degrees4. Since the influence of A has not been accounted for at all, and the estimate is exactly 25 degrees,

it is difficult to say with certainty if the temperature at X is above 25 degrees.

Now consider the following: Imagine that you want to go to your favourite beach, but only if thetemperature is 25 degrees or more. You have three friends who live near the beach you want to go to,and you call them up and ask each one what the temperature is at each of their homes. You draw themap below, with the locations of each friend (A, B, and C) and the temperatures they give you. Yourfavourite beach is at location X. Note that the friend at location B lives high up in the mountains, whilefriends at A and C live near the beach.


7/15

Table of Contents

Page 7 of 15

Using the information above, you may have the following thoughts:

1. The data from B can be ignored, because temperatures high up in the mountains are usually notgood estimates of temperatures on the beach.

2. A and C are on the beach, so they can be used to guess the temperature at X.

3. Since X is between A and C on the map, the temperature at X will probably be somewherebetween the temperature at A and the temperature at C.

4. Therefore, the temperature at X will be somewhere between 28 and 32 degrees

5. Since the temperature range of 28 to 32 degrees is greater than the minimum value of 25 degrees,you would probably decide Yes, Im going to the beach!

Compare this example with the first one. In both cases, all of the locations and temperatures are exactlythe same. However, in the second case, when you took account of the domain which contains the data,you came up with a considerably different result. The point is that separating data into similar regions, ordomains, is a very important part of making any geostatistical estimation.


8/15

Table of Contents

Page 8 of 15

Composites

Compositing is designed to ensure samples have comparable weight in the model (estimation). Samplesinside a geological domain should be length weighted to create equal length composite grades. Basic

statistics on grade versus sample interval length can help determine the optimum sample length forcompositing.

The composite length should be as close to the original sampling interval as possible to preserve theoriginal standard. However, the new regular length of composites may be a projected length, for exampleby bench. Additional weighing factor may be used such as density (SG) or sample recovery factor. Theoutcome is to have equivalent weighed samples for estimation. When compositing, users of databaseshould consider that the altered samples or composites may loose some of the properties of the originalsamples useful for QA/QC. For instance, detection limits will be affected and outliers will be more in linewith the average. This is one of the reasons why the industry practice is to perform the topcut beforecompositing. In effect, this will alter the data further and it is not always desirable. For example, if an oredeposit does have zones (pockets) of high grade represented by cluster of high grade samples, cuttinghigh grade will particularly hurt the quality of the estimation of those features.

A histogram showing the sample interval lengths helps identify the most common sampling interval. In theexample above, we can see that 1m and 4m are the dominant sample lengths. It should be noticed in thatcase that if the 4m samples are reduced to the size of the majority (1m), the grade continuity will beoverly amplified by the repetitious values from the 4m samples broken down into numerous smaller 1msamples. This is why in many cases, compositing should be avoided or tailored to make large samples

(4m in this case). However, large composites instead of small, can reduce the number of data pointssignificantly. Conversely, that can alter the quality of estimation where grade changes rapidly, i.e., oversmoothing.


9/15

Table of Contents

Page 9 of 15

A scatter plot showing the relationship between length and grade is also useful. In the example above, we

can see that the 1m sample length has ore grade values, whereas the 4m lengths most probably occur inareas of waste.

Read the Compositing manual for more information about different compositing methods.


10/15

Table of Contents

Page 10 of 15

Basic Statistics

Overview

One of the important preliminary steps in performing a geostatistical evaluation is to understand thestatistical properties of the data. You will learn about:

descriptive statistics

histograms, cumulative frequency plots and probability curves

bimodal distributions

outliers.

Descriptive statistics

We use descriptive statistics to answer questions about:

What are the typical values? How variable is the data?

How skewed is our data - will extreme values bias our data?

How similar/different is each domain?

There are 2 types of statistics used to describe a set of data:

Measures of what is typical

Measures of difference

What is typical?

Measures of what is typical include:

Mean the sum of the sample values divided by the number of samples. Median the middle value when the samples are sorted in order

Mode the most frequent sample value.

How are they different?

Measure of difference include:

Range the difference between the maximum and minimum sample value

Inter-quartile range difference between the top and bottom quarter values when thedata is sorted in grade order.

Variance the average difference between each sample and the mean grade= sum (each sample value mean)

2/ (number of samples -1)

Standard deviation the square root of the variance (this brings the number back into a

grade sense).


11/15

Table of Contents

Page 11 of 15

Histograms

A histogram is a statistical term which refers to the distribution of values. A graphical version of ahistogram table which shows what proportion of cases fall into each of several non-overlapping intervalsof some variable.

For example, a distribution of gold grades could be represented by the following table:

Gold (g/t) Number of samples

(frequency)

0.0 - 0.5 0

0.5 1.0 40

1.0 - 1.5 58

1.5 2.0 82

2.0 - 2.5 40

2.5 3.0 29

3.0 - 3.5 18

3.5 4.0 10

4.0

4.5 12

4.5 5.0 5

5.5 6.0 5

6.0 6.5 5

6.5 7.0 5

7.0 7.5 8

7.5 8.0 5

This same data can be displayed in a histogram as shown:

Cumulative Probability Plots

The cumulative probability plot is a summary of the proportion of samples that occur below each grade.

Grades associated with the steep part of the curve are the more frequent grades.


12/15

Table of Contents

Page 12 of 15

Probability Plots

This is a cumulative probability plot with the axis adjusted.

If the plot presents as a straight line, then the data set reflects a normal distribution. If the grade scale isconverted to a log scale, a straight line indicates a log-normal population. With the log scale making theline straight, it is possible to distinguish various domains by a change of slope. It is easier to see on aprobability plot than an histogram in general. Hence the probability plot may be more useful to selectindicators for analysis. Inflection points can represent different data populations.

Bimodal DistributionsTwo characteristics which can potentially reduce the quality of your estimations are bimodalism andoutliers.

The mode is the most commonly occurring value in a data set. For example, in the following data set,the number 8 is the mode:

1 3 5 5 8 8 8 9

Bimodal means that there are two relatively most common values which are not adjacent to oneanother. In the following data set, the numbers 2 and 8 are equally common, and the distribution is saidto be bimodal:

1 2 2 2 3 5 5 8 8 8 9

Imagine that you are studying the average specific gravity, or density of rocks in a coal deposit. Ahistogram of all rock samples might look like this:


13/15

Table of Contents

Page 13 of 15

Any histogram which displays two humps, as in the example above, is said to be bimodal. The bimodaldistribution in the example above can be explained by the fact that the data set is comprised of coalsamples as well as intervening sandstone and mudstone bands. The specific gravity values between 1and 2 are representative of the coal, while specific gravity values between 2 and 3 represent theintervening rock.

Often the source of a bimodal distribution can be two domains being mixed into a single data set. In

order to minimise estimation errors, you should make every attempt to separate any data set which has abimodal distribution. In the example above, merely segregating the data based on rock type would resultin two separate normal distributions.


14/15

Table of Contents

Page 14 of 15

Outliers

Overview

Outliers are data values that are much higher (or much lower) than most data values in a single domain.For a number of reasons, you should either "cut" them down (or up) to some value or remove them.

You will learn about:

outliers and topcuts

methods of determining a topcut value

Outliers and Topcuts

An outlier is a statistical term for a value which is significantly different than the majority of all othervalues in the data set. For example, in the following data set, the number 236 would be considered to bean outlier:

1 3 5 5 8 8 8 236

Outliers can cause "noisy" experimental variograms, which are difficult to model. Additionally, if used inan estimation, outliers can result in unrealistic results. One technique used to reduce the impact ofoutliers is to apply a cutoff, or "topcut" to them. In the example above, the value of 236 could be cut,or changed to a value of 9:

1 3 5 5 8 8 8 9

Another alternative is to remove the outlier value(s).

Another point of view on Topcuts!

Topcutting samples could be like saying those samples are no good and should be thrown out. It is arather strong statement that must be justified by geological evidence or not. Geological interpretation isnot a game with numbers. Ifoutliers are anomalous values, ore deposits are all anomalies, and then allore deposits are outliers.

One of the problems with altering the outliers value to make it more normal is that is makes the datafalse in fact, i.e., the grade will look more continuous and smooth than it is. Hence the estimation basedon it will not be exact, perhaps less correct. For instance, Kriging automatically considers grade variation for eachset of data per block and correct the outliers by giving them a proper weight. If a block is surrounded by extremehigh values, perhaps it is a high value block or ore. Only Kriging can process outliers in this way. The point is to havemany values to be able to understand what is a normal value at the block scale or the domain large scale.

Methods of Determining a Topcut Value

Some methods used to determine a topcut value use concepts such as:

Histogram

Confidence interval

Percentile

Experience

Histogram

The point at which the cumulative frequency curve "flattens out" can be used as the cutoff. In the casebelow the curve appears to be fairly flat at a value of 25.

A histogram can also help visually substantiate the choice of a topcut value obtained by other means.


15/15

Table of Contents

Page 15 of 15

Confidence interval

A confidence interval is an estimated range of values which is likely to include a given percentage of thedata values, assuming that the data is normally distributed. The calculation for the upper limit of a 95%confidence interval (CI) is:

95% CI = mean + (1.96 * standard deviation)

For example, if a data set has: mean = 6.49 and standard deviation = 9.30.

95% CI = 6.49 + (1.96 * 9.30)

95% CI = 24.718

For simplicity, you could choose to use the nearest integer value of 25 as your cutoff.

Percentile

A percentile is that data value at which a given percentage of all other data values fall below. Any givenpercentile value could be selected as the outlier cutoff, such as the 90th, 95th, or 99th percentile. Forexample, you could choose one of the following (from the previous Basic Statistics report):

90.0 Percentile: 22.5




Experience

Topcut values are often chosen based on knowledge of a deposit.

If part of an ore zone has been mined, information from grade control samples and reconciliation studiesmay provide a good idea of what the maximum mined block value will be. Comparing sampling statisticsto production results is also another method to choose a top cut. One can lower the topcut value until themodel estimation match the production grade. Again this is experimental, not really scientifique. As mineproduction will go through zones of different quality ore, the process will have to be repeated.

If the deposit has not yet been mined, information from similar deposits may be useful in determining theoutlier cutoff.

gems basicstatistics concepts 23apr09

Documents