eart20170 computing, data analysis & communication skills

36
EART20170 Computing, Data Analysis & Communication skills Lecturer: Dr Paul Connolly (F18 – Sackville Building) [email protected] 2. Computing (Excel statistics/modelling) 2 lectures assessed practical work Course notes etc: http://cloudbase.phy.umist.ac.uk/peopl e/connolly Recommended reading: Cheeney. (1983) Statistical methods in Geology. George, Allen & Unwin

Upload: tamera

Post on 06-Jan-2016

27 views

Category:

Documents


1 download

DESCRIPTION

EART20170 Computing, Data Analysis & Communication skills. Lecturer: Dr Paul Connolly (F18 – Sackville Building) [email protected]. 2. Computing (Excel statistics/modelling) 2 lectures assessed practical work Course notes etc: http://cloudbase.phy.umist.ac.uk/people/connolly. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: EART20170 Computing, Data Analysis & Communication skills

EART20170 Computing, Data Analysis & Communication skills

Lecturer: Dr Paul Connolly (F18 – Sackville Building)[email protected]

2. Computing (Excel statistics/modelling)2 lecturesassessed practical work

Course notes etc: http://cloudbase.phy.umist.ac.uk/people/connolly

Recommended reading: Cheeney. (1983) Statistical methods in Geology. George, Allen & Unwin

Page 2: EART20170 Computing, Data Analysis & Communication skills

Plan

Two more lectures plus drop-in sessions in computer labs

Assessment handed out today and need to hand in by 16:00, Tuesday December 12th.

Page 3: EART20170 Computing, Data Analysis & Communication skills

Lecture 4

Using Microsoft excel

Cell referencing and naming datasets

Entering formulas

Worksheet formulas

Statistical functions and add-ins

Analysing the Gaussian and T-distributions in Excel.

Page 4: EART20170 Computing, Data Analysis & Communication skills

Using Microsoft Excel

Worksheet

Cell

Cell reference

Function bar

Basic functions

Page 5: EART20170 Computing, Data Analysis & Communication skills

Using Microsoft Excel

Can also `import’ text files

Entering data

Can name worksheets

Page 6: EART20170 Computing, Data Analysis & Communication skills

Cell referencing

Several different ways of referencing the information in a cell.

A1 reference style

Difference between absolute and relative references

Page 7: EART20170 Computing, Data Analysis & Communication skills

The A1 reference style

To refer to Use

The cell in column A and row 10 A10

The range of cells in column A and rows 10 through 20

A10:A20

The range of cells in row 15 and columns B through E

B15:E15

All cells in row 5 5:5

All cells in rows 5 through 10 5:10

All cells in column H H:H

All cells in columns H through J H:J

The range of cells in columns A through E and rows 10 through 20

A10:E20

Page 8: EART20170 Computing, Data Analysis & Communication skills

To refer to another worksheet

Example you can use: =MySheet!B1:B10

You can also use worksheet functions in the same way =Average(MySheet!B1:B10)

The worksheets must be in the same workbook

Page 9: EART20170 Computing, Data Analysis & Communication skills

Relative, absolute and mixed references

A relative cell reference, such as A1, is based on the relative position of the cell. If the position of the cell that contains the reference changes, the reference itself is changed.

An absolute cell reference, such as $A$1, always refers to a cell in a specific location. If the position of the cell that contains the formula changes, the absolute reference remains the same.

A mixed reference has either an absolute column and relative row, or absolute row and relative column. An absolute column reference takes the form $A1, $B1, and so on. An absolute row reference takes the form A$1, B$1, and so on.

Page 10: EART20170 Computing, Data Analysis & Communication skills

Relative references

Suppose I enter that cell b2 is equal to a1.

If I copy cell b2 to b3, the relative reference automatically adjusts to be the next cell relative to a1 – a2!

This is the default in Excel

Page 11: EART20170 Computing, Data Analysis & Communication skills

Absolute references

If I put in an absolute cell reference, the cell reference does not change when copied to other cells

Page 12: EART20170 Computing, Data Analysis & Communication skills

Mixed references

What happens if I copy a relative column, absolute row reference to the c3 cell?

The relative column reference adjusts, but the absolute column reference does not

Page 13: EART20170 Computing, Data Analysis & Communication skills

Naming datasets

You can name a dataset by selecting it with the mouse (left clicking and dragging) and entering the name in the `reference box’

The name must not have any `spaces’.

The name can be used as a reference in that worksheet.

Page 14: EART20170 Computing, Data Analysis & Communication skills

Aside: An example of plotting data in excel

There are many ways of plotting the data that get the same result

Usually you will do this by trial and error (i.e. create a plot and edit it

Say you wanted to plot two similar things on the same graph, one a line and one a histogram

Start by selecting all three columns and go to

insert->chart

Page 15: EART20170 Computing, Data Analysis & Communication skills

Aside: An example of plotting data in excel

In this case you want to plot a custom line and column plot

Select Custom types and find the line column graph

Page 16: EART20170 Computing, Data Analysis & Communication skills

Aside: An example of plotting data in excel

As you can see, Excel wants to plot all three columns, which we don’t want in this case

But you can remove the first column by clicking on the Series tab, highlighting the series you want to remove and clicking remove

Page 17: EART20170 Computing, Data Analysis & Communication skills

Aside: An example of plotting data in excel

Clicking next you can play around with many settings and annotate your plot

This requires some playing with to investigate all the options

Page 18: EART20170 Computing, Data Analysis & Communication skills

Aside: An example of plotting data in excel

As you can see the bar chart has large gaps which is not what we want

You can change this by right clicking on the bars and selecting Format Data Series…

On the next dialog, click on the options tab and set the Gap widths to 0

Page 19: EART20170 Computing, Data Analysis & Communication skills

Entering formulas

You can use Excel just like a calculator

This involves entering numerical expressions

For example: imagine I wanted to calculate the mean of 3 numbers: 2, 3 and 7

I could enter in cell a1: =(2+3+7)/3

Page 20: EART20170 Computing, Data Analysis & Communication skills

Entering formulas

A better way might be to enter the values in cells, a1, a2 and a3

This is still repetitive for large datasets, but it is useful for calculating things like t-values

But Excel has many worksheet functions…

Page 21: EART20170 Computing, Data Analysis & Communication skills

Worksheet functions

Functions are predefined formulas that perform calculations by using specific values, called arguments, in a particular order, or structure. Functions can be used to perform simple or complex calculations. For example, the `average’ function calculates the mean of data in a given range.

=average(a1:a10) =average(data1) Excel has a huge library of

functions like this. Using them takes practice but it is best to learn by doing examples.

Page 22: EART20170 Computing, Data Analysis & Communication skills

Worksheet functions

The easiest way to start using functions is to `insert’ one into the worksheet.

Have a cell highlighted and click on the insert->function tab.

Page 23: EART20170 Computing, Data Analysis & Communication skills

Worksheet functions

The insert function dialog pops-up.

You can either search for a function or find it yourself.

Usually you will want to use a certain category such as `statistical’ and find the function in that category

Page 24: EART20170 Computing, Data Analysis & Communication skills

Worksheet functions

Next you will get a dialog asking you to put in cell references for the function.

This dialog changes for different functions. For the above example using the `average’

function we could put in A1:A3 in the first text box and click OK.

The mean will appear in the cell.

Note that when you become efficient with excel you will use shortcuts for inserting functions and wont need to use this dialog.

Page 25: EART20170 Computing, Data Analysis & Communication skills

Statistical worksheet functions (hint, hint)

Average(a1:a3) [average(data1)] Stdev(a1:a3) [stdev(data1)] Var(a1:a3) [variance] Stdevp(a1:a3) [stdevp(data1)] Sqrt($a$1) [sqrt(a1)] ^2 [to the power] Sum(a1:a3) [sum(data1)] pearson(a1:a3,b1:b3)

[pearson(data1,data2)] Rsq(a1:a3,b1:b3) [rsq(data1,data2)] Quartile(a1:a20,0.25)

[quartile(data1,0.5] Mode(a1:a20) [mode(data1)] Normdist(a1,mean,std,TRUE) Norminv(p,mean,std) Tdist(a1,8,2) [students t-distribution] Tinv(p,8) [inverse of t-distribution] Ttest(data1,data2,tails,type)

[comparing two means] probability of rejecting null hypothesis (if this is less than significance level, reject null hypothesis). Type should equal 2.

Page 26: EART20170 Computing, Data Analysis & Communication skills

Analysing the Gaussian (normal) distribution

The function normdist has the following prototype: NORMDIST(x,mean,std,cumulative) It returns the value of a Gaussian distribution with

given mean and standard deviation at x. Cumulative is either set to `FALSE’ or `TRUE’ If it is set to FALSE the function will return the actual

value of the Gaussian distribution at x. If it is set to TRUE the function will return the

cumulative distribution at x (i.e. this is the table from lecture 3).

Confidence level of a given interval. Eg what is the probability of a value lying in the interval...

-2 -1.5 -1 -0.5 0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1A normal distribution, =0.82, =0.45

x

f(x)

normdist-FALSEnormdist-TRUE

Page 27: EART20170 Computing, Data Analysis & Communication skills

Analysing the Gaussian (normal) distribution

The function norminv has the following prototype:

NORMINV(p,mean,std) It returns the x value associated with the

cumulative probability of p Useful for assessing levels of significance. Eg

what are the limits on x at a given confidence level?

This method is used more frequently for the t-distribution

-2 -1.5 -1 -0.5 0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1A normal distribution, =0.82, =0.45

x

f(x)

normdist-FALSEnormdist-TRUE

Page 28: EART20170 Computing, Data Analysis & Communication skills

Analysing the t-distribution

You still need to remember the formulas for (1) estimating the interval for the mean; (2) testing the significance of the correlation coefficient; and (3) if two means are equal.

The function tdist has the following prototype: TDIST(x,df,tails) It returns the significance level (alpha) of a t-distribution

with given degrees of freedom. Tails is either set to 1 or 2. If it is set to 1 the function will return the accumulation of

probability from infinity to x. If it is set to 2 the function will return the accumulation of

probability in both tails. Not used too often

0 0.5 1 1.5 2 2.5 3 3.5 40

0.1

0.2

0.3

0.4

0.5

x

f(x)

A t-distribution, =5

tdist-non cumulativetdist-cumulative

Page 29: EART20170 Computing, Data Analysis & Communication skills

Analysing the t-distribution

The function tinv has the following prototype: TINV(alpha,df) It returns the critical value for the t-distribution

corresponding to a significance level, alpha. By default it is a two tailed confidence level,

but for a one tailed confidence level substitute 2x(alpha) for alpha.

Used in hypothesis testing.

0 0.5 1 1.5 2 2.5 3 3.5 40

0.1

0.2

0.3

0.4

0.5

x

f(x)

A t-distribution, =5

tdist-non cumulativetdist-cumulative

Page 30: EART20170 Computing, Data Analysis & Communication skills

Excel has a quick way of comparing two means: The student t-test. If they have the same length we can use the TTEST

function But rather than giving us the critical t-value, it gives

us a critical probability for rejection.

What if this is less than the significance value, alpha?

0 0.5 1 1.5 2 2.5 3 3.5 40

0.1

0.2

0.3

0.4

0.5

x

f(x)

A t-distribution, =5

tdist-non cumulativetdist-cumulative

T-critical

RejectionAcceptance

We reject the null hypothesis

Page 31: EART20170 Computing, Data Analysis & Communication skills

Random number generation

This is very useful in computational science. They are not really random numbers, they are

generated by an algorithm. But it is difficult to get random numbers on a

computer.

The worksheet function rand() generates random numbers between 0 and 1.

Hence to generate a normally distributed random number sequence, with a given mean and standard deviation we can use:

Norminv(rand(),mean,std) Over many generations, the variable will have the

given mean and standard deviation

-2 -1.5 -1 -0.5 0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1A normal distribution, =0.82, =0.45

x

f(x)

normdist-FALSEnormdist-TRUE

Page 32: EART20170 Computing, Data Analysis & Communication skills

Random number generation

0 500 1000 1500 2000 2500 3000 3500-1

-0.5

0

0.5

1

1.5

2true mean 0.50341true stdev 0.30652

x

A normally distributed random number with =0.5, =0.3

-2 -1 0 1 20

200

400

600

800

1000

Fre

quen

cy

x

Normally distributed random number

true mean 0.50341true stdev 0.30652

true mean 0.50341true stdev 0.30652

Over a large number of generations, the mean approaches that of the true random variable

Page 33: EART20170 Computing, Data Analysis & Communication skills

Next lecture

Propagation of errors.

Many people struggled with this on the test.

There is an easy way to do it with Excel.

Lets go back to our example measuring bed thickness. We know that x=12.10.3m and y=4.20.2m.

So if we generate many values of x and y with the above means and std, then add all these together we could calculate the mean and std of the result.

Page 34: EART20170 Computing, Data Analysis & Communication skills

Error propagation

Using a computer we can generate many normally distributed variables and therefore find the distribution of the answer.

Hence we can directly calculate the error (standard deviation) in the answer.

This is useful as it can be done for complicated equations with ease.

This is called the Monte Carlo method of propagating errors.

Page 35: EART20170 Computing, Data Analysis & Communication skills

One more thing

Worksheet functions can also take text arguments and return text.

If you are doing a hypothesis test you might want a statement that tells you to either accept or reject the null hypothesis

You can use the `IF’ construct

=IF(logical test,if true,if false)

E.g. IF(tvalue>tcrit,”Accept alternate hypothesis”,”Accept null hypothesis”).

IF can also return numeric values.

Page 36: EART20170 Computing, Data Analysis & Communication skills

Homework

Have a look at using some of the functions in excel.

Especially for manipulating cells and calculating means, standard deviations.

Try calculating z-values and t-values from the table from handout 3 in excel to see if you can get them correct.

This will help for Tuesdays practical labs.

REMEMBER: please check the student notice board for your allotted time.