graphs & stats

GRAPHS & STATS

20 September 2014 Sherubtse Training

More on scatterplotsExporting data

Overview of statisticsT-tests

What kinds of interesting questions can we ask?What graphs would we make to answer them?

HtWt Data

• Is there a difference in height between UWICE & SFS personnel? Does it differ for males vs. females?

• Is there a difference in weight between UWICE & SFS personnel? Does it differ for males vs. females?

• Is there a relationship between height and weight for UWICE personnel? How about for SFS personnel?

• Is there a relationship between height and weight for males? How about for females?

Just for fun...add a column of calculated data (BMI)and then summarize the data by SEX and INSTITUTE

HtWt$BMI <- [equation]

Create this scatterplot of heights vs. weights for just UWICE personnel

Alternative format for plot():plot(data=UWICE, kg~cm)

Add a regression line to the UWICE scatterplot and determine if the linear relationship is significant

lm()summary(lm)

significant or not?

add the regression line:abline(lm.UWICE,col="red")

one way to add the p-value as text:(text(locator(1), "p = 0.0016", cex=1.5, col="red"))

Add a thick dashed blue line at x = 170 to indicate which UWICE staff can receive special travel privileges

(HINT: use ?par to figure out the plot arguments for setting line type &

line width)

What are the sexes of the 3 tall persons who will get special privileges? Use identify() to label

identify(UWICE$cm, UWICE$kg, labels=UWICE$sex, n=3)

Do the same with the SFS scatterplot...(is the relationship significant?)

You can put all the data in a single graph with different institutions represented by different colors

METHOD #11) Figure out the lower &

upper limits for x- and y-axes

2) Plot the first (e.g., UWICE) data and regression line,

setting xlim() and ylim()

3) Add the additional data, using points()

4) Add the lines (abline()), using colors matching

each institution's points

Add a Legend

'legend' is not an argument in plot(), so we add it as a

separate line of code

legend (x="topleft", legend=c("UWICE","SFS"), fill=c("purple","blue"), inset=.02, bty="n")

what does each of these arguments mean?

You can put all the data in a single graph with different institutions represented by different colors

METHOD #21) Figure out the lower &

upper limits for x- and y-axes

2) Plot the full data, setting xlim() and ylim() and 2

colors & pch values*

4) Add the lines (abline()), using colors matching

each institution's points

5) Add the legend

* col=c("blue","purple")[HtWt$institute], pch=c(16,15) [HtWt$institute]

How would you change the legend boxes to match the points in the scatterplot (UWICE = circle, SFS = square)?

Exporting Data

To transfer a matrix or data frame via clipboard:write.table(HtsWts,"clipboard",sep="\t") then in excel, paste

...to a tab-delimited text file: write.table(HtsWts, "c:/mydata.txt", sep="\t")

Intro to Statistics

Sample Statistic

Population Parameter (e.g., Height)

DESCRIPTIVE STATISTICS:

What is the mean height

of our sample of persons?

INFERENTIAL STATISTICS:

Is the mean height of our sample a good

measure of true population height?

Standard Deviation

Standard Error

Confidence Interval

Mean

N

DESCRIPTIVE STATISTICS

Our best estimate of the true population mean

Our best estimate of the true population variability

Standard Deviation

Standard Error

Confidence Interval

Mean

NOur best estimate of the true population mean


How good is our estimate (from the sample) of the true population mean?IN

FEREN

TIAL S

TATIST

ICS

Descriptive StatisticsSummarize the data we have collected:

• mean, median, mode

• range, variance, standard deviation, interquartile range

• graphical summaries of the data (e.g., histogram, boxplot)

Why do we need it?It’s difficult to just look at raw data and understand what they mean

Inferential StatisticsUse a sample of data to make conclusions and predictions about the population we sampled from

Often used to determine if there are differences between populations or if a ‘treatment’ affected a population

Why do we need it?We often don’t have the time or money to collect data from the entire population we are interested in. For inferential statistics, conclusions are only reliable if we sampled properly!

Truth + Chance = Sample Statistic

We use the sample data to make our best prediction about the population (the data we don't have), and

then quantify the chance that we’re wrong (standard errors & confidence intervals)

But no matter how fancy the statistics or how pretty the graphs, conclusions are only reliable if we sampled properly!

What is a Normal Distribution?

Does ‘Normal’ Exist?

Does ‘Non-Normal’ Exist?

When are Data Non-Normal?When multiple processes or populations are combined in a single data set...

Heights of children aged 5 - 12

When are Data Non-Normal?When the population has many values close to zero or some other natural limit...

When are Data Non-Normal?When some extreme values skew the population...(here, also bounded by zero)

THE SUPER RICH

COMPANY EXECUTIVES

When are Data Non-Normal?When the data follow a process that naturally generates non-normal distributions

POISSON DISTRIBUTIONCounts of rare events, e.g., accidents (lower

bound of zero)

EXPONENTIAL DISTRIBUTIONPopulation growth

BINOMIAL DISTRIBUTIONProportion (%) data

What Can We Do With Non-Normal Data?

• Check the data for errors; then

• Transform data to approximate a normal distribution; OR

• Apply nonparametric statistics

Standard Deviation

Standard Error

Confidence Interval

Mean

NOur best estimate of the true population mean


How good is our estimate (from the sample) of the true population mean?IN

FEREN

TIAL S

TATIST

ICS

What is the Standard Error?

Standard Error (SE): sd / sqrt(n)

• Standard deviation of the sample means

• Tells us if the sample mean is a good estimate of the true population mean

• Used to calculate the 95% confidence interval

What is the 95% Confidence Interval?

• If we sample from the same population many times, 95% of the samples will have confidence intervals that include the true population parameter

• The true population parameter (e.g., mean) is likely to be within the 95%CI of a sample (if the samples are unbiased). A large 95%CI tells us that our sample mean is not a very reliable estimate of the true mean. With large 95%CI's, it is hard to know from the samples whether or not two populations are truly different

95% Confidence interval: 1.96 X SE

Are plant heights significantly different between control & fertilized treatments?

Control17.2 (95%CI 16.4 – 18.0)

Fertilized18.9 (95%CI 18.1 – 19.7)

SIGNIFICANTLY DIFFERENTcontrol fertilized

N=30=17.2s=2.1

N=30=18.9s=2.2

N=5N=5

18.9 (95%CI 16.2 – 21.6)

17.2 (95%CI 14.6 – 19.8)

NOT SIGNIFICANTLY DIFFERENT

Exploratory Data AnalysisBefore jumping into statistical tests and p-values, LOOK AT YOUR DATA in spreadsheets and graphs to identify: data errors/outliers if your data meet assumptions of parametric

statistical tests interesting patterns

Before you do any statistical tests, you should already have an idea what the results will be

Errors in Data Collection / Entry

• Decimal in wrong place

• Same category spelled many ways

• Data collected in different measurement units

• Forgot to collect some data

• Numbers typed incorrectly when transferred from paper (sloppy handwriting, etc.)

OUTLIER: A data point that is much smaller or much larger than other data in the sample

Why do we care? A few outliers can change the sample mean, increase the variance of sample data, and change the p-value of a parametric statistical test

How do we find potential outliers in our data? • Look at data ranges, histograms & boxplots • For correlation & regression analyses, look at

scatterplots

p = 0.06

For correlations/regressions, outliers may fall within the normalrange of data...but plotting the scatterplot reveals outliers

p = 0.06p = 0.001

A single outlier can change the regression equationand the significance of the relationship

WHAT SHOULD I DO WITH OUTLIERS?

Are data entered

correctly?

Remove outliers before

analyzing data

Are data from the population of inference?

YES

YES

NO

NO

Transform data (if

appropriate)

Use nonparametric

statistics

WHAT SHOULD I DO WITH TRUE OUTLIERS?*

Analyze data with and without outliers

Do study conclusions

change?

Keep outliers & report results

Report & discuss both

results

Remove outliers & report results, but discuss your justification for removing

the outliers

* ALWAYS KEEP GOOD RECORDS OF YOUR DATA EXPLORATION ACTIVITIES AND ANY CHANGES YOU MAKE TO THE ORIGINAL DATA!

YES NO

YES

OUTLIERS

It is wrong to remove outliers from analyses just because they don't fit with the other data!

Outliers can tell us interesting information about a population—conduct more research to understand what causes these unusual data.

Exploratory Data AnalysisBefore jumping into statistical tests and p-values, LOOK AT YOUR DATA in spreadsheets and graphs to identify: data errors/outliers if your data meet assumptions of parametric



Do data come from a normally distributed population?

• Sample data are assumed to represent the distribution of the population. Non-normal data are not 'wrong', they just represent processes that naturally generate other types of distributions.

• With small sample sizes, it can be difficult to tell if data come from a normally distributed population. Consider what you know about the underlying process.

Evaluating NormalityUnderstand which processes generate non-normal data, then...• Visual assessment:

o histograms o normal Q-Q plots

• Normality tests:o Shapiro-Wilk (shapiro.test())o Anderson-Darling (from pkg ‘nortest’)o Pearson chi-square (from pkg ‘nortest’)o Kolmogorov-Smirnov (from pkg ‘nortest’)

Exploratory Data AnalysisBefore jumping into statistical tests and p-values, LOOK AT YOUR DATA in spreadsheets and graphs to identify: data errors/outliers (for scatterplots, graph it!) if your data meet assumptions of parametric



T-Test

T-TestFor determining if population means are different

Two-sample t-testCompare the means of two independent groups (don’t need same sample sizes), e.g., is the mean height of Bhutanese college men different from that of USA college men?

Paired t-testCompare the means paired groups, e.g., is the mean weight of USA college men different before and after a 3-month exercise & diet program?

U

G

U

U

U

U

G

G

G

G

GRAZED UNGRAZED56 7227 2651 6440 4232 46

AVERAGE 41.2 50

Is the height of shrubs different in grazed and

ungrazed areas?

TWO SAMPLE T-TEST

STEP ONE: Look at the data! (Make the point plot)

STEP TWO: Do the t-test: t.test (grazed, ungrazed)

[METHODS] "We conducted a two-sample t-test to compare vegetation biomass in grazed versus ungrazed plots."

[RESULTS] "We did not find a significant difference in vegetation biomass between grazed (M = 41.20; SD = 12.28) and ungrazed (M = 50.00; SD = 18.28) plots in this study, t(7) = -0.89; p = 0.40."

NOTE: Remember to report sample size for each treatment in METHODS.

graphs & stats

Documents

uwice personnel

uwice sfs personnel

institute htwt

htwt datais

sherubtse trainingmore

kinds of interesting