graphs & stats
DESCRIPTION
GRAPHS & STATS. More on scatterplots Exporting data Overview of statistics T-tests. 20 September 2014 Sherubtse Training. HtWt Data. What kinds of interesting questions can we ask? What graphs would we make to answer them?. - PowerPoint PPT PresentationTRANSCRIPT
GRAPHS & STATS
20 September 2014 Sherubtse Training
More on scatterplotsExporting data
Overview of statisticsT-tests
What kinds of interesting questions can we ask?What graphs would we make to answer them?
HtWt Data
• Is there a difference in height between UWICE & SFS personnel? Does it differ for males vs. females?
• Is there a difference in weight between UWICE & SFS personnel? Does it differ for males vs. females?
• Is there a relationship between height and weight for UWICE personnel? How about for SFS personnel?
• Is there a relationship between height and weight for males? How about for females?
Just for fun...add a column of calculated data (BMI)and then summarize the data by SEX and INSTITUTE
HtWt$BMI <- [equation]
Create this scatterplot of heights vs. weights for just UWICE personnel
Alternative format for plot():plot(data=UWICE, kg~cm)
Add a regression line to the UWICE scatterplot and determine if the linear relationship is significant
lm()summary(lm)
significant or not?
add the regression line:abline(lm.UWICE,col="red")
one way to add the p-value as text:(text(locator(1), "p = 0.0016", cex=1.5, col="red"))
Add a thick dashed blue line at x = 170 to indicate which UWICE staff can receive special travel privileges
(HINT: use ?par to figure out the plot arguments for setting line type &
line width)
What are the sexes of the 3 tall persons who will get special privileges? Use identify() to label
identify(UWICE$cm, UWICE$kg, labels=UWICE$sex, n=3)
Do the same with the SFS scatterplot...(is the relationship significant?)
You can put all the data in a single graph with different institutions represented by different colors
METHOD #11) Figure out the lower &
upper limits for x- and y-axes
2) Plot the first (e.g., UWICE) data and regression line,
setting xlim() and ylim()
3) Add the additional data, using points()
4) Add the lines (abline()), using colors matching
each institution's points
Add a Legend
'legend' is not an argument in plot(), so we add it as a
separate line of code
legend (x="topleft", legend=c("UWICE","SFS"), fill=c("purple","blue"), inset=.02, bty="n")
what does each of these arguments mean?
You can put all the data in a single graph with different institutions represented by different colors
METHOD #21) Figure out the lower &
upper limits for x- and y-axes
2) Plot the full data, setting xlim() and ylim() and 2
colors & pch values*
4) Add the lines (abline()), using colors matching
each institution's points
5) Add the legend
* col=c("blue","purple")[HtWt$institute], pch=c(16,15) [HtWt$institute]
How would you change the legend boxes to match the points in the scatterplot (UWICE = circle, SFS = square)?
Exporting Data
To transfer a matrix or data frame via clipboard:write.table(HtsWts,"clipboard",sep="\t") then in excel, paste
...to a tab-delimited text file: write.table(HtsWts, "c:/mydata.txt", sep="\t")
Intro to Statistics
Sample Statistic
Population Parameter (e.g., Height)
DESCRIPTIVE STATISTICS:
What is the mean height
of our sample of persons?
INFERENTIAL STATISTICS:
Is the mean height of our sample a good
measure of true population height?
Standard Deviation
Standard Error
Confidence Interval
Mean
N
DESCRIPTIVE STATISTICS
Our best estimate of the true population mean
Our best estimate of the true population variability
Standard Deviation
Standard Error
Confidence Interval
Mean
NOur best estimate of the true population mean
Our best estimate of the true population variability
How good is our estimate (from the sample) of the true population mean?IN
FEREN
TIAL S
TATIST
ICS
Descriptive StatisticsSummarize the data we have collected:
• mean, median, mode
• range, variance, standard deviation, interquartile range
• graphical summaries of the data (e.g., histogram, boxplot)
Why do we need it?It’s difficult to just look at raw data and understand what they mean
Inferential StatisticsUse a sample of data to make conclusions and predictions about the population we sampled from
Often used to determine if there are differences between populations or if a ‘treatment’ affected a population
Why do we need it?We often don’t have the time or money to collect data from the entire population we are interested in. For inferential statistics, conclusions are only reliable if we sampled properly!
Truth + Chance = Sample Statistic
We use the sample data to make our best prediction about the population (the data we don't have), and
then quantify the chance that we’re wrong (standard errors & confidence intervals)
But no matter how fancy the statistics or how pretty the graphs, conclusions are only reliable if we sampled properly!
What is a Normal Distribution?
Does ‘Normal’ Exist?
Does ‘Non-Normal’ Exist?
When are Data Non-Normal?When multiple processes or populations are combined in a single data set...
Heights of children aged 5 - 12
When are Data Non-Normal?When the population has many values close to zero or some other natural limit...
When are Data Non-Normal?When some extreme values skew the population...(here, also bounded by zero)
THE SUPER RICH
COMPANY EXECUTIVES
When are Data Non-Normal?When the data follow a process that naturally generates non-normal distributions
POISSON DISTRIBUTIONCounts of rare events, e.g., accidents (lower
bound of zero)
EXPONENTIAL DISTRIBUTIONPopulation growth
BINOMIAL DISTRIBUTIONProportion (%) data
What Can We Do With Non-Normal Data?
• Check the data for errors; then
• Transform data to approximate a normal distribution; OR
• Apply nonparametric statistics
Standard Deviation
Standard Error
Confidence Interval
Mean
NOur best estimate of the true population mean
Our best estimate of the true population variability
How good is our estimate (from the sample) of the true population mean?IN
FEREN
TIAL S
TATIST
ICS
What is the Standard Error?
Standard Error (SE): sd / sqrt(n)
• Standard deviation of the sample means
• Tells us if the sample mean is a good estimate of the true population mean
• Used to calculate the 95% confidence interval
What is the 95% Confidence Interval?
• If we sample from the same population many times, 95% of the samples will have confidence intervals that include the true population parameter
• The true population parameter (e.g., mean) is likely to be within the 95%CI of a sample (if the samples are unbiased). A large 95%CI tells us that our sample mean is not a very reliable estimate of the true mean. With large 95%CI's, it is hard to know from the samples whether or not two populations are truly different
95% Confidence interval: 1.96 X SE
Are plant heights significantly different between control & fertilized treatments?
Control17.2 (95%CI 16.4 – 18.0)
Fertilized18.9 (95%CI 18.1 – 19.7)
SIGNIFICANTLY DIFFERENTcontrol fertilized
N=30=17.2s=2.1
N=30=18.9s=2.2
N=5N=5
18.9 (95%CI 16.2 – 21.6)
17.2 (95%CI 14.6 – 19.8)
NOT SIGNIFICANTLY DIFFERENT
Exploratory Data AnalysisBefore jumping into statistical tests and p-values, LOOK AT YOUR DATA in spreadsheets and graphs to identify: data errors/outliers if your data meet assumptions of parametric
statistical tests interesting patterns
Before you do any statistical tests, you should already have an idea what the results will be
Errors in Data Collection / Entry
• Decimal in wrong place
• Same category spelled many ways
• Data collected in different measurement units
• Forgot to collect some data
• Numbers typed incorrectly when transferred from paper (sloppy handwriting, etc.)
OUTLIER: A data point that is much smaller or much larger than other data in the sample
Why do we care? A few outliers can change the sample mean, increase the variance of sample data, and change the p-value of a parametric statistical test
How do we find potential outliers in our data? • Look at data ranges, histograms & boxplots • For correlation & regression analyses, look at
scatterplots
p = 0.06
For correlations/regressions, outliers may fall within the normalrange of data...but plotting the scatterplot reveals outliers
p = 0.06p = 0.001
A single outlier can change the regression equationand the significance of the relationship
WHAT SHOULD I DO WITH OUTLIERS?
Are data entered
correctly?
Remove outliers before
analyzing data
Are data from the population of inference?
YES
YES
NO
NO
Transform data (if
appropriate)
Use nonparametric
statistics
WHAT SHOULD I DO WITH TRUE OUTLIERS?*
Analyze data with and without outliers
Do study conclusions
change?
Keep outliers & report results
Report & discuss both
results
Remove outliers & report results, but discuss your justification for removing
the outliers
* ALWAYS KEEP GOOD RECORDS OF YOUR DATA EXPLORATION ACTIVITIES AND ANY CHANGES YOU MAKE TO THE ORIGINAL DATA!
YES NO
YES
OUTLIERS
It is wrong to remove outliers from analyses just because they don't fit with the other data!
Outliers can tell us interesting information about a population—conduct more research to understand what causes these unusual data.
Exploratory Data AnalysisBefore jumping into statistical tests and p-values, LOOK AT YOUR DATA in spreadsheets and graphs to identify: data errors/outliers if your data meet assumptions of parametric
statistical tests interesting patterns
Before you do any statistical tests, you should already have an idea what the results will be
Do data come from a normally distributed population?
• Sample data are assumed to represent the distribution of the population. Non-normal data are not 'wrong', they just represent processes that naturally generate other types of distributions.
• With small sample sizes, it can be difficult to tell if data come from a normally distributed population. Consider what you know about the underlying process.
Evaluating NormalityUnderstand which processes generate non-normal data, then...• Visual assessment:
o histograms o normal Q-Q plots
• Normality tests:o Shapiro-Wilk (shapiro.test())o Anderson-Darling (from pkg ‘nortest’)o Pearson chi-square (from pkg ‘nortest’)o Kolmogorov-Smirnov (from pkg ‘nortest’)
Exploratory Data AnalysisBefore jumping into statistical tests and p-values, LOOK AT YOUR DATA in spreadsheets and graphs to identify: data errors/outliers (for scatterplots, graph it!) if your data meet assumptions of parametric
statistical tests interesting patterns
Before you do any statistical tests, you should already have an idea what the results will be
T-Test
T-TestFor determining if population means are different
Two-sample t-testCompare the means of two independent groups (don’t need same sample sizes), e.g., is the mean height of Bhutanese college men different from that of USA college men?
Paired t-testCompare the means paired groups, e.g., is the mean weight of USA college men different before and after a 3-month exercise & diet program?
U
G
U
U
U
U
G
G
G
G
GRAZED UNGRAZED56 7227 2651 6440 4232 46
AVERAGE 41.2 50
Is the height of shrubs different in grazed and
ungrazed areas?
TWO SAMPLE T-TEST
STEP ONE: Look at the data! (Make the point plot)
STEP TWO: Do the t-test: t.test (grazed, ungrazed)
[METHODS] "We conducted a two-sample t-test to compare vegetation biomass in grazed versus ungrazed plots."
[RESULTS] "We did not find a significant difference in vegetation biomass between grazed (M = 41.20; SD = 12.28) and ungrazed (M = 50.00; SD = 18.28) plots in this study, t(7) = -0.89; p = 0.40."
NOTE: Remember to report sample size for each treatment in METHODS.