### Using R-studio for bare-bone, quick and dirty stats analysis
# Author: Andreas Madlung
# Date: Jan 20 2016
# Disclaimer: This mini-tutorial is no replacement for a statistics course
# and assumes that the user knows which test to use and why.
# For publication quality stats analysis check
# assumptions more carefully than described here.
# A good website for help is http://rcompanion.org/rcompanion/d_06.html
###
#Download R from here: http://cran.fhcrc.org/
#Download R Studio Installer for Mac or Windows from here: https://www.rstudio.com/products/rstudio/download/
# If you are new to R it is recommended that you run through the tutorial once with the
# example data sets as described below. When you are ready to use your own data, save the this script as
# a new file, edit the code and run it sequentially.
# Create a data set:
# Open an Excel file and save as .csv (Bio332_16_R_ttest.csv)
# To do that use "save as",then at the "Format" pull down tab choose Windows comma seperated (.csv)
# (both for Mac and PC)
# Note: Never use spaces for file or folder names. Use underscores_to_separate_words.
# Note: R is case sensitive. Make sure to use the correct spelling as case format.
# Type in data as follows:
# header (that is the first row): give columns descriptive short names
# order (important): all treatments go in the first column.
# their responses go in the second column as in the example below:
#
#species response
# apple 5
# apple 4.5
# apple 5.1
# apple 5.2
# pear 7.5
# pear 7.1
# pear 5.7
# pear 7
# Create a second data set in Excel (for the ANOVA analysis below)
# and save as Bio332_16_R_ANOVA.csv
#species response
# apple 5
# apple 4.5
# apple 5.1
# apple 5.2
# pear 7.5
# pear 7.1
# pear 5.7
# pear 7
# orange 10
# orange 11
# orange 11.3
# orange 12
# You can now ask if the response is differnt between species.
# This minitutorial introduces you to 4 simple tests:
# For comparing two treatments: use t-test (if normal) and Mann-Whitney-U plus post-hoc (if non-normal)
# For comparing more than 2 treatments: use ANOVA (if normal) and Kruskal-Wallis plus post-hoc (if non-normal)
# WHICH TEST TO USE?
#If you assume normality of your data, use a t-test. If the data are non-normally distributed,
# use a Mann-Whitney-U test.
# If unknown you can't go wrong with the Mann-Whitney-U test, but you lose a bit of power
# to detect significant differences. If your sample size is larger than n=25 use a t-test as the
# "central limit theorem" applies and normality no longer matters as much.
#IMPORTANT NOTE:
# This tutorial uses two test data sets with names that only apply to this test data set. You will need
# to change these names everywhere you find them to reflect your data set names. The same is true for the
# names of the treatments you use in your file.
# The "path" is the address or the location on your computer of the file you want to use.
# You will also have to adjust that every time you see a path in this experiment
# (which currently in this script points to a specific file on my computer, not yours).
# A quick way to find the path for PC users is to right click on the file and then select properties. The path is the "Location:". You can copy and paste it into R, but have to replace "/" with "/".
# On a Mac you can open the terminal (search "terminal" on your computer) and type in pwd after the prompt and hit return.
# Import .csv file
# If using your own data you have to define the "path" to where your document is located on
# YOUR computer. In this example the file is on my (A. Madlung's) Desktop and
# called Bio332_16_R_ttest.csv.
test_data_ttest <- read.csv ("/Users/amadlung/Desktop/Bio332_16_R_ttest.csv", header = TRUE)
# call up the data file to see if it loaded correctly
test_data_ttest
# 1. UNPAIRED T-TEST (assuming normality) for exactly 2 samples
# Do an unpaired ("regular") t-test (asks if response between apples and pears is different)
t.test(response~species, data= test_data_ttest)
# Output interpretation:
# The unpaired t-test in R is called Welch two-sample t-test
# You need the t-value, the degrees of freedom (df) and the p-value.
# The output also gives you the mean (average) of each group and the 95% confidence intervals.
# 2. Equivalent of UNPAIRED T-TEST (NOT assuming normality) for exactly 2 samples
# --> run Mann-Whitney-U instead of t-test as expalined here:
# Do an unpaired ("regular") M-W-U test (which asks if the response between apples and pears
# is different in a non-normal sample.)
# Import data if you haven't already
test_data_ttest <- read.csv ("/Users/amadlung/Desktop/Bio332_16_R_ttest.csv", header = TRUE)
# call up the data file to see if it loaded correctly
test_data_ttest
wilcox.test(response~species, data= test_data_ttest)
# Output interpretation:
# The unpaired "non-normal t-test" equivalent in R is called Wilcoxon rank sum test.
# You need the W-value, and the p-value
# You also get the mean (average) displayed
# Ignore the warning regarding ties if you have several data points with the same values,
# which cannot be unequivocvally ranked. If you use the data set provided, this shouldn't be a problem, though.
# 3. ANOVA: test differences in means if you have more than 2 samples
test_data_ANOVA <- read.csv ("/Users/amadlung/Desktop/Bio332_16_R_ANOVA.csv", header = TRUE)
test_data_ANOVA
fit <- aov(response ~ species, data=test_data_ANOVA)
# If you learned how to interpret residual plots and q-q plots, you can use the
# next two lines of code, otherwise skip them.
layout(matrix(c(1,2,3,4),2,2)) # optional layout
plot(fit) # diagnostic plots
layout(matrix(c(1),1)) # return to one figure per pane
# Use this code to display the ANOVA results
summary(fit) # display Type I ANOVA table
# Use the next line only if you need a type III analysis
# most likely ignore (if you want to run it, uncomment the next line)
#drop1(fit,~.,test="F") # type III SS and F Tests
#ANOVA INTERPRETATION
# From the summary displayed you will need the degrees of freedom (df), the F-value
# and the p-value. This is for the entire ANOVA. If p< 0.05 at least one of the samples is different
# from at least one other. Continue with the Tukey posthoc test to determine which is differnt from which.
TukeyHSD(fit) # where "fit" comes from the previous command.
# INTERPRETATION OF TUKEY TEST:
# Look at each comparison and check the adjusted p-value for significance between pairs.
# Assign letters to variables (here species) such that values NOT connected by the same letter
# signify statsistically significant differences. In this example all three species are
# different and would get the letters A, B, C.
# 4. KRUSKAL-WALLIS: test differences in means if you have more than 2 samples in non-normal data sets
test_data_ANOVA <- read.csv ("/Users/amadlung/Desktop/Bio332_16_R_ANOVA.csv", header = TRUE)
test_data_ANOVA
kruskal.test (response~species, data = test_data_ANOVA)
# INTERPRETATION
# You need the chi-squared value, the degrees of freedom, and the p-value for your report.
# Like in the ANOVA, the p-value tells you that there is at least one species that is different from
# at least one other. To know which it is you need to run a Dunn test or a Nemenyi test.
# Non-parametric tests like Dunn and Nemenyi are less powerful and might not find significance
# that a Tukey test would find in normally distributed data.
#POSTHOC TEST FOR PAIRWISE COMPARISON
#The next step installs a stats package that you need for the Nemenyi test. Wait until R finishes
#spitting out red code before proceding.
install.packages("PMCMR")
library(PMCMR)
require(PMCMR)
posthoc.kruskal.nemenyi.test(x= test_data_ANOVA$response, g= test_data_ANOVA$species, method ="BH")
#For connoisseurs: Another way to run the previous line is the following option:
#attach(test_data_ANOVA)
#posthoc.kruskal.nemenyi.test(x= response, g= species, method ="BH")
#detach(test_data_ANOVA)
#INTERPRETATION:
# The numbers you get from the Nemenyi test are the p-values for each pair-wise comparison.
# Get p-values for each comparison and assign letters as described for the Tukey test in the ANOVA description.
# If Nemenyi is non-significant and you want to run a less stingent test, try the Dunn test below
# (recommended if Nemenyi is insignificant).
#OPTIONAL:
# If you don't want to use a Nemenyi test (or you had trouble installing the package) you can choose
# a different package to run the Nemenyi test or to run the Dunn test, which is equivalent to the
# Nemenyi test. If you want to do so you can try the next 4 lines
# of code to install two packages, then run Dunn or Nemenyi tests (must uncomment to run).
#install.packages("manipulate")
#library(manipulate)
#install.packages("DescTools")
#library(DescTools)
#NemenyiTest(x= test_data_ANOVA$response, g= test_data_ANOVA$species, dist="tukey")
#DunnTest(x= test_data_ANOVA$response, g= test_data_ANOVA$species, method ="fdr")
#GRAPHING:
#To graph the data in a box plot use the following code:
boxplot(response ~ species,
data = test_data_ANOVA,
ylab="Response",
xlab="Species")
# Save the plot by going to Plots -> Save as PDF -> click on "Directory" and choose where to put the PDF
# then click save.
# If you want to add the letters to the categories it is easiest done by importing the PDF into a MS Word document
# (in Word: Insert -> Photo -> Picture from file, then browse to your PDF file,
# then use Format Picture tab, Wrap text tab, choose "tight")
# or, better, use Photoshop and add the letters there.
# If you want to change the labels of the category names the only way to do that is to change it in the
# original .csv file with your raw data.
#END OF THE TUTORIAL