Practical 4 linear models - linear regression

As you have seen, the t-test is a valuable tool in our statistical toolbox, allowing us to address simple questions like whether variable means differ between two groups (or treatments). For some biologists, that is the only tool they need. But, most questions in biology represent more complex problems. Linear Models, LMs, are a highly versatile set of statistical tools, widely used to address a whole range of biologically interesting questions. When applying LMs, we are typically asking questions such as, does eating more food cause fish to grow more rapidly? Or, what is the effect of rainfall on species diversity?

To introduce you to this class of models, we begin with "simple" linear regression. Linear regression is useful when you have two continuous variables: a predictor, or "X", variable, and a response, or "Y", variable. Our null hypothesis is that there is no relationship between X and Y - that is, that the slope of our regression line is zero.

In this practical you will:

  1. Perform linear regression analysis;
  2. Learn how to interpret the statistical details provided to you in the output from R;
  3. Learn how to check if your data meets the assumptions of linear regression;
  4. Apply your new-found analytical skills to ask whether time of high-tide predicts flock size of ruddy turnstones, and whether species diversity predicts abundance in ants.



4.1 Starting out nice and simple

We start with a highly simplified, hypothetical example. We want to explore the basis of linear regression analysis without the distraction of the "noise" of real biology.

First, let's read in our simple dataset:

Y <- c(-3, -2, -2, -1, -1, -1, -1, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3)
X <- c(-2, -1.5, -1, -2, -1.5, -0.5, 0, -1, -0.5, 0.5, 1, 0, 0.5, 1.5, 2, 1, 1.5, 2)


To perform linear regression, we need each of our two variables to be measured on the same object, or individual, and multiple objects with this pair of observations. In our "simple" example data, each value (measurement) of "Y" has an associated value for "X", and we have these measurements for each of 18 different objects (individuals).

Let's look at our data. It is customary that we plot the predictor variable on the x-axis (i.e., the horizontal axis) and the response variable on the y-axis (i.e., the vertical axis):

plot(X, Y)

## Let's customize that graph a little! 
par(                           ## Customize the graphical window 
   mar = c(5, 8, 4, 2),   ## Set margin sizes (in lines) 
   las = 1,                ## Print axis tick labels horizontally 
   xpd = TRUE)               ## Allow printing outside the graph region 

plot(X, Y,                  ## Generate a scatter plot of 'Y' versus 'X' 
   bty = "n",               ## Do not draw a box around the graph 
   font = 2,                ## Use 'bold face' font within the graph 
   cex = 2,                ## Double the size of all characters within the graph 
   pch = 16,                ## Make the plot symbol a black filled circle 
   cex.lab = 3)         ## Triple the size of the axis titles  
Looking at this plot, do you think that X predicts anything about Y?
Yes! Larger values of X tend to be associated with larger values of Y.

If you were to draw a "line of best fit", how many parameters would you need to define to draw that line?
Two; the y-intercept of the line, and the slope of the line.

Does this line explain everything about the relationship between Y and X?
No, otherwise all of the points would be on the line, rather than scattered around it. We can say that there is residual, unexplained, variance in Y at each value of X



4.2 Fit a regression line

R comes with a function called lm (for "linear model") that performs regression analyses. This function will calculate the sum of products of X and Y, and the sum of squares of X, then the slope and the intercept.

The lm function (and other R functions of analysing linear models) uses a ~ character (called a "tilde") to separate the response variable on the left from the predictor variable(s) on the right.

Model1 <- lm(Y ~ X)  ## Regress Y against X and store the result in an object called "Model1"
plot(Y ~ X)          ## We can also use ~ for plotting in the same way! Draw a plot where Y depends on X.
curve(cbind(1,x)%*%coef(Model1), add =T, col = "red", lwd=2) ## Use the coefficients 
## (intercept and slope) from Model1 to draw the regression line through the data

4.2.1 Understanding and using lm outputs

Above, you used the output from lm to plot a line of best fit using the curve function. To fit a line, curve requires instructions in the form of model coefficients (intercept and slope in this case); we can extract these from Model1 using coef(Model1) - how convenient! Using R "objects" in this way is extremely convenient, but it's important to understand them.

Whenever R creates an object (e.g., the "Model1" object you just created), that object contains various elements depending on the function that generated it.

To determine what is contained within the object you just created, you can use the functions: names and/or str. Try running each of these lines of code one by one and looking at what happens.

Model1               ## Here are the results - the two parameters that define the line!
names(Model1)        ## List the elements of Model1
str(Model1)          ## List the elements of Model1 AND their attributes (this is the 'structure' of Model1)
Model1$coefficients  ## Look at the contents of "coefficients"
Model1$residuals     ## Look at the contents of "residuals"
residuals(Model1)    ## Another way of looking at the residuals
Model1[8]            ## Look at the 8th element
Model1$fitted.values ## Look at the contents of "fitted.values"
round(Model1$fitted.values, 1)    ## Round those numbers off to one decimal place

As you can see, there's a lot of information available to us about our model, Model1. Let's unpack it slowly.

What is the output of the first line of code telling you?
The regression coefficients: the y-intercept and the slope of the line!
Notice that these are the same numbers shown when you specifically call "Model1$coefficients"

What are "fitted" values?
Hypothetical values of Y had each dot landed on the line. You learned about these as "predicted" values in your lecture.
Notice how there are 18 fitted values, one for each object!

What are "residuals"?
The vertical distance between the observed Y value and its hypothetical "fitted", or predicted value. The observed value of Y can deviate above, or below its fitted value, resulting in positive or negative residual values.



4.3 Calculating F-statistics and their degrees of freedom

In regression, the null hypothesis is: "the value of Y does not depend on the value of X", which is equivalent to: "the true slope of the line is zero". We test this hypothesis using ANOVA.

In this week's lecture, you were shown how to calculate an ANOVA table. The purpose is to calculate a test statistic called the "F-statistic".

In R, the job is done using the anova function:

anova(Model1)   ## Generate an ANOVA table
Why do we calculate the sum of the squared vertical distances, and, thus, the Sums of Squares - why not just sum the vertical distances across all of our objects?
If we didn't square them, the mean distance would be ~zero! We need to know about the spread of objects - how far the objects are from their predicted value (residual SS) and how far the predicted values are from the mean of Y (regression SS).


The "Regression Sums of Squares (SS)" are the sum of squared deviations of the fitted values (the values that sit along the regression line) from the mean of Y. In our example, the anova output table names this row "X" because is it represents the variation in Y explained by X. If our predictor variable was called "Temperature", then this row would be called "Temperature" in the anova output table.

To calculate the SS of the regression for ourselves, we can first calculate the difference between each of the 18 fitted values and the mean of Y. Above we showed you that we can access the fitted values using Model1$fitted.values. We then square these differences and sum them.

To get the "Residual SS", the model has already calculated the residuals for us (Model1$residuals). So we just have to square these and then sum them!

## calculate the deviations of the fitted values from the mean of Y
## square these and then sum them
RegressionSS <- sum((Model1$fitted.values - mean(Y))^2)
RegressionSS ## View the regression sum of squares

## Square the residuals and then sum the squares
ResidualSS <- sum(Model1$residuals^2)
ResidualSS ## View the residual sum of squares

Did they match the ANOVA table produced by the anova function?

Notice in the code above that the order of the calculations is determined by the use of brackets. The calculations in the inner-most brackets happen first, and then progress through to the outer-most brackets.

Why does the X source of variation only have one degree of freedom in the ANOVA table?


The "fitted values" are constrained to sit on the line. Two points define a straight line and a regression line must pass through the mean of Y. Given this constraint, we have only one degree of freedom.

Degrees of freedom are the sample size minus the number of constraints upon the statistic. We can think of them as the extent to which the numbers in the sample are free to vary in their determination of the statistic. Say we calculate the mean of our sample - once we know that mean, then if we have the value of any 17 of our 18 objects, we can exactly determine the value of the 18th object. Because we must calculate the mean to calculate the deviations of each point from that mean (and from their the Sums of Squared deviations, SS) that means that we have 17 df for the total data, in our simple example.

For the slope, we can define a straight line with two points. We already know that the line must pass through the mean, so we can estimate the slope from ONE point plus the mean. Thus, we have only one degree of freedom.


Look at your ANOVA output for Model1 again. Why are the Sums of Squares and Mean Squares of X the same, but for the "Residual" they are different?
The Mean Squares are the (Sum of Squares)/(degrees of freedom). For X, we divide 30/1 = 30, so the Sum of Squares and Mean Square are the same. For the Residuals, the the degrees of freedom are n-2 = 16, and the SS/MS= 12/16 = 0.75


The degrees of freedom are incredibly valuable information in an ANOVA table. They help us to check that the analysis was correctly conducted (do the df match what we expected based on the sample size of the data?), and help us to check that we are interpreting correctly which effect is which (the Residual degrees of freedom are n-2, while the degrees of freedom for testing the slope will be 1).

Let's continue with re-creating the ANOVA table "by hand" and estimate the Mean Squares:

(RegressionMS <- RegressionSS/1)
(ResdualMS <- ResidualSS/(18-2))

Do they match the ANOVA that you generated above?

Once the two Mean Squares are correctly calculated, the "F-statistic" (also known as the Mean Squares Ratio) is straightforward: Divide the MS of the fitted values by the MS of the residuals:

\(F = \frac{s^2_u}{s^2_e}\)

Then, we can ask R to compare that estimated value:

(Fratio <- RegressionMS/ResdualMS) ## Calculate the F-statistic
1 - pf(Fratio, 1, 16)    ## Calculate the P-values for the statistic, based on the F-distribution, the F-ratio, and the numerator (regression) and denominator (residual) degrees of freedom

You can think of the F-statistic as the ratio of variance explained by the model to the variance not explained by the model. As explained variance increases relative to unexplained variance, the F-statistic increases. The F-statistic is useful because we know its frequency distribution when the null hypothesis is true (similar to the t-statistic). If the value (the F-statistic) is much greater than what we expect when the null hypothesis is true, then we reject the null hypothesis.

Does rejection of the null hypothesis seem reasonable for this data?
Yes. The F-statistic is large (40 times more variation associated with the predictor than the residual) and the P-value is <<0.05.
This makes sense because there is a clear association between X and Y


t-tests can also be performed directly on the regression coefficients; we can "command" R to do this for us using summary:

summary(Model1)    ## Display T-tests for the regression coefficients

FYI: for simple linear regression (i.e., where there is a single predictor variable, X), the F-test is equivalent to the t-test for the slope of the line; F is just \(t^2\) in this case!

For our example, should we accept or reject the null hypothesis that the intercept is 0?
Accept. The estimate of the intercept is precisely 0.00 (that is, the sum of positive and negative values for Y in our sample exactly cancel one another out), and we have a very high probability of observing this mean if our null hypothesis was true.

What do you notice about the last lines of output from the summary of our regression model?
It's our ANOVA output. So summary gives us much of the same information that anova gave us.



4.4 Analysing flock size in Ruddy Turnstones

Need a break?
click here if you're feeling bird-brained https://youtu.be/FHEa4jj3ZGw

Ruddy Turnstone data

The Ruddy Turnstone is a migratory shorebird that occurs all of the world, including Australia.

UQ's own Prof. Richard Fuller studied some of these birds inhabiting the coastline of north-eastern England.

Each day, through binoculars, Rich observed individual birds, and recorded the following variables (some are self-explanatory):

  1. 'BirdID': bird identifier [these birds are our objects in this datafile - each of the following variables was recorded for each of the individually identified focal birds];
  2. 'FlockID';
  3. 'HeadUps': the number of "head-ups" performed by each bird;
  4. 'FlockSize';
  5. 'TideState': tide state ("high" or "low");
  6. 'NumPecks': the number of "sand pecks" performed by each bird;
  7. 'TotalAggression': a record of bird aggression;
  8. 'TimeOfDay';
  9. 'TimeHighTide';
  10. 'DaysStartWinter': the number of days since the start of winter, and;
  11. 'Temperature'.

His database is provided for you in the file, Turnstone.csv.

We will use these data as an illustration of how linear regression is commonly used in biology.



4.5 Analysing the Turnstone data

Import the file Turnstone.csv into Rstudio.

Turnstone <- read.csv("Data/Turnstone.csv", header = TRUE)

There are several different questions that we could ask about causes of bird behaviour using this data; we focus on one:

Aim: Investigate whether the time of high tide ('TimeHighTide') had an effect on the size of the flock ('FlockSize') that a bird was observed in.

plot(FlockSize ~ TimeHighTide, data=Turnstone)        ## Simple scatter plot
fit <- lm(FlockSize ~ TimeHighTide, data=Turnstone)   ## Fit a linear model (regression line) to the data and store the results in the R object "fit"
curve(cbind(1,x)%*%coef(fit), add = T, col = "red", lwd=2)  ## Draw the regression line, calculated from the model fit parameters, on your scatterplot
Does the scatter plot suggest that flock size is statistically associated with time at high tide?
You might find this question hard to answer with confidence, but I think that you will agree that the plot of this data looks nothing like the clear, strong association we observed in our simple example.
Nevertheless, the scatter looks quite random, and our fitted regression line looks very close to being flat (slope = zero; remember, you can check the estimate of slope in the output shown when you run the command summary(model)). So, on balance, we might conclude from this plot and regression fit that we do not think that the time of high tide affects the size of the flock a bird is observed in.


Let's check our intuition about the patterns we can see in the data against the statistical test provided by the regression and ANOVA:

anova(fit)  ## Look at the ANOVA table
Does the ANOVA table provide evidence for a statistical association?
Yes!
The F-statistic is large, and the P-value is <0.05, providing statistical support for the alternative hypothesis that flock size is predicted by time at high tide.



4.6 Diagnostics

Although we can easily estimate our regression parameters of slope and intercept, to draw robust conclusions from hypothesis tests based on either the F or t distributions, we need to be mindful of several assumptions of regression that underlie the mathematics and calculation of the ANOVA table.

Key regression assumptions

  1. Errors (residuals) are normally distributed (i.e., they have a Gaussian distribution).

  2. Homogeneity of variance - there is no change in the spread of the residuals as the value of X changes.

  3. Independence - this assumption is one that we need to keep in mind when designing the experiment or survey, and not one that we can directly assess from the data. We need to ensure that the response value for one observation did not influence the response value of other observations. Two common situations when this assumption is violated are (1) when we take repeated measures on the same individual over time, or (2) if our observations are clustered in space. Violating this assumption can inflate our risk of Type I error, but the solution requires more sophisticated analyses than simple linear regression. You will learn more about this later in the course.

  4. X values are estimated without error. This assumption is often not met in biology (we will leave it aside for now), but in a well-designed experiment with sufficiently large sample sizes, it does not have a particularly concerning effect on our inferences.

  5. We also assume that the relationship between X and Y is linear! More about that later.

4.6.1 Normally distributed residuals

We can check our first assumption, that the residuals are normally distributed, using the qqnorm plot function. qqnorm generates a plot where the residuals are plotted against values they would have had if they were perfectly NORMAL. If the dots form a reasonably straight line, then we are OK.

qqnorm(fit$residuals,  ## Check if the regression residuals are normally distributed
       main = "Residuals of regression of flock size on high tide time")        
qqline(fit$residuals)  ## Add the 1:1 line for visualisation     
Is there any convincing evidence that our residuals are not normally distributed?
Yes
The dots do not form a straight line in the qqplot. Rather, some residuals are much larger than we would have expected if they were normally distributed.
Looking back at the plot of flock size against tide time, do you think we should have expected this problem?


What can we do?

When the distribution of residuals does not match our assumptions, sometimes we can transform the response variable, Y, to resolve this problem and allow us to safely apply our analysis. You can inspect the distribution of Y the same way that you did the residuals:

qqnorm(Turnstone$FlockSize, main = "Flock size")   ## Check if Y is normally distributed
qqline(Turnstone$FlockSize)   ## This just adds a 1:1 line to make it easier to visualise if the observed (y-axis) values correspond to the predicted (x-axis) values

What do you notice?

Deciding on what transformation to apply can be a matter of investigating what distribution your data follows; for many distributions there are standard ways to transform them to a normal distribution. When we work with certain types of variables that other researchers have used, we can sometimes follow their lead, and apply the transformation they have already decided is best.

Flocksize is lognormally distributed. That means, that if we apply a log transformation, we will have a normally distributed Y (and, hence, residuals!). Let's see:

qqnorm(log(Turnstone$FlockSize))                ##log transform and plot against expected values of normal distribution
qqline(log(Turnstone$FlockSize))                ## add 1:1 line to plot
plot(log(FlockSize) ~ TimeHighTide, data=Turnstone)   ## Simple scatter plot of our hypothesis

NB: by default, log is the natural logarithm; log10 is the more familiar, base 10, logarithm. We could use either here; the values differ, but the fit to the normal distribution are the same

Does our response variable, Y, look like it follows a normal distribution now that we have transformed it?
Yes
The dots (datapoints) now sit very close the their expected values, clustering around the 1:1 line.


OK, so let's repeat the data analysis, but this time try using the natural logarithm of flock size.

fit2 <- lm(log(FlockSize) ~ TimeHighTide, data=Turnstone)

qqnorm(fit2$residuals)
qqline(fit2$residuals)
Has log transforming the response variable helped justify the assumption of normally distributed residuals?
Yes
The dots form a much straighter line. So "fixing" our response variable has "fixed" our residuals.
This is important - while we assume Y and the residuals are normally distributed, it is really the distribution of the residuals that impacts on our confidence in the results of the ANOVA, and so it is the distribution of the residuals that we really care about.


4.6.2 Homogeneity of variance

What about the other assumptions of linear regression? We can test the homogeneity of variance assumption by plotting the residuals against their fitted values. In essence, this is checking that how well our X explains our Y does not change with values of X.

## Check that residuals and the fitted values are independent, and that the residual variance is constant across fitted values
plot(fit2$fitted.values, fit2$residuals)
Is there evidence that the residuals do not have a constant variance?


No.
Again, you might feel a little unsure when you are encountering these checks for the first time, but the key thing to note here is that the there is very little indication that the total range of the residuals (i.e., the spread of points on the y-axis) is much more (or, conversely, less) for higher (or lower) fitted values.

If you look at the plot of the residuals against the fitted values from the analysis of the raw Turnstone data, you can see that we should have been somewhat concerned about the homogeneity of variance assumption - we only detected large residuals for high fitted values, leading to a greater spread of values (variance!) for high fitted values than for low fitted values: plot(fit$fitted.values, fit$residuals)
We focus on the plot of fitted versus residuals because it is standard output for lm (you'll see what this means in the next practical). However, it can be easier to think about this homogeneity assumption when you look at the plot of residuals against X because we are assuming homogeneity with reference to the distribution of X: plot(TimeHighTide, fit$residuals) Notice that this is the same pattern! So, when we have large positive values of TimeHighTide (i.e., our X), we have larger residuals (i.e., the observed value of Y, 'FlockSize', deviates much more from the predicted value) than when we have large negative values of X. Log-transformation of 'FlockSize' solves our problem with homogeneity of residuals:plot(TimeHighTide, fit2$residuals).


Let's look at the results:

plot(log(Turnstone$FlockSize) ~ Turnstone$TimeHighTide)
curve(cbind(1,x)%*%coef(fit2), add = T, col = "red", lwd=2)  ## fitted line
anova(fit2)
summary(fit2)
What do you conclude about the effect of time at high tide on the size of the flock that at ruddy turnstone is found in?
That time at high tide significantly influences flock size. Our F-value is large, and our P-value is small. The estimate of slope is positive, so we can conclude that ruddy turnstones are found in larger flocks as time of high tide increases.

4.7 Do it on your own

Now that you have run through an example using the Ruddy Turnstone data, try doing your own analysis with a different dataset (this will help you with future assessments!). This time, we will use the iris dataset. The iris dataset comes with R, so you will not need to load any other data into your workspace. Take a look:

head(iris)
str(iris)
The sepals in iris flowers are showy and look like petals, whereas in other flowering plant species, sepals are not showy Now it's time to try your hand at writing a linear model and analyzing it. Say a plant biologist studying irises suspects that the sepal length in iris flowers causally explains variation in iris flower petal length. They want to analyze the iris dataset to answer the question: does sepal length predict flower petal length in irises?
Write out the necessary code, with comments as needed, in your script to answer the questions below:

What is the model formula (in R) to test the relationship?
If you need to, go back to section 4.2 for guidance

What is the null hypothesis for this analysis?
If you need to, go back to section 4.2 for guidance

What test will you use in order to accept or reject the null hypothesis?
If you need to, go back to section 4.3 and 4.5 for guidance


How will you test the assumptions of your model are met?
If you need to, go back to section 4.6 for guidance

How will you decide if you need to transform your data?
If you need to, go back to section 4.6 for guidance

4.8 List of useful functions

Here are a few useful functions that you learned today.

Linear modelling
lm; residuals; summary

Model diagnostics
qqnorm; qqline

Plotting
plot(lm); curve