Practical 6 Linear models - Multiple regression

A major goal of linear regression is to explain variation in our data. In simple linear regression, we explored whether (and how much) variance was explained by a single predictor variable. Often (usually!), we have identified more than one possible predictor variable.

Multiple regressions fit a linear model with two or more predictor variables (e.g. \(X_1\), \(X_2\), \(X_3\), etc) to a single response variable (\(Y\)). These predictors may be two or more continuous variables, or a combination of continuous and categorical predictors. The latter is sometimes referred to as ANCOVA (analysis of covariance) but all of these are just linear models with at least one continuous predictor combined with other continuous or categorical predictors.

When fitting our model to explain our data, we want to fit the best model! In doing this, we are guided by two competing criteria:

  1. Explain as much of the variation as possible, and;
  2. Choose the simplest model (the "Principle of Parsimony").

In this practical, we analyse flock size in Ruddy Turnstones, and consider the joint effects of several predictor variables together: "Temperature", time at high tide ("TimeHighTide"), and number of pecks ("NumPecks").

You will learn about:

  1. Interactions among variables in their effects on dependent variables;
  2. Correlations among predictor variables;
    1. Measuring correlations;
    2. Dealing with correlations;
  3. Model simplifications;
    1. Omnibus tests;
    2. Stepwise regression, and;
    3. Type-two sum-of-squares


Assumptions of multiple regression

In the previous practical, you learned that statistical tests (e.g., linear regression) may be untrustworthy when certain assumptions are violated. We must perform "diagnostic" checks of the model residuals to ensure that our analyses, and the conclusions that we draw from them, are robust.

Here are the assumptions of linear regression that we can check by inspecting the residuals of our analysis (the assumption of independence, which depends on sound experimental design, applies to multiple regression as well, but we will ignore this for now):

  1. The model residuals are normally distributed;

  2. Homogeneity of variance - the residuals have a constant variance along the length of a regression line (across the values of X);

  3. The relationship between the response (\(Y\)) and predictor (\(X_1\), \(X_2\), etc) variables is linear. Linear can simply mean that a straight line is a reasonable summary of the relationship between the response variable and the predictor variable(s). However, as you learned in your lecture, "linear" in linear models is more correctly defined as models in which \(Y\) is predicted by only a linear ("+" or "-") combination of parameters (intercepts, slopes), and does not specify multiplying or dividing any parameter by another parameter, nor having any parameter as an exponent.

  4. Predictor variables are assumed to NOT be strongly correlated - they should not be explaining the same variation in \(Y\). This is an assumption that only affects multiple regression, not simple linear regression.


6.1 Choosing variables to include in multiple regressions

It's important to have clear research questions and hypotheses before choosing which predictor variables to include in multiple regressions. Remember, statistical models are just the tools we use to get answers to our questions!!! In addition to choosing predictors that make sense and are relevant to our questions, we also need to decide if interactions are required. In this prac we will run through some approaches for choosing which combination of predictor variables performs "best" in explaining variation in the response variable without being overly complex (we will explain the concept of "parsimony" below). As usual, sometimes there aren't clear answers and even the best statisticians in the world argue over different model selection "philosophies".

Open your CONS7008 R project and save a new blank script for today. Now, let's turn to the Turnstone data.

Turnstone <- read.csv("Data/Turnstone.csv", header = TRUE)

library(DHARMa) ## load this package for awesome diagnostics

We can use the wonderful pairs() function to eyeball the data. One way to specify the variables in a pairs() plot is using a formula, like this:

with(Turnstone, pairs(FlockSize ~ Temperature + TimeHighTide + NumPecks, upper.panel=panel.smooth))

The red lines are "smoothers" that snake their way through the data - they can be helpful to visualise relationships.

You will recall from the prac on simple linear regression that the FlockSize response variable is not normally distributed, and we can see from the pairs() plot that this will probably cause some issues as we move forward. For now, we will proceed to demonstrate a particular modelling process.


6.2 Diagnostics tests for multiple regression models

We performed a few diagnostic tests in the last practical (e.g., qqnorm). As you saw in the last practical, problems with the residuals can stem from problems with \(Y\). In multiple regression, our initial assessment of model assumptions should be made on the residuals of the most complicated model -- the model with the most predictors -- that we could fit. If the residuals are unsuitable, the same will apply to any simpler model applied to that same data, e.g., a model with some of the predictor variables excluded.

To start, run the following code to fit the "full" model. For now, don't worry too much about exactly what is being modelled (it's the "full" multiplicative model with all possible interactions among our \(X\), or predictors), that will become clear below when we begin to assess it for parsimony.

fullModel <- lm(FlockSize ~ Temperature * TimeHighTide * NumPecks, data=Turnstone)  ## The full model
## plot diagnostics using the "simulateResidals" function from DHARMa
plot(simulateResiduals(fullModel))

The qqnorm plot isn't very convincing (the large residuals deviate well above that expected if they were normally distributed). On the right-hand plot, there also seems to be more vertical scatter for larger fitted values, just as we saw for simple linear regression.

Let's try two different data transformations: a log transformation (which you used for simple linear regression of 'FlockSize', and a sqrt (square root) transformation. The square root transformation is commonly used to reduce heteroscedacity (a fancy term for heterogeneity of residual variances) in linear regression. As you saw last week, log transformation brought the extremely large values of flock size back toward the bulk of the data - square root transformation can do the same thing, but has a weaker effect.

Let's see:

## Fit models with different transformations of Y

## sqrt
sqrtFullModel <- lm(sqrt(FlockSize) ~ Temperature * TimeHighTide * NumPecks, data=Turnstone) ## sqrt transform
## look at diagnostics
plot(simulateResiduals(sqrtFullModel))

## log
logFullModel <- lm(log(FlockSize) ~ Temperature * TimeHighTide * NumPecks, data=Turnstone)   ## log transform
## look at diagnostics
plot(simulateResiduals(logFullModel))
Which transformation would be most appropriate? Why?
The log transformation
While the qqnorm plot is better for the sqrt transformed data relative to the untransformed (raw) data, the qqnorm plot for the log transformed response very nearly exactly matches the expectations of the normal distribution. Despite DHARMa's warnings in the right-hand plot for the log transformed response, the variation in the residuals (i.e., spread on the y-axis) is the most even for the log transformation. Let's proceed with the best option for now, even though it's not perfect. Statistical decision making can be difficult but it gets easier with experience.


Having obtained "passable" diagnostics, we are now in a position to believe the F-statistics and their P-values, as reported in the ANOVA table.

If a predictor variable is significantly associated with log('FlockSize'), is it necessarily also significantly associated with 'FlockSize' (untransformed)?
Yes
'FlockSize' and log('FlockSize') are proportional. As 'FlockSize' increases, so does log('FlockSize'). Transforming the data gives us confidence that we don't have elevated risk of Type I error from inflated residuals, but allows us to interpret the associations of our predictors and response variables just as we would have been able to do for the non-transformed estimates

6.3 Interactions in multiple regression

In simple linear regression, we care only about the effect of \(X\) on \(Y\). In multiple regression, we care about the effect of \(X_1\) and \(X_2\) on \(Y\). Importantly, the effect of \(X_1\) on \(Y\) may depend on \(X_2\).

This dependence (in their effect on \(Y\)) among predictor variables is known as an interaction effect. You first came across interactions between categorical predictor variables with two-way ANOVA. While it might be difficult to visualise in your head at this stage, we can also fit interactions between two (or more!) continuous predictor variables, and between continuous and categorical predictor variables.

The figures below show the effects of one continuous predictor \(X\) and one categorical (\(Group\): blue/red) predictor on a continuous \(Y\) response variable. The effect of predictor variable \(X\) upon \(Y\) is evident as a positive slope of the line of best fit. In the image on the left, the effect of \(X\) on \(Y\) is identical for two groups (depicted by the blue and red lines); the increase in \(Y\) per unit \(X\) is the same for both groups (the lines are parallel). That is to say: there is no interaction between \(X\) and \(Group\).

In the image on the right, however, the effect of \(X\) is not the same for each group; \(Y\) increases much faster as \(X\) increases for the blue group than for the red group. The effect of \(X\) on \(Y\) depends on which group you look at. That is: there is an interaction between \(X\) and \(Group\).

When interactions are present, main effects are harder to interpret: In the figure on the right, which group has larger values of \(Y\)? It depends on where you look along the x-axis. We've provided a simple example here, with one continuous \(X\) variable and one categorical \(X\) variable. Interactions between continuous predictors are more difficult to visualise.

Interactions work by multiplying predictor variables together, and treating the result as a new predictor variable. This is what goes on "under the hood" whenever you include an interaction term in your regression model. A test of interaction is a test of significance of that new predictor variable.

The following code shows explicitly that R processes interaction terms in this way by multiplying Temperature and NumPecks together to form a new interaction variable:

Turnstone$Temperature_by_NumPecks_interaction <- Turnstone$Temperature * Turnstone$NumPecks  ## Add the new column to the data set
head(Turnstone)      ## Check that the new variable has been added

Now, analyse the data with the "Temperature_by_Peck_interaction" treated as a main effect.

add_model <- lm(log(FlockSize) ~ Temperature + NumPecks + Temperature_by_NumPecks_interaction, data=Turnstone)
anova(add_model)   ## Inspect the ANOVA table

Now analyse the data again, but replace "Temperature_by_NumPecks_interaction" with an interaction term between "Temperature" and "NumPecks".

intxn_model <- lm(log(FlockSize) ~ Temperature + NumPecks + Temperature:NumPecks, data=Turnstone)
anova(intxn_model)      ## Inspect the ANOVA table

You see that the two models generate identical statistics! Therefore, including an interaction effect in a model (e.g., Temperature:NumPecks) is the same thing as multiplying two variables together and treating the product as a predictor variable in its own right.

Now once more, but use * as follows:

star_model <- lm(log(FlockSize) ~ Temperature * NumPecks, data=Turnstone)
anova(star_model)        ## Display the ANOVA table
summary(star_model)      ## Inspect regression coefficients

NB: we used * above in our logFullModel model

Temperature + NumPecks + Temperature:NumPecks and Temperature * NumPecks are equivalent, but the latter makes coding slightly easier. When you type out all your own code, you'll appreciate the * shortcut.

The regression coefficient (the "Estimate") of the interaction term can be interpreted as follows: For each unit increase in the second predictor variable, the slope of the line of best fit between the first predictor variable and the response variable changes by a certain amount. This amount is given by the regression coefficient for the interaction term.

Is there evidence for an interaction between 'Temperature' and 'NumPecks'?
What is your general conclusion?

Yes, because the F-value is large, the t-value is lower than -2, and the corresponding P-values are quite small (<0.001).

The effect of 'NumPecks' on 'FlockSize' depends on the 'Temperature'.

6.3.1 Quickly visualising interactions

One way to visualise interactions between two continuous predictor variables is to plot the fitted relationships between one predictor and the response while holding the second predictor at different values. For example, we can plot relationships between NumPecks and log(Flocksize) for different values of Temperature. Let's try this using John Dwyer's favourite base R function, curve(). We can use the coefficients from 'add_model', 'intxn_model' or 'star_model' because they are equivalent. Let's use 'intxn_model' here.

## first, lets grab three logical values of Temperature; low, medium and high
(Temps_to_plot<-quantile(Turnstone$Temperature, p = c(0.1, 0.5, 0.9))) ## extract the 10th, 50th and 90th percentiles of Temperature

## Now make an "empty" plot between NumPecks and log(FlockSize)
with(Turnstone, plot(log(FlockSize) ~ NumPecks, type = "n"))

## plot the curve for the low temperature (2.5 deg C)
## remember, curve know what 'x' is in the plot above, so we use x here for Numpecks
curve(cbind(1,Temps_to_plot[1], x, x*Temps_to_plot[1])%*%coef(intxn_model), add = T, col = "skyblue", lwd=3)

## median temperature (6.5 deg C)
curve(cbind(1,Temps_to_plot[2], x, x*Temps_to_plot[2])%*%coef(intxn_model), add = T, col = "goldenrod", lwd=3)

## high temperature (11.3 deg C)
curve(cbind(1,Temps_to_plot[3], x, x*Temps_to_plot[3])%*%coef(intxn_model), add = T, col = "tomato", lwd=3)

## add a legend
legend("topleft", legend=c("2.5 deg C", "6.5 deg C", "11.3 deg C"),
       col=c("skyblue", "goldenrod", "tomato"), lty=1, lwd=2, cex=1)

This is probably a good opportunity to show you also how to change the scale of an axis from linear to another scale. Instead of plotting log(Flocksize) (which some people find difficult to interpret), we can plot Flocksize on a log scale by using 'log = "y"' in the plot call. Then, we need to tell the curve() function what's going on. We do this by exponentiating each fitted line within curve(). Notice how we apply exp() after the fitted line is calculated (i.e. the exponentiation occurs after the line calculations so that the entire line is back-transformed).

with(Turnstone, plot(FlockSize ~ NumPecks, log="y", type = "n"))

## plot the curve for the low temperature (2.5 deg C)
curve(exp(cbind(1,Temps_to_plot[1], x, x*Temps_to_plot[1])%*%coef(intxn_model)), add = T, col = "skyblue", lwd=3)

## median temperature (6.5 deg C)
curve(exp(cbind(1,Temps_to_plot[2], x, x*Temps_to_plot[2])%*%coef(intxn_model)), add = T, col = "goldenrod", lwd=3)

## high temperature (11.3 deg C)
curve(exp(cbind(1,Temps_to_plot[3], x, x*Temps_to_plot[3])%*%coef(intxn_model)), add = T, col = "tomato", lwd=3)

## add a legend
legend("topleft", legend=c("2.5 deg C", "6.5 deg C", "11.3 deg C"),
       col=c("skyblue", "goldenrod", "tomato"), lty=1, lwd=2, cex=1)

Now we end up with a very similar looking plot as before but the y-axis is on the log scale! This means the numbers on the y-axis are actually flock sizes but the scale is no longer linear (look at the y-axis carefully). This is one way to help your audience interpret the biology more clearly from model fits involving log (and other) transformations.


6.4 Correlations among predictor variables

An assumption of multiple linear regression is that the predictor variables are independent (i.e., uncorrelated). In practice, predictors are often likely to share variation to some extent, but the more substantial this "collinearity" becomes, the lower our confidence in estimates of the regression parameters (slopes). Thus, while we might be confident that overall our model does a good job predicting \(Y\), we cannot say which of our predictors are contributing.

One manifestation of collinearity is that parameter estimates (and the conclusions that we draw based on them) may be dependent on the order in which the predictor variables are considered. In extreme cases, correlations among predictor variables can make the data analysis completely contradict itself!

The following R code generates three ANOVA tables from the Turnstone data. You will see that the conclusions to be drawn, based on these tables, contradict one another!

Fit a lm three times, but change the order of the predictor variables:

fullModel1 <- lm(log(FlockSize) ~ Temperature * TimeHighTide * NumPecks, data=Turnstone)  ## Model 1
fullModel2 <-  lm(log(FlockSize) ~ TimeHighTide * Temperature * NumPecks, data=Turnstone) ## Model 2
fullModel3 <- lm(log(FlockSize) ~ NumPecks * TimeHighTide * Temperature, data=Turnstone) ## Model 3

## Now have a look at the results
anova(fullModel1)   ## Results of Model 1
anova(fullModel2)   ## Results of Model 2
anova(fullModel3)   ## Results of Model 3
Are all of the statistics identical across the three ANOVA tables?


They are for the 3-way interaction, but not for the 2-way interactions. How frustrating!


This situation is highly unsatisfactory; we want one answer to a question, not several. The problem stems from correlations among predictor variables. Let's explore how to deal with this problem.


6.5 Diagnostics for a potential collinearity problem among predictor variables

6.5.1 Scatter plots

Scatter plots provide a simple, graphical display of statistical association among continuous variables. In addition to allowing us to visually detect strong correlations between pairs of variables, they also allow us to see any non-linear associations between them.

Let's use pairs again to see the scatterplot matrix.

with(Turnstone, pairs(log(FlockSize) ~ Temperature + TimeHighTide + NumPecks, panel.lower=NULL, panel = panel.smooth))     ## Produce a matrix of all pairwise scatter plots for the variables of interest

Here you can see each predictor's relationship with 'log(Flocksize)' (top row), and the predictors' relationships with each other (second and third row). Beware of over-interpreting these plots; each ignores all the other variables in the data.

The distributions of some of these variables are a bit messy, which can make it difficult to spot associations in these data, but there do seem to be some associations among variables. For example, 'TimeHighTide' and 'Temperature' are positively associated.

6.5.2 Correlation coefficient

This is a single statistic that applies to any pair of continuous variables. The "correlation coefficient" quantifies one particular thing about a scatter plot: the tendency of dots to drift upward (or downward) as your eyes scan the image from left to right. The correlation coefficient assumes a linear trend, and do not capture any information about non-linearity in the association. Its value lies between -1 and +1; +1 means perfect positive association (a straight line relationship), zero means no association, and -1 means perfect negative association.

Want to play a game?
click here to play guess the correlation: http://guessthecorrelation.com/


Run the following code and look at the correlation coefficients among the variables of the Turnstone data.

## create a subset of just the predictors of interest
Turnstone.subset<-Turnstone[,c("NumPecks", "TimeHighTide", "Temperature")]
cor(Turnstone.subset)             ## use the `cor` function from base R to generate the matrix of correlation coefficients
round(cor(Turnstone.subset), 3)   ## use the `round` function to make it easier to inspect by truncating the number of decimal places to 3

NB: this "matrix" of correlation coefficients is symmetrical, just like the scatter plot.

What do you notice about the values on the diagonal of this matrix?
They are all exactly 1.
This is an estimate of the association between a variable and itself. Unsurprisingly, a variable is perfectly correlated with itself :-)

Is the correlation coefficient always an adequate summary of the association between two variables?


No.
First, it ignores non-linearity that may exist in a statistical association; see the explanation above. Second, the same correlation coefficient can apply to data that exhibits markedly different pattern of association when plotted in a scatter plot. For example, these plots all have a correlation coefficient of ~0.7.

Nonetheless, for the Turnstone data, the largest correlation estimate is for the pair of variables that appeared in the scatter plot to be associated ('TimeHighTide' and 'Temperature').

6.5.3 Variance inflation factor (VIF)

This statistic is built upon regression analysis. It is obtained by regressing a predictor variable against all the other predictor variables (including interactions), and then looking at the multiple \(R^2\) of that analysis

\(vif = \frac{1}{1-R^2}\)

The following code shows how to manually calculate "vif" for 'Temperature'.

## Manually calculate 'vif' for Temperature (NB: R's 'vif' command does all this for you!)  
fit <- lm(Temperature ~ TimeHighTide + NumPecks                 ## All main effects 
    + Temperature:TimeHighTide                          ## All 2-way interactions 
    + Temperature:NumPecks + TimeHighTide:NumPecks  
    + Temperature:TimeHighTide:NumPecks, data=Turnstone)## 3-way interaction 

1/(1 - summary(fit)$r.squared)                            ## Calculate "vif" for the 'Temperature' main effect 

As a gauge of correlation, the "rule of thumb" is: When \(vif > 10\), correlation among predictor variables is large, and caution is required when interpreting the parameter estimates in a multiple regression analysis for any response variable.

There is one "vif" for each predictor variable in the model. In practice, we would use the vif function to calculate every possible "vif": one for each predictor variable.

Load the required package, car (you might need to install it first using, install.packages('car'))

library(car)         ## Load the library that contains the "vif" command
vif(logFullModel)    ## Look at the variance inflation factors

You should recognise the "vif" that you calculated above for temperature.

Do any of the "vifs" indicate strong correlation among predictor variables?
Yes.
There are many large "vifs" (>10).


The inverse of "vif" (i.e., \(\frac{1}{vif}\)) is called "tolerance" - we should be concerned when the tolerance is less than 0.1 (1/10).


6.6 Robust multiple regression and model simplification

In multiple regression, we often want to produce a regression model that best explains the variation in our response variable. The process is difficult when there are correlations among predictor variables because the statistical significance of terms in the regression model depends on their order in the model.

There are various strategies for dealing with this issue. Importantly none of the solutions we will show you today are adequate when the correlations among predictor variables are very strong - we'll return to a solution for that problem later in the course.

6.6.1 Omnibus tests

We can test the significance of an entire block of terms simultaneously using a single F-test. This approach has the advantages that:

  1. It is unaffected by correlations among the predictor variables being considered;
  2. If it returns a non-significant result, that entire block of terms can be eliminated from the regression model without the need for further investigation;
  3. It greatly reduces the number of statistical tests needed to simplify the model. This is important because the more tests we run, the more likely it is we will have a false positive result - so minimising the number of statistical tests (and applying multiple-testing corrections) is an important goal in statistics.

The anova function can compare two models and produce a single F-test. The null hypothesis is that there is no difference between the residual variances of the two models (in other words, the null hypothesis is that the two models we compare explain the same amount of variance in the response variable, 'log(FlockSize)'). It does not matter how many terms the models differ by.

We begin by comparing the full model to a refit of that model without its 3-way interaction term. R provides a convenient command, update, that allows us to "tweak" lm models without having to re-write them. We will compare our full model from the beginning, logFullModel, with an updated model:

no_3_way <- with(Turnstone, update(logFullModel, ~ . - Temperature:TimeHighTide:NumPecks))  ## Refit without the 3-way interaction term
anova(logFullModel, no_3_way)                                              ## Compare the two models
Can we drop the 3-way interaction term from the regression model? What is your evidence for/against?
Yes.
The F-value is small and the P-value is large, suggesting that the 3-way interaction term does not explain very much variance in the data. The Residual Sums of Squares (RSS) differ only very slightly between the models, so there is little justification for the more complex model.


Now, let's look at whether any of the two-way interactions significantly predict variation in 'log(FlockSize)'. We need to compare the "no_3_way" model to a model without any two-way interactions (i.e., contains only "main effects").

## Refit the updated model ("no_3_way"), but exclude the 2-way interactions
mains_only <- with(Turnstone, update(no_3_way, ~ . - Temperature:TimeHighTide - Temperature:NumPecks - TimeHighTide:NumPecks))
anova(no_3_way, mains_only) ## Compare the two models
Can we drop all 2-way interaction terms from the regression model?
No.
This time, the F-value is large and the P-value is small, reflecting the large difference in RSS between the two models.


The omnibus test is great, but has a couple of disadvantages:

  1. We need to explicitly define the models to test. This is not really a problem - as you saw, it's a simple procedure.
  2. When we infer that a block of terms has an effect, we do not have evidence of which individual terms within that block might be important predictors.

6.6.2 Stepwise regression

Stepwise regression is a cyclic process of removing terms one-by-one, reassessing the fit of the model, re-adding terms that were deleted at a previous step, and repeating. The final model contains only terms that were above our statistical significance threshold. However, there is no guarantee that a significant term was not inadvertently dismissed along the way. The risk of missing important predictors increases with the degree of correlation among predictor variables.

Although complex, there are functions that will take care of the fitting, removing/adding and refitting for us.

Here, we use the step function.

stepModel <- step(logFullModel, direction = "both")

The results are messy and complicated! step uses the Akaike information criterion (AIC) to decide if a model is a better (smaller AIC) or worse (bigger AIC) fit to the data. Notice that we start with the full multiplicative model, which has an AIC of -195.32. step has then removed the three-way interaction term - and the AIC got "better" (smaller: -197.13). This is the same conclusion that we drew with the omnibus test - the three-way interaction does not allow us to explain more variation in 'FlockSize'. AIC applies a parsimony penalty, and so fitting the three-way interaction was "worse" than not fitting it because it added complexity, but did not add any explanatory power!

step now has a new "Step" AIC value to compare models to. You can see that removing some terms (e.g., 'Temperature:TimeHighTide') causes a big increase in AIC, and results in a model that explains less variation in 'FlockSize' (i.e., has higher RSS than the other models).

Look at the ANOVA summary to see which model is best-supported:

anova(stepModel)    ## Look at the significance tests for the best model from `step`

## check diagnostics
plot(simulateResiduals(stepModel))
Do these results agree with those obtained using the omnibus tests?
Yes.
Two 2-way interactions are significant, while the 3-way interaction is not.


Here are a few things to remember when using stepwise model selection:

  1. Start with the full model;
  2. Remove terms that are not significant, in order of least significance;
  3. Do not remove main effects unless all interaction terms involving them have already been removed;
  4. When collinearity exists, refit the model at each step to reassess significance levels;
  5. If two models are not significantly different (i.e., they are equally good at explaining the data), choose the simpler model ("Principle of Parsimony");
  6. Violate no assumptions that underlay the statistical tests (P-values) involved in linear regression (always check the diagnostics!!).

6.6.3 Type II sum-of-squares

Above, we are using "regular" sum-of-squares, which is called the Type I sum-of-squares (also sometimes called sequential sum-of-squares). We saw that the significance of predictor variables depended on the order of the variables in the model.

These are "sequential" sum-of-squares because they work as follows:

  1. What is the effect of \(X_1\) on \(Y\)?
  2. Controlling for \(X_1\), what is the effect of \(X_2\) on \(Y\)?
  3. Controlling for \(X_1\) and \(X_2\), what is the effect of the interaction between \(X_1\) and \(X_2\) on \(Y\)?

In other words, when testing the effect of \(X_1\), the effect of \(X_2\) is not taken into account, but when testing the effect of \(X_2\), \(X_1\) is taken into account.

There is another way of calculating sum-of-squares, which makes the model output invariant to the order that variables are entered. Type II sum-of-squares are calculated for each term assuming it was listed last in the regression model. That is, what effect does \(X_1\) (or \(X_2\) or the interaction) have on \(Y\) after accounting for all other effects.

This means that the same ANOVA table would always be produced, regardless of the order of the terms in the model. Moreover, each term in the table is adjusted statistically for every other term. However, the various sum-of-squares will not add up to the total sum-of-squares. This is because the ANOVA table does not summarise all of the available variation in the data. Specifically, the covariation among predictor variables has been left out of the analysis.

The car package contains a version of the anova function: Anova (Note the capital "A") that generates type II sum-of-squares (instead of regular sum-of-squares).

Anova(logFullModel, Type = "II")
## compare to Type I (typical lm() summary)
summary(logFullModel)
Do these results agree with those obtained using the omnibus tests?
Yes.
Which is quite surprising given the different calculations involved.

6.6.4 Summary

A general guideline that will always stand you in good stead when analysing complex data with multiple predictor or response variables is to be very wary of interpreting an effect as statistically significant or biologically meaningful when you obtain different evidence about that effect from different models.

Now that we have arrived at a robust analysis of the effect of temperature, number of pecks and time since high tide on flock size, what biological conclusions can you draw about the factors affecting whether a ruddy turnstone is found in a large or small flock?
We will discuss this as a class!

6.7 List of useful functions

Multiple Linear Regression
car::vif; stats::step; update; cor; car::ANOVA

Plotting
pairs

Model diagnostics
DHARMa::simulateResiduals