Practical 8 Generalised linear models (GLMs)

Generalised linear models are logical extensions of linear models. They are referred to as "generalised" because they apply the linear model "machinery" to a broader range of data types and error distributions than just the normal distribution. This is very useful because biological response variables often aren't normally distributed, or they have clear limits to the values they can take. For example, conservation scientists love counting tigers, but count data have certain properties that we need to respect. First, any regression lines that we fit to a count response variable shouldn't go below zero because you can't have -6 tigers! Second, counts are positive integers (positive numbers without decimals) that are often not normally distributed. Other distributions are usually more appropriate, such as the Poisson. GLMs can handle both of these peculiarities of count data (and other common types of non-normal data).

GLMs employ a couple of neat tricks. Just like a linear model, GLMs assign a linear relationship, referred to as the linear predictor, to a parameter from a probability distribution. In the case of a simple linear regression, the intercept and slope (the linear predictor) are assigned to $\mu_{i}$ (refer to the lecture on linear regression). In GLMs, the linear predictor is assigned to parameters of OTHER distributions (e.g. $\lambda_{i}$ for Poisson and $p_{i}$ for binomial). Seems too good to be true, but its not!

To make this work for different parameters from different distributions, GLMs use a link function. The link function effectively transforms the response variable inside the model so that a linear relationship can be fitted to the parameter. For the Poisson distribution, $\lambda_{i}$ is linked to the linear predictor using the log link function. For the binomial distribution, $p_{i}$ is linked to the linear predictor using the logit link function. This is NOT THE SAME as transforming the response variable beforehand and then just fitting a linear model, as you will see.

In this practical, we will explore three examples of GLMs use graphing techniques to assess GLM fits to our data.

8.1 An example with count data

Effects of temperature and fragmentation on rainforest beetle abundance

In this example the response variable is "counts", which are counts of a rare rainforest ground beetle (Nurus sp.) caught in pitfall traps. There are two continuous predictor variables:

Tmax = the maximum daily temperature (degrees C) for each location
dist_to_edge = the distance from the forest edge (in metres) to the trap location within the forest (i.e. zero indicates the trap is right on the edge)

The data are available in the file: beetle_data.csv in the "Data" folder of our CONS7008 R project.

beetle_data<-read.csv("Data/beetle_data.csv", header = T)

First, get an overview of the relationships between the predictor and response variables, and the distribution of the response:

## visually explore the data
pairs(beetle_data)
## what can we say just from looking at the pairs plot?

## how is the "counts" variable distributed?
hist(beetle_data$counts)## As Wanda Sykes would say, "that #$%& ain't normal!"

## what is the range?
range(beetle_data$counts)
## so we are looking at an integer variable (counts, with no negative numbers and no decimal places)

Are the relationships linear?

No, there seems to be some curvature in the relationships involving beetle counts.

What else do you notice?

The plots involving the count response seem to "flare out" at larger count values. This indicates that we probably can't rely on the normal distribution to "mop up" the residuals. Remember, the normal distribution assumes that a single residual standard deviation value can be used to represent the scatter all the way along the fitted relationship...but in this case the amount of scatter changes!!!

You may be tempted to try and log-transform the "counts" response and fit a linear model. Let's try this.

## install DHARMa and emmeans first if you haven't already
library(DHARMa) ## for great diagnostic plots for GLM and LMMs
library(emmeans) ## for all kinds of reasons that will become clear

## log-transform and fit a linear model
log_beetle_lm<-lm(log(counts) ~ Tmax + dist_to_edge, data = beetle_data)
## do we meet the assumptions of linear modelling?
qqnorm(resid(log_beetle_lm))
qqline(resid(log_beetle_lm))

## or check diagnostics using the simulateResiduals() function from the DHARMa package
plot(simulateResiduals(log_beetle_lm))

There is some curvature in the tail of the residuals but nothing too troublesome. The DHARMa plots provide a bit more useful information and also indicate that this model fit is pretty good. In some instances, log-transforming and using a linear model is ok, but for counts it is preferable to use a discrete probability distribution like the Poisson, because the counts take discrete values (not continuous values). Discrete probability distributions describe the probability of getting specific counts, such as the probability of finding SEVEN beetles, or the probability of finding FIFTEEN beetles.

Let's try a GLM with Poisson error distribution and log link function.

## the log link is the default for Poisson but its good practice to specify it in brackets
beetle_glm<-glm(counts ~ Tmax + dist_to_edge, family = poisson(link = log), data = beetle_data)

summary(beetle_glm)

Before we pay too much attention to the coefficient estimates in the model summary we need to check a couple of things. You will recall from the lecture that Poisson and some types of binomial GLMs are prone to overdispersion. This occurs when the Poisson or binomial distribution doesn't "mop up" all of the scatter in the residuals - there is leftover unexplained scatter that isn't being accounted for. This is problematic because the standard errors of the coefficient estimates are then calculated assuming a smaller amount of unexplained variation than there actually is. In other words, we end up with smaller SEs than we should, which results in larger test statistics (e.g. larger t or z values) and smaller p-values. We run the risk of concluding that our predictor variables have more significant relationships with our response variable than is actually the case.

If the model is not overdispersed, the residual deviance should be roughly equal to the residual degrees of freedom. What do we find?

In this case the residual deviance = 176 and the residual DF = 147 so this model is not very overdispersed, which is great!

DHARMa has a test for overdispersion called testDispersion()

testDispersion(simulateResiduals(beetle_glm))
## red line is not in extreme location - all good!

## What about the other DHARMa diagnostics?
plot(simulateResiduals(beetle_glm))

The diagnostics look good. It is safe to proceed with further inference from the model summary!

summary(beetle_glm)

We have a significant negative slope estimate for Tmax and a significant positive slope estimate for dist_to_edge. But what do the numbers actually mean?

Remember, we used a log link function, therefore the coefficients are in the units of log(counts). So for a 1$^\circ$C increase in Tmax we expect a -0.131530 reduction in log(beetle counts). For a 1 m increase in distance to edge (1 m further into the forest) we expect a +0.007373 increase in log(beetle counts). Its often easier to understand the biology if we plot the fitted relationships.

In the following plots we will use the curve() function that you have already met. We will use Tmax on the x-axis for both plots, and fit two lines - one for a beetle pitfall trap close to the forest edge (50 m) and one for a beetle pitfall trap far from the forest edge (150 m).

We can plot the fitted relationships in 2 ways. First we will plot them on the log scale where the relationships should be linear (remember, the log link function links to the linear predictor "under the hood" of a Poisson GLM, so relationships will be linear on the log scale). Second, we can back-transform the fitted relationships to the original scale of beetle counts. This will help us understand the biology most clearly.

## we might as well output this nice plot to our "Outputs" folder in the CONS7008 R project!
pdf("Outputs/Beetle_counts_plot.pdf", height = 6, width = 11)
par(mfrow=c(1,2)) ## open a plot canvas with two panels (side-by-side)
## first, plot on "modelled" scale (log link so we use log scale)
with(beetle_data, plot(counts ~ Tmax, log="y", pch = 16, col=ifelse(dist_to_edge>100, "dodgerblue4", "goldenrod")))
## curve for Tmax at 50 m from edge
curve(exp(cbind(1,x,50)%*%coef(beetle_glm)), add=T, col = "goldenrod", lwd=3)
## curve for Tmax at 150 m from edge
curve(exp(cbind(1,x,150)%*%coef(beetle_glm)), add=T, col = "dodgerblue4", lwd=3)
## add a legend
text(x=24, y = c(1.5, 2), labels = c("50 m from edge", "150 m from edge"), col = c("goldenrod", "dodgerblue4"), pos=4)
mtext(side = 3, adj = 0, line=0.75, text="(a) Log scale (inside model)", cex=1.5)

## now plot on the original scale of the counts variable
with(beetle_data, plot(counts ~ Tmax, pch = 16, col=ifelse(dist_to_edge>100, "dodgerblue4", "goldenrod")))
## curve for Tmax at 50 m from edge
curve(exp(cbind(1,x,50)%*%coef(beetle_glm)), add=T, col = "goldenrod", lwd=3)
## curve for Tmax at 150 m from edge
curve(exp(cbind(1,x,150)%*%coef(beetle_glm)), add=T, col = "dodgerblue4", lwd=3)
mtext(side = 3, adj = 0, line=0.75, text="(b) Original scale (backtransformed)", cex=1.5)
dev.off()

Look at the code for the second plot. What did we add to the curve command to get the line to be fitted on the original scale of beetle counts?

We included exp() before the cbind and applied the exponentiation to the calculations of the fitted lines! In other words, we applied exp() to all of this cbind(1,x,150)%*%coef(beetle_glm). This applies the back-transform (from log(counts) to counts) to the entire fitted line.

Look at the code for the first plot. You will notice that we added exp() to the curves here too. So why are the fitted lines straight???

This one is quite tricky. In the plot() we used log = "y" which plots the Y-axis using a log scale, not a typical linear scale. Notice how the distance between 1 and 2 on th Y-axis is the same as the distance between 10 and 20? This is what happens on a log scale, and because our model used a log link function, the fitted line is linear (straight) on the log scale! It's actually a really neat approach!

The fitted lines in the second plot (on the original scale of beetle counts) don't look parallel. Does this mean that we fitted an interaction between Tmax and dist_to_edge?

No. Look at the model specification and you will see that we did not include an interaction. The fitted lines are no longer parallel when we back transform from the log-scale to the original scale. This is common with GLMs and it makes sense when you think about it. In this case exp() exaggerates larger counts more than smaller counts.

8.2 An example with proportion data

Effects of different treatments on the germination of Rhodanthe manglesii seeds
Winter annual plant species occur in regions with Mediterranean climates (where most rain falls in winter and summers are hot and dry). In Australia, this includes southwest Western Australia and parts of South Australia (these are the same climates that produce good wines in Australia!). The annuals germinate when the cool winter rains arrive, live for three or four months, produce seeds and die. The seeds persist in the soil until the next growing season. To ensure that plants emerge when conditions are favourable for growth and reproduction (and not before), winter annuals have evolved some cool mechanisms. Many are physiologically dormant and require baking in the hot summer sun to alleviate this dormancy. This is referred to as "warm after-ripening". Other species are stimulated to germinate by smoke chemicals. Being able to "unlock" seeds from their dormancy remains a key challenge for seed scientists who are banking seeds of native species for conservation.

Research Aim: To determine the best method to stimulate germination of Rhodanthe manglesii

A germination trial was conducted in growth cabinets. Variable numbers of seeds (between 11 and 17 per dish) were assigned to petri dishes with wetted filter paper. Three treatments were randomly assigned to 25 petri dishes each (75 dishes total):

Freshly collected seeds with no treatment (control)
Seeds exposed to warm after-ripening (AR)
Seeds exposed to smoke chemicals (smoke)

Response variable: in this example it is "n_germ" (count of seeds that germinated) but we need to account for "n_seeds_in_dish" (the number of seeds assigned to each dish), which is NOT THE SAME for each dish. As you will see below, we use cbind() to create the full response variable that does everything we need.

The data are available in the file: rhodanthe_data.csv in the "Data" folder of our CONS7008 R project.

## read in rhodanthe_data
rhodanthe_data<-read.csv("Data/rhodanthe_data.csv")

Your first instinct might be to calculate the proportion of seeds that germinated in each dish and use this pre-calculated proportion as the response variable in your analysis. Let's try this.

## make a proportion variable for the response variable
rhodanthe_data$germ_prop<- rhodanthe_data$n_germ / rhodanthe_data$n_seeds_in_dish

## re-level the treatment factor so that control is the "reference" level 
rhodanthe_data$treatment <- factor(rhodanthe_data$treatment , levels = c("control","AR", "smoke"))

## fit a linear model
rhodanthe_lm<-lm(germ_prop ~ treatment, data = rhodanthe_data)

## look at diagnostics using DHARMa
plot(simulateResiduals(rhodanthe_lm))

DHARMa tells us that the treatments differ in their variances and that this is problematic if we want to use a linear model.

There are at least two other problems with this approach. Do you know what they are?

First, we are fitting a linear model to a response variable that is bound between zero and one. This isn't ideal! Second, pre-calculating a proportion ignores the fact that some petri dishes had more seeds than others. We are more confident about a proportion that is calculated from 17 seeds than a proportion calculated from 11 seeds and we need to account for this in the model.

We can use a more appropriate response variable. First create a variable that is the number of seeds per dish that didn't germinate. This is simply the total number of seeds per dish minus the number of seeds that germinated! Then we can specify the model.

rhodanthe_data$n_no_germ <- rhodanthe_data$n_seeds_in_dish - rhodanthe_data$n_germ

## this time, the response variable isn't just a single variable!
## its a concatenation of the n_germ and n_no_germ!!!
rhodanthe_glm<-glm(cbind(n_germ, n_no_germ) ~ treatment, family="binomial", data = rhodanthe_data)

summary(rhodanthe_glm)

In this case the residual deviance = 471 and the residual DF = 72 so this model is VERY overdispersed!

What does DHARMa's testDispersion() say?

testDispersion(simulateResiduals(rhodanthe_glm))
## the red line is in extreme location!
## what should we do???

We have two options:

use a quasi-binomial error distribution
use a mixed-effects model and include an "observation-level" random effect

The quasibinomial model deals with overdispersion rather crudely by calculating a "dispersion parameter". The standard errors of each coefficient estimate are multiplied by this one parameter which has the effect of increasing the SEs (and therefore increasing the p-values) but it doesn't change the coefficient estimates themselves. It is often preferable to go with option ii (mixed-effects model with observation-level random effect) because the coefficient estimates will be more precisely estimated. In option ii, we "mop up" the overdispersion using a random effect which assumes a normal distribution. So the binomial distribution mops up what it can, and we use a normal distribution to mop up the rest. Wacko the chook!!!

## Option 1: quasibinomial
rhodanthe_glm2<-glm(cbind(n_germ, n_no_germ) ~ treatment, family="quasibinomial", data = rhodanthe_data)
summary(rhodanthe_glm2)
## this approach calculates a "dispersion parameter"
## for this model it is 6.06, indicating that the actual amount of unexplained variance is 6 times the amount that the binomial distribution explains

## Option 2: mixed-effects model
## load lme4 package so that we can use the glmer() function
library(lme4)
## create a dish ID column (a unique number for each row of data, which correspond to unique petri dishes)
rhodanthe_data$dish_ID = c(1:dim(rhodanthe_data)[1])

## fit the model with "+ (1|dish_ID)" as the random effect
rhodanthe_glmer<-glmer(cbind(n_germ, n_no_germ) ~ treatment + (1|dish_ID), family="binomial", data = rhodanthe_data)

## did this work?
testDispersion(simulateResiduals(rhodanthe_glmer))
## Yes, this is a great solution!

## look at the model summary
summary(rhodanthe_glmer)

It is clear that AR and smoke are significantly different from the control, but are they different to each other? In this very common situation, the emmeans package is your friend! It stands for "estimated marginal means" and it provides an extremely flexible way to calculate different types of comparisons from fitted models. In this case we want to look at "pairwise" comparisons between treatments so we specify specs = pairwise~treatment.

emmeans(object = rhodanthe_glmer, specs = pairwise~treatment)
## The estimated differences are in logits - we will make more sense of the results below

So all treatments are different from each other! But how big are the differences on a scale that is biologically meaningful?

We can use emmeans to plot the model estimates (treatment means and confidence intervals) on the probability scale.

## emmeans to the rescue again!
plot(emmeans(rhodanthe_glmer, ~treatment, type = "response"), horizontal=F)
 ## we include type = "response" to plot on the probability scale, not the logit scale

Why are the confidence intervals on the emmeans() plot not symmetrical around each point when we use 'type = "response"'?

Remember, binomial GLMs link the probability parameter $p_{i}$ to the linear predictor using the logit link function. So the model is linear on the scale of logits. When we back-transform to the probability scale we see some reassuring behaviour! Remember, the probability scale is bounded between zero and one. We see almost symmetrical bars for "AR" because it is near 0.5 on the probability scale (i.e. it is in the middle of the probability scale). As probabilities approach zero or one, the bars become asymmetrical to respect the bounds. We can see this especially for the smoke treatment!

8.3 An example with binary (0/1) data

Relating climate and soil to the occurrences of a vulnerable tree species
It might seem like an oxymoron, but there is such a thing as a dry rainforest. In the subtropics of Australia, these occur on fertile soils in drier parts of Southeast Queensland and northern New South Wales. Sadly, large areas of dry rainforest have been cleared for agriculture since European colonisation, leaving mostly small fragments. Another consequence of land clearing is that many plant and animal species that inhabit dry rainforests are of conservation concern. One such species is Cossinia australiana, a native member of the citrus family that persists in patches of dry rainforest from Rockhampton to Kingaroy (or "from Rocky to Kingas" as John would say). The Queensland Government is interested in knowing where to search for this species and provides you with occurrence data for 250 small patches of rainforest that they have already surveyed. They suspect that climate and soil are important factors influencing its distribution.

Research Aim: To identify if climate and soil type explain occurrences of Cossinia australiana throughout southeast QLD.

Given the importance of moisture availability in determining the climatic limits of many other rainforest tree species, you extract data on mean annual vapour-pressure deficit (how dry the atmosphere is), soil depth (m) and the geological category (basalt or granite) of each surveyed location.

Response variable: "occurrences" (binary: 1 = it was present and 0 = it was absent)

The data are available in the file: cossinia_data.csv in the "Data" folder of our CONS7008 R project.

## read in cossinia_data
cossinia_data<-read.csv("Data/cossinia_data.csv")

## look at a summary
summary(cossinia_data)

We can go ahead and fit a GLM with a binary response variable and three predictor variables. With binary responses there is no concern about overdispersion, so we do not need to check for this.

cossinia_glm<-glm(occurrences ~ soil_depth + vpd + geology, family="binomial", data = cossinia_data)

## we still need to check our diagnostics (other than overdispersion)
plot(simulateResiduals(cossinia_glm))

## safe to proceed to inference via summary
summary(cossinia_glm)

What can we infer from the model summary?

Remember, the estimates are in logits again (it is a binomial model with logit link function). So the numbers might not make much biological sense yet. But the sign and significance of certain coefficients is very informative!

Like with the other GLMs we have fitted today, it helps to plot the data on the original scale (in this case the probability scale between zero and one) to get a better biological grasp of the relationships.

## John extracted the hexcodes for some colours in the Cossinia photograph...what a nerd.
cossinia_cols<-c("#469652", "#e4e782")

## Plot on the "observed" scale of the counts variable
pdf("Outputs/Cossinia_occurrences_plot.pdf", height = 6, width = 6)
with(cossinia_data, plot(jitter(occurrences, amount=0.05) ~ vpd, pch = 16, 
                         col=ifelse(geology=="basalt", cossinia_cols[1], cossinia_cols[2]),
                         ylab = "Probability of Cossinia occurrence",
                         xlab = "Mean dry-season vapour-pressure deficit (kPa)"))
## curve for Basalt
curve(plogis(cbind(1,mean(cossinia_data$soil_depth), x,0)%*%coef(cossinia_glm)), add=T, col = cossinia_cols[1], lwd=3)
## curve for Granite
curve(plogis(cbind(1,mean(cossinia_data$soil_depth), x,1)%*%coef(cossinia_glm)), add=T, col = cossinia_cols[2], lwd=3)
## add a legend
text(x=0.2, y = c(0.15, 0.225), labels = c("Basalt", "Granite"), col = cossinia_cols, pos=4)
dev.off()

Look at the figure we made. Where would you recommend the Queensland Government prioritises their searches for this species?

Look at the plot (in Outputs folder). You tell me!

What is the jitter function doing in the plot command?

Look at the plot (in Outputs folder). See if you can work it out.

Why did we include mean(cossinia_data$soil_depth) in the curve command?

Think back to multiple regression. We use the same approach when plotting relationships from any model with multiple continuous predictors.

What do the zero and one do after 'x' in the two curve commands?

Think back to multiple regression. We use the same approach when plotting relationships from any model with continuous AND categorical predictors.

8.4 List of useful functions

Here are a few useful functions that you learned today.

Conditional arguments
ifelse()

Statistical analysis
DHARMa:simulateResiduals(); DHARMa:testDispersion(); emmeans:emmeans()

Plotting
curve(); jitter(); plot(emmeans())