Practical 13 Multiple Regression Revisited

In this practical, we build on what you learned in previous weeks of the course, combining multiple regression and PCA after revisiting basic summaries of multivariate data.

You have learned that correlated predictor variables are a problem for multiple regression analysis. PCA provides one way to deal with this problem, rotating the data to create a new set of predictor variables (scores along the PC axes) that are uncorrelated. The new set of variables can replace the old set in regression analysis.

In addition to solving the problem of multicollinearity among predictors, PCA may provide an additional benefit of reducing the number of predictor variables that we need to fit in the model ("freeing" some degrees of freedom, and simplifying the model). As you saw in the PCA prac, the new set of predictor variables (e.g., PC1) may have clear interpretations (e.g., they might represent "size"), which helps to provide clear interpretations in multiple regression.

13.1 Swimming performance in zebrafish

We return to the zebrafish.

Aim: To evaluate whether swimming speed ('SwimSpeed') can be explained by fish morphology, specifically:

fish length ('TotalLength';
tail depth ('TailDepth');
body depth ('BodyDepth'), and/or;
head length ('HeadLength').

The data are available in the file: zebrafish.csv. Import the data file.

Recall that we found that male and female zebrafish were a bit different in terms of their size and shape. For simplicity, let's start but just analysing females. First, and subset the data.

zebrafishF <- zebrafish[zebrafish$Sex == "F", ]  ## Only keep rows where Sex == "F"

Let's start with a basic regression analysis, testing the effect of each variable on swimming speed (omnibus test):

model1 <- with(zebrafishF, lm(SwimSpeed ~ TotalLength + TailDepth + BodyDepth + HeadLength))  ## Full model
lm_empty <- with(zebrafishF, lm(SwimSpeed ~ 1))  ## Null model
qqnorm(model1$residuals) ## View qq plot to check assumption of Gaussian residuals
qqline(model1$residuals) 
plot(model1$fitted.values, model1$residuals)   ## Check the diagnostics for homogeneity of variance in residuals
anova(lm_empty, model1)        ## Look at the significance test

Does fish morphology significantly explain swimming speed?

Yes; the F-statistic is very large and the P-value very small. Remember: our omnibus test is asking whether any of these variables accounts for variation in swimming speed.

Are you confident about the accuracy of the P-value?

Yes; the diagnostic plots look good, with the residuals very close to the quantiles of the normal distribution, and the residuals look homogenous across the range of fitted values.

Having discovered that fish morphology is associated significantly with swimming speed, can we delve deeper into the details to understand how? Do all four variables contribute equally to this effect on swimming speed, or do some variables contribute more than others?

Unfortunately, as you have seen previously (and will explore further below), these morphology variables are all highly correlated with one another. Collinearity (correlation) among the predictor variables means that we cannot be confident in interpreting the regression coefficients if we analyse all predictors in a multiple regression model.

As an example, run the following code and put 'TotalLength' last in the regression model, then compare the results to what you observed in model 1 (where 'TotalLength' was the first variable).

model2 <- with(zebrafishF, lm(SwimSpeed ~ TailDepth + BodyDepth + HeadLength + TotalLength))  ## Full model, different order to model1
anova(model1)  ## Look at the significance tests for model1
anova(model2)  ## Look at the significance tests for model2

What effect does 'TotalLength' have on swimming speed?

Who knows!? The two ANOVA tables are not the same and do not agree about the significance of 'TotalLength' on 'SwimSpeed'; 'TotalLength' is highly significant in the first model (when it was the first predictor), but not significant in the second model (when it was the last predictor).

If you wanted to (you shouldn't!), you could probably make any variable significant or not by changing the input order. That is very unsatisfactory and would leave your research wide-open to criticism.

We can use the "variance inflation factor" ('vif') to evaluate the magnitude of this collinearity problem:

library(car)  ## Load the package that has the vif function and the Anova function
vif(model1)   ## Calculate the vif to measure collinearity among variables

Are any of these 'vifs' cause for concern?

Yes; two are much >10 and another is close to 10.

One option for dealing with collinearity among explanatory variables is to use type II sum-of-squares instead of regular type I sum-of-squares.

Anova(model1, type = "II")

Based on this type II ANOVA table, does fish morphology significantly affect swimming speed?

No. All variables have a non-significant effect on 'SwimSpeed', with relatively small F-values and P-values > 0.05.

Do the results from the type II sum-of-squares agree with the results you obtained above from the onmibus test?

Not at all; the omnibus test suggested that the effect of morphology on swimming speed was highly significant, but none of the variables here are significantly associated with swimming speed.

The drawback in using type II sum-of-squares is that the ANOVA table does not include all of the information available in the data. Specifically, it excludes information about covariance among the explanatory variables: each sum-of-square is what is "left over" after removal of all covariance.

In the case of the fish, the morphological variables share most of their information, and it seems that the "interesting" information about swimming speed is contained in the correlations among the morphological variables. Type II sum-of-squares misses most of the story!

How can we resolve this contradiction? PCA!

13.2 Variances, covariances, and correlations

Before we perform a PCA and re-analyse our data, let's first refresh our memory about associations among variables.

13.2.1 Variances/covariances

The variance-covariance matrix contains all the information needed for PCA. You can calculate individual variances, and covariances among variables, as follows:

var(zebrafishF$TailDepth)               ## Calculate the variance for 'TailDepth'
cov(zebrafishF$TailDepth, zebrafishF$TotalLength)  ## Calculate the covariance between 'TailDepth' and 'TotalLength'

...but it's easier to let R do the work (just as we did in the previous practical).

FishTraits <- with(zebrafishF, cbind(TotalLength, TailDepth, BodyDepth, HeadLength))  ## Bind the data together by column
COVmatrix <- var(FishTraits)  ## Calculate the covariance matrix
round(COVmatrix, 3)           ## Observe the covariance matrix

What is COVmatrix[3, 3]?

The variance associated with 'BodyDepth'; 2.179.

Which variables have the greatest and least variance, respectively?

COVmatrix[1, 1] = 18.7; COVmatrix[2, 2] = 0.342; 'TotalLength' and 'TailDepth'.

13.2.2 Correlation coefficients

The diagonals of 'COVmatrix' are not identical; that is, different variables have different variances. This makes it difficult to fully appreciate the association among two variables. A covariance of +2 could describe a large OR small association between two variables depending on the magnitude of the variances of each variable.

To resolve this, and assess how strongly associated these variables are, let's convert the covariance matrix, 'COVmatrix', to a correlation matrix using the function: cov2cor.

CORmatrix <- cov2cor(COVmatrix)  ## Convert the covariance matrix to a correlation matrix
round(CORmatrix,3)   ## Observe the correlation matrix

13.2.3 Scatter plots

Plot all of the morphology variables against one another.

pairs(FishTraits)  ## Observe all pairwise plots

Compare the correlation coefficients to the scatter plots.

Do the correlation coefficients corroborate what you see in the pairwise scatter plots?

Yes; the variables all have high correlation coefficients, close to +1 and the data is tightly clustered around a positive trend line for each pair of variables. Data for 'TotalLength' and 'HeadLength' are most tightly clustered in the plot, and have the strongest correlation coefficient in the table.

Based on what you have seen so far, can you predict the shape of the 'barplot' of eigenvalues?

The first eigenvalue will be very large and explain almost all of the variance.

What does a covariance matrix tell you that a correlation matrix does not?

The linear relationship among variables.
The amount of variance in individual variables.
The strength of the association among variables.
The sign of the relationship among variables.
Nothing, they are exactly the same thing, but one is scaled.

b. In a correlation matrix, all variances are standardised - variables are scaled so that their variances (the leading diagonals) are all equal to +1. In a covariance matrix, the diagonal elements are the estimates of the variance of each variable. Hence, covariance matrices show us the relative magnitudes of the variances of each variable. By contrast, correlation matrices are important because they allow us to intuitively understand the strength of associations among variables.

13.3 Eigenvectors and eigenvalues

The following command extracts eigenvalues and eigenvectors:

PCA <- eigen(COVmatrix)  ## Eigen analysis of the covariance matrix for 'FishTraits'
round(PCA$values,3)   ## Observe the eigenvalues, showing only 3 decimal places (round) to make it easier to read
round(PCA$vectors,3)  ## Observe the eigenvectors

What does the first eigenvalue tell you?

That most of the variance (98%) associated with fish morphology is associated with the first eigenvector: a single variable (PC1) would capture most of the "story".

Which eigenvalues are larger than the variances of the original variables?

Only the 1st is larger than most of the variables, while the 2nd eigenvalue is just larger than the smallest variable variance (0.342). This tells us that the first eigenvector has effectively rotated the data to allow most of the original variance to be concentrated on one axis.

What does the first eigenvector tell you?

All traits load positively onto the first eigenvector (i.e. a positive association among all variables). The first and fourth traits have the highest loadings; these traits are 'TotalLength' and 'HeadLength'. Hence, we could think of 'PC1' as mostly being related to 'body size'.

The correlation between 'TotalLength' and 'HeadLength' is almost equal to +1; why is the loading bigger for 'TotalLength'?

The variance for 'TotalLength' is much larger than the variance for 'HeadLength'. Variance is often related to mean of a variable - a variable with a larger mean, as with 'TotalLength' here, has a larger variance than a variable with a smaller mean.

13.4 PC scores

We will now use the eigenvectors to construct a new set of uncorrelated variables, and then plot these variables (PC scores) against one another to confirm that they are uncorrelated:

PC_scores <- FishTraits %*% PCA$vectors  ## Calculate the PC scores for each fish
PC_scores <- as.data.frame(PC_scores)    ## Make it a data frame
names(PC_scores) <- c("PC1", "PC2", "PC3", "PC4")  ## Name the columns

pairs(PC_scores)          ## Plot the PC scores against one another
round(cov(PC_scores), 4)  ## Observe the covariance matrix for the PC scores

What do you notice about each pairwise scatter plot?

There is no association among variables.

What do you notice about the covariance matrix of 'PC_scores'?

The covariances are all equal to zero, and the diagonals are all equal to their respective eigenvalue.

Clearly, the PC scores have statistical advantages over the original four-trait set. We can now re-analyse our data using PC scores safe from any problem of collinearity. Remember, these are exactly the same data as we started with, just read differently because we rotated the data (or axes).

Building on what you know about eigenvector loadings, what kind of fish would have a large PC1 score?

Depending on whether the eigenvector loadings were all positive or negative, a large PC1 score would indicate that an individual was very long (if the loadings were positive) or very short (if the loadings were all negative).

13.5 Regression analysis!

After that refresher, we are now in a position to analyse the PC scores using regression analysis.

Let's regress swimming speed against the four PC scores (and check diagnostics):

model3 <- lm(zebrafishF$SwimSpeed ~ PC_scores$PC1 + PC_scores$PC2 + PC_scores$PC3 + PC_scores$PC4)
qqnorm(model3$residuals) ## View qq plot to check assumption of Gaussian residuals
qqline(model3$residuals) 
plot(model3$fitted.values, model1$residuals)   ## Check the diagnostics for homogeneity of variance in residuals
vif(model3)

Is there a still a problem with collinearity among these four explanatory variables?

No; the 'vif' values are all equal to exactly +1.

Are the diagnostic plots OK?

As before, the diagnostic plots are not too bad.

Now look at the ANOVA table and judge whether the PC traits significantly affect swimming speed. Also, while you're at it, change the order of the explanatory variables and determine what effect that has on your interpretations:

anova(model3)  ## Statistical test for model3

model4 <- lm(zebrafishF$SwimSpeed ~ PC_scores$PC4 + PC_scores$PC2 + PC_scores$PC3 + PC_scores$PC1)  ## Change the order of the variables
anova(model4)  ## Statistical test for model4

What effect did changing the order of the explanatory variables have on your conclusions?

None! Because the PCs are uncorrelated, type I sum-of-squares will produce exactly the same results irrespective of the order of variables in your model.

What can you conclude about the association between PC scores and swimming speed?

'PC1' is highly significantly associated with 'SwimSpeed'.

In order to help you visualise this result, the following code will produce a scatter plot of 'SwimSpeed' against 'PC1'.

plot(PC_scores$PC1, zebrafishF$SwimSpeed)    ## Plot SwimSpeed against PC1
abline(model3$coefficients[1:2])  ## Add the regression line from the model

How much of the total relationship between fish morphology and swimming speed is represented by this single plot?

98%! Remember: PC1 has captured 98% of the variance in the four morphological variables, so 98% of the variance in the data is plotted here.

Does 'PC1' account for all of the variation in 'SwimSpeed'?

No; there is still a lot of residual variance (scatter of dots) around the regression line.

Although it is clear that 'PC1' and 'SwimSpeed' are related, how do we interpret the results in the context of the underlying biology?

13.6 Interpreting outputs from regression analyses of PC scores

Interpreting eigenvector loadings and PC scores is not easy.

In the last practical, we simulated data for two variables that were positively correlated. The first eigenvector had coefficients that were either (approximately) [0.7, 0.7], or [-0.7, -0.7].

Recall that these loadings represent the same eigenvector.

However, we need to be careful when interpreting PC scores; the sign of the eigenvector coefficients and PC scores do matter when interpreting results when PC scores are used as predictor or response variables in subsequent analyses.

For example, if we calculated PC scores using the eigenvector, [0.7, 0.7], then large, positive PC1 scores would mean that an individual had LARGE values of both variables. However, if we used the eigenvector, [-0.7, -0.7], to calculate PC scores, then large positive values of PC1 would mean that an individual had SMALL values of both variables.

The results of any analyses that you perform on PC scores will be identical whether you use [0.7, 0.7] or [-0.7, -0.7] to calculate the PC scores, but, if you are not careful, your interpretation could be exactly opposite to the truth!

Question: Did your PC1 vector loadings have positive or negative signs? Did your regression slope for PC1 have a positive or negative sign? Recalling that we interpreted PC1 as capturing variation in fish length, do "longer" or "shorter" fish swim fastest? Lock in your answer...

One simple way to check your interpretation is to plot principal component scores against a variable that loads strongly (i.e., has a high coefficient) onto the relevant eigenvector.

For example, plot 'PC1' against 'TotalLength':

plot(PC_scores$PC1 ~ zebrafishF$TotalLength)

What do large values of 'PC1' represent? What does that mean for you interpretation of the effect of 'PC1' on 'SwimSpeed'?

For the PCA your teaching team undertook, all loadings on PC1 were POSITIVE, and PC1 therefore had a POSITIVE relationship with "TotalLength". Therefore, for us, larger values of 'PC1' represent larger body sizes. We saw a NEGATIVE effect of PC1 on "SwimSpeed". Hence, we concluded that larger fish swim more slowly, and you should have also.

If your loadings on PC1 were all NEGATIVE, and "TotalLength" had a negative association with PC1......you should STILL have arrived at the same conclusion as us, because your regression coefficient of "SwimSpeed" on PC1 would have been POSITIVE. That is - the biology of the effect of morphology on "SwimSpeed" does not change, but how we are seeing it can flip depending on which direction from the 4-dimensional centroid the PC1 vector was defined as: Eigenvector loadings and understanding PC scores .

If you answered that fish size was positively related to swimming speed (i.e., larger fish swim faster), then you would have been exactly...wrong! But now you are armed with the tools to make the correct interpretations.

13.7 Regression of PCs versus indivdiual variables.

If only PC1 explains variation in swimming speed (with a very large F-ratio), and it is just fish length, then wouldn't it be simplest to just use ONE of the measured morphological variables in the regression to resolve the collinearity problem? Yes - in this case, it probably would because it is an extreme case of collinearity (remember - the correlation between TotalLength and HeadLength was 0.976, which suggests (1 - coefficient of determination) that <5% of their variation was independent). In most situations, even when PC1 is an axis of overall size, it does a better job (more robustly captures differences among objects) than each of the individually measured variables does, and so more variance in the reponse is often accounted for by the PC than by any one individual originally measured variable.

13.8 Sex and zebrafish swimming speed

So far we have only analysed the female data. Let's add the male data back in and run another PCA.

Re-load the data, zebrafish.csv.

Perform a PCA:

COVmatrix <- cov(zebrafish[, 6:9])  ## Calculate the covariance matrix
PCA <- eigen(COVmatrix)             ## Perform an eigen analysis
PC_scores <- as.matrix(zebrafish[, 6:9]) %*% PCA$vectors  ## Calculate the PC scores
PC_scores <- as.data.frame(PC_scores)    ## Make it a data frame
names(PC_scores) <- c("PC1", "PC2", "PC3", "PC4")  ## Change the column names

PCA$values/sum(PCA$values)  ## Look at the proportion of variance that each PC eigenvalue explains

As before, PC1 explains most of the variation in the measured fish morphology variables: An analysis of PC1 would tell us 96% of the story about the morphology that has been measured.

Run a regression analysis for the effects of 'PC1' and 'Sex' on 'SwimSpeed':

model5 <- lm(zebrafish$SwimSpeed ~ PC_scores$PC1 + zebrafish$Sex)  ## Fit the model
qqnorm(model5$residuals) ## View qq plot to check assumption of Gaussian residuals
qqline(model5$residuals) 
plot(model5$fitted.values, model5$residuals)   ## Check the diagnostics for homogeneity of variance in residuals
anova(model5)  ## Observe the ANOVA table

The effect of 'Sex' is marginally significant. Let's visualise.

FemaleRows <- zebrafish$Sex == "F"  ## Store the rows that have female data
MaleRows <- zebrafish$Sex == "M"    ## Store the rows that have male data

plot(PC_scores$PC1, zebrafish$SwimSpeed,                              ## Set up the plot
     type = "n",                                  ## Empty plot
     xlab = "First principal component (size)",   ## Label the X-axis
     ylab = "Swimming speed")                     ## Label the Yaxis

points(PC_scores$PC1[MaleRows], zebrafish$SwimSpeed[MaleRows], pch = 16, col="chocolate")                ## Plot the male data
points(PC_scores$PC1[FemaleRows], zebrafish$SwimSpeed[FemaleRows])  ## Plot the female data

Clearly, the relationship between 'SwimSpeed' and 'PC1' is somewhat complex. For females, there is a negative, linear relationship between the two variables. However, for males, it does not appear that there is any relationship between the two variables. That is to say, the effect of 'PC1' on 'SwimSpeed' depends on which sex you consider: There appears to be an interaction between 'Sex' and 'PC1'.

Refit the model with an interaction term included:

model6 <- lm(zebrafish$SwimSpeed ~ PC_scores$PC1 * zebrafish$Sex)  ## Fit the model with an interaction term
qqnorm(model6$residuals) ## View qq plot to check assumption of Gaussian residuals
qqline(model6$residuals) 
plot(model6$fitted.values, model6$residuals)   ## Check the diagnostics for homogeneity of variance in residuals
anova(model6)  ## Observe the ANOVA table

These residual plots are not particularly attractive. If you observe hist(model6$fitted.values[MaleRows]) and hist(model6$fitted.values[FemaleRows]), you can see that the male data fitted values are "bunched up" relative to the female fitted values.

Are there any other problems with the data analysis?

Yes; the average PC scores for males and females differ. In other words, there is a correlation between 'Sex' and 'PC1'. You can see this in our plot of PC1 against speed, where more females have larger values of PC1, and more males had lower values of PC1. You also saw this in your PBL last week, for Q5 in the zebrafish case study

Change the order of the variables in your model; does it affect your conclusions?

The F- and P-values change for 'Sex' and 'PC1', which is not surprising given their correlation. However, the statistics for the interaction term did not change. We should only interpret the interaction term when making our conclusions anyway: We cannot conclude whether males or females swim faster, it depends on their value of 'PC1'.

We will not go any further with this analysis. This exercise was simply to show that you can use principal components just like any other variable, and they are subject to the same rigor as any other variable.

13.9 List of useful functions

Here are a few useful functions that you learned today.

Multivariate normal (Gaussian) number generator
mvrnorm;

eigen/principal components analysis
eigen; princomp

Covariance/correlation matrix
var; cov; cor; cov2cor