Practical 9 Data visualisation

In this practical, you will cover some essential elements of presenting data graphically. The emphasis is on learning base R code to enable you customise graphs to suit your needs.

All R graphics functions come with default settings that you can override to achieve your own design. In other words, graphs can be customised to suit the style of the particular assignment, magazine, or journal. So while the code to make plots may seem cumbersome, once you know the basics you can tailor your graphics for any situation.

In this prac we will NOT cover ggplot2 which comes with the tidyverse ensemble of packages. ggplot2 uses syntax that represents the grammar of graphics, which is completely different syntax to base R. Students sometimes move straight to ggplot2 without getting a good grasp on base R graphics. We recommend learning base R first, and then moving to ggplot2 so that you get the best of both worlds. There are many great guides to using ggplot2 on the Internet for you to follow in the future.

Finally, in this prac we will ask you to make some plots yourself using a built-in dataset in R called iris. Because it is built-in, you don't need to read in the dataset - R already knows what it is. This rather famous dataset includes measurements on the flowers of three different species of iris (click the link above to learn more).

9.1 Histograms

You should all be familiar with histograms by now. A histogram summarises a quantitative variable. It displays the sample frequency distribution of the variable.

Histograms visually summarize discrete (e.g. counts) or continuous (e.g. tree diameter) variables by calculating how many observations fall within different intervals (bins) of the data. So far we have used them to examine the distributions of model residuals when checking if our linear models meet the assumption of "normally distributed errors". To demonstrate how to make histograms in R we will first use data on fertilisation success.

Fertilisation data An experiment was designed to test the effects of copper concentration ($mg/L$) and sperm concentration on fertilisation success in the free-spawning marine invertebrate, Galeolaria caespitosa. The data are available in the file: preloaded_data_1.csv in the "Data" folder of our CONS7008 R project.

preloaded_data_1 <- read.csv("Data/preloaded_data_1.csv", header=TRUE)

Raw data are the number of eggs fertilized per 100. There are two additional variables, copper_concentration and sperm_concentration, have a look at these (one way is the head function) to familiarise yourself with the data.

The following code displays the histogram of fertilisation success in our sample.

hist(preloaded_data_1$fertilisation_success)   ## Basic (default) histogram

What do the various vertical rectangles represent?

The width is the interval ("bin") of a variable.
The height is how often the variable fell within that interval.

9.1.1 Extracting data from graphics commands

Most R graphics functions allow you to extract information and statistics about the image. This can be a useful source of data for customising the image:

H <- hist(preloaded_data_1$fertilisation_success)  ## Save information about the histogram 
names(H)                          ## Look at the elements of 'H' 
H$mids                            ## View the contents of an element

What are the exact heights of the rectangles in the histogram?

H$counts

9.1.2 Break points

The break points along the horizontal axis that define the beginning and end of the rectangles are called "breaks", and the spaces between them are called "bins". Setting the breaks is the same as setting the bin-width. It can have a dramatic effect upon the appearance of the histogram.

hist(preloaded_data_1$fertilisation_success, breaks = seq(0, 100, by = 20)) ## Fewer bins
## What did seq do? Try intervals of 10:
seq(0, 100, by = 10)  

## Try different sized bins!
hist(preloaded_data_1$fertilisation_success, breaks = c(0, 40, 60, 70, 80, 90, 100), col = "pink") 

## Play with colour
hist(preloaded_data_1$fertilisation_success, col = rainbow(10), border = "blue")

NB: For a full list of colours, type colors(). The hist function recognises arguments and parameters common across many graphics functions; some are found by typing ?hist, and others by typing ?par. NB: You can also use what are called hex codes; for example, col="#000000" would give all black bars, and col=c("#000000", "#H93240") would be a vector of two colours. You can check out coolors.co to easily find hexcodes for nice colour schemes. Similarly, you can extract hexcodes from images using Adobe color.

9.1.3 Labels

The following code gives examples of how you can customise the labelling:

hist(preloaded_data_1$fertilisation_success, main = "My histogram", xlab = "Fertilisation Success", ylab = "Frequency", col = "grey")

9.1.4 Overriding the defaults

You can achieve greater control over the final appearance of the graph by using various other functions that print over an existing graph. You can turn off the default settings, and then run custom code as desired. This gives you near total control over the appearance.

The following code uses this approach to customise the text on the same histogram:

## First, re-draw the histogram but remove some default settings
hist(preloaded_data_1$fertilisation_success, ylab = "", xlab = "", xaxt = "n", main = "")

## Add text at the XY coordinates: -15, 23
text(-15, 21, "Distribution of fertilization success in" ~ italic("Galeolaria caespitosa"), pos = 4, xpd = TRUE, cex = 1.3)

## Add text to the horizontal axis (side 1)
mtext("Fertilization success", side = 1, line = 2.4, cex = 1.1)

## Position some text above the vertical axis
text(-3.0, 15.5, "Frequency of \n fertilization \n success", pos = 3, xpd = T, cex = 1.1)

## Define locations for the x-axis
xloc <- seq(0, 100, by = 10)

## Print values along the x-axis at those locations
axis(side = 1, at = xloc + 5, labels = xloc, cex = 1.0, cex.axis = 1.2)

NB: text() places text anywhere within the graph while mtext( ) places text in one of the four margins.

What does "cex" mean and what does it do?

"Character extension"; expand the scale (size) of something.

9.1.5 Adding a curve to a graph

The lines function draws a straight line between two points. If points are close together, and there are a lot of them, you can use curve to create...curves!!!

The following code overlays a normal distribution with the same mean and standard deviation as the sample.

## First, calculate the mean and standard deviation of the 'fertilisation_success' variable
## Calculate the sample mean and store it as "av"
av <- mean(preloaded_data_1$fertilisation_success)
## Calculate the sample standard deviation and store it as "stdev"
stdev <- sd(preloaded_data_1$fertilisation_success)   
## Define the range and "steps" of curve
## for this we go from the smallest to the largest value in 40 even increments using "seq()"
x <- seq(min(preloaded_data_1$fertilisation_success), max(preloaded_data_1$fertilisation_success), length = 40) 

## Now, define the histogram again and plot it
H <- hist(preloaded_data_1$fertilisation_success)
## Calculate height (density) of the normal curve at each 'x' using "dnorm()"
y_dens <- dnorm(x, av, stdev)

## This next bit is tricky
## We need to re-scale the density values so that they are on the same scale as the histogram's Y-axis
## Which is counts of 'fertilisation_success' within bins
## so we need to multiply the densities by the bin width (10) and then by the total number of observations (75)
y_freq <- y_dens * diff(H$mids[1:2]) * nrow(preloaded_data_1) 

## Plot the curve
lines(x, y_freq, col = "red", lwd = 2)

9.1.6 Multiple graphs

Often, efficient and informative data presentation of complex information is best achieved by presenting multiple graphs together as panels in a single figure (more on that here). You can put more than one figure on a page in the following way:

par(mfrow = c(2, 1))  ## Sets up space for two plots side-by-side (2 rows, 1 column)
bins <- seq(0, 100, by = 20)
## first the histogram for low copper concentration
low_copper_subset <- preloaded_data_1[preloaded_data_1$copper_concentration == 0, ]
hist(low_copper_subset $fertilisation_success, breaks = bins, main = "Copper Concentration = 0 mg/L")

## now high copper concentration
high_copper_subset <- preloaded_data_1[preloaded_data_1$copper_concentration == 50, ]
hist(high_copper_subset$fertilisation_success, breaks = bins, main = "Copper Concentration = 50 mg/L")

dev.off()    ## Reset default

NB: if running the code returns the following error: Error in plot.new() : figure margins too large, then expand your plot window size

The code in these examples can be adapted to any R graphic, and not just to hist. Feel free to experiment, and apply them to other graphics functions that follow below.

How would you arrange to have a 3x2 array of graphs (6 graphs) on a page?

par(mfrow = c(3, 2))

Try it on your own: make two histograms, one of Sepal Length and one of Petal Width, in the iris dataset. Change the bin widths, the x- and y-axis titles, and customize the colours / border colours for one or both histograms. Then make a two-panelled figure that shows the two histograms side-by-side

you should be able to do this given what you've learned above!

9.2 Bar graphs

Bar graphs display the value of a statistic (on the Y-axis) across the levels a qualitative variable, i.e., one consisting of categorical categories (e.g., "Male" versus "Female", or "A" versus "B" versus "C", etc). The value can be anything; most commonly it is a mean or a count. The statistic is represented by the height of a "bar" placed over the name of the category.

Bar graphs are effective when you want to compare the magnitude of values among different groups, such as the count of individuals in each species of a plant community. In this example, each species name would be on the x-axis and count would be on the y-axis. Bar graphs are also effective for visually emphasizing change over time, for example, the number of plants in a community from year to year. In this example, year would be on the x-axis and the count would be on the y-axis.

The standard R function for making a bar plot is conveniently called barplot. The following code plots the mean "Fertilization Success" across the two levels of "Sperm concentration", and adds a 95% confidence interval to each mean.

## 1. Calculate the means for each group
## Split data into groups
groups <- split(preloaded_data_1$fertilisation_success, preloaded_data_1$sperm_concentration) 

## Calculate the mean for each group
means <- sapply(groups, mean)                                

## 2. Plotting the means
my.barplot <- barplot(means,          ## Set heights of bars
beside = TRUE,                        ## Put bars side-by-side
names.arg = c("High", "Low"),         ## Labels for the X-axis
ylab = "mean Fertilization Success",  ## Title for the Y-axis
xlab = "Sperm concentration",         ## Title for the X-axis
col = c("grey60", "grey90"),          ## Set colours for the bars
ylim = c(0, 90))                      ## Make the Y-axis go to 80

## 3. Add confidence intervals
## Calculate a variance within each group
var <- tapply(preloaded_data_1$fertilisation_success, preloaded_data_1$sperm_concentration, var)

## Calculate a sample size within each group
n <- tapply(preloaded_data_1$fertilisation_success, preloaded_data_1$sperm_concentration, length) 

## Convert variance to standard error, within each group
se <- sqrt(var/n) 

## Draw each error bar as a "double headed arrow" that has a 90 degree angle between "head" and "stem"
arrows(my.barplot, means - 1.96 * se, my.barplot, means + 1.96 * se, code = 3, angle = 90)

Try it on your own: make a bar chart that shows the number of individuals sampled in each species in the iris dataset

hint: use the table() function to get counts of each level in iris$Species

9.3 Scatter plots

Scatter plots are useful to visualise the relationship between two numeric variables.

A scatterplot is a 2D graph containing many scattered points. You have seen this before when using the pairs() function to visualize the relationships between two variables. Each dot represents a single observation, i.e., it identifies two measurements taken on that observation, by reference to the two axes.

In order to illustrate the idea, import the file FingerData.csv. Each row is data from a former UQ student: the length of the index and ring fingers on the right had. Run the following code and see a basic scatterplot of the two finger lengths:

FingerData <- read.csv("Data/FingerData.csv", header = TRUE)

plot(FingerData$index, FingerData$ring)  # Make the scatter plot

Customise the appearance by overriding default settings:

plot(FingerData$index, FingerData$ring,   
     main = "Plotting one finger against another",  
     xlab = "Index finger length (mm)",  
     ylab = "Ring finger length (mm)")

Customise the text and categorise the dots:

## Make the plot but turn off all settings (plot will be blank) 
plot(FingerData$index, FingerData$ring, 
type = "n", xaxt = "n", yaxt = "n", bty = "n", main = "", xlab = "", ylab = "")     

## Add points for females
with(FingerData, points(index[gender == 'f'], ring[gender == 'f'], pch = 1))                              
## Add points for males
with(FingerData, points(index[gender == 'm'], ring[gender == 'm'], col = "green", pch = 3))     

legend(55, 100, legend = c("Female", "Male"), col = c("black", "green"), pch = c(1, 3))  

## Manually print values along the horizontal axis
axis(side = 1, cex.axis = 1.2)                                                      

## Manually print values along the vertical axis
axis(side = 2, cex.axis = 1.2)                                                      

##  Manually add a title on the horizontal axis
mtext("Length of index finger (mm)", side = 1, line = 2.4, cex = 1.1)               

##  Manually add a title on the vertical axis
mtext("Length of ring finger (mm)", side = 2, line = 2.4, cex = 1.1)                

## Add main heading
text(70, 110, ~bold("Association between finger lengths"), xpd = TRUE, cex = 1.5)

The abline function allows you to add a line to a plot. We can use it in conjunction with the lm output to draw a line of best fit through our data. But as you have already seen, the curve() function is far more versatile and can be used to plot lines from multiple regressions and GLMs.

## fit a simple linear regression
finger_mod<-lm(ring ~ index, data=FingerData)

## Add a line of best fit (regression line)
abline(coef(finger_mod), col = "turquoise", lwd = 2) 

## or we could use the FAR MORE VERSATILE curve() function!!!
curve(cbind(1,x)%*%coef(finger_mod), add = T, col = "dodgerblue4", lwd = 3, lty=3)

9.3.1 Extensions to more than one quantitative variable

A "scatter plot matrix" provides an image of statistical association among several quantitative variables simultaneously. It is a collection of 2D scatter plots; one for each pair of variables.

Using a scatter plot matrix is especially helpful when you have measured many values on a population, and want to visualize the pairwise relationships among measurements. For example, you might imagine measuring many quantitative traits on individuals in a population of plants - say, height, branch number, leaf area, chlorophyll concentration, and root mass. A scatterplot matrix can show you the pairwise relationships of all these variables. A scatterplot matrix also helps you troubleshoot collinearity in your models. In our example, if we want to predict plant performance using traits, we could visualize how collinear our trait values are with each other to avoid the problems associated with collinearity.

In order to illustrate this idea, import the file OzonePollution.csv. The pairs command will generate a scatter plot matrix of the columns contained in the file.

OzonePollution <- read.csv("Data/OzonePollution.csv", header = TRUE)

pairs(OzonePollution)  ## Generate every possible 2D scatterplot among the columns

Does this collection of 2D plots summarise everything there is to know about the statistical association among these four variables?

No, it only summarises associations between each pair of variables. It tells us nothing about potential 2-way interactions (or more complex interactions).

Try it on your own challenge: Make a scatterplot matrix of the quantitative variables in iris

hint: how can you select a subset of the iris data to leave out the categorical variable?

9.4 Strip charts

Strip charts are a type of "scatter plot", but for which the X-axis refers to a categorical variable. Strip charts are powerful visualization tools because they can be used to show how raw data are distributed within each group. This can be helpful for the eye to see the spread and central tendency of the data in each group. Sometimes the points of a strip chart plot will crowd over the top of one another at each level of the categorical variable, like this:

plot(preloaded_data_1$copper_concentration, preloaded_data_1$fertilisation_success)

This is not as informative as it could be - it is difficult to see where the central tendency might be.

The stripchart function can solve this problem by imposing an artificial horizontal scatter upon the points at each level of the categorical variable.

stripchart(preloaded_data_1$fertilisation_success ~ preloaded_data_1$copper_concentration, method = "jitter", vertical = TRUE)

Does the amount of horizontal scatter for each group mean anything?

No, it is just for aesthetics.

Repeat the command many times, you will see that a different plot is created each time.

You can define combinations of groups across several categorical variables.

with(preloaded_data_1, stripchart(fertilisation_success ~ copper_concentration + sperm_concentration, method = "jitter", vertical = TRUE))

Notice that the syntax of stripchart is "regression-style": the response and explanatory variables are separated by a tilde (~). This makes it easier to define combinations of groups across several discrete variables.

Like any R graphic, you can customise the image to suit your particular needs. Many of the tricks already shown in this prac class will work with "stripchart". Here are a few ideas to get you started:

with(preloaded_data_1, stripchart(fertilisation_success ~ copper_concentration + sperm_concentration,  
          method = "jitter",                      ## Add random horizontal jitter to the data points 
          vertical = TRUE,                        ## Arrange the real data measurements vertically 
          col = c("blue", "blue", "darkgreen", "darkgreen"),   ## Choose plot symbol colours 
          pch = c(1, 3),                        ## Choose plot characters 
          xaxt = "n"))                          ## Turn off x-axis labelling

Now add explanatory text to improve the look:

copperLocations <- c(1.5, 3.5)                 ## Horizontal locations for the copper levels 
spermLocations <- 1:4                          ## Horizontal locations for the sperm levels 
copperLevels <- c("0 mg/L", "50 mg/L")         ## Define the text of the copper levels 
spermLevels <- c("High", "Low", "High", "Low") ## Define the text of the sperm levels 

## Print those values along the horizontal axis 
axis(side = 1, at = copperLocations, labels = copperLevels, tick = FALSE, line = 2.5, cex.axis = 1.2) 
axis(side = 1, at = spermLocations, labels = spermLevels, tick = FALSE, line = 0.1, cex.axis = 1.2)

Finally, we can add means and error bars to summarise how the mean changes from group to group.

## 1. Define the CIs to be plotted
groups <- with(preloaded_data_1, split(fertilisation_success, paste(sperm_concentration, as.factor(copper_concentration))))
means <- sapply(groups, mean)   ## Calculate mean of each subgroup
var <- sapply(groups, var)      ## Calculate a variance within each group
n <- sapply(groups, length)     ## Calculate a sample size within each group
se <- sqrt(var/n)               ## Convert variance to standard error, within each group

## 2. Set locations for the CIs
xvals <- 1:4

## 3. plot the means
points(xvals, means, pch=16, cex=1.5, col = "black")

## 4. Add CIs to the graph
arrows(xvals, means - 1.96 * se, xvals, means + 1.96 * se, length = 0.15, angle = 90, code = 3, col = "black")

Try it on your own challenge: Make a stripchart of Sepal.Length for each species in iris, and then add the means and standard errors to the chart

you should be able to do this given the examples above :)

9.5 Line graphs

Line graphs concern the relationship between two quantitative variables. Line graphs are nice when you want to emphasize a trend or pattern through time, such as the increase or decrease of population size through time, or the cycling of population size through time.

A classic example is the change of a mean across values of another variable. The mean is usually displayed with error bars, but beware of the assumptions you make about the distribution of the data when you plot means and error bars!!!

Below we make up some data to demonstrate. We will plot the size of organisms (through time) grown in "cleared" versus "crowded" treatments. We might expect the organisms in the crowded treatment to show less growth through time due to competition.

## Coordinates along the horizontal axis for the statistics 
Time <- 1:5

## Means to put at those points for two groups
mean_size_crowded <- c(7.4, 98, 851, 7222, 54347)
mean_size_cleared <- c(6.8, 104, 846, 9742, 102314)

## Standard errors for those means for two groups
se_crowded <- c(0.5958, 7.8905, 68.5185, 581.4815, 4375.7649)
se_cleared <- c(0.5475, 8.3736, 68.1159, 784.3800, 8237.8422)

## Error bars for those means (remember, 1.96 standard errors is the approximate boundary of the rejection zone in the t distribution!)
yupper_cleared <- mean_size_cleared + 1.96*se_cleared
ylower_cleared <- mean_size_cleared - 1.96*se_cleared
yupper_crowded <- mean_size_crowded + 1.96*se_crowded
ylower_crowded <- mean_size_crowded - 1.96*se_crowded

## Draw the line graph using the data for "mean_size_crowded"
plot(Time, mean_size_crowded, type = "b", col = "blue", ylim = c(0, 120000), bty = "l", ylab = "Mean size") 

## Add a second line graph to the image 
points(Time, mean_size_cleared, type = "b", col = "red") 

## Add error bars to the means appearing on the graph 
arrows(Time, yupper_cleared, Time, ylower_cleared, length = 0.05, angle = 90, code = 3, col = "red")
arrows(Time, yupper_crowded, Time, ylower_crowded, length = 0.05, angle = 90, code = 3, col = "blue")

## Add a legend 
legend(2, 100000,                         ## XY coordinates for the legend 
       legend = c("crowded", "cleared"),  ## Legend labels 
       col = c("blue", "red"),            ## Color of points 
       pch = c(21, 19)                    ### Point type 
)

Which part of the code causes the lines to be drawn that connect the means?

type = "b"

How would you modify the graph to remove the box around the graph?

bty = "I"

What is the difference between a line graph and a scatter plot?

Scatter plots show the raw data.

Line graphs display a summary of the raw data.

One common use for line graphs is to summarise the conclusion of an ANOVA; specifically, the effects of explanatory variables upon the mean of the response variable. The code that you would use is similar to the code already presented above. The difference concerns the width of the error bars. Above, you simply used the standard error. Here, we will use the variance extracted directly from the ANOVA table, as follows:

## First, fit a model
fit <-  lm(fertilisation_success ~ sperm_concentration * as.factor(copper_concentration), data=preloaded_data_1)

DF <- fit$df.residual     ## Extract the degrees of freedom 
VAR <- summary(fit)$sigma^2   ## Extract residual variance 

## Define the groups for which a summary statistic is to be calculated 
groups <- with(preloaded_data_1, split(fertilisation_success, paste(sperm_concentration, as.factor(copper_concentration))))

The remaining code is presented below, and uses the "DF" and "VAR" from above to generate the error bars. Parts of it you have already seen. It is now in a generic form that will work with any data analysis, regardless of the number and complexity of "groups". We will make use of this code in subsequent practical sessions.

## Set up the plot parameters
means <- sapply(groups, mean)    ## Calculate the mean of each subgroup 
n <- sapply(groups, length)      ## Calculate a sample size within each subgroup 
se <- sqrt(VAR/n)                ## Convert variance to standard error, within each subgroup 
t <- qt(0.975, df = DF)          ## Define a 95% confidence interval using the t-distribution 
width = t * se                   ## Set the width of each error bar 
xloc <- seq(1:length(groups))    ## Places along the X-axis to graph the means 
xlab <- names(groups)            ## Define X-axis labels 
library(gplots)                  ## Load the library containing the 'plotCI' program 

## Construct the plot
plotCI(xloc, means, uiw = width, gap = 0.1, xlim = c(min(xloc) - 0.5, max(xloc) + 0.5),  
xlab = "", xaxt = "n")
axis(side = 1, at = xloc, labels = xlab, cex.axis = 1.5)
lines(xloc, means)

Which setting of the plotCI function determines the width of the error bars?

uiw, which is the "upper interval width".

9.6 List of useful functions

Here are a few useful functions that you learned today.

Histograms
hist; bins;

Vectors
seq; diff; split; pairs

Plotting
rainbow; text; par; arrows; mtext; plot3d; stripchart; plotCI

Distributions
dnorm; sd; min; max; mean; deviance; qt