Practical 1 Getting started with R

Why learn R?

R is a professional statistical package, it is:

  1. The tool of choice for most professional biologists and scientists;
  2. FREE;
  3. The most up-to-date, all-purpose statistics package available: a huge group of users constantly up-date and add resources to R;
  4. Immensely powerful, you can use it to:
    1. manipulate and organise data;
    2. run statistical models, from simple to highly complex ones;
    3. handle multiple data sets simultaneously;
    4. produce beautiful graphics, and;
    5. build and run simulation models to understand and quantify biological dynamics.

Whatever your career path, R-literacy is an important skill to have on your CV!


1.1 Installing R and RStudio

We strongly encourage you to install R onto your personal computer. Install R for your operating system (Windows, macOS, or Linux).

An icon will appear on your desktop, and a collection of folders (directories) will also have appeared at the location you specified during the installation. The CRAN (COMPREHENSIVE R NETWORK) website contains a lot of useful information about R: manuals, documentation, FAQs, access to the R-Help archives, add-on modules for download, and more. The Internet is also pretty helpful!!!

When you first install R it comes with the "Base Package" that includes many essential functions. This base has been built on and extended by many practitioners, and we will use other packages to support our analyses. In coming weeks, we'll show you how to install and load these packages.

R is a very "old school" command-line driven application, but there are many wrappers available that drag R into the 21st century, and provide a more user-friendly interface. We will use RStudio, one of the most popular interfaces. Only install this AFTER you have installed R.


1.2 R Projects (inside RStudio)

Biologists, ecologists and conservation scientists who routinely analyse data are often working on more than one analysis at a time. RStudio lets us make a "mini universe" on our computer for each specific project we are working on. These are called "R projects". By now you should have saved and unzipped the CONS7008 R Project from Blackboard to a sensible location on your computer.

Go into that folder and click on the "CONS7008 R Project.Rproj" file (depending on your computer you may not see the ".Rproj" file extension). RStudio should open automatically (it might take a little while, be patient). In the lower right panel, click on "Files" and you will notice that you are "inside" the CONS7008 R Project folder. This will be the same on everyone's computer which is convenient because now we don't have to worry about telling R where we are on our specific computers (e.g. "/Desktop/Johns_crap/"). Instead, when we are inside this mini universe, R knows where to find files that we want to import, or were to put stuff that we want to output (e.g. nice figures!).

1.2.1 Starting a fresh R script

The upper left panel of RStudio is where you write your R code into script files that can be saved. To start a new file, click the icon resembling a white page with a green + symbol in top left-hand corner of the script panel.

Press File -> Save. Rename the file "Prac 1 R script.R" and save it inside the "R scripts" folder. Now we are ready to do some REAL R programming!

You type or paste commands into R scripts, edit them, and pipe them into R by clicking the "Run" button; it will run the command at which the cursor is currently positioned. If you want to run several commands consecutively, highlight them all, and then click "Run". There are also helpful keyboard shortcuts that make it easier to send commands to R. For Windows use Ctrl+Enter and for a Mac use Cmd+Return.

You can have multiple different R script files open at once but for now let's keep it nice and simple with a single R script.

1.3 Getting help

You can get information in R about any command by typing ? followed by the name of command you want help on.

Try running this command:

?with
What happened? Did R do anything
You should have seen the code appear in the bottom left panel "Console", and information about the "with" command appear in the "Help" tab in the bottom right panel


When R responds to the ?, it usually displays an information page that gives example scripts at the bottom. You can cut and paste these scripts straight into the script editor, execute them, and see how they work. Another useful trick is to type example(command) where "command" is the name of the R command that you are trying to learn more about.

Now, try running this command:

??bty

The ?? will take you to information on options within the R command code. Click on the "graphics::par" in the bottom right panel.

Scroll down to see the many graphical parameters that can be modified. What options does "bty=" take?
"o" (the default), "l", "7", "c", "u", or "]" - this sounds like nonsense, but it's telling R which sides of the box you want to have a frame around in a scatterplot graph. Check out the options when you plot the "zebrafish" data below.


You can also "Google" your problems; just type "R" followed by your query. Problems with R are extensively discussed on Internet forums, so there is an extremely high chance that your problem has already been discussed and resolved.


1.4 Using R as a calculator

Copy into R and run these commands and observe their effects:

1 + 2 ## Add 1 and 2
sqrt(1 + 2)  ## Take the square root of the sum of 1 and 2
x <- (48 + 34)/5.5  ## Make an R object "x" to store the result of adding 48 and 35, and dividing by 5.5
x   ## Call the object "x" = show the result in the console

Some tips on coding:

  1. The <- tells R "run the commands at the base (-) of the arrow and store the output as an object with the name pointed to by the arrow (<)". If you do not supply the arrow and a name, the output will appear in the console, but will not be saved in your R environment.

  2. Anything you type into R after the hash symbol (#) is ignored - this is useful for adding notes on your code. Annotate your code to tell yourself what it is doing to help you remember for future use. Any code you share with others (whether for assessment or as a collaboration) should be well annotated so the purpose of each line of code is clear.

  3. R is case and font sensitive. For example, "Height", "HEIGHT", and "height" are completely different things. So take care when typing commands. Another thing that can trick some people is spelling - typically R follows the American conventions (so, "color" not "colour").

What is the answer to: (9 + 3.3)/(10.1 - 4)?
2.0163934

1.5 Vectors, classes, and some basic operations

R has a concatenate command: c(). Concatenate means "join together as a sequence (or, "vector")". The stuff inside the brackets, separated by commas, is concatenated.

To pick and choose stuff within a vector, use [] as demonstrated below. Try this:

stuff <- c(6, "elephant", 4.5, 1, "cat")    ## Create a vector of "stuff" with numbers and characters
stuff                ## Show stuff in the R console
View(stuff)          ## Show stuff as a new tab in the upper left script panel
stuff[2]             ## This calls the second element of the vector, "stuff"
stuff[1]             ## And the first. Note the "" around the 6 - R has not recognised 6 as a number
class(stuff)         ## What type of class is "stuff"? Everything in "stuff" is recognised as a character
stuff[1] + stuff[4]  ## Error message because "6" and "1" are not recognised as numbers
stuff[2:4]           ## The 2nd, 3rd, 4th elements of "stuff"

Challenge: Write the code to make a new vector "Friends", concatenating the first names of your five closest friends (in no particular order!). Show the person next to you JUST the name of your third friend.

1.5.1 Changing between classes

A vector can only be one class. Because "cat" and "elephant" were in the "stuff" vector, "stuff" was recognised as a character vector, hence the "" around each element. The code, stuff[1] + stuff[4] looks like 1 + 6 to us, but, as far as R was concerned, it might as well have been "cat" + "elephant".

It is possible to change between classes:

stuffNumeric <- as.numeric(stuff)         ## There were warnings because "cat" and "elephant" can't be converted to numbers
class(stuffNumeric)                       ## Now it's a numeric vector!
stuffNumeric                              ## See the NAs where "cat" and "elephant" were? R has removed them because it couldn't convert them

stuffNumeric[1] + stuffNumeric[4]         ## Now this works as desired! 1 + 6 = 7

stuffFactor <- as.factor(stuff)           ## Change the type to "factor"
stuffFactor                               ## This is similar to "stuff", but there are now levels. This is IMPORTANT. Notice that the values are now ordered - numerically then alphabetically.
stuffNumeric2 <- as.numeric(stuffFactor)  ## No warnings! Why?
stuffNumeric2                             ## This is now a vector of integers; everything has changed - these numbers are the ORDER that each of our original "stuff" appeared in when we converted to a factor.

What is going on with "stuffNumeric2"?

The integers (numbers) are the "levels" of "stuffFactor", rather than the actual values (6, 4.5, 1) from the stuffFactor vector. 4.5 is no longer 4.5, but the 2nd level of stuffFactor, hence:

levels(stuffFactor)  ## What are the levels of "stuffFactor"
stuffNumeric[3]      ## is 4.5, but
stuffNumeric2[3]     ## is 2! AND,
stuffNumeric2[1] + stuffNumeric2[4]  ## no longer equals 7

In summary, classes are important and it is often necessary to switch between them, particularly for analysing data (e.g., to test for a difference between a "Treatment" and "Control" we need R to know that these are two levels of a factor, not characters). Be careful when changing class, particularly when changing from a factor to a numeric class.


1.6 Data matrices

Most of our work will involve data sets with more than one variable. For example, we might be working with an explanatory variable and a response variable.

Let's make some new data vectors, with the same length (i.e., the same number of elements):

Owners <- c("Sally", "Harry", "Beth", "Roger") ## Vector of people's names
Pets <- c("dog", "lizard", "bird", "cat")      ## Vector of the type of pet each person has

Now let's combine these two vectors into one "matrix", and look at it:

OwnersPets <- cbind(Owners, Pets)  ## cbind concatenates these two vectors as columns
OwnersPets        ## Have a look at it
str(OwnersPets)   ## Check the class - here we've used "str" which is useful when you have multiple variables, or columns in your data
dim(OwnersPets)   ## the command "dim" returns the dimensions (rows and columns) of our object         
What are the dimensions of the matrix "OwnersPets"? Is this what you expected it to be?
4 2; that is, four rows and two columns


cbind has joined the two vectors together to form a "matrix". In R, we can work with vectors, matrices, or "data frames". These are similar, but have slightly different properties, and respond to different commands. This is useful to know, as sometimes the reason your code isn't working is because R thinks your data is in a different format to what you think it is (just like you saw with data classes).

We can alternatively create a dataframe using data.frame:

OwnersPets.df<- data.frame(Owners, Pets)    ## add the vectors to a dataframe
OwnersPets.df         ## Have a look at it
str(OwnersPets.df)    ## Check the structure
How does the structure of "OwnersPets" differ from "OwnersPets.df"?
"OwnersPets" is a list of 1:4 and 1:2 characters, while "OwnersPets.df" is arranged as four rows ("obs." or observations) and two columns (or variables)

1.7 Dataframes and viewing data

R comes with many inbuilt data sets. One is called "women" that contains two columns called "height" and "weight". We will practice with this data set.

Copy each of the below commands to your R script, run each, and discuss what it does.

women
View(women)
women$height
women$weight
height ## You will get an error
with(women, height)
weight ## You will get another error
plot(women$height, women$weight)
plot(women$weight, women$height)
with(women, plot(height, weight, type = "l"))

plot(women$height, women$weight, type = "b", col = "red", bty = "l", cex = 2, xlab = "Height", ylab = "Weight")
What was the purpose of using with?
with tells R that "women" is the dataframe we are using, such that we could refer to 'height' and R knew where to access it from.

What does the "$" do?
This tells R that the height variable can be accessed in the women dataframe. Try running the code: women[, 2] - what is it equivalent to?


What does "cex" tell R to do?
Remember: you can get information by typing "??cex" into your code and running it



1.8 Importing your own data into R

Let's import a spreadsheet called zebrafish.csv into R.

A spreadsheet must be in the right format before it can be imported. Specifically:

  1. The data must be in "row/column" format, i.e.:
    1. rectangular;
    2. each column holds a separate variable, and;
    3. each row holds data from a separate object.
  2. There must be nothing in the spreadsheet but data (e.g., no extra notes, graphs, etc).
  3. Column names (headings) must be one word each (R doesn't like spaces).

zebrafish.csv is located in the "Data" subfolder within our CONS7008 R Project, so we can easily import the data like this:

zebrafish<-read.csv("Data/zebrafish.csv", header = T)
Why did we include "Data/" in the name of the csv file?
R needs to know where to find things. All of the .csv files for this course are stored in the "Data" subfolder within our R Project, so we need to tell R to get "zebrafish.csv" from inside the "Data" folder.


Check that your data is now imported by typing these commands:

zebrafish               ## Look at the entire data set
names(zebrafish)        ## List the variables (the column names)
head(zebrafish, n = 4)  ## Look at the first 4 rows of the data
tail(zebrafish, n = 4)  ## Look at the last 4 rows of the data
dim(zebrafish)          ## How many rows and columns are there?
What is the "zebrafish" data? How many variables (columns) and objects (rows)? What are the variables?
dim() will tell you that there are 482 rows by 10 columns; you can see the column labels when the object is called (i.e., you just run the object name), via the names() command, the head() or tail() commands.

1.9 Classes within data frames

Above, you saw that all elements of a vector were treated as the same class (e.g., factor, numeric, character). In data frames, this applies to each column.

If R encounters a non-numeric character within a column it treats the entire column as a categorical variable (i.e., as a non-numeric variable), even if all other entries are numbers. Therefore, if you have accidentally typed a non-numeric character (even an accidental "space") into your data column, you have to go back to the data file and fix it before importing the data back into R.

As with vectors, it is possible to change the class of columns. You can use the "structure" function, str, to identify the class of each column.

In the present data set ("zebrafish"), 'sex_nr' is a categorical variable for "sex", but R does not recognise that because sex has been coded as numbers: 1 and 2. You need to tell R to treat them as categories using the as.factor function.

Try this:

str(zebrafish)              ## Note that sex_nr is an integer variable
sex_nr <- as.factor(sex_nr) ## Tell R to treat the variable sex_nr as a factor

It didn't work! Instead we get a scary Error in is.factor(x) : object 'sex_nr' not found message!

This is because we didn't tell R where to find the 'sex_nr' variable! We also didn't tell R where to put the revised version of the variable.

Try this instead:

zebrafish$sex_nr <- as.factor(zebrafish$sex_nr)
str(zebrafish)

That worked! The variable 'sex_nr' is now categorical (a factor), and also included within "zebrafish".

Even though it is more typing, it is often safest to use $ to ensure that R is very clear about what variable you are asking it to act on.

Another built-in R dataset is "iris". Type iris$ into your R console - a drop-down list of options will appear, telling you the names of each variable in the data set. What are the five variables in iris?
'Species'; 'Petal.Length'; 'Petal.Width'; 'Sepal.Length'; 'Sepal.Width'

1.10 Picking and choosing parts of a data set

Sometimes you will want to select (or identify) parts of a data set. Above, we used stuff[1] to get the first element of the "stuff" vector. For data frames (which have rows and columns, i.e., they are 2-dimensional, not 1-dimensional), we need two arguments. Again, we use the [], but slightly differently.

Copy the below code and run each line, and deduce what each command does.

zebrafish[2, 4]
zebrafish[, 2]
zebrafish[2, ]
zebrafish[, 1:3]
zebrafish[zebrafish$TotalLength > 30, ]
with(zebrafish, zebrafish[sex_nr == 1, ])
zebrafish$sex_nr
zebrafish[, "sex_nr"]

The > and == commands are basically TRUE/FALSE questions.

Hence, when subsetting the data like this: zebrafish[zebrafish$TotalLength > 30, ], R shows all of the rows where zebrafish$TotalLength > 30 is TRUE.

How many "zebrafish" have a 'TotalLength' less than 20? Of these, what is the longest 'TotalLength'?
6; 19.8


NB: == and = are different. == is a TRUE/FALSE question, whereas = tells R to make the values to the left and right of the "=" the same.

Picking and choosing parts of a data set is relatively easy if you remember a few basic rules:

DataSet[RowNumbers, ColumnNumbers]

DataSet$ColumnName

DataSet[RowNumbers, ]$ColumnName

DataSet[RowNumbers, "ColumnNames"]


What would these command produce?
DataSet[, 1:2]
DataSet[1, ]
DataSet[1:5, ]$Col1Name
DataSet[1:5, "Col1Name"]

The first two columns, ALL rows.
The first row, ALL columns.
The first five rows, and the first column (named "Col1Name").
Exactly the same as the one before!

1.11 Graphing data

The graphics in R are customisable, and capable of anything you are likely to want (given enough time, effort, and patience). We will show you quite a few ways to plot data in this course - there is a wealth of expertise in plotting available via the R community on the internet.

To get started, run the code below, and see how it changes the plots.

pairs(zebrafish[, 6:9])       ## Plot variables against other variables (variables in columns 6, 7, 8, and 9)
plot(zebrafish$TotalLength, zebrafish$SwimSpeed)         ## Plot 'TotalLength' (x-axis) against 'SwimSpeed' (y-axis)
plot(zebrafish$TotalLength, zebrafish$SwimSpeed, type = "n",  ## Set up a blank graph with axes
     xlab = "Fish length (mm)", ylab = "Fish swimming speed")  ## Add axis labels

with(zebrafish, points(TotalLength[sex_nr == 1], SwimSpeed[sex_nr == 1]))               ## Insert the points for the 1st sex
with(zebrafish, points(TotalLength[sex_nr == 2], SwimSpeed[sex_nr == 2], col = "red"))  ## Insert the points for the 2nd sex in red

plot(zebrafish$TotalLength, zebrafish$SwimSpeed, bty = "l")          ## Remove the surrounding box type
plot(zebrafish$TotalLength, zebrafish$SwimSpeed, xlab = "Fish length (mm)", 
     ylab = "Fish swimming speed", cex.axis = 1.5, cex.lab = 1.5)  ## The "cex" (= character expand) command controls the sizes of characters

The plot command has many subcommands, giving you finer control over the appearance (see ?par and ?plot for more details).

Is sex_nr 1 or 2 shown in red on the plot?
points(TotalLength[sex_nr == 2], SwimSpeed[sex_nr == 2], col = "red")

1.12 Looping

Looping is a very useful coding technique. A block of commands is executed some number of times, or until something "happens".

For example, perhaps we want to calculate unit grades by summing up assignment marks. We can write some code to do the calculation, and then run it "for" each student. These are "for loops".

Alternatively, we might be simulating population genetic data, and we want to know when an allele goes to fixation. We can simulate generations, and keep simulating "while" the allele is not fixed. These are called "while loops" (very imaginative).

NB: Loops use a particular type of bracket {}. While we've been encouraging you to run each line of code one by one to understand what they do, when you see a {, you must submit ALL the code up to and including the closing parenthesis } - otherwise you "leave R hanging" - waiting on you to finish the command!

Here are examples of simple for loops:

## A simple for loop
for(i in 1:10){
  print(i)
}

## A for loop with an if/else argument
for(i in 1:10){
  if(i < 5){
    print("Hello")
  }else{
    print("Bye!")
  }
}
How many times did you say "Hello"? What would you change in the code to get it to say "Hello" 3 times?
4. change if(i < 5) to if(i < 4)


Here is a slightly more advanced for loop that does something useful!

x <- c(2.1, 4.44, 1.3, 7.32, 8)
Total <- 0
for(i in 1:length(x)){
  Total <- Total + x[i]
  if(i == length(x)){
    print(paste("The total is ", Total))
  }
}

Loops can be hard to understand at first, so don't worry if they are still a bit confusing. They are pretty useful so it's worth spending some time getting your head around the basic idea.


1.13 Installing and loading packages

Throughout these practicals, we will need to run commands that R doesn't know how to do. We can make our own R functions to solve a task (you'll learn how to do that in the next practical!), but often someone else has already done that for us. People write functions and then compile them into packages for everyone to use. This is extremely helpful, and a large part of what makes R such a powerful tool.

Have a go at installing a package using the code below:

install.packages("praise")  ## This installs the package onto your computer in your default 'library' 

You can also install a package using the buttons under the "Packages" tab in the bottom right panel of RStudio.

After you have installed the package, if you want to run a function that is contained in that package, you first have to load the package:

library(praise)  ## This command loads the 'praise' package
require(praise)  ## This also loads the 'praise' package

NB: library and require are identical commands, you might see both around, but they both just load packages.

Now try running the praise() function. I was "exquisite" - what are you?

If a package is installed on your computer, you can run a function without loading a package like this: praise::praise().


1.14 Saving and Tidying up

At any time (indeed frequently!) save your R script of commands and notes.

When you are finished with them, it is good practice to remove any variables you've created during the session from your R Environment. This can all save confusion with any further commands and objects you introduce later.

rm(zebrafish)       ## Removes the object altogether

You can get rid of everything by typing:

rm(list = ls())    ## remove all data and objects from R's memory

Notice how everything disappeared from your Environment in the top right panel of RStudio?

This can be really helpful in some situations where R has gotten confused; for example, if you have created multiple data frames that share variable names. Once you have done this though, you will need to go back and re-import the data, and re-run the code to create any R objects that you want to work with. This is why saving R scripts is so important! It saves all of the "instructions" for R and you can just run it again in a fresh session.

You can also use the "broom" button that you see at right-hand end of the icons above the Console, Environment and Plot panels of RStudio - that will delete everything in those panels.


1.15 List of useful functions

Here are a few useful functions that you learned today.

Classes
as.character; as.numeric; as.integer; as.factor

Making/unmaking a dataframe the default to apply any future commands to
with

Data structure
head; tail; str; dim; names

Installing and loading packages
install.packages; library; require

Maths
sqrt; log; exp