Practical 2 Estimation and descriptive statistics

Our goal in science is to understand properties of populations. We want to know things like: how big is the average blue whale? How much variation is there in the amount of orange on male guppies? And we want to test questions, such as: are whales in the Pacific Ocean bigger than whales in the Indian Ocean?

The challenge is that we cannot measure every single individual (or thing), we have to take a "sample" from the population and use this to "estimate" the true value. When we call a statistic an "estimate", we mean that it represents some number that we have not measured directly. When we have an estimate of something, we also need to know how good that estimate is.

In this practical you will:

Be introduced to some experimental data that you will use in this prac and the next;
Calculate various "descriptive statistics" (estimated properties of populations);
Learn how to write your own function to calculate a statistic;
Learn how to do perform calculations on a single sample of one population, across multiple samples of a population, and on samples from different populations;
Learn how to utilise simulations of data to explore the idea of "standard error", and how it relates to "standard deviation".
Begin to learn how to address the question: "how can one judge the relative merits of separate studies that address the same question"?

2.1 The cure for cancer

Two research teams were both awarded $1M in funding to investigate the potential of several drugs to cure cancer. Both teams claim to have discovered a new drug, "Cureallaria", which is a cure for all cancers (as well as acne, hair loss, and yellow teeth!).

Below are the data collected by each research team, separated into control (no Cureallaria) and treatment (Cureallaria) groups. Each datum is the proportion of a group of 100 people that were cancer-free after being given either the treatment or the control.

Research Team "A"

Research team A used their preliminary $1M in funding to replicate the initial 200 person (100 control + 100 treatment) study as many times as possible. They were able to repeat their study 10 times.

Research Team "B"

Research team B were so impressed with the massive difference between their treatment and controls in the first round of experiments that they only replicated the experiment three times (for 100 control and 100 treatment individuals each time). Encouraged by these early findings, Team B spent the rest of the initial $1M of funding on advertising and media releases, telling the world about their incredible results, and preparing their business structures for massive cash inflows.

Based on these results, both Research Teams are requesting $10M million in further funding to develop their cures into a product for commercial distribution across the entire world. As the adviser to the Minister for Science you must present a summary of each research team's data so that she can decide which of the team will receive the funding, and which will miss out.

2.2 Calculating descriptive statistics

Go into the CONS7008 R Project folder on your computer and click on the "CONS7008 R Project.Rproj" file (depending on your computer you may not see the ".Rproj" file extension). RStudio should open automatically (it might take a little while, be patient). In RStudio, go to File -> New File -> R Script and a blank R script will appear. Now press File -> Save. Rename the file "Prac 2 R script.R" and save it inside the "R scripts" folder. For today's prac, paste your scripts into this R script and run the commands from there (remember to save the script at the end of the session!).

It is more typical to read in a data file that you have entered elsewhere, but as you learnt in "Practical 1: Getting started with R", you can readily input vectors of data using the concatenate (c) command in R.

Manually enter the "Anatharoma" data into RStudio by copying and pasting this code:

## Enter the data from Team A
ControlA <- c(0.2, 0.4, 0.5, 0.38, 0.6, 0.2, 0.8, 0.4, 0.4, 0.2)
TreatmentA <- c(0.6, 0.8, 0.7, 0.8, 0.7, 0.6, 0.3, 0.6, 0.5, 0.9)

## Enter the data from Team B
ControlB <- c(0.3, 0.6, 0.2)
TreatmentB <- c(0.99, 0.7, 0.6)

## Inspect the data vectors to check the data input correctly
ControlA
TreatmentA
ControlB
TreatmentB

2.3 Measures of "central tendency"

"Central tendency" is the most basic property that a data sample can have. Common measures of central tendency are mean and median, both alternative ways to locate the "middle" of the data.

With your tutor and peers, discuss what "central tendency" might mean. How would you measure it?

The tendency of data to be "close to" or "clump around" a particular value.

2.3.1 Calculting the mean and median

Run the code below to use the built-in R function for calculating the mean (called mean!) to obtain the mean of each of the samples:
NB: if you put () around a line of code, the output will be stored AND appear in the console!

(meanControlA <- mean(ControlA))     ## Calculate the mean of TeamA control group
(meanTreatmentA <- mean(TreatmentA)) ## Calculate the mean of TeamA treatment group
(meanControlB <- mean(ControlB))     ## Calculate the mean of TeamB control group
(meanTreatmentB <- mean(TreatmentB)) ## Calculate the mean of TeamB treatment group

meanControlA <- mean(ControlA)       ## Here, note that the output does not appear in the console, but it is stored!

You now have four means, one for each of the team's control and treatment groups.

Which of the 4 means is highest?

Team B treatment

Repeat what you did above, but calculate the median instead of the mean, again using the in-built R function.

Are either of these "central tendency" statistics enough to judge which team deserves the funding?

No; they ignore the spread (scatter) within the data sample (i.e., the uncertainty about where the centre of the data is).

2.4 Measures of "spread"

Along with measuring the "middle" of a data sample, it is also important to measure the "spread" of a data sample around this centre. How far are the data values from the middle value? There are several measures of spread that we can calculate for our sample to "estimate" the spread of data in the population. We will use built-in functions in R to calculate these. You can check your lecture notes to refresh your memory of the formula for these summary statistics if you are interested.

var(ControlA)    ## Variance
sd(ControlA)     ## Standard deviation
range(ControlA)  ## Range

Duplicate and edit the code to repeat these commands for Team B.

Should spread be measured in the same units as the raw data?

Not necessarily. The standard deviation is, however, in the same units as the raw data. Thus, for variables with a larger mean, the SD will also be larger.
You can check this by changing the mean of one of your data sets. Try making Team A's control data have a lower value:
CA <- ControlA/10
Recalculate var and sd for the "CA" variable - what do you notice?

2.5 Calculating summary statistics across an entire data set

When we're using R, rather than dealing with lots of different variables (such as "ControlA", "TreatmentA", etc), it is easier to have your data together in one data frame (e.g., the "women" and "zebrafish" data frames from the previous practical); we showed you how bind the vectors together to make a data frame in Practical 1 - follow along again below.

Concatenate the Treatment and Control data for Team A using cbind.data.frame (bind together by columns):

teamA <- cbind.data.frame(ControlA, TreatmentA)    ## Column bind the results for Team A and convert the data matrix to a data frame
teamA #check if teamA has both sets of results

Modify the code, repeat, and get a "teamB" data frame.

Excellent!

Above, we calculated summary statistics for each variable (e.g., "TreatmentA"). Now that we have our data together in a data frame, we can take shortcuts.

The apply family of functions (e.g., apply, sapply, tapply) are very useful for calculating summary statistics on sections of a data frame.

Execute the following commands:

apply(teamA, 2, mean)  ## Calculates the mean of each column (or variable), which tells us the average effect of the drug
apply(teamA, 1, mean)  ## Calculates the mean of each row (or object) which isn't very informative

sapply(teamA, mean)      ## Calculates the mean of each variable

In this case, sapply is simpler, but apply is useful to do things to each row and/or each column. The 1 and 2 are for "by row" or "by column", respectively.

Try "applying" a few different functions, such as: var, sum, min, quantile, length. What do they do?

Based on all of the summary statistics you have calculated so far, which Research Team should the minister fund? Why?

We'll come back to this question in the Practical 3!

What benefit can you think of for Team A having conducted 10 experiments of 200 individuals each versus one big experiment with 1,000 people?

The expected mean values of each small versus one big experiment are the same, but we can estimate the uncertainty by having many small experiments.

2.6 Writing your own functions

While the base package (i.e., R itself) and other publicly available packages come with loads of useful functions (like mean and var), anyone can write an R function.

Below is an example of how you might create your own mean function.

The mean ($\bar{x}$) of a set of numbers is their sum ($\sum x_i$) divided by the number of samples ($n$):

$\bar{x} = \frac{\sum x_i}{n}$

We can use built-in functions to calculate sum(x) and length(x) (assuming "x" is a vector of numbers).

Run this code, but remember that when you see a "{" you must submit ALL the code up to and including the closing parenthesis "}" - otherwise you leave R hanging - waiting on you to finish the command!:

myMean <- function(x){       ## Define the function and its input; an arbitrary set of numbers ('x')
  Total <- sum(x)            ## Sum the elements of 'x' and store as 'Total'
  SampleSize <- length(x)    ## Count how many elements there are in 'x'
  Mean <- Total/SampleSize   ## Calculate the mean of 'x' by dividing the sum by the number of samples
  return(Mean)               ## Tell R what to 'spit out'
}                            ## End the definition of the function

The function, myMean, is now part of the "R armory"; you created it! If you look over in the top right panel of your RStudio and scroll down, you will see a "Functions" sub-panel, and myMean listed there.

You can now use myMean in the same way as any other function:

myMean(ControlA)       ## Use the function to calculate the mean of 'ControlA'
mean(ControlA)         ## How does it compare to Rs in-built function?
sapply(teamA, myMean)  ## Apply the function to each column of a data frame

NB: any function you define disappears when you end your session.

Here is an example of something useful that is not built in: A function to calculate another metric of central tendency of data, the "mode" (this code is more complex, and we don't expect you to understand how the code works):

mode <- function(x){                               ## Name the function ('mode') and its input: an arbitrary vector ('x')
  u <- unique(x)                                   ## 'u' has duplicate elements removed, keeping only `unique` values of your variable
  y <- lapply(u, function(y) length(x[x == y]))    ## Count the duplicates of each element of 'x';  that is, find the `length` of the variable) 
  u[which(unlist(y) == max(unlist(y)))]            ## Find the most duplicated element (the 'mode')
}                                                  ## End function definition

Having defined it, you can now apply that function to calculate the mode of any set of numbers.

mode(ControlA)   ## Use the function to calculate the mode of 'ControlA'
mode(TreatmentA) ## Use the function to calculate the mode of 'TreatmentA'

Why does mode return two values for "ControlA", but only one value for "TreatmentA"?

There are two equally frequent values in the Control data, but only one in the Treatment - hence, Control is bi-modal, while Treatment has one mode. Visualising the frequency distribution of your data using, for example hist, is a really useful way of understanding the distribution and central tendency of data. Here though, we only have 10 observations for Team A, and a different approach, reporting the counts of each value, is also useful. Run the below code using another of the apply family, tapply:

tapply(ControlA, ControlA, length)
tapply(TreatmentA, TreatmentA, length)

What are the relative advantages of mean, median, and mode as statistics describing "central tendency?

You could talk all day about this! Here's a question to get you started:

What happens to each if you have a large outlier, e.g. c(2, 1.9, 2.1, 2, 2.3, 1000)?

Imagine that Team A's data IS the population of experiments, and we take a random sample from that population. Would we have a higher probability of sampling an untreated (Control) probability of cure of 0.2 or 0.6? What about the probability of sampling a probability of cure of 0.38 or a much higher probability of 0.8?

The probability of sampling something depends on the frequency with which it occurs in the population. Here, there were six experiments where the proportion of Control patients being cured was 0.2 or 0.4 (3 observations of each value), while each of the other cure rates occurred only 1 time in the 10 experiments.

2.7 Simulating data based on "population" parameters

The Australian Bureau of Statistics (ABS) reported that Australian's aged 18-44 years old have an average height of 1.704m (SD = 0.066). Height is a normally (Guassian) distributed variable. Height is a useful variable to think about because the frequency distribution is very well characterised. Indeed, human height is one of those rare variables where the number of observations in our measurement space might approach the total number of individuals in our population, or sample space.

Let's first simulate our "population"; The ABS has reported ~6 million people in our focal age group (18-44yo), so let's simulate a data set with these characteristics. We can simulate a data sample that will mimic the true distribution of height using the R command rnorm:

## Simulate 6,000,000 data points with mean = 1.704 and sd = 0.066
SimData <- rnorm(n = 6000000, mean = 1.704, sd = 0.066)

Now, let's explore how our sample of objects that we include in a study can influence our understanding of the population from which we sampled, and particularly our understanding of the "true" central value and spread of height.

Copy and run this code to take a sample of 10 individuals from our population:

## "sample" 10 objects from that population, making sure to sample any object only once (replace = FALSE) 
SimData10 <- sample(SimData, 10, replace = FALSE)
SimData10 ## Look at the heights of your 10 sampled individuals

## Calculate basic, descriptive statistics
mean(SimData10)
sd(SimData10)

Did your sample accurately represent the population?

You probably answered "no" to this question - we expect that any random sample of a population might not reflect the typical (average) values of that population, and this is particularly true when our sample is small

When you call the sample function (or the rnorm function), you get a different set of numbers each time. Therefore, if you re-run the code above, you will get slightly different means and standard deviations for each sample. Give it a go.

Start a new chunk of code to record your estimates. You should create two new objects, one for Means and one for SampleSize - use the c() function to concatenate the data for each. For example, I got two different samples of 10 individuals and estimated these means:

Means <- c(1.7275, 1.6854)
SampleSize <- c(10, 10)

Once you have accumulated a about 5 different means from about 5 different samples sizes (e.g., 10, 30, 50, 70, 90), plot them:

plot(SampleSize, Means)
abline(h = 1.704)  ## Add a line for the TRUE mean

Each object within the "Means" vector is a sample mean. Is mean(Means) close to 1.704?

As the number of sample means increases, the distribution of sample means approximates a normal (Gaussian) distribution with the mean approximately equal to the mean of the population, $\mu$, from which the samples were taken.

What is this property of a set of sample means?

CENTRAL LIMIT THEOREM!

Now let R do the work of storing the means:

Ns <- c(10, 25, 85)  ## Sample sizes for R to work with
Nreps <- 50 ## Number of reps per sample size (i.e. number of sample means per sample size)

Results <- matrix(nrow = Nreps, ncol = length(Ns)) ## Set up a blank matrix to store results
colnames(Results) <- Ns

for(i in 1:length(Ns)){  ## For each sample size
  for(k in 1:Nreps){     ## For each replicate
    Results[k, i] <- mean(sample(SimData, Ns[i], replace = FALSE))  ## Record the sample mean in the kth row based on the ith sample size
  }
}

Results  ## Look at the results

Each "sample" provides an "estimate" of the population mean (in this case, we know that the true mean is 1.704); the sample means are stored in "Results".

NB: we could have done exactly the same for the standard deviation - each "sample" provides an estimate of the true population standard deviation of 0.066.

Now, let's plot the results. We need the data in a slightly different format (which requires the "tidyverse" package; you may need to install.packages("tidyverse")):

library(tidyverse)  ## Load the library
## use the wonderful "pivot_longer" function
ResultsLong<-pivot_longer(as.data.frame(Results), everything(), names_to = "SampleSize", values_to = "SampleMean", names_transform = as.numeric)

## now plot the results
with(ResultsLong, plot(SampleMean ~ jitter(SampleSize, 0.5), col = SampleSize))  ## Plot the sample means by the sample sizes
abline(h = 1.704)  ## Add the true population mean

NB: the jitter(SampSize, 0.5) jitters the values around the x-axis a little bit - otherwise all the points would fall on top of one another.

What do you notice about the estimates for the three different sample sizes?

They are ALL fairly close to the 1.704!
The sample means with n = 85 are always very close to mean = 1.704.
The sample means with n = 10 bounce around a lot more, but are fairly close to mean = 1.704.

If you didn't know the mean and standard deviation of this "population", in which "estimate" of the mean would you be most confident? Hopefully the one with more samples - it is clearly more accurate!

Again: the distribution of sample means approximates a normal (Gaussian) distribution with the mean approximately equal to the mean of the population. We can plot this property...

The following code adds Gaussian density lines to the plots you have just seen (we also flip the x- and y-axes because it's just a little bit easier to visualise). Don't worry too much about what the code is doing, just think about what it produces:

## Prepare the density plots
D10 <- density(Results[, 1], adjust = 2)
D25 <- density(Results[, 2], adjust = 2)
D85 <- density(Results[, 3], adjust = 2)

## Scale the density plots so that they are visible
D10$y <- D10$y * 2 + 10
D25$y <- D25$y * 2 + 25
D85$y <- D85$y * 2 + 85

## now make the plot
with(ResultsLong, plot(SampleMean, SampleSize, col = SampleSize, 
     ylim = c(0, max(D85$y)), 
     xlim = c(min(D10$x), max(D10$x))))
abline(v = 1.704)
lines(D10, col = 10)
lines(D25, col = 25)
lines(D85, col = 85)

You should see from these density plots that they approximate a Gaussian (normal) distribution. The peaks of the distributions sit approximately on the population mean (Central Limit Theorem!!!), and the curve is a lot flatter for 10 samples compared to 85 samples.

2.8 Standard error of the mean

The "standard error" quantifies the "level of uncertainty", or accuracy, of an "estimate" (e.g., the "estimated" population mean). The standard error of the mean can be thought of like this: it is the spread of possible values for the true population mean. Indeed, it is related to the standard deviation, which measures the spread of the data itself.

The figure in the previous section showed that that "spread" was large when the sample size was 10, but small when the sample size was 85.

Let's demonstrate this by sampling from our height data:

SimData10 <- sample(SimData, 10, replace = FALSE)     ## Sample 10 individuals 
SimData100 <- sample(SimData, 100, replace = FALSE)   ## Sample 100 individuals 
SimData1000 <- sample(SimData, 1000, replace = FALSE) ## Sample 1000 individuals 

sd(SimData10)   ## Calculate the standard deviation for SimData10
sd(SimData100)  ## Calculate the standard deviation for SimData100
sd(SimData1000) ## Calculate the standard deviation for SimData10000

## The standard error is calculated as follows (there is no inbuilt function):
sd(SimData10)/sqrt(length(SimData10))     ## Calculate the standard error for SimData10
sd(SimData100)/sqrt(length(SimData100))   ## Calculate the standard error for SimData100
sd(SimData1000)/sqrt(length(SimData1000)) ## Calculate the standard error for SimData10000

What happened to the standard deviation as the number of samples increased?

Not much hopefully! Remember that the sample standard deviation is simply an "estimate" of the true standard deviation of the population (although it should become more accurate with more samples).

What happened to the standard error as the number of samples increased?

The standard error decreased!

As you obtain more samples, you become more confident that the "sample" mean (or sample standard deviation) is close to the "population" mean (or population standard deviation). The "spread" of the "possible" population mean values decreases.

What is the difference between standard error and standard deviation?

Standard deviation simply describes the spread of the data.
Standard error describes your confidence in an estimate.

We will come back to this idea in the next practical, and use it to test questions such as, "which research team should we fund"?

2.9 List of useful functions

Here are a few useful functions that you learned today.

Measures of central tendency
mean; median

Measures of spread
sd; var

Other descriptive statistics
max; min; range; length; quantile

Apply functions
apply; tapply; sapply; lapply

Binding by column and row
cbind; rbind

Convert objects
tidyverse::pivot_longer

Sampling
sample; rnorm