Practical 3 Introduction to hypothesis testing

One of the most common class of questions that are asked in biology (including ecology and conservation science) is whether the mean of a sample differs from either a set value, or the mean of some other sample. For example, does giving patients a specific drug causes a change in the value of a response variable of interest (e.g., survival)?

We can easily calculate the difference of the mean of our sample from a fixed value (e.g., zero) or between two means (e.g., treatment and control groups). But, the mean of our sample is just an estimate of the true mean of the population, and we know that there will be some (hopefully small!) error in our estimate.

To determine whether or not our mean differs from another value or the mean of another group, we need to take that error into account. In this prac, we are focused on reinforcing your understanding of the principals of hypothesis testing in science. We will use a specific analysis, the t-test, to really dissect the basis of testing biological hypotheses using any form of statistical test.

In this practical you will:

  1. explore the idea of a "test statistic" through a detailed study of the t-test;
  2. use the t-test to clarify the concepts of "null hypothesis" and "alternative hypothesis";
  3. use the t-test to clarify the concepts of "Type I error" and "Type II error", and;
  4. you will apply your knowledge to this question: "How can one judge the relative merits of separate studies that address the same question"? You will be able to finally decide whether to recommend that Team A or Team B should receive the additional $15M in funding to develop their cancer treatment drug, Cureallaria!

3.1 Hypothesis testing

Fisher calculating the probability of Bristol getting 8/8: WikiMedia Commons
Fisher calculating the probability of Bristol getting 8/8: WikiMedia Commons

A tea party
In an afternoon tea break with Ronald Fisher, phycologist Muriel Bristol claimed to be able to tell whether the tea or the milk was added first to a cup.

Fisher hypothesised that Bristol couldn't tell the difference, and he wanted to test his hypothesis.

If Fisher handed Bristol a cup of tea and asked her: "did the tea or milk go in first?" then Bristol would have a 50% chance of guessing correctly. If Fisher made two cups, there would still be a high chance that Bristol could guess correctly twice.

Fisher decided to test Bristol with eight cups, four of each variety, in a random order. One could then ask: what is the probability of Bristol correctly guessing the four that had her preferred milk-first tea? Fisher calculated this probability as 1 in 70 (4/8 * 3/7 * 2/6 * 1/5 = 24/1680 = 1/70 = 1.43%). While it is possible to guess all eight correctly, it is unlikely.

Needless to say, Bristol got all eight correct!

Fisher was convinced, and concluded that Bristol could tell the difference - he rejected his original ("null") hypothesis.

NB: This story summarises the hypothesis testing framework; the "tea" from the party is different to the "t" from the t-test. Fisher went on to become one of the most influential bio-statisticians of all time, among other things, establishing the fundamentals of experimental design and ANOVA. We'll never know what you would be studying right now if Muriel Bristol hadn't been so (rightly!) confident in her ability to taste scalded milk!.

3.1.1 Null and alternative hypotheses

Statistical hypothesis testing in biology (and other fields!) depends on two cornerstones. First, that we can state a null hypothesis. Typically, we frame the null hypothesis as there being "no difference" (e.g., the means of two treatments are the same). When we reject our null hypothesis, we interpret this as support in favour of an alternative hypothesis.

The alternative hypothesis is mutually exclusive with the null, stating that a population parameter is not the same as some hypothesised value (e.g., the means of two treatments differ). The alternative hypothesis could be that the population parameter is greater or smaller (one-sided) than the hypothesised value, or that the population parameter is different (either bigger or smaller: two-sided) to the hypothesised value.

What were the null and alternative hypotheses in Fisher's test?
Null hypothesis: Bristol cannot tell the difference.
Alternative hypothesis: Bristol can tell the difference.

Fisher rejected the null hypothesis that Bristol could not tell the difference.
Is the alternative hypothesis definitely true?

Never!
It is still possible, albeit unlikely, that Bristol guessed correctly 8 times. While a well-designed experiment can allow us to draw conclusions about the probability of observing the data we have if the null hypothesis was true, we cannot prove that the alternative is true.
We can think about statistical hypothesis testing as Fisher laid out - what is the probability of observing our sample data (8/8 correct guesses) if the null hypothesis is true? Here, if Bristol could not taste the difference between teas (i.e., the null hypothesis is true), the probability of correctly identifying all 8 cups is 1.43%

Recalling the cure for cancer studies that you analysed in Practical 2, what null hypothesis do you think that Team A had?
What null hypothesis has the Minister tasked you with addressing?

Within each study (Team A or Team B), researchers are testing the null hypothesis that mean survival is the same for the Control and Treatment groups, that is, the treatment has no effect on survival.
The Minister is asking you consider which research team should be further funded. Your null hypothesis then is that the difference in mean survival between the Control and Treatment groups for Team A is the same as the difference in mean survival between the Control and Treatment groups for Team B. Only if we reject that null hypothesis do we have any basis for then asking "Why?" and delving into which team's preliminary results we trust to provide a better estimate of the average effect of the drug on survival.



3.2 Using test statistics

Once you have a hypothesis, a critical ingredient to test it is a test statistic. A test-statistic will follow some probability distribution (e.g., Gaussian, t-distribution, exponential). That is, test statistics have a special property: We know what kinds of values to expect of them when the null hypothesis is true. Because we know the probability distribution for a test-statistic, they also have the converse property: We know what values they won't have when the null hypothesis is false!

This is all a bit abstract but you have encountered test statistics before, including the t-statistic (used in a t-test) and F-statistic (used in ANOVA).

Here is the calculation for the t-statistic for a one-sample test:

\(t = \frac{m - \mu}{s /\sqrt n}\)

\(m\) is the sample mean;
\(\mu\) is the "population" mean (a hypothetical value that you're testing against);
\(s\) is the standard deviation;
\(n\) is the sample size.

What value to you think that t is expected to take when the null hypothesis is true?
Zero! The statistic is calculated by subtracting the value we are testing against from our population parameter estimate (i.e., the value of the thing we care about). The null hypothesis is typically "no difference"; here, if our null hypothesis was true, and the values were the same, the difference is equal to zero.
When the null hypothesis is exactly met, then the denominator of this equation does not matter - zero over anything will equal zero. However, our estimate of the test statistic is unlikely to ever be exactly zero, even when the null hypothesis is true.

If your sample mean is larger than the population mean you are comparing to, would your t-statistic be positive or negative?
Positive! You are subtracting the hypothetical population mean from your observed sample mean.

What is the relationship between sample size and the t-statistic?
Remember, we need to consider uncertainty due to small sample size in addition to just looking at the difference between mean values. Smaller sample size increases the denominator of the t-statistic (the standard error) and thus decreases the t-statistic. Therefore, smaller sample size indicates that the observed difference in mean values may not be so meaningful.



3.3 Estimating the test statistic

Let's estimate the t-statistic for the cancer drug trials. First, we need to input the data:

## Enter the data from Team A
ControlA <- c(0.2, 0.4, 0.5, 0.38, 0.6, 0.2, 0.8, 0.4, 0.4, 0.2)
TreatmentA <- c(0.6, 0.8, 0.7, 0.8, 0.7, 0.6, 0.3, 0.6, 0.5, 0.9)
teamA <- cbind.data.frame(ControlA, TreatmentA)

## Enter the data from Team B
ControlB <- c(0.3, 0.6, 0.2)
TreatmentB <- c(0.99, 0.7, 0.6)
teamB <- cbind.data.frame(ControlB, TreatmentB)

cancer.study <- cbind.data.frame(teamA, teamB)
What happened when you ran the code to create "cancer.study"?
It didn't work. Team A and B had different numbers of replicate studies, so different number of rows! R wouldn't just add "NA" to the "missing" data from Team B.
Our solution to this problem will be to analyse them separately, for now.


Calculate the difference in mean between Treatment and Control in their sample:

EffectA <- mean(teamA$TreatmentA) - mean(teamA$ControlA)  
## Difference in control and treatment means for team A

EffectB <- mean(teamB$TreatmentB) - mean(teamB$ControlB)   
## Difference in control and treatment means for team B
Which team (A or B) observed the biggest average effect of the treatment?
B: 0.397, versus the 0.242 observed by A


The value of the test statistic (t) depends not only on this difference in mean, but also on the standard error (the denominator in the equation: standard deviation divided by the square root of the sample size). Recollect what you have learned about the influence of sample size on the spread of estimates - as sample size increases the spread decreases.

The standard error (SE) tells us how confident we are in the estimate of the parameter. In this case, if our Treatment mean is different to the Control mean, but the variation in Treatment values is large (high SE), then we are unlikely to conclude that the treatment is actually different to the Control (we can't reject the null hypothesis).

To think about this further, let's plot the cancer study data. Some of this code will be familiar (you ran it in Practical 2), but other bits are more advanced:

teamA.mean <- apply(teamA, 2, mean) ## First, calculate the mean of each variable (treatment vs control)
teamA.sd <- apply(teamA, 2, sd) ## Then calculate the standard deviation of each variable
teamA.n<-nrow(teamA) ## calculate the sample size for teamA (number of rows)
teamA.se <- teamA.sd/(sqrt(teamA.n)) ## Calculate the se by dividing the sd by the square root of the sample size 
## Note that there is no inbuilt R function to calculate SE. 


## There is no inbuilt function we can call to plot mean +/- SE, so we need to write it out
n.se <- 2 ## height of error bars (as number of standard errors) - you can change this! See how far apart the means are.

## Set up plot initially without any points or lines (type = "n") to define x and y axis limits. 
## We will create custom x axis ticks (xaxt="n").
plot(x = c(0.5, 2.5),  ## The values along to the x-axis
     y = range(c(teamA.mean + n.se * teamA.se, teamA.mean - n.se * teamA.se)), ## The values along to the y-axis
     type = "n",     ## Do not plot any points or lines
     xlab = "Group", ## x-axis label
     ylab = "Survival (Mean +/- SE)",  ## y-axis label
     xaxt = "n") ## x-axis tick mark labels

## Add x-axis ticks and their labels
axis(1, 1:2, labels = c("Control", "Treatment"))

# Plot the means on the plot
points(1:2, teamA.mean, pch = 19)

## Add error bars as "arrows" with heads at each end (code=3) and flat ends (angle=90) 
for(i in 1:2){
  arrows(x0 = i, ## The x value for the ith error bar
         y0 = teamA.mean[i] + n.se * teamA.se[i],  ## The upper value for the ith error bar
         y1 = teamA.mean[i] - n.se * teamA.se[i],  ## The lower value for the ith error bar
         angle = 90,
         code = 3)
}

Duplicate the code and change teamA to teamB to get the plot for teamB.

Looking at the confidence we have in the mean of the Control versus Treatment, do you think that a t-test might reject or accept the null hypothesis? Does your conclusion depend on which team you are considering?
The means of the Treatment of each team do seem distinct from the mean of the Control (i.e., more than 2 SE away), so it seems that a t-test might reject the null hypothesis.


Because t is is calculated by dividing a difference by a standard error, the units of t are standard errors!!! We can therefore think of t values as the number of standard errors from zero (no difference) that our observed difference takes. If t is extreme, then we can reject the null hypothesis.

The t-distribution for a range of degrees of freedom. The rather complicated function is given in the upper left corner.
The t-distribution for a range of degrees of freedom. The rather complicated function is given in the upper left corner.

Even through the t distribution changes depending on the degrees of freedom (as shown above), the bulk of the t distribution does not change that much with the degrees of freedom. Indeed, when the null hypothesis is true most values of t fall between -2 and +2. Therefore, as a "rule of thumb", when our sample mean is <2 standard errors away from the value we are comparing it to, we probably cannot reject the null hypothesis. In other words, our t statistic will likely fall within the "do not reject" zone of the t distribution.

Let's go ahead and check our "guess" about whether we can reject our null hypothesis of no effect for this cancer drug study.


3.4 One sample t-test

One way to test the null hypothesis (Treatment = Control) would be to treat the mean of the Control group as a fixed reference value, and test whether the Treatment mean equals that fixed reference value. This approach is known as a "one-sample" t-test because we have only one sample of interest.

t.stat <- (EffectA)/teamA.se[2]  ## Convert mean of the treatment group to "t-statistic"
t.stat                           ## Look at the value of the t-statistic?

OK! Based on our rule-of-thumb, do we think that this is going to fall within the bulk of the t-distribution, or in the 5% tails? Is this value (in units of SE) consistent with what you expected from the plot?

To determine the probability of observing a t-value at least as extreme as the one that we have observed, we need to compare our observed value to the t-distribution for our degrees of freedom. R can do that for us.

t.test(TreatmentA, mu = mean(ControlA))  ## Use R's inbuilt function to implement this one-sample t-test. 
## Here, "mu" is a fixed point against which we test whether the observed TreatmentA mean is the same.
Should we reject our null hypothesis?
Yes!
Our t-statistic is larger than we would expect to see if our null hypothesis was true. Indeed, the probability of observing a t-value of 4.459783 or larger is 0.0016 (i.e. only 0.16%!).



3.5 Two sample t-test

The previous analysis has an obvious problem: the mean of the Control group is not really a fixed constant. Rather, it is itself an estimate subject to sampling error. A more appropriate analysis would consider the difference between the sample means of the two groups as the test statistic subject to sampling error. That is, we need to estimate the SE of that difference, not just of TreatmentA mean. Fortunately, there is an easy formula for calculating a standard error for that statistic:

\(SE=\sqrt{SE_1^2 + SE_2^2}\)

seJoint <- sqrt(teamA.se[2]^2 + teamA.se[1]^2 )  ## Standard error of the difference between the two means
Tval <- (EffectA)/seJoint     ## Convert difference of means to "T"
Tval                          ## Look at the value of T
t.test(TreatmentA, ControlA)  ## Same thing using R's generic "t.test" program
Now that you are appropriately accounting for the uncertainty in both sample estimates (Control and Treatment), do you accept or reject the null hypothesis?
Reject!


Assumptions of the t-test

As with ALL statistical analysis, we make some assumptions when interpreting the results of a t-test. If our data violate these assumptions, our conclusions might be wrong (more on that below). We will not consider these assumptions further here; in future practicals in the course, you will learn how to check for violation of model assumptions.

  1. The "residuals" (the "unexplained" part of the data, i.e., the "background noise") are Gaussian ("normally distributed"). In future practicals you will learn how to test this assumption, which is common to most "parametric" test statistics. When the data are very non-normal in distribution (e.g., skewed, or containing extreme outlier values, or multi-modal), the results of the t-test can be unreliable. However, the t-test is quite robust to violations of this assumption.
  2. The variance of the residuals is equal everywhere in the data. That is, we assume that different groups have the same variance. Again, although very large differences in variance can make our test unreliable (prone to type I or type II errors; see below), t-tests can be quite robust to moderate differences in variance. Notably, the chance of making an error in your conclusions from a t-test is greater if the sample sizes of your groups are uneven. Therefore, it's always important to try to design your data collection to attain similar sample sizes for groups that you want to compare.
  3. The separate measurements are made on independent random samples of the population. This assumption must be considered at the experimental design stage, before data are collected. Violations of this assumption impact on the generality of our conclusions. If our sample is not random, then repeating the analysis with a different sample might give a different result, and our conclusions might not be general to the population. In this context "independent" means that the individual samples are not related to one another in some way (e.g., multiple measurements taken on the same person will be similar to one another). We will discuss independence more later in the course, and show you tools to deal with correlated samples.

3.6 Type I and type II errors

Imagine it's the dying seconds of the AFL grand final. Your team, the Hawthorn Hawks, lead by two points. The ball is in their defensive goal square. One of the opposition players is possibly fouled by being pushed in the back. The umpire can do one of two things: pay a free kick, or not pay a free kick.

There are four outcomes:

  1. the opposition player was pushed in the back AND the umpire pays a free kick (boo, but good work umpy, correct decision);

  2. the opposition player was not pushed in the back AND the umpire pays a free kick (boo, get your eyes checked umpy!);

  3. the opposition player was pushed in the back AND the umpire does not pay a free kick (YES!! You're a legend umpy - you'll help the Hawks win even though you were wrong);

  4. the opposition player was not pushed in the back AND the umpire does not pay a free kick (YES! Good work umpy).

The point is this: we can't make one decision and it be right all of the time!

When we, as bio-statisticians, accept or reject a null hypothesis, we are making a judgement. As professionals, we should generally make the right decision (just like the umpire - too many wrong decisions and they'd be out of a job!). However, sometimes, we might fail to reject the null hypothesis when it is false (false negative; type II error), or reject the null hypothesis when it is true (false positive; type I error).

Experimental design, and approaches to data analysis aim to find the appropriate balance between the risk of type I versus type II errors. Since scientists are often 'trying' to show an effect, type I errors are typically considered to be the more serious offense - but this isn't always the case. Here is a particularly horrendous example of a Type II error:

Rofecoxib (Vioxx, Merck)
Rofecoxib was a non-steroidal anti-inflammatory drug developed by Merck. It was a booming drug in the late '90s, but there was some suggestion that it may cause heart attacks.

By December 2000, 21 placebo-controlled trials had been performed. They suggested that Rofecoxib was 2.18 times more likely than placebos to cause serious heart problems or death. However, confidence intervals for this relative risk measure ranged from 0.93 (i.e., slightly less than the placebo) to 5.81 (almost six times more likely). The probability that the relative risk was significantly greater than one was P = 0.07. Hence, Merck accepted the null hypothesis of no difference between treatment and control groups.

Larger experiments that followed indeed found Rofecoxib to be almost twice as deadly as placebos. By the time it was taken off the market in September 2004, 20 million Americans had taken the drug; 88,000 had serious heart attacks; 38,000 had died.

Ross et al 2009


If Bristol couldn't tell the difference between tea first or milk first, which type of error did Fisher make?
Type I error. He rejected the null hypothesis, even though it was true.


In relation to hypothesis testing, errors are, simply, when the wrong decision is made. The probability of making the wrong decision is referred to as an "error rate". What affects error rates?


3.7 Statistical power and error rates

We will explore the effects of the key factors, effect size (the magnitude of our test statistic) and sample size, on type I and type II error using a "ShinyApp". ShinyApps are built using R! But, we don't need to update the code and re-run it. Many people have built ShinyApps, and they are particularly useful for exploring parameter spaces to learn statistics (as well as other things that depend on probability and sampling, like population genetics!)

Open the link below in a new browser:

https://mathr.math.appstate.edu/shiny/Power2/

Graph produced from a Shiny app written by Alan T. Arnholt. The red curve is the distribution when our null hypothesis is true; the blue curve is the distribution when the alternative hypothesis is true. The significance level of a t-test (\alpha = 0.05; red shaded area) is the type I error rate. The probability of making a type II error is referred to as \beta (the light green shaded area). Power (1 - \beta; shown in the blue shaded area) is the probability of correctly rejecting the null hypothesis. In other words, power is the probability of detecting a difference when there is indeed a difference.
Graph produced from a Shiny app written by Alan T. Arnholt. The red curve is the distribution when our null hypothesis is true; the blue curve is the distribution when the alternative hypothesis is true. The significance level of a t-test (\(\alpha = 0.05\); red shaded area) is the type I error rate. The probability of making a type II error is referred to as \(\beta\) (the light green shaded area). Power (\(1 - \beta\); shown in the blue shaded area) is the probability of correctly rejecting the null hypothesis. In other words, power is the probability of detecting a difference when there is indeed a difference.


Select the settings shown in the figure above:

  • alternative hypothesis is that the means are not equal;
  • t-test;
  • sample size = 10;
  • Significance = 0.05 - this alpha, \(\alpha\), value is defined prior to the analysis, and we will not vary it here.

Now let's set the other parameter values to investigate null hypotheses about the height of Australian's aged 18-44 years. As you saw in Practical 2, the average height of this group is 1.704m and the standard deviation is 0.066. Enter these as the "True mean (\(\mu_1\))" and "Population standard deviation (\(\sigma\))".

Now, set the "Hypothesised mean (\(\mu_0\))".

Is power (\(1-\beta\); blue shaded area), larger if our sample mean is 1.670m or 1.785m?
It is larger if our sample mean is 1.785
The power of the experiment to correctly reject the null hypothesis (i.e., avoid a type II error) depends on the effect size - how large is the difference between our sample parameter value and our null parameter value?

If you estimate average height to be 1.670m (set the "Hypothesised mean" to be 1.670), do you have higher power (\(1-\beta\); blue shaded area), when the "population standard deviation" is 0.400 or 0.066?
0.066
We again lost power to correctly reject the null hypothesis when the spread of the data was higher. Remember how we calculate the t-statistic. When the spread of the data is larger, the denominator of the t-statistic increases, and the value of t decreases.

If you estimate the average height to be 1.670m (set the "Hypothesised mean" to be 1.670) and the "population standard deviation" to 0.066, do you have greater or lesser power (\(1-\beta\); blue shaded area) when your sample size was 10 or 100?
sample of 100 has higher power
100 does not sound like a large sample size, but we have an impressive amount of power to detect this 3cm difference in average height, with a less than 1% type II error rate. In other words, with these effect and sample sizes, we would almost never accept the null hypothesis.

Are type I and type II error rates both affected by sample size? (Hint: what happens to the sizes of the red and blue shaded areas when you change the sample size?)
Type II errors are affected, but type I errors are not!
The blue area changes, but the red area does not
Remember: we set \(\alpha\) prior to the analysis (and preferably, prior to data collection), so nothing about the experiment size, or observed effect size, can change the type I error rate.



3.8 The power of the t-test

The generic R function power.t.test calculates the power of a t-test for a difference between two means. Let's apply it to Team A's cancer study data.

power.t.test(                               ## Calculate Power of T-test, and output the variable set "NULL"
n = length(TreatmentA),                     ## Sample size of each group
delta = mean(TreatmentA) - mean(ControlA),  ## Difference between the means (effect size)
sd = sd(ControlA) ,                         ## Set a common standard deviation
sig.level = 0.05,                           ## Set a Type I error rate
power = NULL,                               ## Set the statistical power (1 - Type II error rate)
type = "two.sample",                        ## Request a T-test comparing means of two groups
alternative = "one.sided"                   ## Alternative to null hypothesis is Treatment > Control
)
What does "power = 0.8584562" mean?
Given an effect size of 0.242, and a sample size of 10 per group, the probability of correctly rejecting the null hypothesis is 0.8584562.


Input the data from Team B into the power.t.test function.

Which team has the higher probability of committing a type II error?
Team B!


The following R code produces graphs that relate effect size and sample size to power for the cancer data. The first relates effect size to power, for a given sample size. Lets see how big the effect size needs to be in order to get an acceptably small Type II error probability.

EffectSize <- seq(0.0, 0.3, 0.01)     ## Set various effects sizes
Power <- vector()                     ## Request space to hold the power estimates
length(Power) = length(EffectSize)    ## Allocate space to hold the power estimates
for(i in 1:length(EffectSize)){       ## Loop through the effect sizes
Power[i] <- power.t.test(             ## Record the output at the "ith" place in "Power"
n = 10,                               ## Set the sample size of each group
delta = EffectSize[i],                ## Difference between the means (effect size)
sd = sd(ControlA) ,                   ## Set a common standard deviation
sig.level = 0.05,                     ## Set a Type I error rate
power = NULL,                         ## Set the statistical power (1 - Type II error rate)
type = "two.sample",                  ## Request a T-test comparing means of two groups
alternative = "one.sided")$power      ## Alternative to null hypothesis is Treatment > Control
}                                     ## Output the element called "power", then re-loop
plot(EffectSize, Power, type = "b")   ## Graph power against effect size
What does an effect size of 0.1 mean?
The difference in survival probability between control and treatment groups is 0.1. The drug improves survival by 10%.

How much power does Team A have to detect an effect size of 0.1?
Given their sample size (n = 10), about a 25% chance. In other words, the probability that Team A would fail to find this effect (type II error) would be 75%.


Now produce a second graph relating sample size to power. We set the effect size to the one "team A" actually obtained.

SampleSize <- seq(2, 30, 2)                 ## Set various ssmple sizes
Power <- vector()                           ## Request space to hold the power estimates
length(Power) = length(SampleSize)          ## Allocate space to hold the power estimates
for(i in 1:length(SampleSize) ){            ## Loop through the sample sizes
Power[i] <- power.t.test(                   ## Record the output at the "ith" place in "Power"
n = SampleSize[i],                          ## Set the sample size of each group
delta = mean(TreatmentA) - mean(ControlA),  ## Difference between the two means
sd = sd(ControlA) ,                         ## Set a common standard deviation
sig.level = 0.05,                           ## Set a Type I error rate
power = NULL,                               ## Set the statistical power (1 - Type II error rate)
type = "two.sample",                        ## Request a T-test comparing means of two groups
alternative = "one.sided")$power            ## Alternative to null hypothesis is Treatment > Control
}                                           ## Output the element called "power", then re-loop
plot(SampleSize, Power, type = "b")         ## Graph power against sample size
What sample size per group should leave Team A confident of not making a Type II error?
If we consider low probability of type II error to be \(\beta = 0.05\) (i.e. power is 0.95), then a sample size of 15 of more would suffice.

If Team A found an effect size of 0.463 (the effect size Team B found), what sample size per group should leave Team A confident of not making a Type II error??
As little as five per group!! Looks like Team B should have invested a little bit more into sampling!


In practice, we can only impact the effect size by increasing sample size (to reduce the spread, and improve the precision with which we have estimated the sample mean) - we cannot control how large the effect of the drug is (indeed determining how much it affected survival was the purpose of the study!!). We often use evidence from similar studies, or a priori goals based on cost-benefit to determine what risk of Type II error we are prepared to take, and use that to determine our sample size.


3.9 List of useful functions

Here are a few useful functions that you learned today.

t-test
t.test; power.t.test