Archive

Uncategorized

When fitting any statistical model to data, what we’re really trying to do is find the ‘correct’ set of parameters. What correct means is fairly subjective, but for generalised linear models we commonly use the maximum likelihood estimate – the set of parameters which gives us the highest likelihood (the probability we’d observe the data from the model and its parameters).

Whilst there is a straightforward formula for calculating the maximum likelihood parameter estimates for standard linear models, it’s a bit more complicated for the generalised linear models which we use to analyse things like presence/absence or count data. In these cases, we can calculate the estimates by numerical optimisation – setting loose an algorithm which will quickly find the set of parameters with the highest likelihood. Below is a brief illustration of an optimisation routine at work on some normally distributed data, with R code at the end of the post.

First we’ll generate some fake data from the model y = a + bx, with a = -1 and b = 3, and add some normally-distributed random noise to it:
data

Then we set the optimisation routine to work (using R’s optim function) to find the parameters that best fit the data. In the left panel below we look at the whole ‘parameter space’ – all of the different potential combinations of the parameters in which the algorithm is searching for the best estimates. In the background are contour lines showing the actual log-likelihood (it’s easier to deal with the really small numbers involved if we log the likelihood first) and in blue is the path of the optimisation algorithm, starting in the lower right and moving to the highest point. In the right panel are the lines fitted at each step of the algorithm. In each plot, the blue lines get more opaque as they approach their target.

path

The optimisation routine used here (Nelder-Mead) is slightly different to the one used in R’s glm function, but both give more or less the same result as the standard Ordinary least squares approach.

Here’s the R code used to produce these figures:

# set up the data, fixing the random number generator for repeatable results
set.seed(1)

n <- 100
x <- rnorm(n)
y <- -1 + 3 * x + rnorm(n)

# and plot it
plot(y ~ x, type = 'n', cex.lab = 2, cex.axis = 1.3)
abline(-1, 3, lwd = 2, col = 'dark grey')
points(y ~ x, pch = 16)

# ~~~~~~~~~~~~~~~

# define the likelihood for the model
# with a trace to record the algorithm's path
llik <- function(pars) {
  tr <<- rbind(tr, pars)
  -sum(dnorm(y, pars[1] + pars[2] * x, log = TRUE))
}

tr <- vector('numeric', len = 2)

# set up a grid of parameter values to plot the surface with
res <- 30
a <- seq(-5, 5, len = res)
b <- seq(-5, 5, len = res)
pars <- expand.grid(a, b)

# calculate the log likelihood at each place on the grid
ll <- apply(pars, 1, llik)

# re-set the trace
tr <- vector('numeric', len = 2)

# run the optimisation
o <- optim(c(3, -4), llik, hessian = T)

# and plot the path of the optimisation routine
par(mfrow = c(1, 2), mar = c(5, 5, 4, 2))

contour(a, b, matrix(-ll, res), cex.axis = 1.3, cex.lab = 2, ylab = 'b',
        xlab = 'a', col = 'dark grey')
for(i in 3:nrow(tr)) lines(tr[i:(i - 1), ], col = rgb(0, 0, 1, i / nrow(tr)),
                           lwd = 3)

plot(y ~ x, type = 'n', cex.lab = 2, cex.axis = 1.3)
for(i in 2:nrow(tr)) abline(tr[i, ], col = rgb(0, 0, 1, i / nrow(tr)),
                            lwd = 2)
points(y ~ x, pch = 16, cex = 1.3)

# here are our parameter estimates
round(o$par, 3)

# and here are the parameter estimates from glm
# (which uses a different algorithm):
round(glm(y ~ x)$coefficients, 3)

# and lm, which uses Ordinary least squares
round(lm(y ~ x)$coefficients, 3)

Created by Pretty R at inside-R.org

Advertisements

Randomness is hard to come to terms with. Humans have long sought explanations for all of the uncertainties in life. Typically these random events have been assigned either to the gods or to personal fortune. Even the great mathematician and all-round character Girolamo Cardano, who made some of the earliest advances with probability theory, put much of randomness in dice games down to the players’ luck.

The development of Newtonian physics and causality led inevitably to a deterministic explanation of these apparently random events. Determinism tells us that everything happens for a reason, that every event is caused by something, that there is no true randomness – just stochasticity generated by a set of fundamental rules.

We can find examples of this in mathematics. The decimal expansion of pi, which starts 3.141593… is infinite and its digits are (apparently) random. Here’s a random walk plotted using the digits of Pi:

These numbers look and feel random, as random as anything in nature, but they are far from it. We have the simple set of rules to calculate Pi and it always comes out the same!

This idea of a deterministic cause of apparent randomness is, of course, well acknowledged in ecology. It’s the basis for Bob May’s great work on chaotic behaviour in population dynamics. The core idea of this work is that given certain conditions even really simple population models can show apparently random fluctuations (code at the end):

So is this what happens in ecology and everything else; a set of simple rules generates a very complex, but fundamentally non-random, universe?

The next post in this series will take a look at quantum uncertainty and a potential source of true randomness in mathematics.

Here’s some R code for the logistic map:

logisticMap <- function(start = 0.4, n = 100) {
	result <- vector(length = n)
	result[1] <- start
	for (i in 2:n) {
		result[i] <- 4 * result[i - 1] * (1 - result[i - 1])
	}
	result
}

plot(logisticMap(), type = 'l', lwd = 3, bty = 'n')

Created by Pretty R at inside-R.org

It’s been a while since I wrote a post.

When I started the blog I was planning to get something up at least once a month, but haven’t found the time lately. This blogging lark requires more focus and dedication than I’d anticipated! Anyway, I’m hoping to get some stuff up in the coming month so please stay tuned.

As you may have already gathered: I love R. It’s invaluable as a tool for data analysis. People complain that it is a weird programming language and, well, it is. But it’s more than just a language – it’s a whole environment for doing data-centric research.

The greatest advantage of R, however, is that it’s a research community. It’s a network of some of the best statisticians, computer scientists and data analysts in the world who want to share their knowledge with others. Because of this there is a vast array of add on packages available, enabling you to do just about any thing to your data from within R. There’s also a whole load of free books, tutorials and documentation on methods for R users.

Open source software like R is an important step towards the ideal of open science – opening up the whole scientific process so that everyone can scrutinise it and learn from it. But there’s more to open science than just using open source software and publishing in open access journals. To reach a truly open science we need to make the data, methods and results available and accessible for everyone. We also need to communicate the research to people in other fields and the general public.

A number of recent developments extend R from being an environment for data analysis to one for sharing code and research results – a platform for open science.

The R function Sweave has been around for a while and can be used to produce PDF documents containing R code and outputs. This makes it relatively simple to produce documents like Jari Oksanen’s brilliant tutorial for ordination methods using the vegan package. The functionality of Sweave has recently been extended by the knitr package.

Sweave and knitr produce great looking PDFs, but they’re a bit of a leap for those of us who aren’t familiar with LaTeX. It would also be great to publish to a more interactive and distributable format, say HTML. The recently released markdown package for R does just that. Markdown produces great looking HTML from a simple plain text file. RStudio, who make the nice IDE of the same name for R, have even launched a new website for R users to share these HTML scripts with one another!

These tools create a great way of creating highly readable and flexible versions of R scripts and sharing them with the world. This should make it even easier to share data analysis methods and move us towards more open science. It will be interesting to see how these tools tie in with the more traditional methods of science publishing. Will appendices containing all the code and plots needed to carry out the analysis start appearing with published articles? Will it herald a move towards open notebook science for data scientists?

As I mentioned above open science isn’t just about sharing more of the scientific process among scientists, it’s also about sharing the process and the results and implications of our research with the wider public.

Slide show presentations are the way most scientists are used to sharing their research and despite how many terrible slideshows we’ve all seen, they can be a great format for telling a story and illustrating it with graphics. Powerpoint is still the presentation software of choice for most scientists, but other programs are gaining ground. HTML5 is starting to look like a great option for creating presentations and it has greater flexibility than Powerpoint when it comes to embedding different types of media, particularly interactive charts.

And of course you can now create HTML5 presentations from R markdown files. The slidify package looks like a really good way of doing this. Here’s a nice example of an HTML5 presentation created from R (hit F11 to make it full screen). And here’s an interactive data visualisation in HTML5 to whet your appetite.

on randomness: ecology

The kind ecology that I do and am interested in involves dealing with a lot of randomness*. There’s all the random noise in the observational data I work with and random methods I use to find interesting patterns in it.

Recently I’ve been thinking a lot about the fundamental basis of this randomness. Whether it’s just stochasticity arising from a deterministic system or if there’s actually some source of true randomness. There’s quite a lot I’d like to cover, so I’ll string this out over a number of posts.

As I see it ecology is top of the pile of natural sciences when it comes to the amount of randomness we have to deal with. You could order the sciences up like in this XKCD:

except that instead of purity I’d rank them by the amount of randomness – or the noise to signal ratio – inherent in the systems studied in each discipline. I think of all the randomness in ecology as bubbling up from mathematics and physics through layers of increasing chemical and biological complexity into the big old mess that is ecological data.

Our aim as ecologists is to pick out the rules that drive these complex systems. We need simple rules so that we can understand them and apply them to other situations. So we spend most of our time sifting through noisy data looking for patterns.

The usual approach is to come up with a model that explains as much as possible of what we observe, whilst making the least assumptions. We can work out what we would expect from the model and anything that is unexplained (so long as there’s no clear pattern in it) we call noise.

This noise that we shove to one side bothers me. Is it just stuff that we haven’t got round to explaining yet? Or is some of it inexplicable, real randomness that we won’t ever, can’t ever, pin down?

The next post in this series will be on determinism, then I’ll do stuff on quantum uncertainty and potential sources of true randomness in mathematics. I’ll try and tie it all up with a post full of philosophical ramblings about whether any of this is important.

*I’m using the term randomness to mean noise or observed apparent randomness, I’ll go with stochasticity and ‘true randomness’ for the other interpretations

(see update below!)

This is one of those science stories that the media love. A letter to the journal Current Biology suggesting that methane emissions of large sauropod dinosaurs may have had a large impact on the climate of the Mesozoic era. Or as a tabloid headline: ‘dinosaur farts caused climate change’ .

The authors took estimates for the number of sauropod dinosaurs in the Mesozoic era and an assumption about how much methane each dino produced to made a back of the envelope calculation of total methane production from sauropod dinosaurs. They calculated this to be approximately equivalent to the total amount of methane emitted today.

This is a cool idea, that a few species of gut bacteria can have an enormous impact on the entire earth system. The letter raises an interesting subject and doesn’t claim to be the be all and end all on the subject, but nevertheless I have a couple of issues with this work. Ordinarily I’d grumble and move on with my life, but I was having a day off with nothing better to do so decided to dig a little deeper.

My first grumble (which I’m sure many people share) is the assumption of how much methane each dinosaur produced.

We can’t get a precise figure for this since we don’t have any way of working out the gut microflora of these dinosaurs. So we have to make an assumption. This is tricky as different species vary drastically in their methane production. Ruminant animals like cows produce loads of methane, whereas non-ruminants produce less. Kangaroos apparently produce very little and who knows about other species groups.

The BBC, along with most of the other major news sources, made a cock-up in their reporting of this story and stated that the authors got their figure by “scaling up the digestive wind of cows”. Not quite, they realised that dinosaurs wouldn’t have been ruminants, so they used figures for non-ruminant herbivores. In fact the figure they used was from an experiment on rabbits and guinea pigs.

In other words they were assuming these sauropods to be the equivalent of 20-tonne guinea pigs roaming the Mesozoic landscape. This is likely to be more realistic than 20-tonne cows, but it isn’t great. What we’re looking for is a large herbivore that’s as similar as possible to a dinosaur. How about a giant tortoise?

This paper published in 2011 on the methane output of tortoises – giant and not so giant – gives the parameters necessary to make a methane calculation assuming 20-tonne tortoises (weirdly this is actually cited in the dino paper, yet they chose to use the guinea pig data, is there a good reason?).

All you need to do is take parameter a and multiply it by the body mass of the animal raised to the power b. So for a 20-tonne guinea pig that would be 0.18 * 20,000 ^ 0.97, giving you 2,674 litres of methane per day. For a tortoise it would be 0.014 * 20,000 ^ 1.03 or only 376 litres of methane per day. A big difference.

In fact if we plug this in to their formula to calculate the annual mass of methane produced we get the results in the figure below, based on Wilkinson et al.’s figure, but with our extra tortoise-sauropod calculation.

So this assumption has a massive impact on the results. If sauropod flatulence was more like that of tortoises than guinea pigs then their impact on climate is likely to have been much less. R code to do these calculations and produce this figure is given at the end of the post.

So that’s the first thing that bugged me about this article. The second is the lack of error bars on the figure. In order to come up with these estimates of methane production the authors took a load of ‘best guesses’ at the important parameters, like the total biomass of dinosaurs and how heavy each one was. They also used point estimates for the parameters a and b I mentioned above. All of these parameter estimates are uncertain and we should be able to translate our uncertainty in these parameters through to uncertainty in the final estimate. So that’s what we’ll do here.

To keep things simple we’ll just look at uncertainty in methane production of each animal and assume the estimates for the amount of vegetation and the number of brontosaurs it could support are fine.

We have a good measure of the uncertainty in parameters a and b, since they were calculated along with 95% confidence intervals; upper and lower bounds between which we can be 95% confident the real value lies. The parameter estimates and confidence intervals for ruminant mammals, non-ruminant mammals and tortoises are given in table 1 of the tortoise paper.

In order to feed these uncertainties through the equation and get our uncertainty in the methane estimate we can use a Monte Carlo simulation. This is basically just defining a probability distribution for each of these parameters, randomly drawing from them and thowing those numbers at the equation. We’ll then get a whole load of different estimates of dino methane production which will represent our uncertainty in the methane estimate.

Code to do this is at the bottom, and gives the figure below which includes 95% confidence intervals as error bars.

The take home message from this is that there’s massive uncertainty in how much methane the dinosaurs kicked out. Those bars would be even wider if we’d included uncertainty in the biomass of sauropods too, the estimates for that vary from 80 to 672 tonnes of sauropod per square kilometre!

As an aside, the lead author of the paper in question, Dave Wilkinson, was one of my lecturers during my undergraduate degree at Liverpool John Moores University (that’s right, the second best university in the city of Liverpool). He was a fantastic lecturer, capable of explaining tricky aspects of ecological theory to hordes of hungover undergraduates with tremendous enthusiasm. He also does a cracking impression of a plant trying to escape from a herbivore.

Update 11/05/12:

I sent Dave Wilkinson an email containing this post and asking about the assumptions in the model, here’s a summary of his reply:

…there are good reasons for guessing sauropods and other large herbivorous dinosaurs fall closer to mammals than modern reptiles in their methane output… Ecologically an African savannah feels closer to herds of sauropods than anything reptiles do today. Also the mammal regression equation correctly predicts the (limited – for good reason!) measurements of elephant methane production. This is nearest extant species in size to sauropods (so it’s not really just scaling up from small rodents).

…in my view the strongest argument for using the mammal equation is microbiological. Imagine being a methanogen inside a sauropod and ask yourself if your environment is more similar to mammal or modern reptile...Because of their huge size most physiologists who have thought about these things think that sauropods are inertial homeotherms. So you have a nice warm and constant thermal environment in which to work on all that lovely plant matter. That sounds much closer to mammals.

This seems like a good justification for their mammal assumption, although the giant tortoises in the tortoise methane paper would be homeothermic to some extent too (even if they are much smaller than elephants). Re. error bars:

…we were only asked for that graph ‘for illustrative purposes’ just before publication… So it was never really intended to be part of the ‘hard science’. However given all the uncertainties in our toy model we feel that ‘error bars’ would only give a false sense of accuracy – we have no idea of the size of most of the errors!

As I have said to many journalists the title of our paper starts with the word ‘Could’ not the word ‘Did’.

#function to calculate annual dino methane production
#defaults reproduce Wilkinson et al. calculation

farts<-function(a=0.18,b=0.97,BM=200000){
	conv=0.717*10^-3	#methane L to kg conversion factor
	area=75*10^6	#area of vegetation
	days=365		#days in a year
	bronto=20000	#brontosaurus weight (kg)
	animals=BM/bronto	#number of brontosaurs
	area*animals*days*conv*a*bronto^b/10^9#Tg per year
	}

farts()		#with Wilkinson et al.'s giant guinea pigs
farts(0.014,1.03)	#with giant tortoises

#figure 1. Wilkinson's figure with added tortoises

bar<-c(farts(0.014,1.03),farts(),75,200,550)
names(bar)<-c("Tortoise\nSauropods","Guinea pig\nSauropods","Cows","Pre-industrial","Modern")
par(mar=c(5,8,4,4))
barplot(bar,horiz=T,las=1,
	col=c(rep(colours()[385],2),rep(colours()[403],3)),
	border=NA,space=0.6,
	xlim=c(0,600),xlab="Methane (Tg per year)")

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#Monte Carlo simulation to get error bars

#first an annoying fiddly bit:
#the a and b parameter estimates were obtained by first log transforming the data
#in order to draw these parameters from a normal distribution we need to first transform parameter a
#we can then get the standard devation by dividing the 95% difference by 1.96
#(taking the average difference, since there are rounding errors)

#parameter a; for non-ruminant mammals:
log(0.181)	#mean
mean(log(0.181)-log(0.144),log(0.227)-log(0.181))/1.96	#sd

#for tortoises:
log(0.014)	#mean
mean(log(0.014)-log(0.009),log(0.023)-log(0.014))/1.96	#sd

#for b we don't need to log transform, just get the standard deviation as above
(0.97-0.92)/1.96	#non-ruminant sd
(1.03-0.84)/1.96	#tortoise sd

#now we can write functions to draw parameter values from these distributions:

#'a' parameter for GP and tortoise
GPa<-function(n){	exp(rnorm(n,-1.7093,0.1167))}
Ta<-function(n){	exp(rnorm(n,-4.2687,0.2254))}

#'b' parameter for GP and tortoise
GPb<-function(n){rnorm(n,0.97,0.02551)}
Tb<-function(n){rnorm(n,1.03,0.0969)}

#and run 100,000 randomised simulations through the equation
n<-100000
GPig<-farts(GPa(n),GPb(n))
Tort<-farts(Ta(n),Tb(n))

#now we can get median values and 95% confidence intervals around our estimates:
GPig_q<-quantile(GPig,c(0.025,0.5,0.975))
Tort_q<-quantile(Tort,c(0.025,0.5,0.975))

#figure 2. like figure 1 but with error bars

bar<-c(Tort_q[2],GPig_q[2],75,200,550)
names(bar)<-c("Tortoise\nSauropods","Guinea pig\nSauropods","Cows","Pre-industrial","Modern")
par(mar=c(5,8,4,4))
pts<-barplot(bar,horiz=T,las=1,
	col=c(rep(colours()[385],2),rep(colours()[403],3)),
	border=NA,space=0.6,
	xlim=c(0,1000),xlab="Methane (Tg per year)")

#error bars
arrows(Tort_q[1],pts[1],Tort_q[3],pts[1],angle=90,code=3,length=0.1,col=colours()[180])
arrows(GPig_q[1],pts[2],GPig_q[3],pts[2],angle=90,code=3,length=0.1,col=colours()[180])
lines(c(Tort_q[2],Tort_q[2]),c(pts[1]-0.5,pts[1]+0.5),col=colours()[180])
lines(c(GPig_q[2],GPig_q[2]),c(pts[2]-0.5,pts[2]+0.5),col=colours()[180])

Created by Pretty R at inside-R.org

I’ll probably use this as a way of venting stuff that isn’t really related to my day to day work. I think it will mostly be about statistics and mathematical biology since those are the sorts of blogs that I read and the sorts of things I think about. There may well be some other sciency musings in here too and I’m sure vector-borne disease stuff will make an appearance!

Expect a lot of half-baked ideas.