Archive

Monthly Archives: June 2012

As you may have already gathered: I love R. It’s invaluable as a tool for data analysis. People complain that it is a weird programming language and, well, it is. But it’s more than just a language – it’s a whole environment for doing data-centric research.

The greatest advantage of R, however, is that it’s a research community. It’s a network of some of the best statisticians, computer scientists and data analysts in the world who want to share their knowledge with others. Because of this there is a vast array of add on packages available, enabling you to do just about any thing to your data from within R. There’s also a whole load of free books, tutorials and documentation on methods for R users.

Open source software like R is an important step towards the ideal of open science – opening up the whole scientific process so that everyone can scrutinise it and learn from it. But there’s more to open science than just using open source software and publishing in open access journals. To reach a truly open science we need to make the data, methods and results available and accessible for everyone. We also need to communicate the research to people in other fields and the general public.

A number of recent developments extend R from being an environment for data analysis to one for sharing code and research results – a platform for open science.

The R function Sweave has been around for a while and can be used to produce PDF documents containing R code and outputs. This makes it relatively simple to produce documents like Jari Oksanen’s brilliant tutorial for ordination methods using the vegan package. The functionality of Sweave has recently been extended by the knitr package.

Sweave and knitr produce great looking PDFs, but they’re a bit of a leap for those of us who aren’t familiar with LaTeX. It would also be great to publish to a more interactive and distributable format, say HTML. The recently released markdown package for R does just that. Markdown produces great looking HTML from a simple plain text file. RStudio, who make the nice IDE of the same name for R, have even launched a new website for R users to share these HTML scripts with one another!

These tools create a great way of creating highly readable and flexible versions of R scripts and sharing them with the world. This should make it even easier to share data analysis methods and move us towards more open science. It will be interesting to see how these tools tie in with the more traditional methods of science publishing. Will appendices containing all the code and plots needed to carry out the analysis start appearing with published articles? Will it herald a move towards open notebook science for data scientists?

As I mentioned above open science isn’t just about sharing more of the scientific process among scientists, it’s also about sharing the process and the results and implications of our research with the wider public.

Slide show presentations are the way most scientists are used to sharing their research and despite how many terrible slideshows we’ve all seen, they can be a great format for telling a story and illustrating it with graphics. Powerpoint is still the presentation software of choice for most scientists, but other programs are gaining ground. HTML5 is starting to look like a great option for creating presentations and it has greater flexibility than Powerpoint when it comes to embedding different types of media, particularly interactive charts.

And of course you can now create HTML5 presentations from R markdown files. The slidify package looks like a really good way of doing this. Here’s a nice example of an HTML5 presentation created from R (hit F11 to make it full screen). And here’s an interactive data visualisation in HTML5 to whet your appetite.

Advertisements

on randomness: ecology

The kind ecology that I do and am interested in involves dealing with a lot of randomness*. There’s all the random noise in the observational data I work with and random methods I use to find interesting patterns in it.

Recently I’ve been thinking a lot about the fundamental basis of this randomness. Whether it’s just stochasticity arising from a deterministic system or if there’s actually some source of true randomness. There’s quite a lot I’d like to cover, so I’ll string this out over a number of posts.

As I see it ecology is top of the pile of natural sciences when it comes to the amount of randomness we have to deal with. You could order the sciences up like in this XKCD:

except that instead of purity I’d rank them by the amount of randomness – or the noise to signal ratio – inherent in the systems studied in each discipline. I think of all the randomness in ecology as bubbling up from mathematics and physics through layers of increasing chemical and biological complexity into the big old mess that is ecological data.

Our aim as ecologists is to pick out the rules that drive these complex systems. We need simple rules so that we can understand them and apply them to other situations. So we spend most of our time sifting through noisy data looking for patterns.

The usual approach is to come up with a model that explains as much as possible of what we observe, whilst making the least assumptions. We can work out what we would expect from the model and anything that is unexplained (so long as there’s no clear pattern in it) we call noise.

This noise that we shove to one side bothers me. Is it just stuff that we haven’t got round to explaining yet? Or is some of it inexplicable, real randomness that we won’t ever, can’t ever, pin down?

The next post in this series will be on determinism, then I’ll do stuff on quantum uncertainty and potential sources of true randomness in mathematics. I’ll try and tie it all up with a post full of philosophical ramblings about whether any of this is important.

*I’m using the term randomness to mean noise or observed apparent randomness, I’ll go with stochasticity and ‘true randomness’ for the other interpretations