I alluded to this in my last entry, but there is another issue that should inform how statistical analysis gets done. That is, the world of science would be a better place if the way that people analyzed their data could be made available to others when their work is published. But I don't mean a prose description in the methods section of the paper. As someone who has tried to reproduce somebody else's results using only that information, I'd say that it's pretty inadequate. Think about someone describing a piece of software code in a text paragraph and how hard it would be to rewrite that code from only that description. Papers that are about computational methods are often accompanied by source code of some sort, but papers that are focused on scientific results are less likely to have that. It's hard to know exactly why this kind of information is not available, but plain old laziness and the fact that papers can get published without doing this extra work are clearly major factors.
GenePattern is one tool that has tried to slide into this role by allowing pipelines of analysis modules to be published on their site. As of last check, however, this has only been used for 4 or 5 papers. This process would be easy if people used GenePattern for their analysis, but not so easy otherwise, so I have to assume that not a lot of people are using GenePattern for their primary analysis. Taverna is another tool that could also fill this role, although I will leave a more detailed look at it for another post. An alternative is for the Matlab/SAS/S-plus code to be made available when the paper is published. The quality of code you might get would vary widely and it would be in lots of different languages. It might or might not be under version control at the source institution. For any of these mechanisms to be commonplace the journals would have to require it and that's clearly not getting done now.
So what's to be done? Given the wave of data coming out of genomic technologies, having published methods easily re-runnable on new data sets is going to be critical so that people don't waste a lot of time re-discovering good methods. I have several recommendations:
- Journals and funding agencies should require researchers to truly make enough information available for people to re-run their analyses. That would mean the code or a detailed description of the code/version/parameters for code that is already publicly available.
- Statisticians and informaticians should strive to make their code readable, modular and re-usable.
- Researchers should use standard and already-available methods whenever possible. Is it worth using a custom method that only produces slightly better results than a standard method? If you use a custom method it would be appropriate to compare its results to the standard method anyway so people know what the differences are.
There was a recent GenomeWeb article that indicated that the Public Library of Science is considering some efforts to make software and data more widely available, although to me it sounds like baby steps.
This is not an easy issue, however, as the field of informatics is always evolving, often in tandem with the laboratory science. The obvious counter to my thoughts is that every experiment is different and requires new and different analysis techniques. But I would encourage researchers to put egos aside and think about how the world of science can benefit, not just one's career and research. If more people changed their thinking then everyone's career and research could be enriched.