On Tuesday, Will FitzHugh blogged his thoughts about the big data conference that he attended last week, and in the post, he noted that "[a]s data sets grow, deriving useful knowledge from them becomes harder. This might seem counterintuitive, since scientific research often involves adding more data to be able to draw conclusions confidently."
This got me thinking about how researchers and clinicians already using extremely large datasets and how that might change when managing the huge amount of readily available data is no longer a hurdle as more as better data tools and methods evolve; and as the cost of DNA sequencing decreases.
- Good-bye Sample Sizes. If we could have all of the available genetic data about, for instance, prematurity the confidence with which doctors could predict its occurrence would change the way that we approach the statistics behind predicting medical outcomes. Researchers would no longer have to hunt down enough individual cases of a particular condition for their research to be statistically significant. For instance, if millions' of preemies' genomes and those of their parents were sequenced, there would be enough data to find the very specific genetic "spelling" of prematurity with a tremendous degree of accuracy.
- Good-bye Regulatory Red Tape. Not all of it, of course, but the potential to be able to bolster clinical trials with mountains of accurately predictive genomic data could make the journey from lab to clinic safer, smother, and faster. In an interview in advance of the Big Data Leaders Conference in Washington last week, Dr. Eric Perakslis, head of the Center for Biomedical Informatics at Harvard University cited Thalidimide as an example of a bad drug that, even today, would be prevented from going to market because information about adverse effects can be had, "within hours rather than months or years."
- Hello, Prevention. Recently, the NY Times reported on a case of girl with lupus for whom data played a role in her treatment following symptoms that looked like kidney failure. After diagnosing the lupus, her doctor recognized her symptoms as similar to other lupus patients who suffered dangerous blood clots. The doctor consulted the hospital's patient databases and ran basic statistical analyses and decided to add anti-clotting drugs to the girl's treatment. Now, imagine, if you will, that that young lupus patient's genome -- and those of millions of other lupus sufferers' -- had been sequenced and the corresponding incidences of blood clots well documented and easily accessible.
Of course none of this is without difficulty or controversy. Recently, MIT researcher Yaniv Erlich showed that de-identified genomic data could be patched together, and re-identified. Dr. Erlich is one of the good guys and performed his experiment as a way to show that even when it is surrounded by mountains of other de-identified data, an individual's genome can be unmasked.
The Times article cited above also notes that the doctor's quick thinking caused some not completely unwarranted hand-wringing at her hospital. Data mining to treat a particular case can be good medicine in a patient's best interest. On the other hand, the doctor published her findings in the New England Journal of Medicine, possibly tipping her approach just to the edge of what could be considered research. If it were research, the doctor would need to have gotten the data owners' permission to use their records.
Will FitzHugh is right: deriving useful information from growing datasets is hard. And not only for the reasons that I thought.