Last week, I attended a conference that focused on the big data needs of the pharmaceutical and biomedical industries. “Big data” -- like cloud computing or bioinformatics -- is one of those buzzwords that can be hard to define clearly. There’s confusion when people talk about it because they’re not always talking about the same thing.
A 2012 Gartner report (requires payment to download) describes it as data having one or more of the following characteristics:
Several of the conference speakers referred to these characteristics, so it’s clear that the Gartner definition has traction. The wikipedia page refers to veracity as another characteristic, and lately, Gartner seems to have adopted it as well. Several speakers who emphasized ‘dirtiness’ as a major issue with real-world data also pointed to veracity as one of the defining characteristics.
To be honest, I can’t think of many projects of any scale in life sciences and healthcare that don’t have to deal with issues around the size of the data, the different sources of the data, the fact that new data gets generated all the time, the complex structure of the data, or the issue that some data can be missing or inaccurate. And this has been true for over a decade.
So what’s the fuss? I think it’s largely marketing, but I do see two trends: one is technical, and the other is more scientific.
The first is that data management needs are outstripping the relational database. Relational databases from companies like Oracle and Sybase, and from open source projects like MySQL and PostgreSQL, have served many projects extremely well for a long time. They have clearly defined structures, mathematical and logical underpinnings, and they come with a language, SQL, that is used to modify and query data. Oracle is worth hundreds of billions of dollars because relational databases have been broadly adopted across myriad industries.
But relational databases don’t manage massive amounts of data very well, and they perform even worse with data that have structures that change over time. Changing a database schema can be a lot of work, especially when there’s already a lot of data in it. Want to have a major impact on a software development project that uses a relational database? Just say something like, “Oh, I just realized that this project can be shared across multiple labs.”
To address this, there’s been a lot of recent activity with data management tools that help with improved performance (document databases such as MongoDB) and deal with variety and change more seamlessly (RDF-based databases such as Openlink Virtuoso or Systap Bigdata). This is not to say that new tools are the solution to problems with large datasets - but they should enable us to ask questions we might not be able to ask now.
Which leads to my second point. As data sets grow, deriving useful knowledge from them becomes harder. This might seem counterintuitive, since scientific research often involves adding more data to be able to draw conclusions confidently. But think about a simple model where a data set is a giant table, or spreadsheet. The rows could be something like patients in a hospital. If you wanted to identify patients who got good care at the hospital by looking at quality measures across all the patients, you’d no doubt prefer to look at thousands of patients rather than twenty.
But the table also contains columns. And for patients in a hospital, the number of columns can be quite large. Of course you’ll have columns for name and date of birth, but you’ll have also have dozens, or even hundreds of columns for every time that patient visited the hospital. The columns would include s things like lab results and prescriptions, which can change over time, and be given multiple times. I don’t think it’s hard to imagine that such a table could have thousands of columns. And if the recent trend of wearable devices continues, millions of columns might be added to the table from those continuous streams of data.
Now think about answering a question like Which characteristics of a patient’s visit would allow me to predict whether that patient would return to the hospital within a week? If you had a table with a couple of columns, you could easily look at correlation between each of those columns and whether a patient came back. Those correlations would have p-values associated with them that would indicate the likelihood that the correlation happened by chance. You would then be able to evaluate how seriously to take the results.
When you have thousands of columns, you can do the same thing, but you run into an issue called the multiple-testing problem. In a set of 1,000 patients, you might find that patients who were taking low-dose aspirin for heart health were more likely to be back at the hospital within two weeks. But you tested thousands of other medications, so you’re getting a lot of shots at it. It’s like bingo - somebody has to win. The comic strip at the end of this post provides a great explanation.
There are ways to address this problem. You can do p-value correction or use different methods entirely, such as Bayesian statistics, which incorporate prior knowledge when drawing inferences. You can also use data from other sources to avoid having to look at thousands of columns, but that requires more variety of data.
Another way is to just increase the number of patients so you can reach significance. For example, genetic studies, where millions of genetic variants are being assessed, routinely look at 10,000 or 100,000 patients to get results. This costs more, obviously, since you need to generate more data, or pay for access to other data sets, but the benefit is a higher level of confidence in the data.
Those are my takeaways from spending two days talking big data with a lot of smart people. I’m really interested to hear from others about their take on big data - please comment on this post or email me at email@example.com.
Comic Credit: Significant by xkcd under a Creative Commons 2.5 license.