I enjoyed reading Neil Saunder’s
blog post "Has our quest for completeness made things too complicated?" last week. In the blog, Neil advocates that capturing –omic data in a simple feature-probe-value triplet is a much more practical approach to the data integration problem than trying to work with data across a multitude of ontologies and schemas that have been developed for individual data types. Neil’s focus is to identify what’s the same across all these data sets and work with that very simple model rather than spend a lot of effort to come up with a complete model of everything.
So, to you Neil Saunders, mostly I say “Amen, brother! I’m with you.” My physics background also gravitates me towards boiling a problem down to its essence – the bare facts upon which complex analyses can be built. As much as I believe this, however, I also believe that ontologies and other meta-data play an essential role in integrative analyses. Let me illustrate…
We recently were working with a client to integrate multiple –omic data sets (gene expression, GWAS, proteomics, etc). Similar to Neil, our first step was to extract the data from their individual repositories, ontologies, schemas, etc and boil these very disparate data sets down to their greatest common divisor (GCD). “Greatest common divisor” is an apt term here because in selecting what data was included in our simplified model we looked both for what was “common” to all data types and the analyses that would follow and also what was “greatest” in the sense that wherever possible we wanted to take processed results over raw data. In Neil’s case, his GCD was the feature-probe-value triplet. For us, the GCD was a combination of probe, sample and value. It would take another blog post (or more) to go into this project and design decisions, but for now I wanted to emphasize the similarity with Neil’s approach and perhaps holler another “Amen!”
But hold on! Before tossing all ontologies onto the scrap pile of failed good ideas (somewhere between the Apple Lisa and Pepsi Clear, I imagine), let’s take a step back. After extracting the multiple, disparate data out of their various schemas and into a single, clean model, what can we do with it? The answer: not much. In our project, although the data was all in the same form, the domains were still very different. How do we relate the value we got from a SNP to one from a transcript? Should data for a sample labeled “breast cancer” be grouped with data from samples labeled “human mammary carcinoma”? The truth is although we had a unified data model, the data was far from being integrated.
So what did finally bring us to an integrated data set? Ontologies, thesauruses, mappings, etc. In our case we used multiple sets of mapping data to take our primary data values from probes to a common feature – a gene (Entrez). We used nomenclatures like SNOMED and MeSH to normalize our samples. Only after leveraging this ontological information could we work with the data in any meaningful way.
It didn’t stop there. Once we had the data mapped to common feature and sample terminology, we then utilized ontologies like GO to form and test hypotheses on biological function. Our plan going forward is to leverage disease ontologies, pathway information and other meta-data to go beyond simple lists of differential features and go towards true biological understanding. Although coming up with a good ontological model capturing the entities and relationships relevant in biology is hard and complex, it is still a good way to capture our biological knowledge in a form that can be directly applied in analyses.
Finally, lest we forget, although dealing with a different schema/ontology for each type of data is annoying, it is far better than the alternative of having no such schema/ontology in place. This was painfully clear in our project when dealing with the relative uniformity of gene expression data in repositories such as GEO compared to the state of proteomic and even GWAS data.
To sum up, I agree Neil – you’re on to something. Getting your different data sets into a simple, consistent model is the way to go. We shouldn’t try to build a complex schema/ontology to record all things for all data. However, once you’ve got the data in this simple form and are ready to move on to the analysis, I think you’ll find the ontologies to be indispensible.