Up at 5AM: The 5AM Solutions Blog

Big Data and the (Near) Future of Genomics

Posted on Tue, Nov 11, 2014 @ 03:00 PM

At ASHG 2014, I had the pleasure of listening to the talks of Ajay Royyuru from IBM’s Computational Biology watson_genome_infographic-overlayCenter, and David Glazer from Google, during the ‘Separating Signal from Noise’ symposium.

Ajay Royyuru, the Director of the Computational Biology Center at IBM, explored the need for integrating knowledge behind clinical diagnostics based on specific mutations, investigational panels such as those provided by FoundationOne, and research portals that catalog disease variant associations such as MyCancerGenome and cBioPortal.  This is where IBM’s Watson, the supercomputer previously acclaimed for winning Jeopardy, comes in.  

IBM Watson’s Precision Oncology directive is a cognitive computing system that summarizes information for patients, recommends treatment options, and matches patients to clinical trials.  Given a genomic sample from a patient (either as variants from a sequencing experiment or gene expression values), Watson uses its constantly growing knowledge-base of journal publications, disease and variant databases, drug repositories, and clinical trials to build conceptual disease models, and recommend actionable insights.  A textual report can be generated on the fly.  Prototypes are available at The New York Genome Center and the Memorial Sloan Kettering Cancer Center, among others, to further validate this evidence-based approach.

David Glazer, the Engineering Director at Google, highlighted that the “Big” part of “Big Data”, i.e. high-throughput genomics, is not the hard part anymore.  Given Google’s processing powers (~12 billion monthly google searches, processing 20 petabytes a day), he proposed new ways of applying Google’s algorithms to the genomics space.  For example, Google’s BigQuery (based on Dremel) was used to explore genetic variation in the 1000 genomes dataset.  Exploring subsets of variants, stratifying populations, and breakdown structures took a matter of minutes to hours (compared to days), shortening the cycle between question and answer.  Applying PCA (Principal Component Analysis) on all variants in the dataset showed a clustering based on population origin, which took 2 hours on 480 processors.  During the last part of his talk, David put forward the Global Alliance for Genomics and Health, which was focused on data sharing standardization including benchmarking processes, and API standards.

None of what was proposed by IBM or Google is new. There are plenty of groups that have worked on the same problems, and have similar solutions.  Performing PCA on a genomics dataset is something a first year graduate student is introduced to. However, where these tech giants have an edge is in their computational power. Be it the supercomputer that can “process data like a human” or Google’s ability to scale to hundreds of processors, digesting large amounts of scattered, disorganized information to come up with actionable personalized insights is a step in the right direction.

Last week, my colleague Luke Ward did a post on his takeaway about decoding the exome from ASHG 2014. You can read it here. Other recent articles on big data and clinical trials can be found here and here.

Never Miss a  Map of Biomedicine Post. Subscribe!
Infographic: ©Copyright IBM Corporation 2014. Click to download in full resolution.

Tags: clinical trials, Big Data, Genomics


Diagnostic Tests on the Map of Biomedicine


Download the ebook based on our popular blog series. This free, 50+ page edition features updated, expanded posts and redesigned, easier-to-read maps. 

FREE Biobanking Ebook

Biobanking Free Ebook
Get this 29 page PDF document on how data science can be used to advance biorepositories.

 Free NGS Whitepaper

NGS White Paper for Molecular Diagnostics

Learn about the applications, opportunities and challenges in this updated free white paper. 

Recent Posts