Up at 5AM: The 5AM Solutions Blog

Done, Done, Done, and Done

Posted on Tue, Dec 15, 2009 @ 01:01 PM

In the agile world, the definition of done tends to center around backlog items and the iteration boundary. The difficulty I see with such definitions is that they are not granular enough for individual developers. Development work proceeds commit by commit, not user story by user story. Our best practice is for a developer to commit at least once a day, so we don't lose work to vagaries such as hard drive failures or accidental deletion. This has led us to multiple definitions that break down work from the user story to each commit. At 5AM, code can be "Done" for 4 purposes, of increasing rigor. (1) Commit to a private branch, (2) commit to mainline development branch, (3) resolving a subtask, (4) resolving a backlog item. Here's what each definition entails:

Private branch commits: Commits to private branches can be made at any time, without restriction. The code doesn't even have to compile. This permits daily checkpointing and keeps us from losing work.

Mainline development commits: Commits here affect other team members, so we have several requirements. Every commit must be tied to a tracker item. Peer review is required, no matter how minor the change. Code must pass continuous integration (CI), which brings in things such as coding conventions and the regression suite (and the sombrero). These requirements ensure traceability, that CI always passes, and that at least two eyes have seen every piece of code.

Resolving a subtask: Individual tracker items are the lowest level of granularity we typically talk about at daily standups. Resolving subtasks requires not only committed code, but high quality unit tests, removal of fixmes in the code, and known limitations or escapes are captured as additional subtasks. We don't measure velocity here, but we have found that the team gets a good sense of progress by watching subtasks move to resolved.

Resolving backlog items: This is where most scrum definitions of done center, and where the team measures velocity. Our definition is similar to others - beyond resolving all subtasks, resolving a backlog item requires that: Functional and non-functional requirements have been met. Integration tests exist. Escaped or deferred functionality is captured. In totality, the code is deployment-ready.

We're done, done, done, and done.
Read More

Mammography Screening - Don't Panic!

Posted on Mon, Dec 14, 2009 @ 01:01 PM

Surely you saw the newspaper articles about the recent government guidelines about breast cancer screening using mammography. There were breathless articles and letters to editors complaining about how mammograms had saved lives and how could a heartless government agency say we shouldn't continue with the practice. Congressional hearings were held with "debate split along party lines" so you know they really got down to the science of it.

The main idea that caused so much consternation was that women aged 40-49 should not routinely undergo mammograms to look for signs of breast cancer, which is what previous guidelines had said. But let's look at what these new guidelines really say:
  • "The USPSTF recommends biennial screening mammography for women aged 50 to 74 years"
  • "The decision to start regular, biennial screening mammography before the age of 50 years should be an individual one and take patient context into account, including the patient's values regarding specific benefits and harms"

So it doesn't say that women in their 40's should not get mammograms. It says they should weigh the risks and benefits themselves (and with their doctor, obviously, although I wish it had said that explicitly) and make their own decision.

So why not get mammograms earlier? The term 'screening' is key here. A screening test is a test that is given when there's no prior evidence or risk for a condition. A cholesterol test is a screening test, for instance, to look for evidence of heart problems.

You have to remember that by its definition a screening test is given to large number of people, only a small fraction of which have the condition being tested for. So even if such a test is relatively accurate, there will be a large number of false positive results. In the case of mammography, a false positive (and I hesitate to use the word 'positive' here) is an abnormal result when in fact the patient does not have cancer. There's a useful statistic to quantify this called positive predictive value (PPV). PPV is the fraction of positive results that actually have the condition being tested for. For women aged 40-49, mammography has a PPV of 2-4%. So that means that only a small percentage of women in that age group who have an abnormal mammogram actually have cancer.

The critics of these guidelines have said this is fine since we want to catch as many cancers as early as possible. But you have to take into account that no test is without risk, and that more invasive procedures, such as biopsies, are done when a mammogram is abnormal. The mammogram itself, as well as the subsequent procedures, have a monetary cost as well.

What the guidelines say is that women under 50 should get mammograms if they want to, or if they or their doctor feels there are other reasons for them to have a high risk of breast cancer. If one has a family history of breast cancer, or a genetic risk factor, then those would be good reasons to have earlier mammograms, in my non-medical opinion.

So this is a case of reasonable scientific conclusions being misinterpreted. However, I do think there needs to be more research done to say how many cases of cancer would be missed if mammograms were only done for women aged 40-49 who had some other risk factor. That would make it more concrete for women about what risk they were running by not having mammograms before age 50.
Read More

The Case for Complexity

Posted on Mon, Dec 07, 2009 @ 01:00 PM

I enjoyed reading Neil Saunder’s blog post "Has our quest for completeness made things too complicated?" last week. In the blog, Neil advocates that capturing –omic data in a simple feature-probe-value triplet is a much more practical approach to the data integration problem than trying to work with data across a multitude of ontologies and schemas that have been developed for individual data types. Neil’s focus is to identify what’s the same across all these data sets and work with that very simple model rather than spend a lot of effort to come up with a complete model of everything.

So, to you Neil Saunders, mostly I say “Amen, brother! I’m with you.” My physics background also gravitates me towards boiling a problem down to its essence – the bare facts upon which complex analyses can be built. As much as I believe this, however, I also believe that ontologies and other meta-data play an essential role in integrative analyses. Let me illustrate…

We recently were working with a client to integrate multiple –omic data sets (gene expression, GWAS, proteomics, etc). Similar to Neil, our first step was to extract the data from their individual repositories, ontologies, schemas, etc and boil these very disparate data sets down to their greatest common divisor (GCD). “Greatest common divisor” is an apt term here because in selecting what data was included in our simplified model we looked both for what was “common” to all data types and the analyses that would follow and also what was “greatest” in the sense that wherever possible we wanted to take processed results over raw data. In Neil’s case, his GCD was the feature-probe-value triplet. For us, the GCD was a combination of probe, sample and value. It would take another blog post (or more) to go into this project and design decisions, but for now I wanted to emphasize the similarity with Neil’s approach and perhaps holler another “Amen!”

But hold on! Before tossing all ontologies onto the scrap pile of failed good ideas (somewhere between the Apple Lisa and Pepsi Clear, I imagine), let’s take a step back. After extracting the multiple, disparate data out of their various schemas and into a single, clean model, what can we do with it? The answer: not much. In our project, although the data was all in the same form, the domains were still very different. How do we relate the value we got from a SNP to one from a transcript? Should data for a sample labeled “breast cancer” be grouped with data from samples labeled “human mammary carcinoma”? The truth is although we had a unified data model, the data was far from being integrated.

So what did finally bring us to an integrated data set? Ontologies, thesauruses, mappings, etc. In our case we used multiple sets of mapping data to take our primary data values from probes to a common feature – a gene (Entrez). We used nomenclatures like SNOMED and MeSH to normalize our samples. Only after leveraging this ontological information could we work with the data in any meaningful way.

It didn’t stop there. Once we had the data mapped to common feature and sample terminology, we then utilized ontologies like GO to form and test hypotheses on biological function. Our plan going forward is to leverage disease ontologies, pathway information and other meta-data to go beyond simple lists of differential features and go towards true biological understanding. Although coming up with a good ontological model capturing the entities and relationships relevant in biology is hard and complex, it is still a good way to capture our biological knowledge in a form that can be directly applied in analyses.

Finally, lest we forget, although dealing with a different schema/ontology for each type of data is annoying, it is far better than the alternative of having no such schema/ontology in place. This was painfully clear in our project when dealing with the relative uniformity of gene expression data in repositories such as GEO compared to the state of proteomic and even GWAS data.

To sum up, I agree Neil – you’re on to something. Getting your different data sets into a simple, consistent model is the way to go. We shouldn’t try to build a complex schema/ontology to record all things for all data. However, once you’ve got the data in this simple form and are ready to move on to the analysis, I think you’ll find the ontologies to be indispensible.
Read More


Diagnostic Tests on the Map of Biomedicine


Download the ebook based on our popular blog series. This free, 50+ page edition features updated, expanded posts and redesigned, easier-to-read maps. 

FREE Biobanking Ebook

Biobanking Free Ebook
Get this 29 page PDF document on how data science can be used to advance biorepositories.

 Free NGS Whitepaper

NGS White Paper for Molecular Diagnostics

Learn about the applications, opportunities and challenges in this updated free white paper. 

Recent Posts