Up at 5AM: The 5AM Solutions Blog

Science and Sharing: Not Compatible?

Posted on Thu, Oct 29, 2009 @ 12:57 PM

There's been some recent skepticism about how much researchers are really interested in making their data public. A recent article in PLoS One provided a disturbing, if small, example of how authors avoid giving others their data. You could argue that there is little motivation for authors to give others their data.

They definitely risk some harm. They risk others discovering their mistakes or finding interesting results they didn't find, and having those mistakes or missing results exposed in somebody else's paper.

In addition, they have little to gain. Mostly what people tout as the benefits for data sharing are benefits for everybody but the original author. At best one would expect to be a co-author on a study, but that seems to happen only when the subsequent paper is a true collaboration between the original data producers and the second set of researchers. A more likely result is that the original authors would be thanked in the acknowledgements, which most scientists consider to be pretty small potatoes. Another paper in PLoS One associates citation frequency with papers that release their data. Authors certainly pay attention to citations of their work, and there are web sites of the most cited papers that carry some weight, but I suspect this too is a pretty weak motivation for releasing data.

But nobody likes a blogger who bitches all the time and doesn't offer any constructive solutions. And I have two.

One is to financially compensate researchers for sharing their data. In practice, however, this could be complicated. If, as the second PLoS study shows, increased citations are associated with making data available, then could you use that as a surrogate measure and pay researchers based on the number of citations of their work. And who would pay them? Their employers? That doesn't seem very likely, although maybe universities could set up a pool of bonus money for this purpose. The funding agencies could pay them. You could set aside a percentage of grant money to be paid only if certain citation or sharing milestones are met. The NCI is trying to do this by making caBIG usage part of a lot of their grants. But they make the mistake of paying people first and then assuming they'll do it. They should withhold that money until the sharing is complete.

But I am skeptical that any kind of motivation like this will work very well. Researchers will always try to find a way to serve their own personal interests so will try to subvert or minimize the sharing they have to do.

My second idea is that we should separate the generation of large data sets from publication entirely. I know generating large sets of data is a sophisticated effort but at some point we have to separate the repeatable, factory-like parts of science from the data analysis parts. Why couldn't the funding agencies go ahead and fund efforts to sequence samples and explicitly say that no publishing can come from this effort. Maybe if there are technology advances made they could be published, but the data cannot be a paper in itself. The data would have to be made freely available to anyone using an appropriate mechanism (and I realize this is topic worth an entire blog posting by itself).

Then anyone can analyze the data and publish their results. Perhaps you could also make it a condition of using the data that you have to include authors from those that generated the data, but that is a tricky idea. I am more happy with the idea of taking publishing out of it entirely to avoid the situation where data generators just go right out and collaborate with their favorite analysis group and publish a paper.

For this idea to work you'd have to find people willing to do this kind of work and not get recognized in scientific publications, at least in the traditional way. Would people be interested in this? I'm not sure, but I imagine if you paid them enough they'd be happy to. The question is whether the extra expense would be repaid by the benefits of the easy sharing of the results.
Read More

Shall we meet?

Posted on Thu, Oct 22, 2009 @ 12:54 PM

It's midnight, and my husband is brushing his teeth before bed. Me? I'm sitting here slogging through the 100+ emails I received today. It seems like a ton, but I guess that's probably the daily average. Why am I dealing with them at this hour? Because I spent the entire day in meetings.

And I mean the entire day. 8:15am through 5:30pm. No break for lunch (no lunch, in fact - Brent grabbed a sandwich for me but I never got around to eating it because of, well, meetings). I was even double- or triple-booked, having to miss a meeting because I was in another meeting.

All the meetings were good, or enjoyable. I learned a lot and contributed some knowledge here and there. This is certainly not a typical day for me (thankfully). But as a Myers-Brigg Introvert, who gains energy from being alone, a meeting-packed day is especially tough. Alas, I did need to attend all of my meetings today, but thankfully the meeting facilitators kept their meetings on schedule, because as each ended, I rushed off to the next. After a day like today, I needed to reflect on my own meeting rules (all of which I've broken at one point or another [or severally]).

Rule #0: No "good meetings." If the outcome of a meeting is that it was a "good meeting," the meeting was unnecessary. See Rules below.

Rule #1: No meetings without an agenda. If we can't list the things we want to discuss, why are we here?

Rule #2: Any meeting should have a desired outcome or outcomes. We're discussing the things listed on the agenda to what end?

Rule #3: A meeting should have resulting actions and key decisions that should be assigned and communicated as quickly as possible. These actions and decisions should align with the meeting objectives. (Thanks to colleagues here at the NHIN for awesome modeling on this point.)

Rule #4: Consider whether there are other ways to accomplish the objective. Sometimes, a conversation or email exchange can get us there quicker. Do we really need a status meeting? A corollary is to invite only those people who can achieve the desired outcome(s). Those who are interested can read the decisions and action items (Rule 3).

Rule #5: Respect. Stick to the agenda, keep the outcome in mind, and watch the clock. People are giving their time - make it worthwhile. (Also why I'm dealing with emails now - when I attend meetings, I'm there to pay attention, not check mail on my iPhone.)

Meetings have the potential to be effective, but don't we all lament that we meet too often? At Microsoft that there are no meetings longer than 30 minutes. This time limitation only means that some people attend 16 meetings a day rather than 8. Oh, joy.

We have important work to do. My colleagues and I are trying to make our program (our client's program) a smashing success. Part of this effort involves holding regular meetings with stakeholders and partners - a lot of them. I also prefer working sessions, which I suppose are quasi-meetings - collaborations whose outcome is something concrete.

At 5AM, we no longer hold regular management meetings, and I do miss engaging with my colleagues in meetings that sometimes ranged wildly and were always "good meetings," but I appreciate that extra hour each week, which allows and demands that each of us interact with more focus because we can't rely on those meetings.

A colleague has meeting requirements he calls the "Three P's" - that each meeting should have a plan, a process, and a payoff. Next time I'm in control of a meeting, I vow to follow my own rules. Anyone up for renewing your own meeting vows? I'd love to hear yours - please comment. I'll be sure to seriously consider your next invite....
Read More

Clever Code

Posted on Tue, Oct 20, 2009 @ 12:53 PM

Clever Code

As part of an internal effort to dabble with Scala, I sat down yesterday and solved a few of the Project Euler problems. It's been a bit since I did any math-type coding problems, and I enjoyed the hours of flow that coding usually brings me. By the end, I felt pretty good about my solutions and the new language acquisition. And then I searched for solutions, to see what better Scala programmers would have done. Whoa, Nellie!

There were two classes of differences between my solutions and the ones I found online. First, the online solutions used language constructs and APIs I wasn't familiar with. This is what I expected, and why I searched in the first place. But, the other type of difference was what I'll call cleverness. Here's two examples:

My solution for Euler Problem #1:
def euler1(max : int) : int = {
var result = 0;
var i = 0;
while (i < max) { if (i % 3 == 0 || i % 5 == 0) result += i; i = i + 1; } return result; } euler1(1000);
A more elegant solution, demonstrating familiarity with the language syntax and features that I was clearly not familiar with:
(1 until 1000).filter(n => n % 3 == 0 || n % 5 == 0).foldLeft(0)(_ + _)
As an example of cleverness, consider this code:
object Euler005 extends Application {
def divBy(div:Int)(x:Int) = if(x%div==0) x/div else x
def reduceMuls(src:List[Int]):List[Int] = src match {
case 1::rest => reduceMuls(rest)
case a::rest => a::reduceMuls(rest map divBy(a))
case _ => src
println ((1 /: reduceMuls(1 to 20 toList))(_*_))
Quick: Without looking at the explanation, what is this code doing?

I admire the cleverness of this code. It uses the language features nicely, is efficient, compact, and the explanation of how the author arrived here is well spoken. But I was really struck by the fact that this code is quite different than most of the code I encounter. My solution was to factor each number into a bag (method 1), then do a union of the bags (method 2), then multiply. The above solution is almost certainly faster, but absent clear and compelling performance reasons, I would strongly prefer the decomposed solution. I don't want to spend the time to be that clever in the first place, and certainly don't want to spend the time to learn what the code does if it ever had to change.

Save the clever code for toy programming assignments, performance-critical blocks, or hacking. If it must exist in a project that needs to be maintained, please provide an explanation like the one the author of the above code did.
Read More

Going Out on a LIMS (Again)

Posted on Thu, Oct 08, 2009 @ 12:53 PM

A commenter on my last blog entry, Of Life and LIMS, had a few follow up questions to my post. This was a pleasant surprise for a couple of reasons. First, this means my readership is about 100% more than I had previously estimated (no worries, Mom, you're still my favorite). Second, to answer the commenter requires a blog entry in itself - so no searching for appropriate subject matter this week. I'll leave the tutorial on posting cat videos to YouTube for another day.

Let me address one of the questions from the comment.

What would be your definition of the unmet need? Is it not describable because everyone either is or believe themselves to be so unique?

Let me remind folks that in the blog I was discussing the LIMS (laboratory information management system) needs of smaller, R&D labs and let me start with what is not their unmet need. Their unmet need is not a shrink-wrapped LIMS-in-a-box that, until this point, has remained elusive. I really don't think that software exists. Second, I don't believe that the unmet need is a scientist-friendly application that will be up and rolling after a simple install and a little configuration with an intuitive GUI. I've seen too many bench scientists struggle with just programming the instruments in their lab to think that having them set up a complex LIMS is a reasonable solution.

So what is the unmet need and is it describable? I think so. Although all LIMS are different, I believe that there are some common pieces that pretty much every LIMS needs, such as a database to record inventory, methods to control and record workflow and simple reporting and user input tools. A full LIMS will require more than just these core components, but having these pieces available in a framework would already get you well down the path of a functioning LIMS. Framework is the key word here, because that's where I think the unmet need is. Just as frameworks such as Ruby on Rails and Django have made a web developer's job much easier by setting up the core components found in many sites, I believe an open LIMS framework would greatly decrease both the time and complexity required to set up a LIMS. Having a framework will not make LIMS development tractable to bench scientists, but will easily bring it within the capabilities of a bioinformatician or engineer lacking a formal software engineering background. Finally, having an open framework is key for encouraging developers to share add-ons and instrument specific controllers and parsers that in turn will make the framework even better - the kind of positive feedback loop that takes a project from obscurity to adoption.

None of this is rocket science - in fact, a few efforts have been made before (e.g. The GnosisLIMS Project, Symphony, etc). To my knowledge, none of these have taken off, but I'm confident that eventually the right balance of architecture, functionality and marketing will be found. I only hope it comes before I have to deploy another LIMS myself.
Read More

Be an Expert

Posted on Wed, Oct 07, 2009 @ 12:52 PM

As I think about the good technical colleagues I've worked with over the years, one common quality they all share is that they are or were all experts at something: maybe a file format, Javascript regular expressions, Hibernate, or a build system. It could be big or small, relevant to a current project or merely a source of curiosity and fun trivia, but my good technical people always seem to be an expert–and add to their expertise list over time. What's the relationship between expert and good technical colleague?

I want to take a moment to define expert as I'm using it here. Calling someone an expert is a relative judgment. I may an expert among one group but a relative neophyte for the same topic in another. When I talk about "the expert," I mean the person that is the go-to for answers. The person that is recognized as authoritative for the group. Generally the expert either authored a bunch of relevant code in the past, or does reviews of the code now. They can answer easy or normal questions from memory. For obscure questions, they either know from past experience, point you at a relevant section of the docs, or solve the problem because they get fascinated. An expert is not an encyclopedia, nor do they know everything. But they know more than you.

All is a strong term. Are all good technical people experts? Maybe not, and I think this may be a distinction between coders and software engineers. In the normal course of coding we encounter problems no one in our local group has seen before. Solving the problem at hand gives the person experience, but not expertise. How someone generally reacts after solving the immediate problem separates good from not-so-good, experts from non-experts. Good software engineers (note: not 'coders') generally add extra unit tests or comments, go and investigate to see if the problem has a general solution, or take a few extra minutes to read up on existing documentation. And I think the most critical thing, the one that makes you an expert: communicating that knowledge back to the team. Look it up: the best way to learn is to teach. Two things happen by communicating back to the team. 1) You internalize the knowledge, and 2) Everyone else knows you know.

So, good individual software practice means communicating back to your team, and the act of communicating well makes you an expert.
Read More

New and Better Ways

Posted on Tue, Oct 06, 2009 @ 12:51 PM

Here at 5AM, we provide full medical insurance coverage for employees and their families – we cover 100% of premiums and deductibles. Job candidates and employees both understand this as a pretty nice, and unusual, benefit for a company of our size to offer.

Every year at this time, our insurance company notifies us of the impending end-of-contract rate hikes, and we work with our broker to shop around to ensure we’re competitive in the coverage and the cost.

This year, our non-profit health insurance provider notified us that they were raising their prices by 34%. No, that’s not a typo – 34%. This, without much explanation except the friendly reminder that they’re “always seeking new and better ways to contain costs” – raising rates 34% sure IS a good way to, er, contain costs.

In the past, the rate hikes we endured were typically in the 10-18% range. We didn’t budget this 34%, but we’re sticking to our guns and will continue to offer full coverage to our employees because we believe they and their families should have access to health care and that we as a company are in the position, and have somewhat of an imperative, to provide it.

Why do we feel obligated? Well, because one of our founding tenets is that collectively we have more power than we do individually. This holds true in how we perform our work here (in teams), and it theoretically holds true that we can group ourselves together to obtain better prices and benefits in all sorts of areas, including and especially health insurance And we feel obligated because in our type of company, health insurance is not so much a benefit as a requirement, and the “benefit” comes in the quality of the offering and the amount the company will pay, not in the offering of the insurance itself.

In the confluence of this hoarding/avaricious rate hike, and with the current health care debate and its question (promise?) of a “public option,” I flashed back to nearly five years ago, when three rather large companies called on President Bush because they feared that the cost of providing health insurance to their employees would crush them.

Among its other burdens, GM pays nearly $6 billion per year on health insurance for its current and retired employees (adding about $1500 to the price of each car). It’s estimated that small businesses like ours will pay nearly $2.4 trillion dollars over the next ten years in health care costs for their workers. We’ve been talking about this problem long enough, lining countless pockets and doing not much as the number of uninsured grows.

Aren’t we greater than the sum or our parts? Isn’t there power in a collective? We sure believe that at 5AM, and with another year of unexplained and apparently disproportional rate hikes to provide our employees with something so basic as medical care, we’re sure ready for new options.
Read More

Leaky Abstractions

Posted on Thu, Oct 01, 2009 @ 12:50 PM

This post is the first of several in which I would like to talk about the issue of leaky abstractions in software engineering. A leaky abstraction can be a broad and somewhat vague notion, but the essential idea is that you've created some model of the behavior of a complex system, where the model simplifies and hides some of the details of the system, yet at some point the simplification breaks down and those details poke through. When this happens, it is a big problem, as this simplification is what allows programmers to get their job done, by allowing them to work at the level of complexity appropriate to their task at hand. In this post I want to describe two examples of leaky abstractions I came across recently in my own programming.
The first is a fairly simple one. The product I work on has an API exposed as a set of remote EJBs; clients that invoke these EJBs can optionally provide login credentials so that the operations are invoked with the security privileges of the corresponding user. It turns out that, if the login credentials are incorrect, the client gets a ClassNotFoundException for a low-level mysql class. The reason for this is that a FailedLoginException is thrown, but ultimately this contains a cause exception which references a mysql class. Since the client does not have mysql's jdbc jar on its classpath, this causes the ClassNotFoundException.
This is a clear case of a leaky abstraction: the client is programming to the JAAS security API; it should not matter that ultimately the identity store behind it is backed by a mysql database, and the client should not need to have the mysql jar in this scenario. The fix here is simple: the thrown FailedLoginException should not include the underlying exception as a cause.
The second example is more far-reaching and involves a Hibernate query with a limit clause. The limit was set to 50 results, but the query would return only 49 results, even though I knew that there were in fact more than 50 matching records in the database. After some frustrating back and forth between the debugger and DbVisualizer, I found the root cause: The target class for the Hibernate query had a collection with an eager fetch strategy. This implied that, when the Hibernate query was actually translated into SQL and executed, there were potentially multiple rows corresponding to a single instance of the class, one for each element in the collection. The limit of 50 results, however, was passed down into the generated SQL layer as-is, which meant that the SQL result set had 50 rows, but by the time those were translated back into class instances, there were fewer than 50.
In fairness to Hibernate, its documentation recognizes this and explicitly warns that limit queries and eager fetching of collections do not work together. But this is really a cop-out. Hibernate is meant to be an abstraction that allows me to think of my data store in terms of class instances, and not worry, once i've done the mapping, about the actual rows in the database. Given that, when I specify a limit for a query, I want to think of that limit in terms of the objects, not the rows, and at least in this case, Hibernate doesn't let me do that. I can sympathize, as fixing this would definitely be non-trivial - Hibernate could conceivably detect that the number of object results was less than the limit, and issue additional SQL query (or queries, since there would be no way to predict how many additional rows would need to be retrieved to get the right number of object results), or even not do a limit query at the database level, and limit the results on the Java side. But this would likely make performance worse.
These are just two recent examples that I ran into directly, but others abound. The entire distributed object paradigm, with the notion that one could treat local and remote objects the same, is one giant leaky abstraction, and it was quickly realized that the only way to write performant code in this paradigm is to very much care about which objects were local and which were remote. The various web frameworks that try to shoehorn a Swing-like component model on top of HTTP are another case where the abstraction can feel akin to putting lipstick on a pig.
In future posts I would like to discuss what makes leaky abstractions more likely to occur, how one can try to avoid them, and what makes the software engineering domain particularly prone (or not) to leaky abstractions.
Read More


Diagnostic Tests on the Map of Biomedicine


Download the ebook based on our popular blog series. This free, 50+ page edition features updated, expanded posts and redesigned, easier-to-read maps. 

FREE Biobanking Ebook

Biobanking Free Ebook
Get this 29 page PDF document on how data science can be used to advance biorepositories.

 Free NGS Whitepaper

NGS White Paper for Molecular Diagnostics

Learn about the applications, opportunities and challenges in this updated free white paper. 

Recent Posts