Thursday, October 31, 2013

Sequencing is the new pcr

- Gautham

I was wondering if the domination of high-throughput sequencing in genome science is an anomaly of the times, so I decided to look at the table of contents of old Genome Research issues. What was Genome Research about before the advent of all this sequencing?

http://genome.cshlp.org/content/2/1.toc
http://genome.cshlp.org/content/1/4.toc

Take a look! It was even more dominated by PCR then than it is dominated by HT sequencing now.

Wednesday, October 30, 2013

Better programming practices applied to a scientific plotting problem

- Gautham

It is easy to find "best programming practices" out on the web, but usually there aren't accompanied by an example. This is a real-life example of applying some good practices to turn a bewildering task into something actually pleasant.

One of the hardest aspects of making the worm paper was dealing with the complicated plots that had to be generated. The input data was of exactly two forms:

  • RNA-FISH data sets. A sample of worm embryos of some particular strain, grown under particular conditions, were fixed and imaged. For each embryo analyzed you have #nuclei, and the number of RNA-FISH spots in all fluorescent dye channels used in that experiment. Each dye channel corresponds to a different gene, and different data sets can involve different genes.
  • Embryo cell lineage data sets. Our collaborators Travis Walton and John Murray took movies of individual embryos from certain strains and conditions and determined the time of each cell division. We also incorporated published lineage data. The names of the cells in the embryo follow a standardized pattern. 
Since the C. elegans embryonic lineage is invariant, you can use the data set in 2 to effectively determine a time axis for RNA-FISH, and do things like determine whether expression of a gene starts before or after a certain cell's division (which was actually a key question asked by the paper). 

The goal is to make plot layouts like:
Output of the figure-1 producing code in MATLAB
You see here lineages versus time (bottom left), RNA versus time (bottom center) and versus nuclei (top right), and even a nuclei versus time plot (bottom right). RNA traces for different genes are shown, color-coded, and a specific lineage (the E lineage) is highlighted and the times of its divisions are highlighted in all the plots. Plus there's the shaded alternating grey bars.

My first crack at this thing was for our first submission. It involved very messy and long scripts in MATLAB. Producing plots required a lot of looking back and forth and copying and pasting. Changing which lineage data or which RNA-FISH data set was used to make the plot was a chore. When the reviews came back, and we had to consider many ways to present our data to make our point clear, we needed a more flexible way to make these plots. Here are some principles/tips that helped me:

Centralize access to raw data
This is related to a previous post on keeping together all data of the same type.

LINDATADIR = 'lineagedata/';
RNADATADIR = 'RNAdata/';

lindata = loadWormLineageData( LINDATADIR );
rnadata = loadAllWormRNAData( RNADATADIR );

>> rnadata
rnadata = 

         gut: {1x49 cell}
    othergut: {1x8 cell}
      muscle: {1x12 cell}

>> showIndexWormData(rnadata.gut)
1)   RT 101003
2)   15C 101011
3)   RT 101115
...

49)   N2 20C 110219 and 120309 gut

In Matlab, structs or arrays of structs are the container of choice for arbitrary data. Above I have a cell array of structs for each experiment, and each struct contains a string that has a descriptor of the experiment, which I can show with the showIndexWormData function. Similarly,
>> lindata
lindata = 

           n2: [1x1 struct]
         gg52: [1x3 struct]
         gg19: [1x3 struct]
         div1: [1x5 struct]
...
In this case lindata is itself a struct, where each field corresponds to a strain or source and contains a struct array of lineage data, which could represent data from different individual embryos.

Good code needs few comments
If you have to write many comments to explain what your code does, rewrite the code. Code that makes its intent clear is good code. The nightmare is not looking back at code and asking "how does this code work?" but rather "what was I trying to do here!?" 

Here is the top level script run to make the figure:
%% FIGURE 1 - Overview

% The end-3 and end-1 data is in 110219
% The elt-2 data is in 120216

rnadatacell = { ...
    getWormData( rnadata.gut, 'N2 20C 110219' ), ...
    getWormData( rnadata.gut, 'N2 20C 110219' ), ...
    getWormData( rnadata.gut, 'N2 120216 20C' )};

genenamecell = {'end3','end1','elt2'};
dyenamecell = {'alexa','tmr','cy'};
RNArangecell = {[-100 700],[-100 900],[-100 600]};

lineagedata = lindata.Bao_to20C(1);

figure(3);clf;
plotfig1_0114( rnadatacell, genenamecell, ...
    dyenamecell, RNArangecell, lineagedata)

set(gcf,'PaperPositionMode','auto')
print('DRAFT_fig1_0114.eps', '-depsc')

If clarity of intent is key, we are prepared to sacrifice all manner of other things, including speed, storage and efficiency. As you can see, the script really packages the variables I might be interested in changing when producing Fig. 1 and sends them off to a function that does the work.

One other good reason to have such self-commented code is that your code and your comments will never diverge due to laziness. If you rely on comments to understand your code once, you will have to be diligent from then on to make sure they are totally sync'ed up any time you come back and modify it. 

Use accessor and constructor functions liberally
This is an offshoot of ideas related to the way you write object-oriented programs. Marshall Levesque turned me onto writing "get" and "make" functions.

getWormData( rnadata.gut, 'N2 20C 110219' )

Fetches data by descriptor, rather than by number, filename, or some other obtuse thing. 

Here is the portion of the code in plotfig1_0114 that plots the three plots in the bottom center of the figure:

for i=1:length(rnadatacell) 
    axes(ht(i));
    xlim(RNArangecell{i})
    addtimeshading_t( getLinDatabycellname('ABa',lindata_adj) , ...
        10, brightcolor, darkcolor, 'y_t');
    plot_onewormseries_vs_t( rnadatacell{i}, lindata_adj,...
        getRNAslopes( genenamecell{i}),...
        getWormColors( genenamecell{i}), dyenamecell{i}, 'x_RNA__y_t') ;
    addLineageMarks_t( lindata_adj, Linsubsetfunc, 'y_t') ;
end

There are three 'get's here: getLinDatabycellname('ABa',lindata_adj) finds the time of cell division of cell ABa based on the lineage data and getWormColors( genenamecell{i}) returns a structure containing all the graphical properties I like to use when plotting that particular gene. This is how a consistent color is applied every time for a given gene (green for elt-2, for example). In my implementation, the table of graphical properties for all genes is hard coded into that function body and all it does is look up the entry for the desired gene. ( getRNAslopes is also a "get" involving a hard-coded table )

Incidentally, the low level functions accept things like the strings 'x_RNA__y_t' or 'y_t' to determine what is plotted on the x and y axes. This was important when we were considering all kinds of variations with this or that plot flipped. 

The lineage marks are added using the lineage data and a function that filters only the lineage I wanted to highlight (the E lineage including the EMS division):
addLineageMarks_t( lindata_adj, Linsubsetfunc, 'y_t')

Turns out that the filter itself was made with an accessor/constructor function:
Linsubsetfunc = makeLinsubsetter('EwEMS');

Here is a function that returns a function itself. MATLAB is a bit handicapped for this kind of thing since the only kind of function you can make interactively is an anonymous function one-liner, but it sufficed for my purpose:

function fh = makeLinsubsetter( lineagename )

% Return a function handle to an anonymous function that will filter for
% the specified lineage. 

switch lineagename
    case 'AB'
        fh = @(cellnames) find( strncmp('AB',cellnames,2) );

    case 'E'
        fh = @(cellnames) find( strncmp('E',cellnames,1) & ...
                                  ~ strcmp('EMS',cellnames) );

...

    case 'EwEMS'
        fh = @(cellnames) find( strncmp('E',cellnames,1));
...
    case 'notABorP0'
        fh = @(cellnames) find( ~ ( strncmp('AB',cellnames,2) | ...
            strncmp('P0',cellnames,2) | strncmp('NA',cellnames,2)));    
    case 'all'
        fh = @(cellnames) 1:length(cellnames);


Having a function like this was incredibly handy.

Avoid copy-paste code. Wrapper functions are nice.  

These two principles are closely related. For example, consider the plotting of RNA vs time (bottom center)  and the RNA vs nuclei (center left) in the figure. The function that plots an individual RNA vs time graph is:
function [ linehandles, plothandles ] = plot_onewormseries_vs_t( cdata, ...
        lindata, rnaslopes, graphicsspecs, colortoplot, plotmode)
...
...

embryoages = assignAgesFrom_lindata( numcells, ...
                                    numrna, rnaslopes, lindata) ; 
linehandles = plotGoodBadWormData( embryoages, numrna, ...
    has_clearthreshold, graphicsspecs, flip_rna_t) ;
end



On the other hand, the code that plots an individual RNA vs nuclei graph is:
function [linehandles, plothandles] = plot_onewormseries_vs_N( cdata, ...
                                graphicsspecs, colortoplot, plotmode)
...
...
linehandles = plotGoodBadWormData( numcells, numrna, ...
    has_clearthreshold, graphicsspecs, flip_rna_n) ;
end

Both functions ultimately call plotGoodBadWormData to actually draw the graphs, but the time version  passes in embryo ages, which it computes, rather than the number of cells (same as the number of nuclei). This way, plotGoodBadWormData does the work of plotting and applying all the graphics parameters and its code does not have to be duplicated. Meanwhile, the two plot_onewormseries... functions are really just wrappers. 

Ultimately, in all my worm code there are only three or four functions that do actual hard "work". 

More function, less scripting.

This is a corollary of avoiding code duplication. Think twice before you copy-paste! If you are using a language like R or python that allows you to write arbitrary functions interactively, there is no down-side to functionalizing. 

Hope some of these tips help you to not lose your mind, like I did on my first go at this stuff. We scientists are some pretty bad programmers because nobody else usually sees our code. But we have to see it and work with it. Good practices do make you more efficient, and also, happier. I trust you have enough good taste to avoid the extremes.

Tuesday, October 29, 2013

The design police

Design is everywhere these days. It’s got to be the Apple effect: they’ve definitely pushed design into the public consciousness, and now we’re stuck listening to these insufferable design-nerds anytime any website or software is updated anywhere, no matter how inconsequential the change. Perhaps it’s ironic that Apple itself is the target of most of this design-related hate-babble, with literally every aspect of their software being the subject of endless rants about stuff like kerning and “icon balance” and other stuff nobody cares about.

You know the real issue here? Apple already has great design. Design-police people: please turn your attention to the rest of us! We need you over here! Could iOS7 use slightly less transparency here or there? Maybe. But have you looked at the upenn.edu website anytime lately? Especially the functional parts that us members of the university have to use almost daily? Or our fancy new Concur expense reporting system that borders on useless? Trust me, there’s tons to do because practically the rest of the world needs a design overhaul. I wish the world I worked in most of the time was even remotely as well designed as iOS7 (or frankly, iOS1). And perhaps you could train us on figure making while you’re at it?

By the way, I should say that I do actually appreciate good design, despite it all (and I think iOS7 is actually pretty great). One of my favorite movies on the topic is Helvetica, about, you know, the font. Basically a bunch of design guys getting on there talking about how Helvetica is overused and boring and bad. One of my favorite parts of the movie, though, is when they interview these two design guys and one says something like: “Look, just make your text in Helvetica Bold and it will look good.” Amen.

Saturday, October 26, 2013

Product differentiation when there is no difference

(No, I don't mean this kind of product differentiation.)

I was just reading this article in the NYtimes.com about how sales of bottled water will soon surpass that of soda. Of course, this is utterly ridiculous. It's &*#$-ing water! You know, the stuff that flows from your tap almost for free? Whatever, I guess that's an old argument, and is seems that side of rationality has lost that one. But one of the interesting things in the article was about how all these water companies are trying to differentiate themselves to get an edge because the plain old bottled water market is so competitive that there's very little profit left:

The TalkingRain Beverage Company, a bottled water business that started in the Pacific Northwest, is getting out of the plain water business altogether because the economics are so bad. “The water business has become very commoditized,” said Kevin Klock, its chief executive.
(Umm, "very commoditized"?!?! Sorry, gotta say it again: It's &*#$-ing water!)

Which got me thinking about what really is a commodity. Even tap water does taste different in different places, and I suppose some bottles might be easier to open than others. Grains, chemicals, minerals–they can all come in different purities and grades. In fact, even in lab, we buy all sorts of different types of water for different purposes, including, umm, bottled RNase free water (oops!). So whatever the perfect commodity is, it would have to be something digital and thus inherently pure, like some form of information. But it can't just be all information. For instance, if you're selling some kind of information, you're probably selling information that's hard to come by or process or present in some way. So then I'm not sure if that qualifies as a commodity.

Which leaves money. Money is just a number, and it's equally valuable wherever you go or from whoever you get it; there is no different kind of money (there are financial schemes and categories of investments, but they are not money in and of itself). Of course, you might be wondering who would be selling such a "product". Like, who's out there saying "here's an authentic $10 bill, I'll sell it to you for $10.50!" Well, that's exactly what credit cards do, for instance. So how do different cards differentiate themselves? How do you differentiate yourselves when you're just giving people money? Seems like there are two ways they do it. One is psychological manipulation. These include making merchants pay more or less, which is essentially a hidden cost to the consumer, and those insidious schemes that play upon people's "just pay later" mentality. The other differentiator is the various bonus rewards schemes, which rely on other companies' real world products to provide differentiation.

But what surprises me is that they don't really seem to compete on cost. As in, they ALL have astronomical interest rates. Now, for something that is to my mind as commoditized as a commodity could be, that seems quite strange. Apparently, you can get bottled water for as little as 8 cents per bottle in bulk because competition has driven enormous efficiencies in the production of the "product" (i.e., bottle manufacturing). Why is there so little price competition in credit cards? Sure, on interest rates, they have to cover the folks who don't pay their bills, and on charging businesses, they have to cover the infrastructure costs.  Still, shouldn't one company be able to ultimately offer much lower rates than another? I think the clearest answer is cartel behavior. And indeed, they have been hit with many such fines for that sort of behavior in the past. I think it's because the only way to make money (and tons of it) in a purely commodity business is to cheat. In that way, you could view the price competition for water as a shining example of the highs and lows of capitalism: competition drives incredible optimization and efficiencies in a product whose very existence is essentially ridiculous.

Definitely don't want to step into any arguments about the pros and cons of capitalism and regulation. But I will offer up this funny anecdote that might provide fodder for both sides of the argument. I was in DC some time ago for a grant panel, and in the room, they had... bottled water, of course. But apparently, they weren't allowed to call it bottled water, most likely because of some well-intentioned rule against bottled water. So instead, they had these bottles specially labeled by the caterer as "refreshing drink" or something like that, which (ironically) probably cost more than just getting regular old bottled water. Sigh. I was in Canada for a conference recently, and they had a bunch of cups and a pitcher. Now that was refreshing.

Thursday, October 10, 2013

Some analogies for the Higgs field

I was just reading about the Nobel Prize in Physics on NYTimes.com, which this year was for the discovery of the Higgs boson, and came across the following analogy for how the Higgs imbues particles with mass:

"According to this model, the universe brims with energy that acts like a cosmic molasses, imbuing the particles that move through it with mass, the way a bill moving through Congress attracts riders and amendments, becoming more and more ponderous and controversial."

Umm, sure, that's a sensible analogy! :)

Anyway, that got me thinking of some more analogies with a political/ideological bent:

The Higgs field acts like a cosmic molasses, imbuing particles that move through it with mass...

...like a wolf traipsing carefree through the forest, only to get its paws progressively caught up in traps set illegally by hunters.

...like a gun that moves from dealer to dealer, impeded by endless background checks.

...like a all natural tomato growing for generations only to get progressively genetically modified by vicious genetic engineers.

...like a happy healthy child driven to autism by an extensive vaccine regimen.

...like a investment banker hampered by bureaucratic regulations.

...like a factory hit with repeated union strikes.

...like a group of workers progressively worked into the ground by management.

And some other non-political/ideological ones:

...like a PI who receives several e-mails a minute about how the latest seminar series has been canceled.

...like an unadorned Cheeto that picks up that cheesy powder on its trip through the Cheeto factory.

...like an adult working her or his way through life, only to be slowed by children, mortgage payments, and a variety of home maintenance projects.

Sunday, October 6, 2013

Impact factor and reading papers

I just read this paper on impact factor, and it makes the point that impact factor of a journal is a highly imperfect measure of scientific merit, but that the reason that the issue persists is that hiring committees fetishize publications in high impact journals. The authors say that this results in us scientists ceding our responsibilities for gauging scientific merit largely to a handful of professional editors, most of whom have very little experience as practicing scientists. All true. The real question is: why does this happen? Surely we as scientists would not have let things get so bad, especially when hiring decisions are so very important, right?

I think the issue comes down to the same basic issue that is straining the entire scientific system: the expansion of science and the accompanying huge proliferation in the amount of scientific information. Its a largely positive development, but it has left us straining to keep up with all the information that's out there. I think it's likely that the majority of faculty on hiring committees are just too busy to read many (if any) of the papers of the candidates they are evaluating, often because there's just not enough time in the day (and probably because most papers are so poorly written). So they rely on impact factor as some sort of gauge of whether the work has "impact". So far, pretty standard complaint. And the solutions are always the same boring standard platitudes like "We must take action and evaluate our colleagues work on its merits and not judge the book by its cover, blah blah blah." People have been saying this for years, and I think the reason this argument hasn't gone anywhere is that while it sounds like the right thing to say, it doesn't address the underlying problem in a realistic way.

For another way of looking at the issue, put yourself in the same position as this member of the hiring committee. What would you do if you had to read hundreds of papers, many highly technical but also not in your field of expertise, while running an entire lab and teaching and whatever else you're supposed to do? You would probably also look for some other, more efficient way to gauge the relative importance of these papers without actually reading them. That's why the research seminar is so useful, because it allows you to get a quick sense of what a scientist is all about. Which raises the question that I often have, which is why do we bother writing all these papers that nobody reads? My personal feeling is that we have to reevaluate how we communicate science, and that the paper format is just not suitable for the volume of information we are faced with these days. In our lab, we have slidecasts, which are 10 minute or less online presentations that I think are at least a step in the right direction. But I have some other ideas to extend that concept further. Anyway, whatever the solution, it needs to get here soon, because I think our ability to be responsible scientists will depend on it.

Saturday, October 5, 2013

Why are figures hard to read?

In a previous post, I talked about how figure legends are quite useless. But that got me thinking about figures more generally. Gautham and I have both been talking about how sometimes we read papers first without even looking at the figures. I like this style of reading because it allows you to pay attention to the authors' arguments without interruption. Of course, the problem is that you don't actually see the data they're referring to. But if you trust that their interpretation of their graph or picture or whatever is correct (otherwise they wouldn't be showing it, right?), then there's really no point in looking at the figure, right? Hmm...

So why am I avoiding figures? I think the reason is that there are a lot of cognitive hurdles to doing so. Many papers will have some sort of an interpretative statement, like this "Knocking down the blah-1 gene led to a substantial increase in cells walking to the side of the plate and turning purple (Fig. 2D)." Okay, but then you look at the figure, and it will probably have some sort of bar graphs of knockdown efficiency and some cell-walking-turning-purple assay readout. Then I have to think about what those charts mean and how they did the assay and whether that makes sense and... you get the idea. It requires a lot of thinking, and by the time you understand it, I defy anyone to go right back to the main text and regain their previous train of thought.

What to do? Well, this seems to be far less of a problem when you're giving a talk. Why? I think it's because you have the opportunity to spend time with the figure and explicitly show people what you want them to get out of it as an integrated part of your narrative. In a paper, I think one of the things we could do is to adopt things like sparklines (ala Tufte), which are inline graphics. Here's an example lifted straight out of Wikipedia: The Dow Jones index for February 7, 2006 sparkline which illustrates the fluctuations in the Down Jones index on February 7, 2006. Let's take our previous example. What if I said something like "Knocking down the blah-1 gene  led to a substantial increase in cells walking to the side of the plate and turning purple  . Simple and informative, and inline, so there's no cognitive interruption. (Note: they came out bigger than I wanted, but it's hard to deal with this pesky Blogger interface.) Does this have all the information that a figure would have? No. But it conveys the main point. And probably with slight larger graphics, you could include a lot of that information. As it is with most figures, there's so much missing information anyway in terms of how one actually made the figure that it's not like it's all that complete to begin with. The best way to deal with that is if we adopted a writing style in which we didn't just give an interpretation, but actually took the time to explain what we did. That's hard within the writing limitations imposed by the journals, though.