This post proposes a simple method to analyze gene hits from a screen in terms of GO categories, provides an example from the literature, and links to R code.
It is easy to find many very sophisticated algorithms and frameworks for analyzing the results of high throughput screens in the context of biological annotations to categories like the Gene Ontology (GO). The result is almost always p-values and enrichment scores for all GO categories, and it is common practice to then report the "most significant" (lowest p-value) terms. There are often a very large number of terms that meet reasonable significance criteria even after multiple hypothesis testing correction. This may be partly due to improvement in experimental methods, so that now even small effects result in statistically significant gene hits. Many comparative RNA-Seq experiments yield thousands of differential expression gene hits.
It is tough to get a feeling for the make-up of the hit gene list by a vanilla GO analysis because of the following problems:
- Screening by p-value tends to give large categories. This is for the same reason that screening differential expression by p-value tends to give genes with high average expression.
- Screening by enrichment gives very small categories. Again, for the same reason as screening differential expression by fold change tends to give genes with very low average expression.
- GO terms are not mutually exclusive, and hit lists tend to be made up of overlapping GO categories whose high rank on the list is due to a common set of hits.
The first two problems are common to any differential expression screen, but the overlap problem is unique to category analysis. Without exaggeration, these are the top 10 hits by p-value for a set of gene hits from a screen we did in the lab:
- multicellular organismal process
- anatomical structure morphogenesis
- system development
- multicellular organismal development
- anatomical structure development
- developmental process
- organ morphogenesis
- skeletal system development
- nervous system development
More sophisticated GO analyses take into account the graph structure of GO to try to get around this problem, but it seemed like something very simple could do the trick instead. Here is a proposal:
- Decide the scale of GO terms you want to consider at the outset. Only consider GO terms that have more than some minimum number of genes annotated to it. The larger the cutoff, the broader the picture you will get.
- Find the GO term with the highest concentration of hits. This category has the greatest number of hits in the screen per annotated gene.
- Assign these hit genes to that GO category. Remove these genes from further consideration.
- Repeat 2,3,4 but considering only hits that remain unassigned. Continue until all hits are assigned, or you have as many GO terms as you want. Usually, the parent biological_process category will eventually get selected and sweep up all remaining genes.
Step 1 forces you to explicitly choose your resolution. Step 2 seeks to find categories that are as likely as possible to be actually involved in the biology you are studying. Step 3 bravely (or foolishly) assigns genes as doing the most enriched function they are annotated for, and Step 4 eliminates the GO overlap problem. This kind of algorithm is commonly known as "greedy," so we refer to it sometimes as greedy GO.
To illustrate we used the data set generated by Marks et al. 2012 (PMID: 22541430). These authors looked at the transcriptome differences between mouse ES cells maintained in serum-containing media or media containing two small molecule signaling inhibitors. They found nearly 1500 genes significantly more highly expressed in 2i media than in serum, and present their own category analysis (PANTHER database) in their Fig. 1c. Using the BioConductor-maintained org.Mm.eg.db annotation database for mouse as a source for GO annotations, here is a breakdown by the greedy method of the "high in 2i" gene hits from Marks et al. Cell 2012:
|cellular lipid metabolic process||73||73||495|
|regulation of immune system process||49||54||393|
|cellular nitrogen compound biosynthetic process||40||44||328|
|ion transmembrane transport||41||45||374|
|carbohydrate metabolic process||43||58||414|
|neuron projection development||33||38||382|
These are not the top hits, or handpicked in any way. They are the complete output of our greedy algorithm. No p-value threshold has been set here. In fact, the method does not even compute p-values. Compare to Fig. 1c in the paper, and you'll see the categories mostly agree, and that we avoided duplicate entries for lipid metabolism. The approach can be augmented with p-values by running something like goseq, which calculates the p-values for every GO category.
The R code for the functions that do the computations is here:
A script that does an analysis of the Marks et al. data is below, showing a simple way to get GO annotations into the right format, by exploiting a function in the goseq package.