Gene Expression Informatics

Joel S. Bader, Ph.D.
jsbader@curagen.com
Cold Spring Harbor, Genome Informatics
October 17, 1999


Topics

  1. Gene expression experiments
  2. Downloading data sets
  3. Data mining and visualization
  4. Ideas you might try

1. Gene expression experiments

Understanding biological organization

Major progress: Genes, genome sequencing
Beachhead for post-genome era assault:
Gene function: associating genes with pathways, biological roles
Genetic variation: predictive, individualized medicine

Why cDNA?

Technologies

TaqMan, RT-PCR cDNA sequencing Differential Display Microarrays and Chips Generic Description

Top

2. Downloading data sets

People with catalogs of links

Our example data: Large-scale temporal gene expression mapping of CNS development
Link to the raw data and PDF. Reference: Wen X, Fuhrman S, Michaels GS, Carr DB, Smith S, Barker JL, Somogyi R (1998) Large-Scale Temporal Gene Expression Mapping of CNS Development. Proc Natl Acad Sci USA, 95:334-339.

Abstract: Examine gene function by looking for expression patterns that correlate with rat CNS developmental stages. Samples are 9 stages: embryonic (E11-21 days), post-natal (P0-P14), and adult (P90). Variables are 112 gene expression levels measured by RT-PCR.

The raw data from Somogyi's paper is available. For convenience, I have pre-parsed data using a perl parser that is also available.

As you go to different sites, you'll notice that each has its own format for saving information. You have to write an individual parser for each. Wouldn't it be nice to have a central repository for gene expression information in a standardized format that allowed you to make direct comparisons between different data sets? (See suggested projects.) The EBI has issued a press release stating their desire to do just this.

Top

3. Data mining and visualization

Simple idea: find genes that are up- or down-regulated in a particular comparison. Simple solution: sort a list. Quickly find that this information isn't sufficient.

Slightly more complex idea:

What is a good measure of expression profile similarity? Can we define a distance so that genes with similar profiles have a smaller distance, genes with different profiles have a larger distance? Think about how we measure sequence similarity: one way is to count the number of indels and subs to get a score.

Gene Expression Distances

Suppose we have genes A and B and look at several biological states numbered 1, 2, 3, ..., N:

GeneState 1State 2...State N
AA1A2...AN
BB1B2...BN

A1, A2, ... are numbers that tell us the expression level in some units (flourescence intensity, # of sequences in an EST database, intensity relative to some control, ...).

There are many ways to define a distance. A good reference is Multivariate Analysis by Mardia, Kent, and Bibby. Here are some typical choices to get D(A,B) between genes A and B.

MethodFormula Good pointsBad points
Euclidean distance D(A,B) = Sum (Ai - Bi)2,
i = 1 .. N
Easy to calculate Two genes that have exactly the same shape to their expression profiles but differ in amplitude can end up far apart. Genes with different profiles but that are both expressed at low levels can end up close together.
Pearson correlation coefficient distance corln coef C =
[Sum AiBi -
(1/N) Sum Ai Sum Bi] /
[SA SB]
SA = sqrt[Sum Ai2 -
(1/N) (Sum Ai)2]
SB = sqrt[Sum Bi2 -
(1/N) (Sum Bi)2]
Then D(A,B) = sqrt(2 - 2 C)
The correlation coefficient is a useful statistical measure. People will know (or at least pretend to know) what you're talking about. It fixes most of the problems with the Euclidean distance measure. A single large difference in the expression levels of A and B in a particular experiment (or a single bad data point) can dominate the measure, just as an outlier can through off a least squares line.
Spearman (rank-order) correlation Sort the expression values A1, A2, ..., AN, then replace each value by its order in the sorted list. For example, with (A1,A2,A3,A4) = (0.5, 0.7, 0.2, 1.3), the new values would be (A1,A2,A3,A4) = (2, 3, 1, 4). Do the same for the measurements of gene B, then follow the Pearson formulas. Non-parametric statistics are more robust than parametric statistics. Translation: this reduces the effect of outliers. Might be less sensitive than Pearson

Math Support: PDL

Perl has a great math package called the Perl Data Language (PDL) module that performs matrix operations quickly and easily. The CPU-intensive operations are performed in C for efficiency. You can read about PDL in Mastering Algorithms with Perl from O'Reilly. I hear they have a book devoted to PDL in the works.

Installing PDL

Generating distance

I've provided express.pl, a program using PDL to generate distances between genes. It expects to read from STDIN. To read the input from a file, type
perl express.pl < datafile
where datafile is the name of the file (for example, somogyi.txt). As output you get infile, a matrix of the distances between genes.

Suppose you wanted to look for similarities between biological states. One solution would be to take a matrix transpose before calculating the distances. The software already does this:
perl express.pl t < datafile
and here are the distances. This could be useful if the biological states are different drugs and you want to see which have the same mechanism of action.

Hierarchical Classification

A matrix of distances doesn't do much good. But you can use it to organize the genes according to similarity. One approach is hierarchical classification. There are simple algorithms that use pairwise distances to generate a branching tree. Here's a typical algorithm called neighbor joining:

  1. Start with each gene in its own cluster
  2. Pick the two closest clusters and join them
  3. Repeat until only one cluster remains
But how do we decide on the distance between clusters? There are three typical algorithms:
  1. Choose the shortest distance between pairs of genes in the two clusters (nearest neighbor, single linkage)
  2. Choose the average distance
  3. Choose the longest distance (complete linkage)
Phylip by Joe Felsenstein is a great program for doing this stuff. The module you'll need is neighbor_jsb.c, along with the phylip.h header and the makefile. I modified neighbor_jsb.c from the original neighbor.c to allow for longer sample names. Compile it with the command
make neighbor_jsb

I called the output of express.pl infile because that's the file that neighbor_jsb expects to read. Run neighbor_jsb and you get two files:

Important point: parentheses are part of the syntax of the treefile and really should be parsed out of sample names. I don't do this yet. You can load the treefile into a viewer like treeview (PCs and Macs only, and it barfs on parentheses). Here are the trees I got for the genes and the samples.
The samples. The early embryonic (11-15 days) branch off, as do the latest stages (14 days postnatal and adult). The late embryonic and early post-natal form a third cluster.
The genes
For comparison, here are the genes as clustered in Somogyi's paper by a similar method.


In a case like this, it's easy to define clusters, or waves, by eye. You can modify a hierarchical classification algorithm to build clusters by defining initial groups to contain all genes within a pre-determined distance of each other. In the figure below, also from the Somogyi paper, the temporal profiles are shown separately for each of the clusters.


Looking for pathways

We see that many genes have similar expression profiles. One hypothesis is that genes with similar functional roles will have similar expression profiles. In this example, instead of worrying about 112 individual genes we can think about 4 or 5 developmental waves.

Suppose that we could really define a set of factors F1, F2, ... so that the expression level of gene A is
A = L(A,1)F1 + L(A,2)F2 + ...,
the expression level of gene B is
B = L(B,1)F1 + L(B,2)F2 + ...,
and so on. This helps us understand what's going on if the number of factors we use is smaller than the number of genes. Also, the factors should end up having something to do with biological pathways or functional roles. (In the example we're using, this should be true because we can see by eye that 4 or 5 pathways describe the main features of the hierarchical classification.) Mathematically, our goal is to choose a set of factors F1, F2, ..., and a set of factor loadings L(A,1), L(A,2), ..., L(B,1), L(B,2), ..., that best explain the data.

The magical recipe is
L(A,i) = U(A,i)sqrt(Ei)
where U(A,i) is the ith eigenvector of the matrix HCH arranged in decreasing order of eigevalue Ei; C(A,B) is the correlation between genes A and B calculated from one of the recipes above, or set to -(1/2)D(A,B) where D is any distance metric, and H is the centering matrix with diagonal elements 1 - 1/N and off-diagonal elements -1/N with N the number of genes. When we multiply by the square root, it's called principal factor analysis. If we stop before the multiplication, it's called principal component analysis.

To get the principal components describing the genes, run
perl express.pl p < somogyi.txt
and to get the principal components describing the biological states,
perl express.pl tp < somogyi.txt

In this figure the biological states (embryonic days, post-natal days, and adult) are plotted according to the first and second principal factors. The first factor apparantly describes changes during embryonic development culminating at E18, and the second factor describes changes from E18 to Adult. Now we see why E18, E21, P0, and P7 were grouped together in the clustering.

Top

4. Ideas you might try

Top
Copyright 1999 Joel S. Bader jsbader@curagen.com