Gene Expression Informatics, Joel Bader

Gene Expression Informatics

Joel S. Bader, Ph.D.
jsbader@curagen.com
Cold Spring Harbor, Genome Informatics
October 17, 1999

Topics

Gene expression experiments
Downloading data sets
Data mining and visualization
- Hierarchical clustering
- Fuzzy clustering, self-organizing maps
- Geometrical representations
Ideas you might try

1. Gene expression experiments

Understanding biological organization

Phenotypes
Pathways
Proteins, drugs
mRNA, cDNA
Genes
Genetic variation, SNPs

Major progress: Genes, genome sequencing
Beachhead for post-genome era assault:
Gene function: associating genes with pathways, biological roles
Genetic variation: predictive, individualized medicine

Why cDNA?

Representative of the biological processes occuring in a cell
Technology exists (can't amplify proteins)

Technologies

TaqMan, RT-PCR

Few genes, many samples
Analogous to candidate gene approach

cDNA sequencing

Typically project: 1000-10,000 ESTs, multiple tissues
SAGE: 10x better coverage
Can just about download your own database for electronic northerns. See UniGene digital differential display. Can FTP Hs.data to associate library identifiers (LID) with each cluster, Hs.Lib.info to associate library with tissue. This could be a project once you've had work on databases.

Differential Display

Take cDNA pools, cut with REs, do electrophoretic separation, look for differences in band intensities (looks almost like a sequencing trace).
With a good enough database, can associate each band with an individual gene, generate a list of gene vs. expression level.
Without a database sequence, have to isolate and sequence to determine gene identity.
My company (CuraGen) uses this technology.

Microarrays and Chips

Major commercial providers: Incyte/Synteni, Affymetrix
Visualize the output of a single experiment as a list: (Gene, Expression level). Note that expression level might actually be relative to a matched control. This is a snapshot.
Build up data from multiple experiments

Generic Description

Start with a data matrix. Rows = Genes, Columns = Samples/Experiments, elements are gene expression measurements. Columns can be a time-series (gene expression movie), drug series, surgical treatments, genetic models, etc.
Identify genes by interesting patterns
Correlate genes with other biological information (promotors, known functions, disease association, ...)
Or, use gene expression pattern to learn about samples (pharmacogenomics)

Top

Our example data: Large-scale temporal gene expression mapping of CNS development
Link to the raw data and PDF. Reference: Wen X, Fuhrman S, Michaels GS, Carr DB, Smith S, Barker JL, Somogyi R (1998) Large-Scale Temporal Gene Expression Mapping of CNS Development. Proc Natl Acad Sci USA, 95:334-339.

Abstract: Examine gene function by looking for expression patterns that correlate with rat CNS developmental stages. Samples are 9 stages: embryonic (E11-21 days), post-natal (P0-P14), and adult (P90). Variables are 112 gene expression levels measured by RT-PCR.

The raw data from Somogyi's paper is available. For convenience, I have pre-parsed data using a perl parser that is also available.

As you go to different sites, you'll notice that each has its own format for saving information. You have to write an individual parser for each. Wouldn't it be nice to have a central repository for gene expression information in a standardized format that allowed you to make direct comparisons between different data sets? (See suggested projects.) The EBI has issued a press release stating their desire to do just this.

Top

3. Data mining and visualization

Simple idea: find genes that are up- or down-regulated in a particular comparison. Simple solution: sort a list. Quickly find that this information isn't sufficient.

Slightly more complex idea:

Genes with similar functions should have similar expression profiles.
Sometimes the expression profile can tell us about function

What is a good measure of expression profile similarity? Can we define a distance so that genes with similar profiles have a smaller distance, genes with different profiles have a larger distance? Think about how we measure sequence similarity: one way is to count the number of indels and subs to get a score.

Gene Expression Distances

Suppose we have genes A and B and look at several biological states numbered 1, 2, 3, ..., N:

Gene State 1 State 2 ... State N

A A1 A2 ... AN

B B1 B2 ... BN

Gene	State 1	State 2	...	State N
A	A1	A2	...	AN
B	B1	B2	...	BN

A1, A2, ... are numbers that tell us the expression level in some units (flourescence intensity, # of sequences in an EST database, intensity relative to some control, ...).

There are many ways to define a distance. A good reference is Multivariate Analysis by Mardia, Kent, and Bibby. Here are some typical choices to get D(A,B) between genes A and B.

Method Formula Good points Bad points

Euclidean distance D(A,B) = Sum (Ai - Bi)²,
i = 1 .. N Easy to calculate Two genes that have exactly the same shape to their expression profiles but differ in amplitude can end up far apart. Genes with different profiles but that are both expressed at low levels can end up close together.

Pearson correlation coefficient distance corln coef C =
[Sum AiBi -
(1/N) Sum Ai Sum Bi] /
[SA SB]
SA = sqrt[Sum Ai² -
(1/N) (Sum Ai)²]
SB = sqrt[Sum Bi² -
(1/N) (Sum Bi)²]
Then D(A,B) = sqrt(2 - 2 C) The correlation coefficient is a useful statistical measure. People will know (or at least pretend to know) what you're talking about. It fixes most of the problems with the Euclidean distance measure. A single large difference in the expression levels of A and B in a particular experiment (or a single bad data point) can dominate the measure, just as an outlier can through off a least squares line.

Spearman (rank-order) correlation Sort the expression values A1, A2, ..., AN, then replace each value by its order in the sorted list. For example, with (A1,A2,A3,A4) = (0.5, 0.7, 0.2, 1.3), the new values would be (A1,A2,A3,A4) = (2, 3, 1, 4). Do the same for the measurements of gene B, then follow the Pearson formulas. Non-parametric statistics are more robust than parametric statistics. Translation: this reduces the effect of outliers. Might be less sensitive than Pearson

Method	Formula	Good points	Bad points
Euclidean distance	D(A,B) = Sum (Ai - Bi)², i = 1 .. N	Easy to calculate	Two genes that have exactly the same shape to their expression profiles but differ in amplitude can end up far apart. Genes with different profiles but that are both expressed at low levels can end up close together.
Pearson correlation coefficient distance	corln coef C = [Sum AiBi - (1/N) Sum Ai Sum Bi] / [SA SB] SA = sqrt[Sum Ai² - (1/N) (Sum Ai)²] SB = sqrt[Sum Bi² - (1/N) (Sum Bi)²] Then D(A,B) = sqrt(2 - 2 C)	The correlation coefficient is a useful statistical measure. People will know (or at least pretend to know) what you're talking about. It fixes most of the problems with the Euclidean distance measure.	A single large difference in the expression levels of A and B in a particular experiment (or a single bad data point) can dominate the measure, just as an outlier can through off a least squares line.
Spearman (rank-order) correlation	Sort the expression values A1, A2, ..., AN, then replace each value by its order in the sorted list. For example, with (A1,A2,A3,A4) = (0.5, 0.7, 0.2, 1.3), the new values would be (A1,A2,A3,A4) = (2, 3, 1, 4). Do the same for the measurements of gene B, then follow the Pearson formulas.	Non-parametric statistics are more robust than parametric statistics. Translation: this reduces the effect of outliers.	Might be less sensitive than Pearson

Math Support: PDL

Perl has a great math package called the Perl Data Language (PDL) module that performs matrix operations quickly and easily. The CPU-intensive operations are performed in C for efficiency. You can read about PDL in Mastering Algorithms with Perl from O'Reilly. I hear they have a book devoted to PDL in the works.

Installing PDL

Go to the CPAN archive and download PDL (pronounced piddle), the perl data language module.
gunzip PDL.tar.gz; tar -xvf PDL.tar.gz
Read the INSTALL and DEPENDENCIES files. Unless you are eager for more work, edit perldl.conf to turn off WITH_3D, WITH_KARMA, WHERE_KARMA, and WITH_SLATEC by setting them to 0. work).
To use a local library located in /home/myname/perl/lib edit Makefile.PL and insert the line
use lib "/home/myname/perl";
To configure a local installation in /home/myname/perl/lib
perl Makefile.PL PREFIX=/home/myname/perl
After making the Makefile, do a make, then make test, then make install
If the documentation doesn't build properly, you can find it on the web. The installation should also have a generated a local version in a directory called
.../site_perl/PDL/HtmlDocs/Index.html

Generating distance

I've provided express.pl, a program using PDL to generate distances between genes. It expects to read from STDIN. To read the input from a file, type
perl express.pl < datafile
where datafile is the name of the file (for example, somogyi.txt). As output you get infile, a matrix of the distances between genes.

Suppose you wanted to look for similarities between biological states. One solution would be to take a matrix transpose before calculating the distances. The software already does this:
perl express.pl t < datafile
and here are the distances. This could be useful if the biological states are different drugs and you want to see which have the same mechanism of action.

Hierarchical Classification

A matrix of distances doesn't do much good. But you can use it to organize the genes according to similarity. One approach is hierarchical classification. There are simple algorithms that use pairwise distances to generate a branching tree. Here's a typical algorithm called neighbor joining:

Start with each gene in its own cluster
Pick the two closest clusters and join them
Repeat until only one cluster remains

But how do we decide on the distance between clusters? There are three typical algorithms:

Choose the shortest distance between pairs of genes in the two clusters (nearest neighbor, single linkage)
Choose the average distance
Choose the longest distance (complete linkage)

Phylip by Joe Felsenstein is a great program for doing this stuff. The module you'll need is neighbor_jsb.c, along with the phylip.h header and the makefile. I modified neighbor_jsb.c from the original neighbor.c to allow for longer sample names. Compile it with the command
make neighbor_jsb

I called the output of express.pl infile because that's the file that neighbor_jsb expects to read. Run neighbor_jsb and you get two files:

outfile, the tree in buaf format
treefile, the tree in treefile format

Important point: parentheses are part of the syntax of the treefile and really should be parsed out of sample names. I don't do this yet. You can load the treefile into a viewer like treeview (PCs and Macs only, and it barfs on parentheses). Here are the trees I got for the genes and the samples.

The samples. The early embryonic (11-15 days) branch off, as do the latest stages (14 days postnatal and adult). The late embryonic and early post-natal form a third cluster.

The genes

For comparison, here are the genes as clustered in Somogyi's paper by a similar method.

In a case like this, it's easy to define clusters, or waves, by eye. You can modify a hierarchical classification algorithm to build clusters by defining initial groups to contain all genes within a pre-determined distance of each other. In the figure below, also from the Somogyi paper, the temporal profiles are shown separately for each of the clusters.

Looking for pathways

We see that many genes have similar expression profiles. One hypothesis is that genes with similar functional roles will have similar expression profiles. In this example, instead of worrying about 112 individual genes we can think about 4 or 5 developmental waves.

Suppose that we could really define a set of factors F1, F2, ... so that the expression level of gene A is
A = L(A,1)F1 + L(A,2)F2 + ...,
the expression level of gene B is
B = L(B,1)F1 + L(B,2)F2 + ...,
and so on. This helps us understand what's going on if the number of factors we use is smaller than the number of genes. Also, the factors should end up having something to do with biological pathways or functional roles. (In the example we're using, this should be true because we can see by eye that 4 or 5 pathways describe the main features of the hierarchical classification.) Mathematically, our goal is to choose a set of factors F1, F2, ..., and a set of factor loadings L(A,1), L(A,2), ..., L(B,1), L(B,2), ..., that best explain the data.

The magical recipe is
L(A,i) = U(A,i)sqrt(Ei)
where U(A,i) is the i^th eigenvector of the matrix HCH arranged in decreasing order of eigevalue Ei; C(A,B) is the correlation between genes A and B calculated from one of the recipes above, or set to -(1/2)D(A,B) where D is any distance metric, and H is the centering matrix with diagonal elements 1 - 1/N and off-diagonal elements -1/N with N the number of genes. When we multiply by the square root, it's called principal factor analysis. If we stop before the multiplication, it's called principal component analysis.

To get the principal components describing the genes, run
perl express.pl p < somogyi.txt
and to get the principal components describing the biological states,
perl express.pl tp < somogyi.txt

In this figure the biological states (embryonic days, post-natal days, and adult) are plotted according to the first and second principal factors. The first factor apparantly describes changes during embryonic development culminating at E18, and the second factor describes changes from E18 to Adult. Now we see why E18, E21, P0, and P7 were grouped together in the clustering.

Top

4. Ideas you might try

(Easy) Change express.pl to have command-line options to specify the input and output filenames. Try the GetOpts module.
(Easy) Change express.pl to call neighbor_jsb directly.
(Easy) Change express.pl to give principal factors and/or principal components, as specified by command-line options.
(Moderate) Change express.pl to try out different distance methods. Make this a command-line option.
(Moderate) Make a command-line version of neighbor_jsb where you can specify whether to use nearest distance, average distance, or longest distance for building the tree. See what this does to the trees.
(Moderate) Self-organizing maps can also be used to cluster and classify genes according to expression profile. A recent PNAS from MIT/Whitehead describes this method. Try it out.
(Moderate) Take a yeast expression set, do hierarchical clustering, and explore whether genes in the same cluster share a biological role. The MIPS database associates ORFs with biological function.
(Moderate) After you learn about cgi's, make this program into a cgi. Provide check-boxes for the various command-line options, a textarea or file upload to provide the data, and output the results in an easy-to-parse format.
(Challenging) Build an electronic northern database from UniGene. Download Hs.data, construct a table UNIGENE_LIB with primary key (UNIGENE_ID, LIBRARY_ID). Download library info, construct a table LIB_INFO with primary key LIBRARY_ID and additional columns LIBRARY_NAME and TISSUE. Then provide an interface for a user to select a number of tissues and the computer to generate a list of genes with significantly different expression levels. Or let the user provide a UniGene ID and find other genes that are co-expressed. Build a CGI front-end.
(Challenging) Build a server to hold gene expression profiles. Allow people to check data in and out, annotate the experiment (organism, tissue, treatment, ...), and make comparisons between different data sets. If you are interested in this, please send me your resume.

Top