MB620 Bioinformatics
University of New Haven
Instructor: Joel S. Bader
Class 8: Sequence Similarity Searching. ORF Finding.


Agenda


Master Outline

Genetics
Traits/Genes to Location Genetic and physical maps
Research Genetics mapping panel
Stanford mapping server
Traits/Genes to Experimental Organisms Jackson Laboratories
Trait/Gene Location Database OMIM, On-line Mendelian Inheritance in Man
Genomic DNA Analysis
Sequences to Contigs CuraTools
CAP, PHRAP
Contigs to mRNA Genscan
Grail
mRNA Analysis
DNA to Homologblastn, blastx, fasta
DNA to ProteinORF finders
NCBI ORF Finder
Protein Analysis

Sequence Similarity Searching

Where were we?

First step: has anyone seen a sequence like this before?
Why do this:

How to do this: sequence similarity searching

Sequence Alignment, Sequence Similarity Searching

Questions we will answer

There is one sequence similarity search you have almost certainly used, and it has virtually nothing to do with molecular biology. What is it? (Thinking of this example will usually provide good intuition.)

The Sequence Alignment problem: Take two strings of characters. Align the letters in one string with the letters in the second string, allowing gaps and mismatches in the alignment. The score of the alignment is the number of matched letters. What is the best alignment?

Example: matching co-evolved words

alignment: MOTHER  M OTHER   MOTHER
           . . ..  .  . ..   .
           MUTTER  MUTT ER   MU TTER
score:     4       4         1
There can be a tie for the best alignment.
How would you adjust the rules to make the first alignment best?
How would you adjust the rules to say that O-U isn't a bad mismatch?

Visual aid: dotplots

  M O T H E R
M .
U
T     .
T     .
E         .
R           .
Put a dot where letters match. What does an exact alignment look like?
  B I O I N F O R M A T I C S
B .
I   .   .               .
O     .       . 
I   .   .               .
N         .
F           .
O     .       . 
R               .
M                 .
A                   .
T                     .
I   .   .               .
C                         .
S                           .
Long diagaonal: filled
Off-diagonal: random matches
General description: Search for a long diagonal filled with dots.
Can't re-use letters in the alignment.
Alignments = paths where each step moves 1+ rows down and 1+ columns right
Score of path = # of dots
Needleman-Wunsch, Smith-Waterman: algorithms for finding the best complete path
Blast, Fasta: algorithms for finding the high-scoring segment pairs (HSPs) (look for local matches, then extend)

Describing sequence similarity: % sequence identity (SI), length of match

Blast Example: start with human insulin gene, take it to CuraTools
gb:GENBANK-ID:HUMINS01|acc:J00265 Human insulin gene
Reading blast output:

BLASTN 2.0a19MP-WashU [05-Feb-1998] [Build sol2.5-ultra 01:47:25 05-Feb-1998]

Reference:  Gish, Warren (1994-1997).  unpublished.
Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J.
Lipman (1990).  Basic local alignment search tool.  J. Mol. Biol. 215:403-10.

Notice:  this program and its default parameter settings are optimized to find
nearly identical sequences rapidly.  To identify weak similarities encoded in
nucleic acid, use BLASTX, TBLASTN or TBLASTX.

Query=  HUMINS01 Human insulin gene, complete cds.
        (4044 letters)

Database:  /opt/database/public/blast/gbcomp
           565,116 sequences; 1,083,671,655 total letters.

                                                                     Smallest
                                                                       Sum
                                                              High  Probability
Sequences producing High-scoring Segment Pairs:              Score  P(N)      N

gb:GENBANK-ID:HUMINS01|acc:J00265 Human insulin gene, com... 20220  0.0       1
gb:GENBANK-ID:HUMINSTHIG|acc:L15440 Homo sapiens tyrosine... 20210  0.0       1
gb:GENBANK-ID:HSINSU|acc:V00565 Human gene for preproinsu... 20082  0.0       1
gb:GENBANK-ID:HUMINSPR|acc:M10039 Human alpha-type insuli... 11244  0.0       1
gb:GENBANK-ID:PTPPINS|acc:X61089 P.troglodytes gene for p...  9328  0.0       2
gb:GENBANK-ID:E00022|acc:E00022 DNA coding for human prep...  8846  0.0       1
gb:GENBANK-ID:HSA9655|acc:AJ009655 Homo sapiens ins gene,...  6956  2.1e-308  1
gb:GENBANK-ID:CEPPINS|acc:X61092 C.aethiops gene for prep...  3821  3.2e-284  3
gb:GENBANK-ID:HUMINS1UA|acc:J00266 Human insulin gene: al...  4228  3.5e-185  1
gb:GENBANK-ID:ATRINS|acc:J02989 Owl monkey (A.trivirgatus...  3812  1.0e-176  2
gb:GENBANK-ID:HUMINSHVR|acc:M26868 Human hypervariable DN...  2752  1.6e-118  1
gb:GENBANK-ID:HUMINS1UB|acc:J00267 Human insulin gene: be...  2514  1.4e-107  1
gb:GENBANK-ID:A57040|acc:A57040 Sequence 2 from Patent WO...  2220  2.2e-99   2
gb:GENBANK-ID:I55872|acc:I55872 Sequence 2 from patent US...  2220  2.2e-99   2
gb:GENBANK-ID:S99616|acc:S99616 insulin {promoter} [human...  2220  2.2e-99   2
gb:GENBANK-ID:HSIGF2AP|acc:X05331 Human insulin/IGF-II in...  2145  1.8e-90   1
gb:GENBANK-ID:MMINSIIG|acc:X04724 Mouse preproinsulin gen...   759  1.2e-75   3
gb:GENBANK-ID:RATINSII|acc:J00748 Rat insulin II gene (in...   713  9.7e-67   4
gb:GENBANK-ID:MMINSIG|acc:X04725 Mouse preproinsulin gene...   692  4.1e-64   3
gb:GENBANK-ID:CLINSU|acc:V00179 Dog gene encoding insulin...   776  6.0e-64   3
gb:GENBANK-ID:RNINS2|acc:V01243 Rat gene for insulin 2 - ...   713  1.1e-56   3

Format of output

Blast vs. Fasta
blast: faster, less sensitive, better if there's a strong match
fasta: slower, more sensitive, better if there's only weak homology
blast has been improved (gaps, multiple regions), more sensitive, used more often.

Blast Programs
ProgramQueryDatabaseUse
blastnDNADNAexact matches
blastxtrDNA protein analyzing new DNA and ESTs
blastpprotein protein distant relationships
tblastnprotein trDNA finding gene homologs in genomic DNA
tblastxtrDNA trDNA sensitive EST analysis
Why might blastx give hits that blastn missed?
What does evolution conserve?

ORF Finders

Usually it's useful to convert to protein before further analysis. ORF = open reading frame (not necessarily MET to Stop)
ccgctttcgt ctccgtcctg ctgccgttac cgccgctgct gccgccgctt gcgtcccccg
ctccggtctg tggtgcagcc gggacccagg accatgtctc tgtctcgctc agaggagatg
caccggctca cggaaaatgt ctataagacc atcatggagc agttcaaccc tagcctccgg
aacttcatcg ccatggggaa gaattacgag aaggcactgg caggtgtgac gtatgcagcc
aaaggctact ttgacgccct ggtgaagatg ggggagctgg ccagcgagag ccagggctcc
aaagaactcg gagacgttct cttccagatg gctgaagtcc acaggcagat ccagaatcag
ctggaagaaa tgctgaagtc ttttcacaac gagctgctta cgcagctgga gcagaaggtg
gagctggact ccaggtatct gagtgctgcg ctgaagaaat accagactga gcaaaggagc
aaaggcgacg ccctggacaa gtgtcaggct gagctgaaga agcttcggaa gaagagccag
ggcagcaaga atcctcagaa gtactcggac aaggagctgc agtacatcga cgccatcagc
aacaagcagg gcgagctgga gaattacgtg tccgacggct acaagaccgc actgacagag
gagcgcaggc gcttctgctt cctggtggag aagcagtgcg ccgtggccaa gaactccgcg
gcctaccact ccaagggcaa ggagctgctg gcgcagaagc tgccgctgtg gcaacaggcc
tgtgccgacc ccagcaagat cccggagcgc gcggtgcagc tcatgcagca ggtggccagc
aacggcgcca ccctccccag cgccctgtcg gcctccaagt ccaacctggt catttccgac
cccattccgg gggccaagcc cctgccggtg ccccccgagc tggcaccgtt cgtggggcgg
atgtctgccc aggagagcac acccatcatg aacggcgtca caggcccgga tggcgaggac
tacagcccgt gggctgaccg caaggctgcc cagcccaaat ccctgtctcc tccgcagtct
cagagcaagc tcagcgactc ctactccaac acactccccg tgcgcaagag cgtgacccca
aaaaacagct atgccaccac cgagaacaag actctgcctc gctcgagctc catggcagcc
ggcctggagc gcaatggccg tatgcgggtg aaggccatct tctcccacgc tgctggggac
aacagcaccc tcctgagctt caaggagggt gacctcatta ccctgctggt gcctgaggcc
cgcgatggct ggcactacgg agagagtgag aagaccaaga tgcggggctg gtttcccttc
tcctacaccc gggtcttgga cagcgatggc agtgacaggc tgcacatgag cctgcagcaa
gggaagagca gcagcacggg caacctcctg gacaaggacg acctggccat cccacccccc

Remember: for genomic DNA, do gene prediction (Genscan/Grail) first!

Two long ORFs: what to do?
What does NCBI recommend?
What's the answer?

What about sequencing errors? ORF find the following sequence.

ccgctttcgt ctccgtcctg ctgccgttac cgccgctgct gccgccgctt gcgtcccccg
ctccggtctg tggtgcagcc gggacccagg accatgtctc tgtctcgctc agaggagatg
caccggctca cggaaaatgt ctataagacc atcatggagc agttcaaccc tagcctccgg
aacttcatcg ccatggggaa gaattacgag aaggcactgg caggtgtgac gtatgcagcc
aaaggctact ttgacgccct ggtgaagatg ggggagctgg ccagcgagag ccagggctcc
aaagaactcg gagacgttct cttccagatg gctgaagtcc acaggcagat ccagaatcag
ctggaagaaa tgctgaagtc ttttcacaac gagctgctta cgcagctgga gcagaaggtg
gagctggact ccaggtatct gagtgctgcg ctgaagaaat accagactga gcaaaggagc
aaaggcgacg ccctggacaa gtgtcaggct gagctgaaga agcttcggaa gaagagccag
ggcagcaaga atcctcagaa gtactcggac aaggagctgc agtacatcga cgccatcagc
aacaagcagg gcgagctgga gaattacgtg tccgacggct acaagaccgc actgacagag
gagcgcaggc gcttctgctt cctggtggag aagcagtgcg ccgtggccaa gaactccgcg
gcctaccact ccaagggcaa ggagctgctg gcgcagaagc tgccgctgtg gcaacaggcc
tgtgccgacc ccagcaagat cccggagcgc gcggtgcagc tcatgcagca ggtggccagc
aacggcgcca ccctccccag cgccctgtcg gcctccaagt ccaacctggt catttccgac
cccattccgg gggccaagcc cct ccggtg ccccccgagc tggcaccgtt cgtggggcgg
atgtctgccc aggagagcac acccatcatg aacggcgtca caggcccgga tggcgaggac
tacagcccgt gggctgaccg caaggctgcc cagcccaaat ccctgtctcc tccgcagtct
cagagcaagc tcagcgactc ctactccaac acactccccg tgcgcaagag cgtgacccca
aaaaacagct atgccaccac cgagaacaag actctgcctc gctcgagctc catggcagcc
ggcctggagc gcaatggccg tatgcgggtg aaggccatct tctcccacgc tgctggggac
aacagcaccc tcctgagctt caaggagggt gacctcatta ccctgctggt gcctgaggcc
cgcgatggct ggcactacgg agagagtgag aagaccaaga tgcggggctg gtttcccttc
tcctacaccc gggtcttgga cagcgatggc agtgacaggc tgcacatgag cctgcagcaa
gggaagagca gcagcacggg caacctcctg gacaaggacg acctggccat cccacccccc


Homework for Class 9
  1. Find the best ORF in the following sequence. You might have to correct sequencing errors. How long (in aa) is the ORF? What's the Accno of the closest protein?
    
    gcatcgggat atcactatgg agtctggtcg tgtg
    aaggat gtaaggcctt ttttaaaaga
    agcattcaag gacataatga ttatatttgt ccagctacaa atcag
    tgtac aatcgataaa
    aaccggcgca agagctgcca ggc ctgccga cttcggaagt gttacgaa gt gggaatggtg
    aagtgtggct cccggagaga gagatgtggg taccgccttg tgcggagaca gagaagtgcc
    gacgagcagc tgcactgtgc cggcaaggcc aagagaagtg gcggccacgc gccccgagtg
    cgggagctgc tgctgg acgc cctgagcccc gagcagctag tgctcaccct cctggaggct
    gagccgcccc atgtgctgat cagccgcccc agtgcgccct tcaccgaggc ctccatgatg
    atgtccctga ccaagttggc cgacaaggag ttggtacaca tgatcagctg ggccaagaag
    attcccggct ttgtgg agct cagcctgttc gaccagtac ggctcttgga gagctgttgg
    atggaggtgt taatgatggg gctgatgtgg cgctcaattg accaccccgg caagctcatc
    tttgctccag atcttgttct ggacagggat gaggggaaat gcgtagaagg aattctggaa
    atctttgaca tgctcctggc aactacttca aggtttcgag agttaaaact ccaacacaaa
    gaatatctct ggtcaaggc catgatcctg ctcaattcca gtatgtaccc tctggtcaca
    gcgacccagg atgctgacag cagccggaag ctggctcact tgctgaacgc cgtgaccgat
    gctttggttt gggtgattgc caagagcggc atctcctccc agcagcaatc catgcgcctg
    gctaacctcc tgatgctcct gtcccacgtc aggcatgcga gggcagaaaa ggcctctcaa
    acactcacct catttggaat gaagatggag actcttttgc ctgaagcaac gatggagcag
    tgaccctcta atcaactcgg tggcctaa ag aaaaatcttgggtaacattt tcacttcaat
    ttccctctgg gatcattgta atccatgaaa aaaataattt taaagaaaga gttaaaatac
    tttgaagtta gttatgtggt taaaaaccac cttcctttct attatcaatc caacaatttg
    ataactgtaa acgctaaagt gaagacggat tctcttcaga tggtctcctt aactgcccag
    ggcttgcaga tgtctcaccc atgaggggca ccaatgtaga aagctgaggc ttcatctact
    gatgagcttc actggtttcc cctgaggttt gtgctttggc agagaagggg aggaggggac
    tgggattgtg tggtcagctg tggctgccaa cagatgcagg ttaggaactg tgttcagtat
    cttccaataa gaaaggggaa atgccgatgc ctatcctctt tgtttaggta gaaagtaaaa
    
    
  2. Now take the same sequence and do a blastx. What are the top matches?
  3. Find the ORF in this sequence. Again, identify any sequencing errors.
    gcacatagcc agtatgacct ggagcgactg cgtgctgccc agaaacagct tgagagggaa
    caggagcacg tgc  gccggga ggcaga gcgg ctcagc cagc  ggcagacaga acgggacctg
    tgtcaggttt cccatccaca taccaagctgatgaggatcc catcgttcttccccagtcct
    gaggag cccccctcgcca tc tgcaccttcc atagccaaat cagggtcatt ggactcagaa
    ctttcagtgt ccccaaaa  ag gaacagcatctctcggacac acaaagataa ggggcctttt
    cacatactga gtcaaccag ccag  acaaacaaaggaccag aagggcagag ccaggcccct
    gcgtccacct ctgcctctac ccgcctgttt gggttaacaa agccaaagga aaagaaggag
    aaaaaaaaga agaacaaaac cagccgctct cagcccggtg atggtcccgc gtcagaagta
    tcagcagagg gtgaagagat cttctgctga
    
  4. What is the map position of this gene? Is it close to any disease loci?

Copyright 1999 Joel S. Bader jsbader@curagen.com