MB620 Bioinformatics at University of New Haven

MB620 Bioinformatics
University of New Haven
Instructor: Joel S. Bader
Class 8: Sequence Similarity Searching. ORF Finding.

Agenda

Sequence similarity searching
Dotplots
Postponed: Global Alignments, Needleman-Wunsch, Smith-Waterman, Blast, Fasta
blastn: DNA query vs. nucleotide database
blastx: DNA query vs. protein database
ORF Finding

Master Outline

Genetics
Traits/Genes to Location	Genetic and physical maps Research Genetics mapping panel Stanford mapping server
Traits/Genes to Experimental Organisms	Jackson Laboratories
Trait/Gene Location Database	OMIM, On-line Mendelian Inheritance in Man
Genomic DNA Analysis
Sequences to Contigs	CuraTools CAP, PHRAP
Contigs to mRNA	Genscan Grail
mRNA Analysis
DNA to Homolog	blastn, blastx, fasta
DNA to Protein	ORF finders NCBI ORF Finder
Protein Analysis

Sequence Similarity Searching

Where were we?

Found a piece of DNA or cDNA
Context: anywhere from positional cloning (lots of context) to shotgun sequencing (little context)
Characterize the information content

First step: has anyone seen a sequence like this before?
Why do this:

Learn from their results
Establish scientific priority (publication rights)
Establish intellectual property rights (patents)

How to do this: sequence similarity searching

Sequence Alignment, Sequence Similarity Searching

Questions we will answer

It's easy to say that two sequences are identical. But notice the word like. What does like mean?
Does likeness follow our intuition: identity, containments, overlaps, SNPs, sequencing errors, splice variants, common evolutionary ancestry (homologs, orthologs, paralogs). How do we measure likeness?
What's the difference between sequence similarity, %SI, and homology?
What are the applications that do this for us?
How relevant are the matched sequences?

There is one sequence similarity search you have almost certainly used, and it has virtually nothing to do with molecular biology. What is it? (Thinking of this example will usually provide good intuition.)

The Sequence Alignment problem: Take two strings of characters. Align the letters in one string with the letters in the second string, allowing gaps and mismatches in the alignment. The score of the alignment is the number of matched letters. What is the best alignment?

Example: matching co-evolved words

alignment: MOTHER  M OTHER   MOTHER
           . . ..  .  . ..   .
           MUTTER  MUTT ER   MU TTER
score:     4       4         1

There can be a tie for the best alignment.
How would you adjust the rules to make the first alignment best?
How would you adjust the rules to say that O-U isn't a bad mismatch?

Visual aid: dotplots

  M O T H E R
M .
U
T     .
T     .
E         .
R           .

Put a dot where letters match. What does an exact alignment look like?

  B I O I N F O R M A T I C S
B .
I   .   .               .
O     .       . 
I   .   .               .
N         .
F           .
O     .       . 
R               .
M                 .
A                   .
T                     .
I   .   .               .
C                         .
S                           .

Long diagaonal: filled
Off-diagonal: random matches
General description: Search for a long diagonal filled with dots.
Can't re-use letters in the alignment.
Alignments = paths where each step moves 1+ rows down and 1+ columns right
Score of path = # of dots
Needleman-Wunsch, Smith-Waterman: algorithms for finding the best complete path
Blast, Fasta: algorithms for finding the high-scoring segment pairs (HSPs) (look for local matches, then extend)

Describing sequence similarity: % sequence identity (SI), length of match

98+% SI over 50+bp: probably same gene
95+% SI: close homolog (same gene, different species)
80+% SI: close homolog
40+% SI: probable homolog
20+% SI: twilight zone

Blast Example: start with human insulin gene, take it to CuraTools
gb:GENBANK-ID:HUMINS01|acc:J00265 Human insulin gene
Reading blast output:

BLASTN 2.0a19MP-WashU [05-Feb-1998] [Build sol2.5-ultra 01:47:25 05-Feb-1998]

Reference:  Gish, Warren (1994-1997).  unpublished.
Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J.
Lipman (1990).  Basic local alignment search tool.  J. Mol. Biol. 215:403-10.

Notice:  this program and its default parameter settings are optimized to find
nearly identical sequences rapidly.  To identify weak similarities encoded in
nucleic acid, use BLASTX, TBLASTN or TBLASTX.

Query=  HUMINS01 Human insulin gene, complete cds.
        (4044 letters)

Database:  /opt/database/public/blast/gbcomp
           565,116 sequences; 1,083,671,655 total letters.

                                                                     Smallest
                                                                       Sum
                                                              High  Probability
Sequences producing High-scoring Segment Pairs:              Score  P(N)      N

gb:GENBANK-ID:HUMINS01|acc:J00265 Human insulin gene, com... 20220  0.0       1
gb:GENBANK-ID:HUMINSTHIG|acc:L15440 Homo sapiens tyrosine... 20210  0.0       1
gb:GENBANK-ID:HSINSU|acc:V00565 Human gene for preproinsu... 20082  0.0       1
gb:GENBANK-ID:HUMINSPR|acc:M10039 Human alpha-type insuli... 11244  0.0       1
gb:GENBANK-ID:PTPPINS|acc:X61089 P.troglodytes gene for p...  9328  0.0       2
gb:GENBANK-ID:E00022|acc:E00022 DNA coding for human prep...  8846  0.0       1
gb:GENBANK-ID:HSA9655|acc:AJ009655 Homo sapiens ins gene,...  6956  2.1e-308  1
gb:GENBANK-ID:CEPPINS|acc:X61092 C.aethiops gene for prep...  3821  3.2e-284  3
gb:GENBANK-ID:HUMINS1UA|acc:J00266 Human insulin gene: al...  4228  3.5e-185  1
gb:GENBANK-ID:ATRINS|acc:J02989 Owl monkey (A.trivirgatus...  3812  1.0e-176  2
gb:GENBANK-ID:HUMINSHVR|acc:M26868 Human hypervariable DN...  2752  1.6e-118  1
gb:GENBANK-ID:HUMINS1UB|acc:J00267 Human insulin gene: be...  2514  1.4e-107  1
gb:GENBANK-ID:A57040|acc:A57040 Sequence 2 from Patent WO...  2220  2.2e-99   2
gb:GENBANK-ID:I55872|acc:I55872 Sequence 2 from patent US...  2220  2.2e-99   2
gb:GENBANK-ID:S99616|acc:S99616 insulin {promoter} [human...  2220  2.2e-99   2
gb:GENBANK-ID:HSIGF2AP|acc:X05331 Human insulin/IGF-II in...  2145  1.8e-90   1
gb:GENBANK-ID:MMINSIIG|acc:X04724 Mouse preproinsulin gen...   759  1.2e-75   3
gb:GENBANK-ID:RATINSII|acc:J00748 Rat insulin II gene (in...   713  9.7e-67   4
gb:GENBANK-ID:MMINSIG|acc:X04725 Mouse preproinsulin gene...   692  4.1e-64   3
gb:GENBANK-ID:CLINSU|acc:V00179 Dog gene encoding insulin...   776  6.0e-64   3
gb:GENBANK-ID:RNINS2|acc:V01243 Rat gene for insulin 2 - ...   713  1.1e-56   3

Format of output

GenBank ID, Locus, Accno, definition
Score (related to number of matching bp)
E-Value: expected number of matches at least this good in a random db
Is high E-Value or low E-Value better?
Number of distinct HSP regions (large gaps)
Small gaps don't break an alignment

Blast vs. Fasta
blast: faster, less sensitive, better if there's a strong match
fasta: slower, more sensitive, better if there's only weak homology
blast has been improved (gaps, multiple regions), more sensitive, used more often.

Blast Programs

Program Query Database Use

blastn DNA DNA exact matches

blastx trDNA protein analyzing new DNA and ESTs

blastp protein protein distant relationships

tblastn protein trDNA finding gene homologs in genomic DNA

tblastx trDNA trDNA sensitive EST analysis

Why might blastx give hits that blastn missed?
What does evolution conserve?

Program	Query	Database	Use
blastn	DNA	DNA	exact matches
blastx	trDNA	protein	analyzing new DNA and ESTs
blastp	protein	protein	distant relationships
tblastn	protein	trDNA	finding gene homologs in genomic DNA
tblastx	trDNA	trDNA	sensitive EST analysis

ORF Finders

Usually it's useful to convert to protein before further analysis.

Information content
Biological relevance

ORF = open reading frame (not necessarily MET to Stop)

ccgctttcgt ctccgtcctg ctgccgttac cgccgctgct gccgccgctt gcgtcccccg
ctccggtctg tggtgcagcc gggacccagg accatgtctc tgtctcgctc agaggagatg
caccggctca cggaaaatgt ctataagacc atcatggagc agttcaaccc tagcctccgg
aacttcatcg ccatggggaa gaattacgag aaggcactgg caggtgtgac gtatgcagcc
aaaggctact ttgacgccct ggtgaagatg ggggagctgg ccagcgagag ccagggctcc
aaagaactcg gagacgttct cttccagatg gctgaagtcc acaggcagat ccagaatcag
ctggaagaaa tgctgaagtc ttttcacaac gagctgctta cgcagctgga gcagaaggtg
gagctggact ccaggtatct gagtgctgcg ctgaagaaat accagactga gcaaaggagc
aaaggcgacg ccctggacaa gtgtcaggct gagctgaaga agcttcggaa gaagagccag
ggcagcaaga atcctcagaa gtactcggac aaggagctgc agtacatcga cgccatcagc
aacaagcagg gcgagctgga gaattacgtg tccgacggct acaagaccgc actgacagag
gagcgcaggc gcttctgctt cctggtggag aagcagtgcg ccgtggccaa gaactccgcg
gcctaccact ccaagggcaa ggagctgctg gcgcagaagc tgccgctgtg gcaacaggcc
tgtgccgacc ccagcaagat cccggagcgc gcggtgcagc tcatgcagca ggtggccagc
aacggcgcca ccctccccag cgccctgtcg gcctccaagt ccaacctggt catttccgac
cccattccgg gggccaagcc cctgccggtg ccccccgagc tggcaccgtt cgtggggcgg
atgtctgccc aggagagcac acccatcatg aacggcgtca caggcccgga tggcgaggac
tacagcccgt gggctgaccg caaggctgcc cagcccaaat ccctgtctcc tccgcagtct
cagagcaagc tcagcgactc ctactccaac acactccccg tgcgcaagag cgtgacccca
aaaaacagct atgccaccac cgagaacaag actctgcctc gctcgagctc catggcagcc
ggcctggagc gcaatggccg tatgcgggtg aaggccatct tctcccacgc tgctggggac
aacagcaccc tcctgagctt caaggagggt gacctcatta ccctgctggt gcctgaggcc
cgcgatggct ggcactacgg agagagtgag aagaccaaga tgcggggctg gtttcccttc
tcctacaccc gggtcttgga cagcgatggc agtgacaggc tgcacatgag cctgcagcaa
gggaagagca gcagcacggg caacctcctg gacaaggacg acctggccat cccacccccc

Remember: for genomic DNA, do gene prediction (Genscan/Grail) first!

Two long ORFs: what to do?
What does NCBI recommend?
What's the answer?

What about sequencing errors? ORF find the following sequence.

ccgctttcgt ctccgtcctg ctgccgttac cgccgctgct gccgccgctt gcgtcccccg
ctccggtctg tggtgcagcc gggacccagg accatgtctc tgtctcgctc agaggagatg
caccggctca cggaaaatgt ctataagacc atcatggagc agttcaaccc tagcctccgg
aacttcatcg ccatggggaa gaattacgag aaggcactgg caggtgtgac gtatgcagcc
aaaggctact ttgacgccct ggtgaagatg ggggagctgg ccagcgagag ccagggctcc
aaagaactcg gagacgttct cttccagatg gctgaagtcc acaggcagat ccagaatcag
ctggaagaaa tgctgaagtc ttttcacaac gagctgctta cgcagctgga gcagaaggtg
gagctggact ccaggtatct gagtgctgcg ctgaagaaat accagactga gcaaaggagc
aaaggcgacg ccctggacaa gtgtcaggct gagctgaaga agcttcggaa gaagagccag
ggcagcaaga atcctcagaa gtactcggac aaggagctgc agtacatcga cgccatcagc
aacaagcagg gcgagctgga gaattacgtg tccgacggct acaagaccgc actgacagag
gagcgcaggc gcttctgctt cctggtggag aagcagtgcg ccgtggccaa gaactccgcg
gcctaccact ccaagggcaa ggagctgctg gcgcagaagc tgccgctgtg gcaacaggcc
tgtgccgacc ccagcaagat cccggagcgc gcggtgcagc tcatgcagca ggtggccagc
aacggcgcca ccctccccag cgccctgtcg gcctccaagt ccaacctggt catttccgac
cccattccgg gggccaagcc cct ccggtg ccccccgagc tggcaccgtt cgtggggcgg
atgtctgccc aggagagcac acccatcatg aacggcgtca caggcccgga tggcgaggac
tacagcccgt gggctgaccg caaggctgcc cagcccaaat ccctgtctcc tccgcagtct
cagagcaagc tcagcgactc ctactccaac acactccccg tgcgcaagag cgtgacccca
aaaaacagct atgccaccac cgagaacaag actctgcctc gctcgagctc catggcagcc
ggcctggagc gcaatggccg tatgcgggtg aaggccatct tctcccacgc tgctggggac
aacagcaccc tcctgagctt caaggagggt gacctcatta ccctgctggt gcctgaggcc
cgcgatggct ggcactacgg agagagtgag aagaccaaga tgcggggctg gtttcccttc
tcctacaccc gggtcttgga cagcgatggc agtgacaggc tgcacatgag cctgcagcaa
gggaagagca gcagcacggg caacctcctg gacaaggacg acctggccat cccacccccc

Homework for Class 9

Find the best ORF in the following sequence. You might have to correct sequencing errors. How long (in aa) is the ORF? What's the Accno of the closest protein?


gcatcgggat atcactatgg agtctggtcg tgtg
aaggat gtaaggcctt ttttaaaaga
agcattcaag gacataatga ttatatttgt ccagctacaa atcag
tgtac aatcgataaa
aaccggcgca agagctgcca ggc ctgccga cttcggaagt gttacgaa gt gggaatggtg
aagtgtggct cccggagaga gagatgtggg taccgccttg tgcggagaca gagaagtgcc
gacgagcagc tgcactgtgc cggcaaggcc aagagaagtg gcggccacgc gccccgagtg
cgggagctgc tgctgg acgc cctgagcccc gagcagctag tgctcaccct cctggaggct
gagccgcccc atgtgctgat cagccgcccc agtgcgccct tcaccgaggc ctccatgatg
atgtccctga ccaagttggc cgacaaggag ttggtacaca tgatcagctg ggccaagaag
attcccggct ttgtgg agct cagcctgttc gaccagtac ggctcttgga gagctgttgg
atggaggtgt taatgatggg gctgatgtgg cgctcaattg accaccccgg caagctcatc
tttgctccag atcttgttct ggacagggat gaggggaaat gcgtagaagg aattctggaa
atctttgaca tgctcctggc aactacttca aggtttcgag agttaaaact ccaacacaaa
gaatatctct ggtcaaggc catgatcctg ctcaattcca gtatgtaccc tctggtcaca
gcgacccagg atgctgacag cagccggaag ctggctcact tgctgaacgc cgtgaccgat
gctttggttt gggtgattgc caagagcggc atctcctccc agcagcaatc catgcgcctg
gctaacctcc tgatgctcct gtcccacgtc aggcatgcga gggcagaaaa ggcctctcaa
acactcacct catttggaat gaagatggag actcttttgc ctgaagcaac gatggagcag
tgaccctcta atcaactcgg tggcctaa ag aaaaatcttgggtaacattt tcacttcaat
ttccctctgg gatcattgta atccatgaaa aaaataattt taaagaaaga gttaaaatac
tttgaagtta gttatgtggt taaaaaccac cttcctttct attatcaatc caacaatttg
ataactgtaa acgctaaagt gaagacggat tctcttcaga tggtctcctt aactgcccag
ggcttgcaga tgtctcaccc atgaggggca ccaatgtaga aagctgaggc ttcatctact
gatgagcttc actggtttcc cctgaggttt gtgctttggc agagaagggg aggaggggac
tgggattgtg tggtcagctg tggctgccaa cagatgcagg ttaggaactg tgttcagtat
cttccaataa gaaaggggaa atgccgatgc ctatcctctt tgtttaggta gaaagtaaaa

Now take the same sequence and do a blastx. What are the top matches?

Find the ORF in this sequence. Again, identify any sequencing errors.

gcacatagcc agtatgacct ggagcgactg cgtgctgccc agaaacagct tgagagggaa
caggagcacg tgc  gccggga ggcaga gcgg ctcagc cagc  ggcagacaga acgggacctg
tgtcaggttt cccatccaca taccaagctgatgaggatcc catcgttcttccccagtcct
gaggag cccccctcgcca tc tgcaccttcc atagccaaat cagggtcatt ggactcagaa
ctttcagtgt ccccaaaa  ag gaacagcatctctcggacac acaaagataa ggggcctttt
cacatactga gtcaaccag ccag  acaaacaaaggaccag aagggcagag ccaggcccct
gcgtccacct ctgcctctac ccgcctgttt gggttaacaa agccaaagga aaagaaggag
aaaaaaaaga agaacaaaac cagccgctct cagcccggtg atggtcccgc gtcagaagta
tcagcagagg gtgaagagat cttctgctga