MB620 Bioinformatics
University of New Haven
Instructor: Joel S. Bader
Class 9: Multiple Sequence Alignments. Phylogenetics.


Agenda


Master Outline

Genetics
Traits/Genes to Location Genetic and physical maps
Research Genetics mapping panel
Stanford mapping server
Traits/Genes to Experimental Organisms Jackson Laboratories
Trait/Gene Location Database OMIM, On-line Mendelian Inheritance in Man
Genomic DNA Analysis
Sequences to Contigs CuraTools
CAP, PHRAP
Contigs to mRNA Genscan
Grail
mRNA Analysis
DNA to Homologblastn, blastx, fasta
DNA to ProteinORF finders
NCBI ORF Finder
Protein Analysis
protein homologsblastp
conserved residuesmultiple sequence alignment, clustal-w
evolutionary historyPhylip, Paup
blastp for linguistics AltaVista Babelfish

Protein Annotation

Where were we?

First step: has anyone seen a protein sequence like this before?
Why do this:

Last week: DNA sequence analysis
This week: Protein sequence analysis

We're working on the human genome project. We find a gene, translate it to get an ORF. What does the protein do?

Here is the protein sequence:


mkgsiftlfl fsvlfaisev rskesvrlcg leyirtviyi cassrwrrhl egipqaqqae
tgnsfqlphk refseenpaq nlpkvdasge drlwggqmpt eelwkskkhs vmsrqdlqtl
cctdgcsmtd lsalc
First, find homologs from sequence similarity searching (blastp at NCBI).
Sequences producing significant alignments:                        (bits)  Value

ref|NP_005469.1|PINSL5|  insulin-like 5 >gi|4768935|gb|AAD29686....   282  7e-76
gb|AAD29687.1|AF133817_1  (AF133817) insulin-like peptide INSL5 ...   163  4e-40
gi|3851207  (AC005952) INL3_HUMAN; LEY-I-L; RELAXIN-LIKE FACTOR;...    39  0.014
gi|3719459  (AF094580) relaxin-like factor [Bos taurus]                38  0.024
sp|P51461|INL3_PIG  LEYDIG INSULIN-LIKE PEPTIDE PRECURSOR (LEY-I...    38  0.032
sp|P51460|INL3_HUMAN  LEYDIG INSULIN-LIKE PEPTIDE PRECURSOR (LEY...    37  0.055
pir||A26463  relaxin - spiny dogfish (fragments)                       37  0.055
sp||RELX_SQUAC_1  [Segment 1 of 2] RELAXIN                             37  0.072
bbs|179129  (S82815) RLF=relaxin-like factor/insulin homolog {co...    36  0.12
Is this relaxin or another insulin-like protein?

Multiple Sequence Alignment

>human-unknown
mkgsiftlfl fsvlfaisev rskesvrlcg leyirtviyi cassrwrrhl egipqaqqae
tgnsfqlphk refseenpaq nlpkvdasge drlwggqmpt eelwkskkhs vmsrqdlqtl
cctdgcsmtd lsalc
>mouse-insl-5
mkgptlalfl llvllavvev rsrqtvklcg ldyvrtviyi cassrwrrhl eghfhsqqae
trnylqlldr hepskktleh slpktdlsgq elvrdpqapk eglwelkkhs vvsrrdlqal
ccregcsmke lstlc
>human-insl-3a
mdprlpawal vllgpalvfa lgpaptpemr eklcghhfvr alvrvcggpr wstearrpat
ggdrellqwl errhllhglv adsnltlgpg lqplpqtshh hrhhraaatn parycclsgc
tqqdlltlcp y
>bull-insl-3
mdrrpltwal vllgpalaia lgpaaaqeap eklcghhfvr alvrlcggpr wsseedgrpv
aggdrellrw legqhllhgl masgdpvlvl apqplpqasr hhhhrratai nparhcclsg
ctrqdlltlc ph
>pig-insl-3
mdphpltwal vllgpalals rapapaqeap eklcghhfvr alvrlcggpr wspedgrava
ggdrellqwl egqhlfhglm asgdpmlvla pqpppqasgh hhhrraaatn parhcclsgc
trqdlltlcp h
>human-insl-3b
mdprlpawal vllgpalvfa lgpaptpemr eklcghhfvr alvrvcggpr wstearrpaa
ggdrellqwl errhllhglv adsnltlgpg lqplpqtshh hrhhraaatn parycclsgc
tqqdlltlcp y
>mouse-relaxin-like
mraplllmll algsalrspq ppearaklcg hhlvrtlvrv cggprwspea tqpvetrdre
llqwleqrhl lhalvadvdp aldpqlprqa sqrqrrsaat navhrccltg ctqqdllglc
ph
>marmoset-relaxin-like
mdprlpawal vllgpalvfa lgpaptpemr eklcghhfvr alvrvcggpl wstearrpva
agdgellqwl errhllyglv ansepapggp glqpmpqtsh hhrhrraaas nparycclsg
csqqdlltlc p
>human-relaxin
mprlflfhll efclllnqfs ravaakwkdd viklcgrelv raqiaicgms twskrslsqe
dapqtprpva eivpsfinkd tetiiimlef ianlppelka alserqpslp elqqyvpalk
dsnlsfeefk klirnrqsea adsnpselky lgldthsqkk rrpyvalfek ccligctkrs
lakyc
>salmonella-atp-binding
msqpllavng lmmrfgglla vnnvslelre reivsligpn gagkttvfnc ltgfykptgg
titlrerhle glpgqqiarm gvvrtfqhvr lfremtvien llvaqhqqlk tglfsgllkt
pafrraqsea ldraatwler igllehanrq asnlaygdqr rleiarcmvt qpeilmldep
aaglnpketk eldeliaelr nhhnttilli ehdmklvmgi sdriyvvnqg tplangtpee
irnnpdvira ylgea
>pan-relaxin-2
sravadswmd eviklcgrel vraqiaicgk stwskrslsq edapqtprpv aeivpsfink
dtetinmmse fvanlpqelk ltlsemqpal pqlqqyvpvl kdssllfeef kklirnrqse
aadsspselk ylgldthsrk krqlysalan kcchvgctkr slarfc

Go to CuraTools, paste in sequences, run Clustal-W

Multiple sequence alignment algorithm

CLUSTAL W (1.7) multiple sequence alignment

pan-relaxin-2               -------------------SRAVA----DSWMDEV-IKLCGRELV-RAQI
human-relaxin               MPRLFLFHLLEFCLLLNQFSRAVA----AKWKDDV-IKLCGRELV-RAQI
human-insl-3b               --MDPRLPAWALVLLGPALVFALG----PAPTPEMREKLCGHHFV-RALV
human-insl-3a               --MDPRLPAWALVLLGPALVFALG----PAPTPEMREKLCGHHFV-RALV
marmoset-relaxin-like       --MDPRLPAWALVLLGPALVFALG----PAPTPEMREKLCGHHFV-RALV
pig-insl-3                  --MDPHPLTWALVLLGPALALSRA----PAPAQEAPEKLCGHHFV-RALV
bull-insl-3                 --MDRRPLTWALVLLGPALAIALG----PAAAQEAPEKLCGHHFV-RALV
mouse-relaxin-like          -------MRAPLLLMLLALGSALR----SPQPPEARAKLCGHHLV-RTLV
mouse                       ------MKGPTLALFLLLVLLAVV----EVRSRQT-VKLCGLDYV-RTVI
human-unknown               ------MKGSIFTLFLFSVLFAIS----EVRSKES-VRLCGLEYI-RTVI
salmonella-atp-binding      --MSQPLLAVNGLMMRFGGLLAVNNVSLELREREI-VSLIGPNGAGKTTV
                                                 :           :    * * .   :: :

pan-relaxin-2               AICGKSTWSKRS-LSQEDAPQ-TPRPVAEIVPSFIN------KDTETINM
human-relaxin               AICGMSTWSKRS-LSQEDAPQ-TPRPVAEIVPSFIN------KDTETIII
human-insl-3b               RVCGGPRWSTE-----------ARRPAAGGDRELL-------QWLERRHL
human-insl-3a               RVCGGPRWSTE-----------ARRPATGGDRELL-------QWLERRHL
marmoset-relaxin-like       RVCGGPLWSTE-----------ARRPVAAGDGELL-------QWLERRHL
pig-insl-3                  RLCGGPRWSPE-----------DGRAVAGGDRELL-------QWLEGQHL
bull-insl-3                 RLCGGPRWSSE----------EDGRPVAGGDRELL-------RWLEGQHL
mouse-relaxin-like          RVCGGPRWSPE-----------ATQPVETRDRELL-------QWLEQRHL
mouse                       YICASSRWRRHL-EG-------HFHSQQAETRNYL-------QLLDRHEP
human-unknown               YICASSRWRRHL-EG-------IPQAQQAETGNSF-------QLPHKREF
salmonella-atp-binding      FNCLTGFYKPTGGTITLRERHLEGLPGQQIARMGVVRTFQHVRLFREMTV
                              *    :                 .        .       :       

pan-relaxin-2               MSEFVANLPQELKLTLSEMQPALPQLQQYVPVLKDSSLLFEEFKKLIRNR
human-relaxin               MLEFIANLPPELKAALSERQPSLPELQQYVPALKDSNLSFEEFKKLIRNR
human-insl-3b               LHGLVADSNLTLG-PG--LQP-LPQTSH-------------------HHR
human-insl-3a               LHGLVADSNLTLG-PG--LQP-LPQTSH-------------------HHR
marmoset-relaxin-like       LYGLVANSEPAPGGPG--LQP-MPQTSH-------------------HHR
pig-insl-3                  FHGLMASGDPMLV-LA--PQP-PPQASG-------------------HHH
bull-insl-3                 LHGLMASGDPVLV-LA--PQP-LPQASR-------------------HHH
mouse-relaxin-like          LHALVADVDPALD-----PQL-PRQAS---------------------QR
mouse                       SKKTLEHSLPKTDLSG-QELVRDPQAPK----------------EGLWEL
human-unknown               SEENPAQNLPKVDASG-EDRLWGGQMPT----------------EELWKS
salmonella-atp-binding      IENLLVAQHQQLKTGLFSGLLKTPAFRRAQSEALDRAATWLERIGLLEHA
                                                                            . 

pan-relaxin-2               QSEAADSSPSELKYLGLDTHSRKKRQLYSALANKCCHVGC--TKRSLARF
human-relaxin               QSEAADSNPSELKYLGLDTHSQKKRRPYVALFEKCCLIGC--TKRSLAKY
human-insl-3b               HHRAAATNP----------------------ARYCCLSGC--TQQDLLTL
human-insl-3a               HHRAAATNP----------------------ARYCCLSGC--TQQDLLTL
marmoset-relaxin-like       HRRAAASNP----------------------ARYCCLSGC--SQQDLLTL
pig-insl-3                  HRRAAATNP----------------------ARHCCLSGC--TRQDLLTL
bull-insl-3                 HRRATAINP----------------------ARHCCLSGC--TRQDLLTL
mouse-relaxin-like          QRRSAATNA----------------------VHRCCLTGC--TQQDLLGL
mouse                       KKHSVVSRR----------------D----LQALCCREGC--SMKELSTL
human-unknown               KKHSVMSRQ----------------D----LQTLCCTDGC--SMTDLSAL
salmonella-atp-binding      NRQASNLAYG--------------------DQRRLEIARCMVTQPEILML
                            : .:                                   *  :  .:   

pan-relaxin-2               C-------------------------------------------------
human-relaxin               C-------------------------------------------------
human-insl-3b               CPY-----------------------------------------------
human-insl-3a               CPY-----------------------------------------------
marmoset-relaxin-like       CP------------------------------------------------
pig-insl-3                  CPH-----------------------------------------------
bull-insl-3                 CPH-----------------------------------------------
mouse-relaxin-like          CPH-----------------------------------------------
mouse                       C-------------------------------------------------
human-unknown               C-------------------------------------------------
salmonella-atp-binding      DEPAAGLNPKETKELDELIAELRNHHNTTILLIEHDMKLVMGISDRIYVV
                                                                              

pan-relaxin-2               ----------------------------
human-relaxin               ----------------------------
human-insl-3b               ----------------------------
human-insl-3a               ----------------------------
marmoset-relaxin-like       ----------------------------
pig-insl-3                  ----------------------------
bull-insl-3                 ----------------------------
mouse-relaxin-like          ----------------------------
mouse                       ----------------------------
human-unknown               ----------------------------
salmonella-atp-binding      NQGTPLANGTPEEIRNNPDVIRAYLGEA
                                                        

Phylogenetic Analysis

Two programs in widespread use: Phylip (free), Paup (costs $$$). We'll use Phylip from CuraTools.

  11 Populations

Neighbor-Joining/UPGMA method version 3.572c


 Neighbor-joining method

 Negative branch lengths allowed


     +human-insl
  +--8  
  !  +human-insl
  !  
--9marmoset-r
  !  
  !                            +pan-relaxi
  !            +---------------1  
  !            !               +---human-rela
  !     +------4  
  !     !      !                 +--mouse     
  !     !      !       +---------2  
  !  +--6      +-------3         +-----human-unkn
  !  !  !              !  
  !  !  !              +-----------------------------------salmonella
  +--7  !  
     !  +---mouse-rela
     !  
     !  +pig-insl-3
     +--5  
        +bull-insl-


remember: this is an unrooted tree!

Between        And            Length
-------        ---            ------
   9             8              0.10701
   8          human-insl       -0.00199
   8          human-insl        0.02003
   9          marmoset-r        0.18388
   9             7              0.21611
   7             6              0.41042
   6             4              1.13516
   4             1              2.69704
   1          pan-relaxi       -0.11219
   1          human-rela        0.62211
   4             3              1.28907
   3             2              1.71628
   2          mouse             0.46292
   2          human-unkn        0.91501
   3          salmonella        6.00235
   6          mouse-rela        0.74337
   7             5              0.46742
   5          pig-insl-3        0.15327
   5          bull-insl-        0.04251

Interpreting the cladogram/dendrogram/phylogeny: The human unkown is closest to the mouse sequence (insulin-like protein 5).

Evolutionary Biology and Phylogenetics

Here are some myoglobins:
>hum-myo
mglsdgewql vlnvwgkvea dipghgqevl irlfkghpet lekfdkfkhl ksedemkase
dlkkhgatvl talggilkkk ghheaeikpl aqshatkhki pvkylefise ciiqvlqskh
pgdfgadaqg amnkalelfr kdmasnykel gfqg
>pig-myo
mglsdgewql vlnvwgkvea dvaghgqevl irlfkghpet lekfdkfkhl ksedemkase
dlkkhgntvl talggilkkk ghheaeltpl aqshatkhki pvkylefise aiiqvlqskh
pgdfgadaqg amskalelfr ndmaakykel gfqg
>whale-myo
vlsegewqlv lhvwakvead vaghgqdili rlfkshpetl ekfdrfkhlk teaemkased
lkkhgvtvlt algailkkkg hheaelkpla qshatkhkip ikylefisea iihvlhsrhp
gdfgadaqga mnkalelfrk diaakykelg yqg
>dog-myo
glsdgewqiv lniwgkvetd laghgqevli rlfknhpetl dkfdkfkhlk tedemkgsed
lkkhgntvlt alggilkkkg hheaelkpla qshatkhkip vkylefisda iiqvlqskhs
gdfhadteaa mkkalelfrn diaakykelg fqg
>penguin-myo
glndqewqqv ltmwgkvesd laghghavlm rlfkshpetm drfdkfrglk tpdemrgsed
mkkhgvtvlt lgqilkkkgh heaelkplsq thatkhkvpv kylefiseai mkviaqkhas
nfgadaqeam kkalelfrnd maskykefgf qg
>horse-myo
glsdgewqqv lnvwgkvead iaghgqevli rlftghpetl ekfdkfkhlk teaemkased
lkkhgtvvlt alggilkkkg hheaelkpla qshatkhkip ikylefisda iihvlhskhp
gdfgadaqga mtkalelfrn diaakykelg fqg



   6 Populations

Neighbor-Joining/UPGMA method version 3.572c


 Neighbor-joining method

 Negative branch lengths allowed


     +--pig-myo   
  +--3  
  !  +------hum-myo   
  !  
  !    +--------dog-myo   
--4----1  
  !    +-------------------------------------penguin-my
  !  
  !  +------horse-myo 
  +--2  
     +------------whale-myo 


remember: this is an unrooted tree!

Between        And            Length
-------        ---            ------
   4             3              0.02899
   3          pig-myo           0.05372
   3          hum-myo           0.12577
   4             1              0.07569
   1          dog-myo           0.16442
   1          penguin-my        0.63682
   4             2              0.03773
   2          horse-myo         0.10914
   2          whale-myo         0.20575

The long-branch problem: why do dog and penguin end up together?

Distance-based algorithms vs. Maximum parsimony algorithms

A Worked Example

English  one two  three   four   five
German  ein  zwei  drei   vier   funf
French   un  deux t rois quatre  cinq
Italian  un  due  t re  quattro  cinque
Spanish  un  dos  t res  cuatro  cinco



Similarities (# of common letters)
  E  G  F  I  S
E 
G 6 
F 4  5  
I 5  6  15
S 5  6  13 14

Branching
(I,F)
S,(I,F)
(G,E)
(G,E),(S,(I,F))

              |------ English
   |----------|
   |          |------ German
---|
   |
   |
   |          
   |          |-- Spanish
   |----------|
              |   |- French
              |---|
                  |- Italian
Fun links: Tree of Life, http://phylogeny.arizona.edu/tree/phylogeny.html
Homework for Class 10
  1. Find 5 insulin-like human proteins (include insulin if you like) and build a phylogeny.
  2. Pick a protein and use it to determine the evolutionary branching of 3 birds, 3 reptiles, and 3 mammals.
  3. Pick a second protein and repeat the analysis, using the same species. What are the differences between the two trees? Are birds more closely related to reptiles or mammals?
  4. Extra credit. Pick 10 mammals and determine the evolutionary history using protein evolution. Find the full scientific names of the animals. Does the branching you determine match the branching in the scientific names?

Copyright 1999 Joel S. Bader jsbader@curagen.com