MB620 Bioinformatics
University of New Haven
Instructor: Joel S. Bader
Class 11: Protein Sequence Analysis. Orthologs. Motifs and Domains. Localization and Sorting. Structure Prediction.


Agenda


Master Outline

Genetics
Traits/Genes to Location Genetic and physical maps
Research Genetics mapping panel
Stanford mapping server
Traits/Genes to Experimental Organisms Jackson Laboratories
Trait/Gene Location Database OMIM, On-line Mendelian Inheritance in Man
Genomic DNA Analysis
Sequences to Contigs CuraTools
CAP, PHRAP
Contigs to mRNA Genscan
Grail
tblastn (protein query, genomic database)
mRNA Analysis
DNA to Homologblastn, blastx, fasta
DNA to ProteinORF finders
NCBI ORF Finder
Protein Analysis
protein homologsblastp
conserved residuesmultiple sequence alignment, clustal-w
evolutionary historyPhylip, Paup
blastp for linguistics AltaVista Babelfish
Prokaryot homologs Clusters of Orthologous Groups
Domains Pfam
Prosite
Prodom
Cellular localization Psort
Secondary structure prediction Consensus prediction
Tertiary structure prediction Swiss-Model
Known folds SCOP
CATH, DALI
Structure similarity searchVAST
Bioinformatics Resources
GenBank, OMIM, Blast, Entrez NCBI
Swiss-Prot, TrEMBL, Prosite ExPASy

Protein Annotation

Where were we?

Motifs, Profiles, Domains

Evolution: conserved domains
Evidence: sequence similarity in multiple alignments

Use of profiles:

  1. Function to Structure
    Many genes, same function, this part is conserved, must be functional
  2. Structure to Function
    New gene, weak similarity to known sequences, what does it do?
    Gene with a mutation: is it relevant?
    Protein engineering: add or modify function

Example of a profile: Zinc finger protein from pfam

ADR1_YEAST/104-126    FVCE...VCT...RAFARQEHLKRHYRS...H
ADR1_YEAST/132-155    YPCG...LCN...RCFTRRDLLIRHAQK..IH
AGIE_RAT/269-291      YICE...ECG...IRCKKPSMLKKHIRT...H
AGIE_RAT/297-321      YVCK...LCN...FAFKTKGNLTKHMKSK.AH

consensus             YVCE...LCN...RAFKRK..LKKH.RS...H
second choice         .................KR....R..K.....
Profile, pattern, motif = sequence with substitution frequencies
Consensus = only show the most likely sequence, less information than profile

Algorithms for using profiles to search

Building Motifs: Psi-Blast

How did we identify conserved regions?

Position-Specific Iterated Blast

Example:
MKVDLHVHSIVSKCSLNPKGLLEKFCIKKNIVPAICDHNKLTKL
NFAIPGEEIATNSGEFIGLFLTEEIPANLDLYEALDRVREQGALIYLPHPFDLNRRRS
LAKFNVLEEREFLKYVHVVEVFNSRCRSIEPNLKALEYAEKYDFAMAFGSDAHFIWEV
GNAYIKFSELNIEKPDDLSPKEFLNLLKIKTDELLKAKSNLLKNPWKTRWHYGKLGSK
YNIALYSKVVKNVRRKLNI

Prosite

Prosite = part of Swiss-Prot, has pre-defined motifs.
Two main functions:

Scan for pre-defined consensus patterns. Try the following:

MGKISSLPTQLFKCCFCDFLKVKMHTMSSSHLFYLALCLLTFTSSATAGP
ETLCGAELVDALQFVCGDRGFYFNKPTGYGSSSRRAPQTGIVDECCFRSC
DLRRLEMYCAPLKPAKSARSVRAQRHTDMPKTQKEVHLKNASRGSAGNKN
YRM

Among other features, find the following:

[5] PDOC00235 PS00262  INSULIN
Insulin family signature

            95-109 CCFRSCDLRRLEMYC 
More accession numbers!
PDOC = documentation on building the pattern
PS = Prosite information

Consensus pattern:

                C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C
Using the pattern, scan against Swiss-Prot to find homologs.

Prodom: Like Pro-Site, but French

They used Psi-Blast to build a motif database.

Pfam (more protein family searching!)

Again, try the protein sequences above.

Clusters of Orthologous Groups

General idea: protein sequence is conserved by evolution.
COGs = pre-analyzed ortholog families from completed genomes

Tour

Notice that Yeast often has pairs of homologs. Why might this be?

Here's a human sequence. Where does it fit in?

mtevgllsin lsinsthaal lpirydnrcr nmsqeqvaqk lakdpkpair frleqvvpaf
qdlvygwnrh evasvegdpv imksdgfpty hlacvvddhh mgishvlrgs ewlvstakhl
llyqalgwqp phfahlplll nrdgsklskr qgdvflehfa adgflpdsll diitncgsgf
aenqmgrtlp elitqfnltq vtchsalldl eklpefnrlh lqrlvsnesq rrqlvgklqv
lveeafgcql qnrdvlnpvy verilllrqg hicrlqdlvs pvysylwtrp avgraqldai
sekvdviakr vlg

Now try our old friend insulin. What happens and why?

malwmrllpl lallalwgpd paaafvnqhl cgshlvealy lvcgergffy tpktrreaed
lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn 

Phi-Blast: Combining Similarity Searching with Patterns

Pattern-Hit Initiated Blast

Cellular Localization: Psort

Predicting the function of a new protein

Where do proteins end up? What are the shipping instructions?
LocationEvidence
nucleusnuclear localization signal
cytoplasm
plasma membrane hydrophobic, membrane-spanning regions, glycosylation sites
secretedsignal peptide

Expert system

Psort: expert system for guessing cellular localization

Examples:

A secreted protein:
        1 malspflaav iplvlllsra ppsadtrttg hlcgkdlvna lyiacgvrgf fydptkmkrd
       61 tgalaaflpl ayaednesqd desiginevl kskrgiveqc chkrcsiydl enycn
A transmembrane receptor:
        1 mavaplrgal llwqllaagg aaleigrfdp ergrgaapcq aveipmcrgi gynltrmpnl
       61 lghtsqgeaa aelaefaplv qygchshlrf flcslyapmc tdqvstpipa crpmceqarl
      121 rcapimeqfn fgwpdsldca rlptrndpha lcmeapenat agpaephkgl gmlpvaprpa
      181 rppgdlgpga ggsgtcenpe kfqyveksrs caprcgpgve vfwsrrdkdf alvwmavwsa
      241 lcffstaftv ltfllephrf qyperpiifl smcynvysla fliravagaq svacdqeaga
      301 lyviqeglen tgctlvflll yyfgmasslw wvvltltwfl aagkkwghea ieahgsyfhm
      361 aawglpalkt iviltlrkva gdeltglcyv astdaaaltg fvlvplsgyl vlgssflltg
      421 fvalfhirki mktggtntek leklmvkigv fsilytvpat cvivcyvyer lnmdfwrlra
      481 teqpcaaaag pggrrdcslp ggsvptvavf mlkifmslvv gitsgvwvws sktfqtwqsl
      541 cyrkiaagra rakacrapgs ygrgthchyk aptvvlhmtk tdpslenpth l
A nuclear receptor:
        1 masredelrn cvvcgdqatg yhfnaltceg ckgffrrtvs ksigptcpfa gscevsktqr
       61 rhcpacrlqk cldagmrkdm ilsaealalr rakqaqrraq qtpvqlskeq eelirtllga
      121 htrhmgtmfe qfvqfrppah lfihhqplpt lapvlplvth fadintfmvl qvikftkdlp
      181 vfrslpiedq isllkgaave ichivlnttf clqtqnflcg plrytiedga rvgfqvefle
      241 llfhfhgtlr klqlqepeyv llaamalfsp drpgvtqrde idqlqeemal tlqsyikgqq
      301 rrprdrflya kllgllaelr sineaygyqi qhiqglsamm pllqeics

Protein Secondary Structure Prediction

Start with secondary structure prediction: alpha helix, beta sheet, loop
Accuracy: 70-80%

Protein Tertiary Structure Prediction

Algorithm

Accuracy depends on sequence similarity

Structure Classification Systems

Purposes

Amino Acid Codes

The one letter codes for each of the amino acids do have rhyme and reason!

From Koni Stone, please send comments to koni@chem.csustan.edu

First, some reasons: The following amino acids all use the first initial as their code. These amino acids are either very common, or they have a unique first letter. To make the following list, use the mnemonic:
Give A Violin Lesson to Isabelle SoThat Cindy can Matriculate and wear a Professional Hat.

Now for some rhymes: And for the remnant:

Amino Acid Properties

Code    Letters AA Residue      Special properties
A       Ala     alanine         methyl functional group
C       Cys     cyst(e)ine      disulfide bonds for protein stability
D       Asp     aspartate       hydrophilic, COO(-) 
E       Glu     glutamate       hydrophilic, COO(-) 
F       Phe     phenylalanine   hydrophobic
G       Gly     glycine         no functional group
H       His     histidine       pKa of 6, used for acid-base chemistry
I       Ile     isoleucine      hydrophobic
K       Lys     lysine          hydrophilic, NH3(+)
L       Leu     leucine         hydrophobic
M       Met     methionine      start codon
N       Asn     asparagine      amide form of aspartate. found in asparagus
P       Pro     proline         causes turns in protein structure
Q       Gln     glutamine       amide form of glutamate
R       Arg     arginine        hydrophilic, like NH3(+)
S       Ser     serine          -OH group used in serine proteases
T       Thr     threonine       -OH group
V       Val     valine          hydrophobic
W       Trp     tryptophan      strong UV absorbance used to probe protein structure
Y       Tyr     tyrosine        strong UV absorbance, phosporylated by tyrosine kinases
X               any

Homework for Class 10
  1. Pick 5 nuclear proteins, 5 cytoplasmic proteins, 5 membrane proteins, and 5 secreted proteins, and use Psort I and Psort II to predict where the proteins are localized. How many of each does each program get right? Which program does better?
  2. Help with the Human Genome Project!
    Try to analyze the following novel sequences predicted from human genomic information.
    What are the top homologs?
    What is the sequence similarity?
    What does the protein do?
    Where is it localized?
    Can you predict structure?
    How confident are you of the answer?
    MEPETLEARINRATNPLNKELDWASINGFCEQLNEDFEGPPLAT
    RLLAHKIQSPQEWEAIQALTVLETCMKSCGKRFHDEVGKFRFLNELIKVVSPKYLGSR
    TSEKVKNKILELLYSWTVGLPEEVKIAEAYQMLKKQGIVKSDPKLPDDTTFPLPPPRP
    KNVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEKISKRVNAIEEVN
    NNVKLLTEMVMSHSQGGAAAGSSEDLMKELYQRCERMRPTLFRLASDTEDNDEALAEI
    LQANDNLTQVINLYKQLVRGEEVNGDATAGSIP
    
    AFVTGLEEQLGKFGDKCIARGWDHQGDPLHKIQQDVAEHHKQIG
    NVLQIVESCSQLQGFQSEEVSPAEPASPGTPQQVKDKTLQESSFEDIMATRSSDWLRR
    PLGEDNQPETQLFWDKEPWFWHDTLTEQLWRIFAGMRILAHGELVLATAISSFTRHVF
    TCGRRGIKVWSLTGQVAEDRFPESHLPIQTPGAFLRTCLLSSNSRSLLTGGYNLASVS
    VWDLAAPSLHVKEQLPCAGLNCQALDANLDANLAFASFTSGVVRIWDLRDQSVVRDLK
    GYPDGVKSIVVKGYNIWTGGPDACLRCWDQRTIMKPLEYQFKSQIMSLSHSPQEDWVL
    LGMANGQQWLQSTSGSQRHMVGQKDSVILSVKFSPFGQWWASVGMDDFLGVYSMPAGT
    KVFEVPEMSPVTCCDVSSNNRLVVTGSGEHASVYQITY
    
    MDPNSILLSPQPQICSHLAEACTEGERSSSPPELDRDSPFPWSQ
    VPSSSPTDPEWFGDEHIQAKRARVETIVRGMCLSPNPLVPGNAQAGVSPRCPKKARER
    KRKQNLPTPQGLLMPAPAWDQGNRKGGPRVREQLHLLKQQLRHLQEHILQAAKPRDTA
    QGPGGCGTGKGPLSAKQGNGCGPRPWVVDGDHQQGTSKDLSGAEKHQESEKPSFLPSG
    APASLEILRKELTRAVSQAVDSVLQKFNRCITSQMIKWFSNFREFYYIQMEKSARQAI
    SDGVTNPKMLVVLRNSELFQALNMHYNKGNDFEISADHFSKLKTNFMQDLSVAVYSHL
    SAGRVYQLDNANSLQLTIAIGAPTSYMQKKLNFRL
    
    
  3. Memorize the table of amino acid codes and properties

Copyright 1999 Joel S. Bader jsbader@curagen.com