MB620 Bioinformatics at University of New Haven

MB620 Bioinformatics
University of New Haven
Instructor: Joel S. Bader
Class 11: Protein Sequence Analysis. Orthologs. Motifs and Domains. Localization and Sorting. Structure Prediction.

Agenda

Protein Sequence Analysis: Orthologs, Clusters of Orthologous Groups
Motifs and Domains. Prosite, Prodom, Pfam, ...
Motifs on-the-fly: Psi-Blast and Phi-Blast
Cellular localization: Psort
Structure prediction

Master Outline

Genetics
Traits/Genes to Location	Genetic and physical maps Research Genetics mapping panel Stanford mapping server
Traits/Genes to Experimental Organisms	Jackson Laboratories
Trait/Gene Location Database	OMIM, On-line Mendelian Inheritance in Man
Genomic DNA Analysis
Sequences to Contigs	CuraTools CAP, PHRAP
Contigs to mRNA	Genscan Grail tblastn (protein query, genomic database)
mRNA Analysis
DNA to Homolog	blastn, blastx, fasta
DNA to Protein	ORF finders NCBI ORF Finder
Protein Analysis
protein homologs	blastp
conserved residues	multiple sequence alignment, clustal-w
evolutionary history	Phylip, Paup
blastp for linguistics	AltaVista Babelfish
Prokaryot homologs	Clusters of Orthologous Groups
Domains	Pfam Prosite Prodom
Cellular localization	Psort
Secondary structure prediction	Consensus prediction
Tertiary structure prediction	Swiss-Model
Known folds	SCOP CATH, DALI
Structure similarity search	VAST
Bioinformatics Resources
GenBank, OMIM, Blast, Entrez	NCBI
Swiss-Prot, TrEMBL, Prosite	ExPASy

Protein Annotation

Where were we?

Genetic or physical mapping: 10 Mb of DNA
Gene finding: 1 mRNA, 1-10 Kb
mRNA to CDS (coding domain sequence) and ORF
ORF to protein
protein to annotated protein
novel protein: homologs, motifs, domains

Motifs, Profiles, Domains

Evolution: conserved domains
Evidence: sequence similarity in multiple alignments

Use of profiles:

Function to Structure
Many genes, same function, this part is conserved, must be functional
Structure to Function
New gene, weak similarity to known sequences, what does it do?
Gene with a mutation: is it relevant?
Protein engineering: add or modify function

Example of a profile: Zinc finger protein from pfam

ADR1_YEAST/104-126    FVCE...VCT...RAFARQEHLKRHYRS...H
ADR1_YEAST/132-155    YPCG...LCN...RCFTRRDLLIRHAQK..IH
AGIE_RAT/269-291      YICE...ECG...IRCKKPSMLKKHIRT...H
AGIE_RAT/297-321      YVCK...LCN...FAFKTKGNLTKHMKSK.AH

consensus             YVCE...LCN...RAFKRK..LKKH.RS...H
second choice         .................KR....R..K.....

Profile, pattern, motif = sequence with substitution frequencies
Consensus = only show the most likely sequence, less information than profile

Algorithms for using profiles to search

Use consensus in a blast (misses information about likely substitutions)
Build a scoring matrix, slide along a sequence
Build a Hidden Markov Model (more complicated, like using a mix of scoring matrices)

Building Motifs: Psi-Blast

How did we identify conserved regions?

start with a protein
find homologs (blastp)
multiple alignment (clustalw)
identify conserved region (pattern recognition)
add to a list

Position-Specific Iterated Blast

Take a protein, blastp vs. GenBank
Build a multiple alignment and a profile
Use the profile to search the database
Estimate the significance
Repeat

Example:

MKVDLHVHSIVSKCSLNPKGLLEKFCIKKNIVPAICDHNKLTKL
NFAIPGEEIATNSGEFIGLFLTEEIPANLDLYEALDRVREQGALIYLPHPFDLNRRRS
LAKFNVLEEREFLKYVHVVEVFNSRCRSIEPNLKALEYAEKYDFAMAFGSDAHFIWEV
GNAYIKFSELNIEKPDDLSPKEFLNLLKIKTDELLKAKSNLLKNPWKTRWHYGKLGSK
YNIALYSKVVKNVRRKLNI

Prosite

Prosite = part of Swiss-Prot, has pre-defined motifs.
Two main functions:

Take a sequence, look for motifs
Take a motif, look for sequences that contain it

Scan for pre-defined consensus patterns. Try the following:

MGKISSLPTQLFKCCFCDFLKVKMHTMSSSHLFYLALCLLTFTSSATAGP
ETLCGAELVDALQFVCGDRGFYFNKPTGYGSSSRRAPQTGIVDECCFRSC
DLRRLEMYCAPLKPAKSARSVRAQRHTDMPKTQKEVHLKNASRGSAGNKN
YRM

Among other features, find the following:

[5] PDOC00235 PS00262  INSULIN
Insulin family signature

            95-109 CCFRSCDLRRLEMYC

More accession numbers!
PDOC = documentation on building the pattern
PS = Prosite information

Consensus pattern:

                C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C

Using the pattern, scan against Swiss-Prot to find homologs.

Prodom: Like Pro-Site, but French

They used Psi-Blast to build a motif database.

Take the insulin sequence, go to Prodom
Blast to find likely domains, sorted by E-value
What a nice summary of the homolog family!

Pfam (more protein family searching!)

Start with Swiss-Prot to get domains
Hand-adjust the alignments
Build Hidden Markov Models to define family (better than consensus!)
Good job at letting you know what's significant

Again, try the protein sequences above.

Clusters of Orthologous Groups

General idea: protein sequence is conserved by evolution.
COGs = pre-analyzed ortholog families from completed genomes

Tour

Click on Table, see list of COGs organized by function
Click on J for translation
Click on COG0008
Click on COG0130

Notice that Yeast often has pairs of homologs. Why might this be?

Here's a human sequence. Where does it fit in?

mtevgllsin lsinsthaal lpirydnrcr nmsqeqvaqk lakdpkpair frleqvvpaf
qdlvygwnrh evasvegdpv imksdgfpty hlacvvddhh mgishvlrgs ewlvstakhl
llyqalgwqp phfahlplll nrdgsklskr qgdvflehfa adgflpdsll diitncgsgf
aenqmgrtlp elitqfnltq vtchsalldl eklpefnrlh lqrlvsnesq rrqlvgklqv
lveeafgcql qnrdvlnpvy verilllrqg hicrlqdlvs pvysylwtrp avgraqldai
sekvdviakr vlg

Now try our old friend insulin. What happens and why?

malwmrllpl lallalwgpd paaafvnqhl cgshlvealy lvcgergffy tpktrreaed
lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn

Phi-Blast: Combining Similarity Searching with Patterns

Pattern-Hit Initiated Blast

Start with a sequence and a profile
Use the motif to find database entries that include the profile and are similar to the sequence around the profile
Feed into Psi-Blast

Cellular Localization: Psort

Predicting the function of a new protein

Where do proteins end up? What are the shipping instructions?

Location Evidence

nucleus nuclear localization signal

cytoplasm

plasma membrane hydrophobic, membrane-spanning regions, glycosylation sites

secreted signal peptide

Location	Evidence
nucleus	nuclear localization signal
cytoplasm
plasma membrane	hydrophobic, membrane-spanning regions, glycosylation sites
secreted	signal peptide

Expert system

Look at the information experts use to guess localization
Build a set of rules modeled on what the experts do
Automate the process

Psort: expert system for guessing cellular localization

Examples:

A secreted protein:
        1 malspflaav iplvlllsra ppsadtrttg hlcgkdlvna lyiacgvrgf fydptkmkrd
       61 tgalaaflpl ayaednesqd desiginevl kskrgiveqc chkrcsiydl enycn
A transmembrane receptor:
        1 mavaplrgal llwqllaagg aaleigrfdp ergrgaapcq aveipmcrgi gynltrmpnl
       61 lghtsqgeaa aelaefaplv qygchshlrf flcslyapmc tdqvstpipa crpmceqarl
      121 rcapimeqfn fgwpdsldca rlptrndpha lcmeapenat agpaephkgl gmlpvaprpa
      181 rppgdlgpga ggsgtcenpe kfqyveksrs caprcgpgve vfwsrrdkdf alvwmavwsa
      241 lcffstaftv ltfllephrf qyperpiifl smcynvysla fliravagaq svacdqeaga
      301 lyviqeglen tgctlvflll yyfgmasslw wvvltltwfl aagkkwghea ieahgsyfhm
      361 aawglpalkt iviltlrkva gdeltglcyv astdaaaltg fvlvplsgyl vlgssflltg
      421 fvalfhirki mktggtntek leklmvkigv fsilytvpat cvivcyvyer lnmdfwrlra
      481 teqpcaaaag pggrrdcslp ggsvptvavf mlkifmslvv gitsgvwvws sktfqtwqsl
      541 cyrkiaagra rakacrapgs ygrgthchyk aptvvlhmtk tdpslenpth l
A nuclear receptor:
        1 masredelrn cvvcgdqatg yhfnaltceg ckgffrrtvs ksigptcpfa gscevsktqr
       61 rhcpacrlqk cldagmrkdm ilsaealalr rakqaqrraq qtpvqlskeq eelirtllga
      121 htrhmgtmfe qfvqfrppah lfihhqplpt lapvlplvth fadintfmvl qvikftkdlp
      181 vfrslpiedq isllkgaave ichivlnttf clqtqnflcg plrytiedga rvgfqvefle
      241 llfhfhgtlr klqlqepeyv llaamalfsp drpgvtqrde idqlqeemal tlqsyikgqq
      301 rrprdrflya kllgllaelr sineaygyqi qhiqglsamm pllqeics

Protein Secondary Structure Prediction

Start with secondary structure prediction: alpha helix, beta sheet, loop
Accuracy: 70-80%

Protein Tertiary Structure Prediction

Algorithm

blastp to find neighbors of known structure
cluster to find closest evolutionary relative
align with known structure

Accuracy depends on sequence similarity

70% or more: probably ok
30% to 70%: might be ok
below 30%: who knows

Structure Classification Systems

Purposes

Catalog of known folds
Families based on similar structure
Guess fold based on sequence similarity

Amino Acid Codes

The one letter codes for each of the amino acids do have rhyme and reason!

From Koni Stone, please send comments to koni@chem.csustan.edu

First, some reasons: The following amino acids all use the first initial as their code. These amino acids are either very common, or they have a unique first letter. To make the following list, use the mnemonic:
Give A Violin Lesson to Isabelle SoThat Cindy can Matriculate and wear a Professional Hat.

Glycine = G
Alanine = A
Valine = V
Leucine = L
Isoleucine = I
Serine = S
Threonine = T
Cysteine = C
Methionine = M
Proline = P
Histidine = H

Now for some rhymes:

Arginine = R. R we having fun yet?
Asparagine = N The kNights of Ne say "Ne".
Glutamine is a cute amine = Q
I say "glutamate"/a former vice president says "glutEmate" = E
ditto with AsparDic acid = D
Fenylalanine makes tasty italien sausage = F
Theres are two rings in tryptophan, and there are two v's in W = W
Tyrosine = Y, Y? Because we love biochemistry.

And for the remnant:

Lysine= K Lysine couldn't have L because leucine already claimed it, so it took K, the next letter in the alphabet.

Amino Acid Properties

Code    Letters AA Residue      Special properties
A       Ala     alanine         methyl functional group
C       Cys     cyst(e)ine      disulfide bonds for protein stability
D       Asp     aspartate       hydrophilic, COO(-) 
E       Glu     glutamate       hydrophilic, COO(-) 
F       Phe     phenylalanine   hydrophobic
G       Gly     glycine         no functional group
H       His     histidine       pKa of 6, used for acid-base chemistry
I       Ile     isoleucine      hydrophobic
K       Lys     lysine          hydrophilic, NH3(+)
L       Leu     leucine         hydrophobic
M       Met     methionine      start codon
N       Asn     asparagine      amide form of aspartate. found in asparagus
P       Pro     proline         causes turns in protein structure
Q       Gln     glutamine       amide form of glutamate
R       Arg     arginine        hydrophilic, like NH3(+)
S       Ser     serine          -OH group used in serine proteases
T       Thr     threonine       -OH group
V       Val     valine          hydrophobic
W       Trp     tryptophan      strong UV absorbance used to probe protein structure
Y       Tyr     tyrosine        strong UV absorbance, phosporylated by tyrosine kinases
X               any

Homework for Class 10

Pick 5 nuclear proteins, 5 cytoplasmic proteins, 5 membrane proteins, and 5 secreted proteins, and use Psort I and Psort II to predict where the proteins are localized. How many of each does each program get right? Which program does better?

Help with the Human Genome Project!
Try to analyze the following novel sequences predicted from human genomic information.
What are the top homologs?
What is the sequence similarity?
What does the protein do?
Where is it localized?
Can you predict structure?
How confident are you of the answer?

MEPETLEARINRATNPLNKELDWASINGFCEQLNEDFEGPPLAT
RLLAHKIQSPQEWEAIQALTVLETCMKSCGKRFHDEVGKFRFLNELIKVVSPKYLGSR
TSEKVKNKILELLYSWTVGLPEEVKIAEAYQMLKKQGIVKSDPKLPDDTTFPLPPPRP
KNVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEKISKRVNAIEEVN
NNVKLLTEMVMSHSQGGAAAGSSEDLMKELYQRCERMRPTLFRLASDTEDNDEALAEI
LQANDNLTQVINLYKQLVRGEEVNGDATAGSIP

AFVTGLEEQLGKFGDKCIARGWDHQGDPLHKIQQDVAEHHKQIG
NVLQIVESCSQLQGFQSEEVSPAEPASPGTPQQVKDKTLQESSFEDIMATRSSDWLRR
PLGEDNQPETQLFWDKEPWFWHDTLTEQLWRIFAGMRILAHGELVLATAISSFTRHVF
TCGRRGIKVWSLTGQVAEDRFPESHLPIQTPGAFLRTCLLSSNSRSLLTGGYNLASVS
VWDLAAPSLHVKEQLPCAGLNCQALDANLDANLAFASFTSGVVRIWDLRDQSVVRDLK
GYPDGVKSIVVKGYNIWTGGPDACLRCWDQRTIMKPLEYQFKSQIMSLSHSPQEDWVL
LGMANGQQWLQSTSGSQRHMVGQKDSVILSVKFSPFGQWWASVGMDDFLGVYSMPAGT
KVFEVPEMSPVTCCDVSSNNRLVVTGSGEHASVYQITY

MDPNSILLSPQPQICSHLAEACTEGERSSSPPELDRDSPFPWSQ
VPSSSPTDPEWFGDEHIQAKRARVETIVRGMCLSPNPLVPGNAQAGVSPRCPKKARER
KRKQNLPTPQGLLMPAPAWDQGNRKGGPRVREQLHLLKQQLRHLQEHILQAAKPRDTA
QGPGGCGTGKGPLSAKQGNGCGPRPWVVDGDHQQGTSKDLSGAEKHQESEKPSFLPSG
APASLEILRKELTRAVSQAVDSVLQKFNRCITSQMIKWFSNFREFYYIQMEKSARQAI
SDGVTNPKMLVVLRNSELFQALNMHYNKGNDFEISADHFSKLKTNFMQDLSVAVYSHL
SAGRVYQLDNANSLQLTIAIGAPTSYMQKKLNFRL

Memorize the table of amino acid codes and properties

MB620 Bioinformatics University of New Haven Instructor: Joel S. Bader Class 11: Protein Sequence Analysis. Orthologs. Motifs and Domains. Localization and Sorting. Structure Prediction.