MB620 Bioinformatics
University of New Haven
Instructor: Joel S. Bader
Class 5: Biological Sequence Databases


Agenda


Gene Structure

HW question 1: fill in the table, give references Genome Sizes
OrganismGenome Size (Mb)Number of GenesAvg. Gene Length
E. coli 4.6 Mb 2000 2000 bp
S. cerevisiae... 6000 ...
C. elegans ... ... ...
H. sapiens 3300 Mb 100,000 30-50,000 bp
Humans have 100K genes = 100K proteins
Average protein is ~300 aa, average transcript is ~2000 bp.
HW question 2:
What percentage of the genome is expressed?
What percentage of the genome is translated?
 DNA: ...(promoter)...(transcr. start).............................................................(transcr. term.)...
mRNA:                 5' cap and UTR...AUG and exon1....intron......exon...........poly-A signal...
mRNA:                 5' cap and UTR...AUG and complete CDS....transl. stop...3' UTR...AUAAAAAAAAA
pro-protein:                          N-term, signal peptide, protein, C-term
active protein:                                               protein
Differences between polykaryotic and eukaryotic gene structure

DNA Sequence Databases

Why DNA and not protein as the primary database? How DNA sequence is generated

Simplest data storage: flatfiles
Typical: FASTA

>A12345
CGTATTAGCTATACGTCGTACGCGTCATAATGGGCTTATAAGATAGC
ACTATTAGGTAAAATTATACGATATCGGC...
>M35832
TTCCAATGGACCGTACA...
Benefits: compact, transportable
Drawbacks: data integrity, multi-user

Data integrity: what information belongs with a DNA sequence?

Multi-User

Databases attempt to solve these problems 3 mirrored databases:

All share information, exchange new sequences daily.

Go to NCBI
Click on Searching GenBank
Click on Entrez
Search the nucleotide database
Reading the results:

LOCUS       LEPR         3800 bp    mRNA            PRI       19-MAR-1999
DEFINITION  Homo sapiens leptin receptor (LEPR) mRNA.
ACCESSION   NM_002303
NID         g4504978
VERSION     NM_002303.1  GI:4504978
KEYWORDS    .
SOURCE      human.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
            Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 3800)
  AUTHORS   Tartaglia,L.A., Dembski,M., Weng,X., Deng,N., Culpepper,J.,
            Devos,R., Richards,G.J., Campfield,L.A., Clark,F.T., Deeds,J.,
            Muir,C., Sanker,S., Moriarty,A., Moore,K.J., Smutko,J.S.,
            Mays,G.G., Woolf,E.A., Selent-Munro,C. and Tepper,R.I.
  TITLE     Identification and expression cloning of a leptin receptor, OB-R
  JOURNAL   Cell 83 (7), 1263-1271 (1995)
  MEDLINE   96128129
REFERENCE   2  (bases 1 to 3800)
  AUTHORS   Tartaglia,L.A.
  TITLE     Direct Submission
  JOURNAL   Submitted (12-DEC-1995) Louis A. Tartaglia, Millennium
            Pharmaceuticals, 640 Memorial Drive, Cambridge, MA 02139
COMMENT     REFSEQ: This reference sequence was derived from U43168.
            PROVISIONAL RefSeq: This is a provisional reference sequence record
            that has not yet been subject to human review. The final curated
            reference sequence record may be somewhat different from this one.
FEATURES             Location/Qualifiers
     source          1..3800
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
     gene            1..3800
                     /gene="LEPR"
                     /note="OBR"
                     /db_xref="MIM:601007"
                     /db_xref="LocusID:3953"
     CDS             194..3691
                     /gene="LEPR"
                     /note="OB-R"
                     /codon_start=1
                     /db_xref="MIM:601007"
                     /db_xref="LocusID:3953"
                     /product="leptin receptor"
                     /protein_id="NP_002294.1"
                     /db_xref="PID:g4504979"
                     /db_xref="GI:4504979"
                     /translation="MICQKFCVVLLHWEFIYVITAFNLSYPITPWRFKLSCMPPNSTY
                     DYFLLPAGLSKNTSNSNGHYETAVEPKFNSSGTHFSNLSKTTFHCCFRSEQDRNCSLC
...
                     ESGVLLTDKSRVSCPFPAPCLFTDIRVLQDSCSHFVENNINLGTSSKKTFASYMPQFQ
                     TCSTQTHKIMENKMCDLTV"
BASE COUNT     1154 a    715 c    778 g   1153 t
ORIGIN      
        1 ggcacgagcc ggtctggctt gggcaggctg cccgggccgt ggcaggaagc cggaagcagc
       61 cgcggcccca gttcgggaga catggcgggc gttaaagctc tcgtggcatt atccttcagt
...
     3721 tataatgggt aatataaagt gtaatagatt atagttgtgg gtgggagaga gaaaagaaac
     3781 cagagtccaa atttgaaaat 
//
Reading the lines: GenBank release notes

Protein Sequence Databases

How is protein sequence obtained?

The best curated protein sequence database is Swiss-Prot, http://expasy.hcuge.ch/ from the Swiss Institute of Bioinformatics
2 parts: Swiss-Prot + TrEMBL
TrEMBL = translation of EMBL nucleotide sequences

Benefits

Examples: again search for leptin receptor Swiss-Prot Entry

ID   LEPR_HUMAN     STANDARD;      PRT;  1165 AA.
AC   P48357;
DT   01-FEB-1996 (REL. 33, CREATED)
DT   01-FEB-1996 (REL. 33, LAST SEQUENCE UPDATE)
DT   15-JUL-1998 (REL. 36, LAST ANNOTATION UPDATE)
DE   LEPTIN RECEPTOR PRECURSOR (OB RECEPTOR) (OB-R).
GN   LEPR OR OBR.
OS   HOMO SAPIENS (HUMAN).
OC   EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; MAMMALIA; EUTHERIA;
OC   PRIMATES; CATARRHINI; HOMINIDAE; HOMO.
RN   [1]
RP   SEQUENCE FROM N.A.
RC   TISSUE=BRAIN;
RX   MEDLINE; 96128129. [NCBI, ExPASy, Israel, Japan]
RA   TARTAGLIA L.A., DEMBSKI M., WENG X., DENG N., CULPEPPER J.,
RA   DEVOS R., RICHARDS G.J., CAMPFIELD L.A., CLARK F.T., DEEDS J.,
RA   MUIR C., SANKER S., MORIARTY A., MOORE K.J., SMUTKO J.S.,
RA   MAYS G.G., WOOLF E.A., MONROE C.A., TEPPER R.I.;
RT   "Identification and expression cloning of a leptin receptor, OB-R.";
RL   CELL 83:1263-1271(1995).
RN   [2]
RP   SEQUENCE FROM N.A.
RA   THOMPSON D.B., OSSOWSKI V., SUTHERLAND J., APEL W.,
RA   BIESTERFELDT J.;
RL   SUBMITTED (OCT-1996) TO THE EMBL/GENBANK/DDBJ DATABASES.
CC   -!- FUNCTION: RECEPTOR FOR OBESITY FACTOR (LEPTIN).
CC   -!- SUBCELLULAR LOCATION: TYPE I MEMBRANE PROTEIN.
CC   -!- SIMILARITY: BELONGS TO THE CYTOKINE FAMILY OF RECEPTORS.
CC   --------------------------------------------------------------------------
CC   This SWISS-PROT entry is copyright. It is produced through a collaboration
CC   between  the Swiss Institute of Bioinformatics  and the  EMBL outstation -
CC   the European Bioinformatics Institute.  There are no  restrictions on  its
CC   use  by  non-profit  institutions as long  as its content  is  in  no  way
CC   modified and this statement is not removed.  Usage  by  and for commercial
CC   entities requires a license agreement (See http://www.isb-sib.ch/announce/
CC   or send an email to license@isb-sib.ch).
CC   --------------------------------------------------------------------------
DR   EMBL; U43168; AAA93015.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   EMBL; U59263; AAB09673.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   EMBL; U59248; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   EMBL; U59249; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   EMBL; U59250; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   EMBL; U59252; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   EMBL; U59253; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   EMBL; U59254; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   EMBL; U59255; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   EMBL; U59256; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   EMBL; U59257; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   EMBL; U59258; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   EMBL; U59259; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   EMBL; U59260; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   EMBL; U59261; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   EMBL; U59262; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   GeneCard; LEPR.
DR   MIM; 601007; -.
DR   PROSITE; PS00340; RECEPTOR_CYTOKINES_2; 2.
DR   PFAM; PF00041; fn3; 2.
DR   HSSP; P10912; 3HHR. [HSSP ENTRY / SWISS-3DIMAGE /
DR         PDB-ENTRY /  PDB-RASMOL / PDB-3DIMAGE]
DR   DOMO; P48357.
DR   PRODOM [Domain structure / List of seq. sharing at least 1 domain]
DR   PROTOMAP; P48357.
DR   PRESAGE; P48357.
DR   SWISS-2DPAGE; GET REGION ON 2D PAGE.
KW   OBESITY; RECEPTOR; TRANSMEMBRANE; GLYCOPROTEIN; SIGNAL.
FT   SIGNAL        1      ?       POTENTIAL.
FT   CHAIN         ?   1165       LEPTIN RECEPTOR.
FT   DOMAIN        ?    841       EXTRACELLULAR (POTENTIAL).
FT   TRANSMEM    842    862       POTENTIAL.
FT   DOMAIN      863   1165       CYTOPLASMIC (POTENTIAL).
FT   CARBOHYD     23     23       POTENTIAL.
FT   CARBOHYD     41     41       POTENTIAL.
FT   CARBOHYD     56     56       POTENTIAL.
FT   CARBOHYD     73     73       POTENTIAL.
FT   CARBOHYD     81     81       POTENTIAL.
FT   CARBOHYD     98     98       POTENTIAL.
FT   CARBOHYD    187    187       POTENTIAL.
FT   CARBOHYD    206    206       POTENTIAL.
FT   CARBOHYD    276    276       POTENTIAL.
FT   CARBOHYD    347    347       POTENTIAL.
FT   CARBOHYD    397    397       POTENTIAL.
FT   CARBOHYD    433    433       POTENTIAL.
FT   CARBOHYD    516    516       POTENTIAL.
FT   CARBOHYD    624    624       POTENTIAL.
FT   CARBOHYD    659    659       POTENTIAL.
FT   CARBOHYD    670    670       POTENTIAL.
FT   CARBOHYD    688    688       POTENTIAL.
FT   CARBOHYD    697    697       POTENTIAL.
FT   CARBOHYD    728    728       POTENTIAL.
FT   CARBOHYD    750    750       POTENTIAL.
SQ   SEQUENCE   1165 AA;  132449 MW;  A63D1B74 CRC32;
     MICQKFCVVL LHWEFIYVIT AFNLSYPITP WRFKLSCMPP NSTYDYFLLP AGLSKNTSNS
     NGHYETAVEP KFNSSGTHFS NLSKTTFHCC FRSEQDRNCS LCADNIEGKT FVSTVNSLVF
     QQIDANWNIQ CWLKGDLKLF ICYVESLFKN LFRNYNYKVH LLYVLPEVLE DSPLVPQKGS
     FQMVHCNCSV HECCECLVPV PTAKLNDTLL MCLKITSGGV IFQSPLMSVQ PINMVKPDPP
     LGLHMEITDD GNLKISWSSP PLVPFPLQYQ VKYSENSTTV IREADKIVSA TSLLVDSILP
     GSSYEVQVRG KRLDGPGIWS DWSTPRVFTT QDVIYFPPKI LTSVGSNVSF HCIYKKENKI
     VPSKEIVWWM NLAEKIPQSQ YDVVSDHVSK VTFFNLNETK PRGKFTYDAV YCCNEHECHH
     RYAELYVIDV NINISCETDG YLTKMTCRWS TSTIQSLAES TLQLRYHRSS LYCSDIPSIH
     PISEPKDCYL QSDGFYECIF QPIFLLSGYT MWIRINHSLG SLDSPPTCVL PDSVVKPLPP
     SSVKAEITIN IGLLKISWEK PVFPENNLQF QIRYGLSGKE VQWKMYEVYD AKSKSVSLPV
     PDLCAVYAVQ VRCKRLDGLG YWSNWSNPAY TVVMDIKVPM RGPEFWRIIN GDTMKKEKNV
     TLLWKPLMKN DSLCSVQRYV INHHTSCNGT WSEDVGNHTK FTFLWTEQAH TVTVLAINSI
     GASVANFNLT FSWPMSKVNI VQSLSAYPLN SSCVIVSWIL SPSDYKLMYF IIEWKNLNED
     GEIKWLRISS SVKKYYIHDH FIPIEKYQFS LYPIFMEGVG KPKIINSFTQ DDIEKHQSDA
     GLYVIVPVII SSSILLLGTL LISHQRMKKL FWEDVPNPKN CSWAQGLNFQ KPETFEHLFI
     KHTASVTCGP LLLEPETISE DISVDTSWKN KDEMMPTTVV SLLSTTDLEK GSVCISDQFN
     SVNFSEAEGT EVTYEAESQR QPFVKYATLI SNSKPSETGE EQGLINSSVT KCFSSKNSPL
     KDSFSNSSWE IEAQAFFILS DQHPNIISPH LTFSEGLDEL LKLEGNFPEE NNDKKSIYYL
     GVTSIKKRES GVLLTDKSRV SCPFPAPCLF TDIRVLQDSC SHFVENNINL GTSSKKTFAS
     YMPQFQTCST QTHKIMENKM CDLTV
//

Similar entry from TrEMBL

ID   Q92920      PRELIMINARY;      PRT;   958 AA.
AC   Q92920;
DT   01-FEB-1997 (TREMBLREL. 02, CREATED)
DT   01-FEB-1997 (TREMBLREL. 02, LAST SEQUENCE UPDATE)
DT   01-NOV-1998 (TREMBLREL. 08, LAST ANNOTATION UPDATE)
DE   LEPTIN RECEPTOR.
OS   HOMO SAPIENS (HUMAN).
OC   EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; MAMMALIA; EUTHERIA;
OC   PRIMATES; CATARRHINI; HOMINIDAE; HOMO.
RN   [1]
RP   SEQUENCE FROM N.A.
RA   BENNETT B.D., SOLAR G.P., YUAN J.Q., MATHIAS J., THOMAS G.R.,
RA   MATTHEWS W.;
RL   CURR. BIOL. 6:0-0(0).
DR   EMBL; U66496; AAB07496.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR   PFAM; PF00041; fn3; 2.
DR   SWISS-2DPAGE; GET REGION ON 2D PAGE.
SQ   SEQUENCE   958 AA;  109392 MW;  A6376C9C CRC32;
     MICQKFCVVL LHWEFIYVIT AFNLSYPITP WRFKLSCMPP NSTYDYFLLP AGLSKNTSNS
     NGHYETAVEP KFNSSGTHFS NLSKTTFHCC FRSEQDRNCS LCADNIEGKT FVSTVNSLVF
     QQIDANWNIQ CWLKGDLKLF ICYVESLFKN LFRNYNYKVH LLYVLPEVLE DSPLVPQKGS
     FQMVHCNCSV HECCECLVPV PTAKLNDTLL MCLKITSGGV IFQSPLMSVQ PINMVKPDPP
     LGLHMEITDD GNLKISWSSP PLVPFPLQYQ VKYSENSTTV IREADKIVSA TSLLVDSILP
     GSSYEVQVRG KRLDGPGIWS DWSTPRVFTT QDVIYFPPKI LTSVGSNVSF HCIYKKENKI
     VPSKEIVWWM NLAEKIPQSQ YDVVSDHVSK VTFFNLNETK PRGKFTYDAV YCCNEHECHH
     RYAELYVIDV NINISCETDG YLTKMTCRWS TSTIQSLAES TLQLRYHRSS LYCSDIPSIH
     PISEPKDCYL QSDGFYECIF QPIFLLSGYT MWIRINHSLG SLDSPPTCVL PDSVVKPLPP
     SSVKAEITIN IGLLKISWEK PVFPENNLQF QIRYGLSGKE VQWKMYEVYD AKSKSVSLPV
     PDLCAVYAVQ VRCKRLDGLG YWSNWSNPAY TVVMDIKVPM RGPEFWRIIN GDTMKKEKNV
     TLLWKPLMKN DSLCSVQRYV INHHTSCNGT WSEDVGNHTK FTFLWTEQAH TVTVLAINSI
     GASVANFNLT FSWPMSKVNI VQSLSAYPLN SSCVIVSWIL SPSDYKLMYF IIEWKNLNED
     GEIKWLRISS SVKKYYIHDH FIPIEKYQFS LYPIFMEGVG KPKIINSFTQ DDIEKHQSDA
     GLYVIVPVII SSSILLLGTL LISHQRMKKL FWEDVPNPKN CSWAQGLNFQ KMLEGSMFVK
     SHHHSLISST QGHKHCGRPQ GPLHRKTRDL CSLVYLLTLP PLLSYDPAKS PSVRNTQE
//

Week 6 Homework

  1. Question 1 is stated above.
  2. Question 2 is stated above.
  3. GenBank releases new sequences to its website every day. But how often does it release an entirely new database update?
  4. Show a plot of the number of sequences in GenBank per year.
  5. Explain the meaning of every line for GenBank Accno AJ002495
  6. Do the same for GB X02317.
  7. Do the same for K00065 --- what's the difference between these?
  8. Find the SWISS-PROT entry that corresponds to K00065. Explain the meaining of every line.
  9. How many human nucleotides are in GenBank? What fraction of the genome does this correspond to? What fraction of the genome has actually been sequenced? Explain why these numbers should be the same or different.
  10. Make a timeline showing the dates that complete genomes for model organisms have been sequenced, along with the size of the genome. Show at least 10 organisms.
  11. Extra credit: Choose a nucleotide database, a protein sequence database, and a protein structure database (we haven't talked about these, but PDB is a good example). Find out how many entries were in each database at the end of each year back to when the information starts. How do the growth rates compare?

Copyright 1999 Joel S. Bader jsbader@curagen.com