Organism | Genome Size (Mb) | Number of Genes | Avg. Gene Length |
---|---|---|---|
E. coli | 4.6 Mb | 2000 | 2000 bp |
S. cerevisiae | ... | 6000 | ... |
C. elegans | ... | ... | ... |
H. sapiens | 3300 Mb | 100,000 | 30-50,000 bp |
DNA: ...(promoter)...(transcr. start).............................................................(transcr. term.)... mRNA: 5' cap and UTR...AUG and exon1....intron......exon...........poly-A signal... mRNA: 5' cap and UTR...AUG and complete CDS....transl. stop...3' UTR...AUAAAAAAAAA pro-protein: N-term, signal peptide, protein, C-term active protein: proteinDifferences between polykaryotic and eukaryotic gene structure
Simplest data storage: flatfiles
Typical: FASTA
>A12345
CGTATTAGCTATACGTCGTACGCGTCATAATGGGCTTATAAGATAGC
ACTATTAGGTAAAATTATACGATATCGGC...
>M35832
TTCCAATGGACCGTACA...
Benefits: compact, transportable
Drawbacks: data integrity, multi-user
Data integrity: what information belongs with a DNA sequence?
Multi-User
Databases attempt to solve these problems 3 mirrored databases:
Go to NCBI
Click on Searching GenBank
Click on Entrez
Search the nucleotide database
Reading the results:
LOCUS LEPR 3800 bp mRNA PRI 19-MAR-1999 DEFINITION Homo sapiens leptin receptor (LEPR) mRNA. ACCESSION NM_002303 NID g4504978 VERSION NM_002303.1 GI:4504978 KEYWORDS . SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 3800) AUTHORS Tartaglia,L.A., Dembski,M., Weng,X., Deng,N., Culpepper,J., Devos,R., Richards,G.J., Campfield,L.A., Clark,F.T., Deeds,J., Muir,C., Sanker,S., Moriarty,A., Moore,K.J., Smutko,J.S., Mays,G.G., Woolf,E.A., Selent-Munro,C. and Tepper,R.I. TITLE Identification and expression cloning of a leptin receptor, OB-R JOURNAL Cell 83 (7), 1263-1271 (1995) MEDLINE 96128129 REFERENCE 2 (bases 1 to 3800) AUTHORS Tartaglia,L.A. TITLE Direct Submission JOURNAL Submitted (12-DEC-1995) Louis A. Tartaglia, Millennium Pharmaceuticals, 640 Memorial Drive, Cambridge, MA 02139 COMMENT REFSEQ: This reference sequence was derived from U43168. PROVISIONAL RefSeq: This is a provisional reference sequence record that has not yet been subject to human review. The final curated reference sequence record may be somewhat different from this one. FEATURES Location/Qualifiers source 1..3800 /organism="Homo sapiens" /db_xref="taxon:9606" gene 1..3800 /gene="LEPR" /note="OBR" /db_xref="MIM:601007" /db_xref="LocusID:3953" CDS 194..3691 /gene="LEPR" /note="OB-R" /codon_start=1 /db_xref="MIM:601007" /db_xref="LocusID:3953" /product="leptin receptor" /protein_id="NP_002294.1" /db_xref="PID:g4504979" /db_xref="GI:4504979" /translation="MICQKFCVVLLHWEFIYVITAFNLSYPITPWRFKLSCMPPNSTY DYFLLPAGLSKNTSNSNGHYETAVEPKFNSSGTHFSNLSKTTFHCCFRSEQDRNCSLC ... ESGVLLTDKSRVSCPFPAPCLFTDIRVLQDSCSHFVENNINLGTSSKKTFASYMPQFQ TCSTQTHKIMENKMCDLTV" BASE COUNT 1154 a 715 c 778 g 1153 t ORIGIN 1 ggcacgagcc ggtctggctt gggcaggctg cccgggccgt ggcaggaagc cggaagcagc 61 cgcggcccca gttcgggaga catggcgggc gttaaagctc tcgtggcatt atccttcagt ... 3721 tataatgggt aatataaagt gtaatagatt atagttgtgg gtgggagaga gaaaagaaac 3781 cagagtccaa atttgaaaat //Reading the lines: GenBank release notes
How is protein sequence obtained?
The best curated protein sequence database is
Swiss-Prot, http://expasy.hcuge.ch/
from the Swiss Institute of Bioinformatics
2 parts: Swiss-Prot + TrEMBL
TrEMBL = translation of EMBL nucleotide sequences
Benefits
Examples: again search for leptin receptor Swiss-Prot Entry
ID LEPR_HUMAN STANDARD; PRT; 1165 AA. AC P48357; DT 01-FEB-1996 (REL. 33, CREATED) DT 01-FEB-1996 (REL. 33, LAST SEQUENCE UPDATE) DT 15-JUL-1998 (REL. 36, LAST ANNOTATION UPDATE) DE LEPTIN RECEPTOR PRECURSOR (OB RECEPTOR) (OB-R). GN LEPR OR OBR. OS HOMO SAPIENS (HUMAN). OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; MAMMALIA; EUTHERIA; OC PRIMATES; CATARRHINI; HOMINIDAE; HOMO. RN [1] RP SEQUENCE FROM N.A. RC TISSUE=BRAIN; RX MEDLINE; 96128129. [NCBI, ExPASy, Israel, Japan] RA TARTAGLIA L.A., DEMBSKI M., WENG X., DENG N., CULPEPPER J., RA DEVOS R., RICHARDS G.J., CAMPFIELD L.A., CLARK F.T., DEEDS J., RA MUIR C., SANKER S., MORIARTY A., MOORE K.J., SMUTKO J.S., RA MAYS G.G., WOOLF E.A., MONROE C.A., TEPPER R.I.; RT "Identification and expression cloning of a leptin receptor, OB-R."; RL CELL 83:1263-1271(1995). RN [2] RP SEQUENCE FROM N.A. RA THOMPSON D.B., OSSOWSKI V., SUTHERLAND J., APEL W., RA BIESTERFELDT J.; RL SUBMITTED (OCT-1996) TO THE EMBL/GENBANK/DDBJ DATABASES. CC -!- FUNCTION: RECEPTOR FOR OBESITY FACTOR (LEPTIN). CC -!- SUBCELLULAR LOCATION: TYPE I MEMBRANE PROTEIN. CC -!- SIMILARITY: BELONGS TO THE CYTOKINE FAMILY OF RECEPTORS. CC -------------------------------------------------------------------------- CC This SWISS-PROT entry is copyright. It is produced through a collaboration CC between the Swiss Institute of Bioinformatics and the EMBL outstation - CC the European Bioinformatics Institute. There are no restrictions on its CC use by non-profit institutions as long as its content is in no way CC modified and this statement is not removed. Usage by and for commercial CC entities requires a license agreement (See http://www.isb-sib.ch/announce/ CC or send an email to license@isb-sib.ch). CC -------------------------------------------------------------------------- DR EMBL; U43168; AAA93015.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; U59263; AAB09673.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; U59248; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; U59249; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; U59250; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; U59252; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; U59253; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; U59254; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; U59255; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; U59256; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; U59257; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; U59258; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; U59259; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; U59260; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; U59261; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; U59262; AAB09673.1; JOINED. [EMBL / GenBank / DDBJ] [CoDingSequence] DR GeneCard; LEPR. DR MIM; 601007; -. DR PROSITE; PS00340; RECEPTOR_CYTOKINES_2; 2. DR PFAM; PF00041; fn3; 2. DR HSSP; P10912; 3HHR. [HSSP ENTRY / SWISS-3DIMAGE / DR PDB-ENTRY / PDB-RASMOL / PDB-3DIMAGE] DR DOMO; P48357. DR PRODOM [Domain structure / List of seq. sharing at least 1 domain] DR PROTOMAP; P48357. DR PRESAGE; P48357. DR SWISS-2DPAGE; GET REGION ON 2D PAGE. KW OBESITY; RECEPTOR; TRANSMEMBRANE; GLYCOPROTEIN; SIGNAL. FT SIGNAL 1 ? POTENTIAL. FT CHAIN ? 1165 LEPTIN RECEPTOR. FT DOMAIN ? 841 EXTRACELLULAR (POTENTIAL). FT TRANSMEM 842 862 POTENTIAL. FT DOMAIN 863 1165 CYTOPLASMIC (POTENTIAL). FT CARBOHYD 23 23 POTENTIAL. FT CARBOHYD 41 41 POTENTIAL. FT CARBOHYD 56 56 POTENTIAL. FT CARBOHYD 73 73 POTENTIAL. FT CARBOHYD 81 81 POTENTIAL. FT CARBOHYD 98 98 POTENTIAL. FT CARBOHYD 187 187 POTENTIAL. FT CARBOHYD 206 206 POTENTIAL. FT CARBOHYD 276 276 POTENTIAL. FT CARBOHYD 347 347 POTENTIAL. FT CARBOHYD 397 397 POTENTIAL. FT CARBOHYD 433 433 POTENTIAL. FT CARBOHYD 516 516 POTENTIAL. FT CARBOHYD 624 624 POTENTIAL. FT CARBOHYD 659 659 POTENTIAL. FT CARBOHYD 670 670 POTENTIAL. FT CARBOHYD 688 688 POTENTIAL. FT CARBOHYD 697 697 POTENTIAL. FT CARBOHYD 728 728 POTENTIAL. FT CARBOHYD 750 750 POTENTIAL. SQ SEQUENCE 1165 AA; 132449 MW; A63D1B74 CRC32; MICQKFCVVL LHWEFIYVIT AFNLSYPITP WRFKLSCMPP NSTYDYFLLP AGLSKNTSNS NGHYETAVEP KFNSSGTHFS NLSKTTFHCC FRSEQDRNCS LCADNIEGKT FVSTVNSLVF QQIDANWNIQ CWLKGDLKLF ICYVESLFKN LFRNYNYKVH LLYVLPEVLE DSPLVPQKGS FQMVHCNCSV HECCECLVPV PTAKLNDTLL MCLKITSGGV IFQSPLMSVQ PINMVKPDPP LGLHMEITDD GNLKISWSSP PLVPFPLQYQ VKYSENSTTV IREADKIVSA TSLLVDSILP GSSYEVQVRG KRLDGPGIWS DWSTPRVFTT QDVIYFPPKI LTSVGSNVSF HCIYKKENKI VPSKEIVWWM NLAEKIPQSQ YDVVSDHVSK VTFFNLNETK PRGKFTYDAV YCCNEHECHH RYAELYVIDV NINISCETDG YLTKMTCRWS TSTIQSLAES TLQLRYHRSS LYCSDIPSIH PISEPKDCYL QSDGFYECIF QPIFLLSGYT MWIRINHSLG SLDSPPTCVL PDSVVKPLPP SSVKAEITIN IGLLKISWEK PVFPENNLQF QIRYGLSGKE VQWKMYEVYD AKSKSVSLPV PDLCAVYAVQ VRCKRLDGLG YWSNWSNPAY TVVMDIKVPM RGPEFWRIIN GDTMKKEKNV TLLWKPLMKN DSLCSVQRYV INHHTSCNGT WSEDVGNHTK FTFLWTEQAH TVTVLAINSI GASVANFNLT FSWPMSKVNI VQSLSAYPLN SSCVIVSWIL SPSDYKLMYF IIEWKNLNED GEIKWLRISS SVKKYYIHDH FIPIEKYQFS LYPIFMEGVG KPKIINSFTQ DDIEKHQSDA GLYVIVPVII SSSILLLGTL LISHQRMKKL FWEDVPNPKN CSWAQGLNFQ KPETFEHLFI KHTASVTCGP LLLEPETISE DISVDTSWKN KDEMMPTTVV SLLSTTDLEK GSVCISDQFN SVNFSEAEGT EVTYEAESQR QPFVKYATLI SNSKPSETGE EQGLINSSVT KCFSSKNSPL KDSFSNSSWE IEAQAFFILS DQHPNIISPH LTFSEGLDEL LKLEGNFPEE NNDKKSIYYL GVTSIKKRES GVLLTDKSRV SCPFPAPCLF TDIRVLQDSC SHFVENNINL GTSSKKTFAS YMPQFQTCST QTHKIMENKM CDLTV //
Similar entry from TrEMBL
ID Q92920 PRELIMINARY; PRT; 958 AA. AC Q92920; DT 01-FEB-1997 (TREMBLREL. 02, CREATED) DT 01-FEB-1997 (TREMBLREL. 02, LAST SEQUENCE UPDATE) DT 01-NOV-1998 (TREMBLREL. 08, LAST ANNOTATION UPDATE) DE LEPTIN RECEPTOR. OS HOMO SAPIENS (HUMAN). OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; MAMMALIA; EUTHERIA; OC PRIMATES; CATARRHINI; HOMINIDAE; HOMO. RN [1] RP SEQUENCE FROM N.A. RA BENNETT B.D., SOLAR G.P., YUAN J.Q., MATHIAS J., THOMAS G.R., RA MATTHEWS W.; RL CURR. BIOL. 6:0-0(0). DR EMBL; U66496; AAB07496.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence] DR PFAM; PF00041; fn3; 2. DR SWISS-2DPAGE; GET REGION ON 2D PAGE. SQ SEQUENCE 958 AA; 109392 MW; A6376C9C CRC32; MICQKFCVVL LHWEFIYVIT AFNLSYPITP WRFKLSCMPP NSTYDYFLLP AGLSKNTSNS NGHYETAVEP KFNSSGTHFS NLSKTTFHCC FRSEQDRNCS LCADNIEGKT FVSTVNSLVF QQIDANWNIQ CWLKGDLKLF ICYVESLFKN LFRNYNYKVH LLYVLPEVLE DSPLVPQKGS FQMVHCNCSV HECCECLVPV PTAKLNDTLL MCLKITSGGV IFQSPLMSVQ PINMVKPDPP LGLHMEITDD GNLKISWSSP PLVPFPLQYQ VKYSENSTTV IREADKIVSA TSLLVDSILP GSSYEVQVRG KRLDGPGIWS DWSTPRVFTT QDVIYFPPKI LTSVGSNVSF HCIYKKENKI VPSKEIVWWM NLAEKIPQSQ YDVVSDHVSK VTFFNLNETK PRGKFTYDAV YCCNEHECHH RYAELYVIDV NINISCETDG YLTKMTCRWS TSTIQSLAES TLQLRYHRSS LYCSDIPSIH PISEPKDCYL QSDGFYECIF QPIFLLSGYT MWIRINHSLG SLDSPPTCVL PDSVVKPLPP SSVKAEITIN IGLLKISWEK PVFPENNLQF QIRYGLSGKE VQWKMYEVYD AKSKSVSLPV PDLCAVYAVQ VRCKRLDGLG YWSNWSNPAY TVVMDIKVPM RGPEFWRIIN GDTMKKEKNV TLLWKPLMKN DSLCSVQRYV INHHTSCNGT WSEDVGNHTK FTFLWTEQAH TVTVLAINSI GASVANFNLT FSWPMSKVNI VQSLSAYPLN SSCVIVSWIL SPSDYKLMYF IIEWKNLNED GEIKWLRISS SVKKYYIHDH FIPIEKYQFS LYPIFMEGVG KPKIINSFTQ DDIEKHQSDA GLYVIVPVII SSSILLLGTL LISHQRMKKL FWEDVPNPKN CSWAQGLNFQ KMLEGSMFVK SHHHSLISST QGHKHCGRPQ GPLHRKTRDL CSLVYLLTLP PLLSYDPAKS PSVRNTQE //