Genetics | |
---|---|
Traits/Genes to Location |
Genetic and physical
maps Research Genetics mapping panel Stanford mapping server |
Traits/Genes to Experimental Organisms | Jackson Laboratories |
Trait/Gene Location Database | OMIM, On-line Mendelian Inheritance in Man |
Genomic DNA Analysis | |
Sequences to Contigs | CuraTools CAP, PHRAP | Contigs to mRNA | Genscan Grail |
mRNA Analysis | |
DNA to Homolog | blastn, blastx, fasta |
DNA to Protein | ORF finders NCBI ORF Finder |
Protein Analysis |
Where were we?
First step: has anyone seen a sequence like this before?
Why do this:
How to do this: sequence similarity searching
Questions we will answer
There is one sequence similarity search you have almost certainly used, and it has virtually nothing to do with molecular biology. What is it? (Thinking of this example will usually provide good intuition.)
The Sequence Alignment problem: Take two strings of characters. Align the letters in one string with the letters in the second string, allowing gaps and mismatches in the alignment. The score of the alignment is the number of matched letters. What is the best alignment?
Example: matching co-evolved words
alignment: MOTHER M OTHER MOTHER . . .. . . .. . MUTTER MUTT ER MU TTER score: 4 4 1There can be a tie for the best alignment.
Visual aid: dotplots
M O T H E R M . U T . T . E . R .Put a dot where letters match. What does an exact alignment look like?
B I O I N F O R M A T I C S B . I . . . O . . I . . . N . F . O . . R . M . A . T . I . . . C . S .Long diagaonal: filled
Describing sequence similarity: % sequence identity (SI), length of match
Blast Example: start with human insulin gene, take it to CuraTools
gb:GENBANK-ID:HUMINS01|acc:J00265 Human insulin gene
Reading blast output:
BLASTN 2.0a19MP-WashU [05-Feb-1998] [Build sol2.5-ultra 01:47:25 05-Feb-1998] Reference: Gish, Warren (1994-1997). unpublished. Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol. 215:403-10. Notice: this program and its default parameter settings are optimized to find nearly identical sequences rapidly. To identify weak similarities encoded in nucleic acid, use BLASTX, TBLASTN or TBLASTX. Query= HUMINS01 Human insulin gene, complete cds. (4044 letters) Database: /opt/database/public/blast/gbcomp 565,116 sequences; 1,083,671,655 total letters. Smallest Sum High Probability Sequences producing High-scoring Segment Pairs: Score P(N) N gb:GENBANK-ID:HUMINS01|acc:J00265 Human insulin gene, com... 20220 0.0 1 gb:GENBANK-ID:HUMINSTHIG|acc:L15440 Homo sapiens tyrosine... 20210 0.0 1 gb:GENBANK-ID:HSINSU|acc:V00565 Human gene for preproinsu... 20082 0.0 1 gb:GENBANK-ID:HUMINSPR|acc:M10039 Human alpha-type insuli... 11244 0.0 1 gb:GENBANK-ID:PTPPINS|acc:X61089 P.troglodytes gene for p... 9328 0.0 2 gb:GENBANK-ID:E00022|acc:E00022 DNA coding for human prep... 8846 0.0 1 gb:GENBANK-ID:HSA9655|acc:AJ009655 Homo sapiens ins gene,... 6956 2.1e-308 1 gb:GENBANK-ID:CEPPINS|acc:X61092 C.aethiops gene for prep... 3821 3.2e-284 3 gb:GENBANK-ID:HUMINS1UA|acc:J00266 Human insulin gene: al... 4228 3.5e-185 1 gb:GENBANK-ID:ATRINS|acc:J02989 Owl monkey (A.trivirgatus... 3812 1.0e-176 2 gb:GENBANK-ID:HUMINSHVR|acc:M26868 Human hypervariable DN... 2752 1.6e-118 1 gb:GENBANK-ID:HUMINS1UB|acc:J00267 Human insulin gene: be... 2514 1.4e-107 1 gb:GENBANK-ID:A57040|acc:A57040 Sequence 2 from Patent WO... 2220 2.2e-99 2 gb:GENBANK-ID:I55872|acc:I55872 Sequence 2 from patent US... 2220 2.2e-99 2 gb:GENBANK-ID:S99616|acc:S99616 insulin {promoter} [human... 2220 2.2e-99 2 gb:GENBANK-ID:HSIGF2AP|acc:X05331 Human insulin/IGF-II in... 2145 1.8e-90 1 gb:GENBANK-ID:MMINSIIG|acc:X04724 Mouse preproinsulin gen... 759 1.2e-75 3 gb:GENBANK-ID:RATINSII|acc:J00748 Rat insulin II gene (in... 713 9.7e-67 4 gb:GENBANK-ID:MMINSIG|acc:X04725 Mouse preproinsulin gene... 692 4.1e-64 3 gb:GENBANK-ID:CLINSU|acc:V00179 Dog gene encoding insulin... 776 6.0e-64 3 gb:GENBANK-ID:RNINS2|acc:V01243 Rat gene for insulin 2 - ... 713 1.1e-56 3
Format of output
Blast vs. Fasta
blast: faster, less sensitive, better if there's a strong match
fasta: slower, more sensitive, better if there's only weak homology
blast has been improved (gaps, multiple regions), more sensitive,
used more often.
Blast Programs
Program | Query | Database | Use |
---|---|---|---|
blastn | DNA | DNA | exact matches |
blastx | trDNA | protein | analyzing new DNA and ESTs |
blastp | protein | protein | distant relationships |
tblastn | protein | trDNA | finding gene homologs in genomic DNA |
tblastx | trDNA | trDNA | sensitive EST analysis |
ccgctttcgt ctccgtcctg ctgccgttac cgccgctgct gccgccgctt gcgtcccccg ctccggtctg tggtgcagcc gggacccagg accatgtctc tgtctcgctc agaggagatg caccggctca cggaaaatgt ctataagacc atcatggagc agttcaaccc tagcctccgg aacttcatcg ccatggggaa gaattacgag aaggcactgg caggtgtgac gtatgcagcc aaaggctact ttgacgccct ggtgaagatg ggggagctgg ccagcgagag ccagggctcc aaagaactcg gagacgttct cttccagatg gctgaagtcc acaggcagat ccagaatcag ctggaagaaa tgctgaagtc ttttcacaac gagctgctta cgcagctgga gcagaaggtg gagctggact ccaggtatct gagtgctgcg ctgaagaaat accagactga gcaaaggagc aaaggcgacg ccctggacaa gtgtcaggct gagctgaaga agcttcggaa gaagagccag ggcagcaaga atcctcagaa gtactcggac aaggagctgc agtacatcga cgccatcagc aacaagcagg gcgagctgga gaattacgtg tccgacggct acaagaccgc actgacagag gagcgcaggc gcttctgctt cctggtggag aagcagtgcg ccgtggccaa gaactccgcg gcctaccact ccaagggcaa ggagctgctg gcgcagaagc tgccgctgtg gcaacaggcc tgtgccgacc ccagcaagat cccggagcgc gcggtgcagc tcatgcagca ggtggccagc aacggcgcca ccctccccag cgccctgtcg gcctccaagt ccaacctggt catttccgac cccattccgg gggccaagcc cctgccggtg ccccccgagc tggcaccgtt cgtggggcgg atgtctgccc aggagagcac acccatcatg aacggcgtca caggcccgga tggcgaggac tacagcccgt gggctgaccg caaggctgcc cagcccaaat ccctgtctcc tccgcagtct cagagcaagc tcagcgactc ctactccaac acactccccg tgcgcaagag cgtgacccca aaaaacagct atgccaccac cgagaacaag actctgcctc gctcgagctc catggcagcc ggcctggagc gcaatggccg tatgcgggtg aaggccatct tctcccacgc tgctggggac aacagcaccc tcctgagctt caaggagggt gacctcatta ccctgctggt gcctgaggcc cgcgatggct ggcactacgg agagagtgag aagaccaaga tgcggggctg gtttcccttc tcctacaccc gggtcttgga cagcgatggc agtgacaggc tgcacatgag cctgcagcaa gggaagagca gcagcacggg caacctcctg gacaaggacg acctggccat cccacccccc
Remember: for genomic DNA, do gene prediction (Genscan/Grail) first!
Two long ORFs: what to do?
What does NCBI recommend?
What's the answer?
What about sequencing errors? ORF find the following sequence.
ccgctttcgt ctccgtcctg ctgccgttac cgccgctgct gccgccgctt gcgtcccccg
ctccggtctg tggtgcagcc gggacccagg accatgtctc tgtctcgctc agaggagatg
caccggctca cggaaaatgt ctataagacc atcatggagc agttcaaccc tagcctccgg
aacttcatcg ccatggggaa gaattacgag aaggcactgg caggtgtgac gtatgcagcc
aaaggctact ttgacgccct ggtgaagatg ggggagctgg ccagcgagag ccagggctcc
aaagaactcg gagacgttct cttccagatg gctgaagtcc acaggcagat ccagaatcag
ctggaagaaa tgctgaagtc ttttcacaac gagctgctta cgcagctgga gcagaaggtg
gagctggact ccaggtatct gagtgctgcg ctgaagaaat accagactga gcaaaggagc
aaaggcgacg ccctggacaa gtgtcaggct gagctgaaga agcttcggaa gaagagccag
ggcagcaaga atcctcagaa gtactcggac aaggagctgc agtacatcga cgccatcagc
aacaagcagg gcgagctgga gaattacgtg tccgacggct acaagaccgc actgacagag
gagcgcaggc gcttctgctt cctggtggag aagcagtgcg ccgtggccaa gaactccgcg
gcctaccact ccaagggcaa ggagctgctg gcgcagaagc tgccgctgtg gcaacaggcc
tgtgccgacc ccagcaagat cccggagcgc gcggtgcagc tcatgcagca ggtggccagc
aacggcgcca ccctccccag cgccctgtcg gcctccaagt ccaacctggt catttccgac
cccattccgg gggccaagcc cct ccggtg ccccccgagc tggcaccgtt cgtggggcgg
atgtctgccc aggagagcac acccatcatg aacggcgtca caggcccgga tggcgaggac
tacagcccgt gggctgaccg caaggctgcc cagcccaaat ccctgtctcc tccgcagtct
cagagcaagc tcagcgactc ctactccaac acactccccg tgcgcaagag cgtgacccca
aaaaacagct atgccaccac cgagaacaag actctgcctc gctcgagctc catggcagcc
ggcctggagc gcaatggccg tatgcgggtg aaggccatct tctcccacgc tgctggggac
aacagcaccc tcctgagctt caaggagggt gacctcatta ccctgctggt gcctgaggcc
cgcgatggct ggcactacgg agagagtgag aagaccaaga tgcggggctg gtttcccttc
tcctacaccc gggtcttgga cagcgatggc agtgacaggc tgcacatgag cctgcagcaa
gggaagagca gcagcacggg caacctcctg gacaaggacg acctggccat cccacccccc
gcatcgggat atcactatgg agtctggtcg tgtg aaggat gtaaggcctt ttttaaaaga agcattcaag gacataatga ttatatttgt ccagctacaa atcag tgtac aatcgataaa aaccggcgca agagctgcca ggc ctgccga cttcggaagt gttacgaa gt gggaatggtg aagtgtggct cccggagaga gagatgtggg taccgccttg tgcggagaca gagaagtgcc gacgagcagc tgcactgtgc cggcaaggcc aagagaagtg gcggccacgc gccccgagtg cgggagctgc tgctgg acgc cctgagcccc gagcagctag tgctcaccct cctggaggct gagccgcccc atgtgctgat cagccgcccc agtgcgccct tcaccgaggc ctccatgatg atgtccctga ccaagttggc cgacaaggag ttggtacaca tgatcagctg ggccaagaag attcccggct ttgtgg agct cagcctgttc gaccagtac ggctcttgga gagctgttgg atggaggtgt taatgatggg gctgatgtgg cgctcaattg accaccccgg caagctcatc tttgctccag atcttgttct ggacagggat gaggggaaat gcgtagaagg aattctggaa atctttgaca tgctcctggc aactacttca aggtttcgag agttaaaact ccaacacaaa gaatatctct ggtcaaggc catgatcctg ctcaattcca gtatgtaccc tctggtcaca gcgacccagg atgctgacag cagccggaag ctggctcact tgctgaacgc cgtgaccgat gctttggttt gggtgattgc caagagcggc atctcctccc agcagcaatc catgcgcctg gctaacctcc tgatgctcct gtcccacgtc aggcatgcga gggcagaaaa ggcctctcaa acactcacct catttggaat gaagatggag actcttttgc ctgaagcaac gatggagcag tgaccctcta atcaactcgg tggcctaa ag aaaaatcttgggtaacattt tcacttcaat ttccctctgg gatcattgta atccatgaaa aaaataattt taaagaaaga gttaaaatac tttgaagtta gttatgtggt taaaaaccac cttcctttct attatcaatc caacaatttg ataactgtaa acgctaaagt gaagacggat tctcttcaga tggtctcctt aactgcccag ggcttgcaga tgtctcaccc atgaggggca ccaatgtaga aagctgaggc ttcatctact gatgagcttc actggtttcc cctgaggttt gtgctttggc agagaagggg aggaggggac tgggattgtg tggtcagctg tggctgccaa cagatgcagg ttaggaactg tgttcagtat cttccaataa gaaaggggaa atgccgatgc ctatcctctt tgtttaggta gaaagtaaaa
gcacatagcc agtatgacct ggagcgactg cgtgctgccc agaaacagct tgagagggaa caggagcacg tgc gccggga ggcaga gcgg ctcagc cagc ggcagacaga acgggacctg tgtcaggttt cccatccaca taccaagctgatgaggatcc catcgttcttccccagtcct gaggag cccccctcgcca tc tgcaccttcc atagccaaat cagggtcatt ggactcagaa ctttcagtgt ccccaaaa ag gaacagcatctctcggacac acaaagataa ggggcctttt cacatactga gtcaaccag ccag acaaacaaaggaccag aagggcagag ccaggcccct gcgtccacct ctgcctctac ccgcctgttt gggttaacaa agccaaagga aaagaaggag aaaaaaaaga agaacaaaac cagccgctct cagcccggtg atggtcccgc gtcagaagta tcagcagagg gtgaagagat cttctgctga